The most exciting thing about this world is its ever changing quality.

Friday, August 07, 2009

The ultimate answer to everything in software - upgrade?!

In the last few days, I have been working on a very very nasty problem. To put this into context, I planned to apply one of the existing real time patches to Linux to give us hard real time scheduling performance as opposed to soft real time, which I have briefly explained the difference in my previous blog. I adopted Xenomai 2 to uClinux distribution for Blackfin BF5xx processor using Adeos patch supplied with uClinux2008R1.5 , which is running kernel 2.6.22. Anyway, the problem was, one of our existing GPIO (on I2C bus) driver stops working.

So, after a bit of digging in the code, it appears that the kernel has continuously been trapped within one of the interrupt handler registered by this driver. I should say in this case, the interrupt is in Linux domain. That means, since there is no handling done in Xenomai primary domain and this interrupt is passed by ipipe all the way to Linux non-real time kernel to handle. The problem is, it worked fine when we did not have real time patch in between. In this instance, google did not help either - no obvious answer. I decided to ask for help in Blackfin-uClinux forum and also Xenomai help as people are quite helpful there usually, if you are asking something which has been asked before or known issues. Since this seems to be an outlier, I did not get much out of it, until the author of the patch responded. In short, his answer was that the version I used is a legacy version (2.4.0), unless I upgrade to the latest and greatest xenomai version and uClinux distribution which does not use threaded IRQ anymore, I am pretty much on my own. Right, so first thought came to my mind is that I am busted. It is not a trivial task to port a heavily modified kernel distribution to another version, let alone Linux does not really maintain backward compatibility that well. Anyway, I was stuck between rock and hard place.

So I started to dig into the ipipe to see what is the difference. Unfortunately, the only thing I can see is a positive sign, where the interrupt has been much reliable triggered with smaller amount of latency. Bear this in mind, I have to bet my money on Blackfin implementation at this point. So I went back to check the hardware reference for the type of processor I am using and found out the following:

“When using either rising or falling edge-triggered interrupts, the interrupt condition must be cleared each time a corresponding interrupt is serviced by writing 0x01 to the appropriate bit in the GPIO clear register.”

Right now I have all the pieces of the puzzle. The problem was that the original driver code did not explicitly clear out the GPI pin we configured for interrupt edge triggering, relying on kernel peripheral to clear out resource its allocated after interrupt being served. With Xenomai patched, the interrupt comes quicker to the point before work to be finished (previous interrupt status to be cleared out), the next interrupt kicks in (passed by ipipe). Hence the kernel stuck in this particular ISR. Fix itself is easy enough, one line change to clear the port interrupt register.

What makes me think, however, is that how we normally deal with unknown issues in our system when it comes to software release. Of course there are times that to find out the root cause of A problem would be expensive, which could also become a major distraction from current development undertaking. Unfortunately, from my own experience, many organisations choose to take the altitude to offer system upgrade as a silver bullet when customer has legacy system upon which problems were reported and then pray for those problems to go away on the newer version of release. "Obviously, there are so much we do not know about this world, this problem might just be one of these we could not explain or completely out of our leads, or not worth spending our efforts on. I can to certain degree try to justify if it was for the last reason as we all know sometimes we need to make a balanced decision about where precious resource (as always) should be spent on.

As you can clearly see, the suggestion I was offered as threaded IRQ is a complete wrong shot. Unfortunately, we do blind shot a lot. Question is, have you done this before?

No comments: