Thursday, March 27, 2008 

Zen and the Art of Fixing

A little known fact about me is that I like fixing things. Mostly TV sets, but over the years I've tried to repair (with various degrees of success :-) any electrical, electronic or mechanical device that needed some fixing. Indeed, when I was about 14 I made a small business out of rescuing old vacuum-tube radios, fixing and then selling them to collectors.
Even today, I occasionally talk about diagnosing and repairing a defective TV when teaching about modular design, modular reasoning, design for testability, and so on.

In fact, fixing a device has a lot in common with fixing a buggy program. Some devices (like old washing machines) are relatively simple. You just need to know some basic concepts, and apply some common sense. Sure, it may take some creativity to fix even the simplest device (especially if you can't find the right replacement parts), but overall, the problem is often self-evident, and you "only" need to resort to your ability, often just manual ability.

Electronic devices, however, can fail in complex, even puzzling ways. You need a better understanding of what's going on under the hood. You may need special tools (you can't get much far repairing a TV if you ain't got an oscilloscope) but sometimes it's just plain old intuition (or sheer luck :-). You need to know some tricks of the trade (like using a light bulb to discriminate problems in the power supply vs. horizontal deflection), but overall, what you need more is rational system thinking.

The same is true when reasoning about complex software failures. We often have a large numbers of parts (hardware, firmware, drivers, OS, libraries, our own code) which can fail for a large number of reasons. Your best allied is rational system thinking. Your worst enemy is the all-so-common uncircumstantiated certainty that the fault must lie exactly somewhere (usually outside our own code :-). Overall, I would say my experience fixing stuff has made me a better troubleshooter, helping me to find obscure bugs in systems I knew very little about.

Still, I don't like to fix computers. I guess I'm already spending too much time around computers, and besides, today the integration scale is so high that you can hardly fix a broken motherboard. However, there is always an exception, and this one is interesting enough to be worth telling.

A couple of years ago (yeah :-) I rescued a notebook before it was thrown away. After a fall, it wouldn't even turn on. The CD-ROM was visibly damaged, but the screen was intact. I took it home and left it alone for a few months, till I had some time to kill.

I'm usually lucky, and in fact, I discovered that by pushing the power plug a little heavier than usual, the notebook would indeed power on. That's usually just a broken joint; it took forever :-) to disassemble the notebook and expose the PCB, but after that, it was quite easy to fix. Now the computer would turn on, start booting XP, and die a few seconds later. I suspected the HD was damaged too, and replaced it with a spare one (with W2K installed). Similar story: it would start booting the OS, then dump a blue screen reporting that “the boot device was not accessible”.

That's usually due to a broken IDE controller, or a faulty RAM chip. I tried replacing the RAM, but still got the same problem. However, the IDE controller was somehow working, as the computer would indeed start booting. Weird :-).

I didn't give up and got another HD, with good old MS-DOS installed. The notebook booted like a charm, and all the applications seemed to work. Weird again: the IDE controller seemed fine. Again, it could be a faulty RAM chip, since MS-DOS only uses the first 640KB.

I connected a USB CD-ROM and tried an old version of Knoppix on CD: knoppix uses all the available RAM, so that was like a definitive RAM test. It worked fine, but as soon as I tried using the HD, it would simply lock up. So, RAM was fine, the IDE controller was fine under MS-DOS but failed under other operating systems.
My diagnosis was that DMA was at fault. I tried to disable DMA at the BIOS level with no luck. I also disabled DMA in that W2K HD (using another PC, of course), but it would still lock, just like Knoppix.

At that point, I contemplated using an external HD (via USB), and perhaps installing a tweaked version of XP which can boot from a USB device (details on the BartPE page). I could even rewire one of the USB ports and install the whole thing internally, since the missing CD-ROM left a lot of space. But it looked like a damn lot of work :-), so I did nothing and left the notebook around for more than one year.

A few days ago, having a little time to kill again, I tried a different experiment. I got a 1GB USB stick, downloaded the latest version of Knoppix, and put it on the stick (instead of a CD-ROM). I followed this tutorial to speed things up.
It worked fine, and booting from the USB stick was very quick. However, at that point I noticed (from the boot log) that this newer version of Knoppix ran with DMA disabled (I guess the old one I tried before didn't). Time to test my DMA controller theory!

I put an HD inside, booted Knoppix from the USB stick, and guess what, no hang-up! I could access the HD with not problem whatsoever. At that point, it was hard to resist: I had to try installing knoppix on the HD itself. Again, I followed a tutorial to speed things up: this one is about a previous version, but still valid. The proof of the pudding is in the eating: I removed the USB stick, tried booting and Lo!, it worked. The world is still somehow deterministic :-).
By the way: Knoppix takes forever to boot from the HD. Overall, it would be better to keep the USB stick, maybe rewiring a USB port to keep everything inside.

Now, in a perfect world, I would have found a way to really "fix" the notebook, that is, to repair the DMA controller. I suspect it's just another loose joint from the fall. Most likely, some pin of some surface-mounted chip suffered some mechanical stress. This world, however, ain’t perfect, so I did nothing else, leaving it as a working Knoppix notebook. Fun is over, time to get back to work :-)

Labels: ,

Non siamo a livello di riparare la motocicletta con un pezzo di latta (Chi ha letto il libro a cui il titolo del tuo post si riferisce dovrebbe ricordare), ma c'è tanto Zen anche nella tua storia.

Mi piace la teoria, ma leggere di tanto in tanto qualcosa di pratico è tutta un'altra cosa.
Datato, ma sempre attuale un mio commento ad un tuo post precedente.

Riflessioni e studi molto interessanti, anche se a mio modo di vedere esistono cambi di prospettiva che possono rivoluzionare del tutto o almeno in parte le teorie esposte, come si può chiaramente vedere da qui.

Non so se avete notato, ma facendo la trasformata di Fourier dell'immagine, scambiando poi tra loro la parte reale ed immaginaria di ogni valore [ovviamente tranne il primo], e facendo infine la trasformata inversa, il pesce ed il gatto si scambiano di posto!
Post a Comment

<< Home