Thursday, March 27, 2008
Zen and the Art of Fixing
A little known fact about me is that I like fixing things. Mostly TV sets, but over the years I've tried to repair (with various degrees of success :-) any electrical, electronic or mechanical device that needed some fixing. Indeed, when I was about 14 I made a small business out of rescuing old vacuum-tube radios, fixing and then selling them to collectors.
Even today, I occasionally talk about diagnosing and repairing a defective TV when teaching about modular design, modular reasoning, design for testability, and so on.
In fact, fixing a device has a lot in common with fixing a buggy program. Some devices (like old washing machines) are relatively simple. You just need to know some basic concepts, and apply some common sense. Sure, it may take some creativity to fix even the simplest device (especially if you can't find the right replacement parts), but overall, the problem is often self-evident, and you "only" need to resort to your ability, often just manual ability.
Electronic devices, however, can fail in complex, even puzzling ways. You need a better understanding of what's going on under the hood. You may need special tools (you can't get much far repairing a TV if you ain't got an oscilloscope) but sometimes it's just plain old intuition (or sheer luck :-). You need to know some tricks of the trade (like using a light bulb to discriminate problems in the power supply vs. horizontal deflection), but overall, what you need more is rational system thinking.
The same is true when reasoning about complex software failures. We often have a large numbers of parts (hardware, firmware, drivers, OS, libraries, our own code) which can fail for a large number of reasons. Your best allied is rational system thinking. Your worst enemy is the all-so-common uncircumstantiated certainty that the fault must lie exactly somewhere (usually outside our own code :-). Overall, I would say my experience fixing stuff has made me a better troubleshooter, helping me to find obscure bugs in systems I knew very little about.
Still, I don't like to fix computers. I guess I'm already spending too much time around computers, and besides, today the integration scale is so high that you can hardly fix a broken motherboard. However, there is always an exception, and this one is interesting enough to be worth telling.
A couple of years ago (yeah :-) I rescued a notebook before it was thrown away. After a fall, it wouldn't even turn on. The CD-ROM was visibly damaged, but the screen was intact. I took it home and left it alone for a few months, till I had some time to kill.
I'm usually lucky, and in fact, I discovered that by pushing the power plug a little heavier than usual, the notebook would indeed power on. That's usually just a broken joint; it took forever :-) to disassemble the notebook and expose the PCB, but after that, it was quite easy to fix. Now the computer would turn on, start booting XP, and die a few seconds later. I suspected the HD was damaged too, and replaced it with a spare one (with W2K installed). Similar story: it would start booting the OS, then dump a blue screen reporting that “the boot device was not accessible”.
That's usually due to a broken IDE controller, or a faulty RAM chip. I tried replacing the RAM, but still got the same problem. However, the IDE controller was somehow working, as the computer would indeed start booting. Weird :-).
I didn't give up and got another HD, with good old MS-DOS installed. The notebook booted like a charm, and all the applications seemed to work. Weird again: the IDE controller seemed fine. Again, it could be a faulty RAM chip, since MS-DOS only uses the first 640KB.
I connected a USB CD-ROM and tried an old version of Knoppix on CD: knoppix uses all the available RAM, so that was like a definitive RAM test. It worked fine, but as soon as I tried using the HD, it would simply lock up. So, RAM was fine, the IDE controller was fine under MS-DOS but failed under other operating systems.
My diagnosis was that DMA was at fault. I tried to disable DMA at the BIOS level with no luck. I also disabled DMA in that W2K HD (using another PC, of course), but it would still lock, just like Knoppix.
At that point, I contemplated using an external HD (via USB), and perhaps installing a tweaked version of XP which can boot from a USB device (details on the BartPE page). I could even rewire one of the USB ports and install the whole thing internally, since the missing CD-ROM left a lot of space. But it looked like a damn lot of work :-), so I did nothing and left the notebook around for more than one year.
A few days ago, having a little time to kill again, I tried a different experiment. I got a 1GB USB stick, downloaded the latest version of Knoppix, and put it on the stick (instead of a CD-ROM). I followed this tutorial to speed things up.
It worked fine, and booting from the USB stick was very quick. However, at that point I noticed (from the boot log) that this newer version of Knoppix ran with DMA disabled (I guess the old one I tried before didn't). Time to test my DMA controller theory!
I put an HD inside, booted Knoppix from the USB stick, and guess what, no hang-up! I could access the HD with not problem whatsoever. At that point, it was hard to resist: I had to try installing knoppix on the HD itself. Again, I followed a tutorial to speed things up: this one is about a previous version, but still valid. The proof of the pudding is in the eating: I removed the USB stick, tried booting and Lo!, it worked. The world is still somehow deterministic :-).
By the way: Knoppix takes forever to boot from the HD. Overall, it would be better to keep the USB stick, maybe rewiring a USB port to keep everything inside.
Now, in a perfect world, I would have found a way to really "fix" the notebook, that is, to repair the DMA controller. I suspect it's just another loose joint from the fall. Most likely, some pin of some surface-mounted chip suffered some mechanical stress. This world, however, ain’t perfect, so I did nothing else, leaving it as a working Knoppix notebook. Fun is over, time to get back to work :-)
Even today, I occasionally talk about diagnosing and repairing a defective TV when teaching about modular design, modular reasoning, design for testability, and so on.
In fact, fixing a device has a lot in common with fixing a buggy program. Some devices (like old washing machines) are relatively simple. You just need to know some basic concepts, and apply some common sense. Sure, it may take some creativity to fix even the simplest device (especially if you can't find the right replacement parts), but overall, the problem is often self-evident, and you "only" need to resort to your ability, often just manual ability.
Electronic devices, however, can fail in complex, even puzzling ways. You need a better understanding of what's going on under the hood. You may need special tools (you can't get much far repairing a TV if you ain't got an oscilloscope) but sometimes it's just plain old intuition (or sheer luck :-). You need to know some tricks of the trade (like using a light bulb to discriminate problems in the power supply vs. horizontal deflection), but overall, what you need more is rational system thinking.
The same is true when reasoning about complex software failures. We often have a large numbers of parts (hardware, firmware, drivers, OS, libraries, our own code) which can fail for a large number of reasons. Your best allied is rational system thinking. Your worst enemy is the all-so-common uncircumstantiated certainty that the fault must lie exactly somewhere (usually outside our own code :-). Overall, I would say my experience fixing stuff has made me a better troubleshooter, helping me to find obscure bugs in systems I knew very little about.
Still, I don't like to fix computers. I guess I'm already spending too much time around computers, and besides, today the integration scale is so high that you can hardly fix a broken motherboard. However, there is always an exception, and this one is interesting enough to be worth telling.
A couple of years ago (yeah :-) I rescued a notebook before it was thrown away. After a fall, it wouldn't even turn on. The CD-ROM was visibly damaged, but the screen was intact. I took it home and left it alone for a few months, till I had some time to kill.
I'm usually lucky, and in fact, I discovered that by pushing the power plug a little heavier than usual, the notebook would indeed power on. That's usually just a broken joint; it took forever :-) to disassemble the notebook and expose the PCB, but after that, it was quite easy to fix. Now the computer would turn on, start booting XP, and die a few seconds later. I suspected the HD was damaged too, and replaced it with a spare one (with W2K installed). Similar story: it would start booting the OS, then dump a blue screen reporting that “the boot device was not accessible”.
That's usually due to a broken IDE controller, or a faulty RAM chip. I tried replacing the RAM, but still got the same problem. However, the IDE controller was somehow working, as the computer would indeed start booting. Weird :-).
I didn't give up and got another HD, with good old MS-DOS installed. The notebook booted like a charm, and all the applications seemed to work. Weird again: the IDE controller seemed fine. Again, it could be a faulty RAM chip, since MS-DOS only uses the first 640KB.
I connected a USB CD-ROM and tried an old version of Knoppix on CD: knoppix uses all the available RAM, so that was like a definitive RAM test. It worked fine, but as soon as I tried using the HD, it would simply lock up. So, RAM was fine, the IDE controller was fine under MS-DOS but failed under other operating systems.
My diagnosis was that DMA was at fault. I tried to disable DMA at the BIOS level with no luck. I also disabled DMA in that W2K HD (using another PC, of course), but it would still lock, just like Knoppix.
At that point, I contemplated using an external HD (via USB), and perhaps installing a tweaked version of XP which can boot from a USB device (details on the BartPE page). I could even rewire one of the USB ports and install the whole thing internally, since the missing CD-ROM left a lot of space. But it looked like a damn lot of work :-), so I did nothing and left the notebook around for more than one year.
A few days ago, having a little time to kill again, I tried a different experiment. I got a 1GB USB stick, downloaded the latest version of Knoppix, and put it on the stick (instead of a CD-ROM). I followed this tutorial to speed things up.
It worked fine, and booting from the USB stick was very quick. However, at that point I noticed (from the boot log) that this newer version of Knoppix ran with DMA disabled (I guess the old one I tried before didn't). Time to test my DMA controller theory!
I put an HD inside, booted Knoppix from the USB stick, and guess what, no hang-up! I could access the HD with not problem whatsoever. At that point, it was hard to resist: I had to try installing knoppix on the HD itself. Again, I followed a tutorial to speed things up: this one is about a previous version, but still valid. The proof of the pudding is in the eating: I removed the USB stick, tried booting and Lo!, it worked. The world is still somehow deterministic :-).
By the way: Knoppix takes forever to boot from the HD. Overall, it would be better to keep the USB stick, maybe rewiring a USB port to keep everything inside.
Now, in a perfect world, I would have found a way to really "fix" the notebook, that is, to repair the DMA controller. I suspect it's just another loose joint from the fall. Most likely, some pin of some surface-mounted chip suffered some mechanical stress. This world, however, ain’t perfect, so I did nothing else, leaving it as a working Knoppix notebook. Fun is over, time to get back to work :-)
Wednesday, March 19, 2008
(Simple) Metrics
I've been using metrics for a long time (certainly more than 10 years now). I've been using metrics to control project quality (including my own stuff, of course), to define acceptance criteria for outsourced code, to understand the way people work, to "smell" large projects before attempting a refactoring activity, to help making an informed refactor / rewrite decision, to pinpoint functions or classes in need of a careful review, to estimate residual bugs, an so on.
Of course, I use different metrics for different purposes. I also combine metrics to get the right picture. In fact, you can now find several tools to calculate (e.g.) code metrics. You can also find many papers discussing (often with contradictory results) the correlation between any given metric and (e.g.) bug density. In most cases, those papers are misguided, as they look for correlation between a single metric and the target (like bug density). Reality is not that simple; it can be simplified, but not to that point.
Consider good old cyclomatic complexity. You can use it as-is, and it can be useful to calculate the minimum reasonable number of test cases you need for a single function. It's also known that functions with higher cyclomatic complexity tend to have more bugs. But it's also well known that (on average) there is a strong, positive correlation between cyclomatic complexity (CC) and lines of code (LOC). That's really natural: long functions tend to have a complex control flow. Many people have therefore discounted CC, as you can just look at the highly correlated (and easier to calculate) LOC. Simple reasoning, except it's wrong :-).
The problem with that, again, is trying to use just one number to understand something that's too complex to be represented by a single number. A better way is to get both CC and LOC for any function (or method) and then use quadrants.
Here is a real-world example, albeit from a very small program: a smart client invoking a few web services and dealing with some large XML files on the client side. It has been written in C# using Visual Studio, therefore some methods are generated by the IDE. Also, the XML parser is generated from the corresponding XSD. Since I'm concerned with code which is under the programmer's control, I've excluded all the generated files, resulting in about 20 classes. For each method, I gathered the LOC and CC count (more on "how" later). I used Excel to get the following picture:

As you can see, every method is just a dot in the chart, and the chart has been split in 4 quadrants. I'll discuss the thresholds later, as it's more important to understand the meaning of each quadrant first.
The lower-left quadrant is home for low-LOC, low-CC methods. These are the best methods around: short and simple. Most code ought to be there (as it is in this case).
Moving clockwise, the next you get (top-left) is for high LOC, low CC methods. Although most coding standards tend to somehow restrict the maximum length of any given method, it's pretty obvious that a long method with a small CC is not that bad. It's "linear" code, likely doing some initialization / configuration. No big deal.
The next quadrant (top-right) is for high LOC, high CC methods. Although this might seem the worst quadrant, it is not. High LOC means an opportunity for simple refactoring (extract method, create class, stuff like that). The code would benefit from changes, but those changes may require relatively little effort (especially if you can use refactoring tools).
The lower-right quadrant is the worst: short functions with high CC (there are none in this case). These are the puzzling functions which can pack a lot of alternative paths into just a few lines. In most cases, it's better to leave them alone (if working) or rewrite them from scratch (if broken). When outsourcing, I usually ask that no code falls in this quadrant.
For the project at hand, 3 classes were in quadrant 3, so candidate for refactoring. I took a look, and guess what, it was pretty obvious that those methods where dealing with business concerns inside the GUI. There were clearly 3 domain classes crying to be born (1 shared by the three methods, 1 shared by 2, one used by the remaining). Doing so brought to better code, with little effort. This is a rather ordinary experience: quadrants pinpoint problematic code, then it's up to the programmer/designer to find the best way to fix it (or decide to leave it as it is).
A few words on the thresholds: 10 is a rather generous, but somewhat commonly accepted threshold for CC. The threshold for LOC depends heavily on the overall project quality. I've been accepting a threshold of 100 in quality-challenged projects. As the quality improves (through refactoring / rewriting) we usually lower the threshold. Being a new development, I adopted 20 LOC as a rather reasonable threshold.
As I said, I use several different metrics. Some can be used in isolation (like code clones), but in most cases I combine them (for instance, code clones vs. code stability gives a better picture of the problem). Coupling and cohesion should also be considered as pairs, never as single numbers, and so on.
Quadrants are not necessarily the only tool: sometimes I also look at the distribution function of a single metric. This is way superior to what too many people tend to do (like looking at the "average CC", which is meaningless). As usual, a tool is useless if we can't use it effectively.
Speaking of tools, the project above was in C#, so I used Source Monitor, a good free tool working directly on C# sources. Many .NET tools work on the MSIL instead, and while that may seem like a good idea, in practice it doesn't help much when you want a meaningful LOC count :-).
Source Monitor can export in CSV and XML. Unfortunately, the CSV didn't contain the detailed data I wanted, so I had to use the XML. I wrote a short XSLT file to extract the data I needed in CSV format (I suggest you use the "save as" feature, as unwanted spacing / carriage returns added by browsers may cripple the result). Use it freely: I didn't put a license statement inside, but all [my] source code in this blog can be considered under the BSD license unless otherwise stated.
Of course, I use different metrics for different purposes. I also combine metrics to get the right picture. In fact, you can now find several tools to calculate (e.g.) code metrics. You can also find many papers discussing (often with contradictory results) the correlation between any given metric and (e.g.) bug density. In most cases, those papers are misguided, as they look for correlation between a single metric and the target (like bug density). Reality is not that simple; it can be simplified, but not to that point.
Consider good old cyclomatic complexity. You can use it as-is, and it can be useful to calculate the minimum reasonable number of test cases you need for a single function. It's also known that functions with higher cyclomatic complexity tend to have more bugs. But it's also well known that (on average) there is a strong, positive correlation between cyclomatic complexity (CC) and lines of code (LOC). That's really natural: long functions tend to have a complex control flow. Many people have therefore discounted CC, as you can just look at the highly correlated (and easier to calculate) LOC. Simple reasoning, except it's wrong :-).
The problem with that, again, is trying to use just one number to understand something that's too complex to be represented by a single number. A better way is to get both CC and LOC for any function (or method) and then use quadrants.
Here is a real-world example, albeit from a very small program: a smart client invoking a few web services and dealing with some large XML files on the client side. It has been written in C# using Visual Studio, therefore some methods are generated by the IDE. Also, the XML parser is generated from the corresponding XSD. Since I'm concerned with code which is under the programmer's control, I've excluded all the generated files, resulting in about 20 classes. For each method, I gathered the LOC and CC count (more on "how" later). I used Excel to get the following picture:

As you can see, every method is just a dot in the chart, and the chart has been split in 4 quadrants. I'll discuss the thresholds later, as it's more important to understand the meaning of each quadrant first.
The lower-left quadrant is home for low-LOC, low-CC methods. These are the best methods around: short and simple. Most code ought to be there (as it is in this case).
Moving clockwise, the next you get (top-left) is for high LOC, low CC methods. Although most coding standards tend to somehow restrict the maximum length of any given method, it's pretty obvious that a long method with a small CC is not that bad. It's "linear" code, likely doing some initialization / configuration. No big deal.
The next quadrant (top-right) is for high LOC, high CC methods. Although this might seem the worst quadrant, it is not. High LOC means an opportunity for simple refactoring (extract method, create class, stuff like that). The code would benefit from changes, but those changes may require relatively little effort (especially if you can use refactoring tools).
The lower-right quadrant is the worst: short functions with high CC (there are none in this case). These are the puzzling functions which can pack a lot of alternative paths into just a few lines. In most cases, it's better to leave them alone (if working) or rewrite them from scratch (if broken). When outsourcing, I usually ask that no code falls in this quadrant.
For the project at hand, 3 classes were in quadrant 3, so candidate for refactoring. I took a look, and guess what, it was pretty obvious that those methods where dealing with business concerns inside the GUI. There were clearly 3 domain classes crying to be born (1 shared by the three methods, 1 shared by 2, one used by the remaining). Doing so brought to better code, with little effort. This is a rather ordinary experience: quadrants pinpoint problematic code, then it's up to the programmer/designer to find the best way to fix it (or decide to leave it as it is).
A few words on the thresholds: 10 is a rather generous, but somewhat commonly accepted threshold for CC. The threshold for LOC depends heavily on the overall project quality. I've been accepting a threshold of 100 in quality-challenged projects. As the quality improves (through refactoring / rewriting) we usually lower the threshold. Being a new development, I adopted 20 LOC as a rather reasonable threshold.
As I said, I use several different metrics. Some can be used in isolation (like code clones), but in most cases I combine them (for instance, code clones vs. code stability gives a better picture of the problem). Coupling and cohesion should also be considered as pairs, never as single numbers, and so on.
Quadrants are not necessarily the only tool: sometimes I also look at the distribution function of a single metric. This is way superior to what too many people tend to do (like looking at the "average CC", which is meaningless). As usual, a tool is useless if we can't use it effectively.
Speaking of tools, the project above was in C#, so I used Source Monitor, a good free tool working directly on C# sources. Many .NET tools work on the MSIL instead, and while that may seem like a good idea, in practice it doesn't help much when you want a meaningful LOC count :-).
Source Monitor can export in CSV and XML. Unfortunately, the CSV didn't contain the detailed data I wanted, so I had to use the XML. I wrote a short XSLT file to extract the data I needed in CSV format (I suggest you use the "save as" feature, as unwanted spacing / carriage returns added by browsers may cripple the result). Use it freely: I didn't put a license statement inside, but all [my] source code in this blog can be considered under the BSD license unless otherwise stated.
Friday, March 07, 2008
Problem frames and the DNC
With this background, it's hardly surprising that I've always found the notion of a Domain Neutral Component quite uncomfortable. It really sounded like an attempt to shoehorn the world into a predefined model, while we should carefully look for the relevant portion of the world we want to represent into our model.
Still, in many cases (especially for a junior analyst) starting with the DNC might be better than starting with a blank page. How could this be? Does it work all the time? If not, when? Honestly, in the past years I haven't spent too much time trying to answer those questions. The DNC was part of my bag of tricks, but I didn't use it often.
Recently, however, I was thinking (once more :-) about colors and UML, and while looking into some of Peter Coad's works for a specific reference, I stumbled on the DNC again. So I thought, maybe I've learnt something in the past few years that could shed some light on the inner quality of the DNC, and its suitability in any (?) given context.
The DNC can be considered as an overengineered "standard" model representing something (an event / moment / interval) happening somewhere (a place) involving one party or more (originally an actor), usually exchanging or dealing with some good (a thing). The party plays a role, hence the later shift from actor to party + role. Indeed, you can start with a very simple model, and "derive" the DNC by following a very reasonable line of reasoning: see From Associations To Domain Neutral Component for the full story.
Of course, in many cases, the DNC might be overengineered. But you can always simplify the unnecessary parts. The real question, however, is when the DNC can give you a head start, and when it won't (context, context, context :-).
That's where Problem Frames Patterns can shed some light. I recommend that you keep the PFP paper at hand while reading what follows.
Consider, for instance, the Commanded Behavior problem frame. Shortly, the problem is stated as:
There is some part of the world whose behavior is to be controlled in accordance with commands issued by
an operator. The problem is to build a machine that will accept the operator's commands and impose the
control accordingly.
and the frame concerns are:
1. When the Operator issues a Command
2. AND the Machine rejects invalid Commands
3. AND the Machine either ignores it if unviable, OR issues Control Events
4. AND the Control Events change the Controlled Domain
5. ENSURE the changed state meets the Commanded Behavior in every case.
That's not at odds with the DNC. Control Events map nicely to moment-interval; moreover, the text above suggests that multiple events might be issued for a single command (MomentInterval-MomentIntervalDetails). The Control Events change the Controlled Domain. Therefore, they must describe external entities (each probably having a Role) that must somehow influence internal entities (Party, having a Role, or Places, having a Role).
Using the Email Client example, "Email Retrieval" is an event, composed of individual retrieval events (one for each email message). Each Message is a Thing, although with a dubious Role. Retrieval needs (at least) an Account, which is a Party playing a specific Role (Receiver). Retrieval takes place on a specific Server, playing a Role (POP3 or IMAP server). Not so bad.
What if we look into another problem frame? Let's try Transformation:
There are some given inputs which must be transformed to give certain required outputs. The output data
must be in a particular format, and it must be derived from the input data according to certain rules. The
problem is to build a machine that will produce the required outputs from the inputs.
Concerns:
1. BY traversing the input in sequence, and simultaneously traversing the outputs in sequence
2. AND finding values in the input domain, and creating values in the output domain
3. AND that the input values produce the correct output values
4. ENSURES the I/O relation is satisfied.
Hmmm. Doesn't map so nicely, but the text is really too abstract. Let's try the actual problem: an HTML email to be converted in plain text to be shown on a limited device (I added some context to the equally abstract problem described in the original paper). Well, there is hardly a dominant MomentInterval here. Hardly a party, place, thing triad gravitating around the central MI. Hardly any value in adopting the DNC as a starting point. What can be helpful here? Concepts from grammars, taxonomies, language theory. We're basically modeling a translator, and language theory will give you the head start.
So here it is. The DNC is an interesting concept because it maps nicely to some recurring problem frames (if you got time to spare, you may want to investigate which frames are a good fit for the DNC). Some problem frames, however, just don't match with the DNC. It's not just about individual problems: it's about a whole class of problems, all those within the mismatched problem frames.
For me, this is actually good news. Once we know which problem frames map nicely to the DNC, I would say the DNC itself becomes a more powerful tool, one that can be applied wisely and not blindly.
Winding down: since it all began with colors, it's interesting to see how people used the DNC to reason about more general issues. For instance, in Whole Part Relationships in Color Models, David J. Anderson starts with the DNC and ends up recommending that we avoid some whole-part colors, like a green whole with a yellow part. There is probably more to investigate along those lines. Next time I get back to my idea of coloring associations and dependencies, I'll give it a deeper thought.
Labels: analysis, article reference
Monday, March 03, 2008
Domain Neutral Component
I mentioned the Domain Neutral Component when answering a comment to my post on the Cognitive Dimensions of Notations. That reminded me I've never been a fan of the DNC, exactly because of its context-independent ambitions. Recently, however, I've reconsidered the DNC in terms of Problem Frames.
It's kinda late, so more on this soon, I hope...
It's kinda late, so more on this soon, I hope...





