
How Management and Tech people fool themselves with Measurement
The software industry seems to have an obsession with metrics and measurement. We want to quantify everything. Once upon a time everything was about counting lines of code (KLOC). Managers ran around asking, “How many lines of code have you written? How many bugs per KLOC are there? What is the size of project in KLOC?” etc. Then we started counting everything else that was left from quantifying KLOC and managers started asking, “How many requirements are there? What is the number of test cases? How many bugs did you find? What is the defect density? How many test cases are passed? What is the requirement coverage?” etc.
The obsession with quantification is often an influence from the manufacturing industry where you can count things that are physical and are visible to eyes. However, counting things in the software industry appears to have helped consultants who sell the premise that charts, graphs and measures based on invalid constructs are meaningful. The problem is often the misinterpretation of these metrics.
At one of my former workplaces, the testing team used to generate a report which had a metric called “Quality Index” (QI). The objective with this measure, that the team explained to me, was to have some indication of quality and performance. The managers needed an indicator to assess development performance (for example, are there any issues with understanding, communications, requirements, process, and so forth). The QI measure was considered a yardstick. That is, every time a new build was tested, QI could tell managers how good (or bad) the build was.
However, there is a problem with using yardsticks. They can’t be used for measuring something that is subjected to interpretation or is subjective in nature. Often they are good heuristics and can be useful as a first order metric, but mostly for physical measurement only.
A metric, like quality index, may be used as an indicator to ask, “is there a problem here?” In that case it becomes a heuristic. Because it is fallible. It may help you find a solution, but it may never guarantee one. Using heuristics for assessing your key employees performance is dangerous. You may lose their trust and respect and they may eventually leave (unless that is what you actually want).
Metrics are a powerful tool but they can always be misinterpreted and can be skewed to show favorable (or otherwise) results. Without context they are very much meaningless. Quantitative measurement can lead to a false sense of control. It creates an illusion that we can understand and control something because we can count it. Someone mentioned in an online forum about quality that if you can’t measure it, you can’t have it. I guess this person was referring to Tom DeMarco who wrote in his book “Controlling Software Projects, Management Measurement & Estimation, (1982), p. 3.” that – you can’t control what you can’t measure. In a recent discussion Michael Bolton reminded me that few years ago Tom renounced from the opinion that he has held. His recent views can be read here.
As I mentioned earlier, metrics are often used as a sales tool by consultants to gain more business. I was once invited for dinner at a 5-star hotel by the testing group of a large bank in Australia. I wasn’t aware that the dinner was actually hosted by the bank’s testing vendor, a huge I.T. outsourcing firm. This vendor’s introductory presentation included the data of efficiency they brought to their banking client. These details included automating 3500 test cases, reducing test preparation time by 70% and so forth. As James Christie said in his blog post, “100 is bigger than 10. 10,000 is pretty impressive, and 100,000 is satisfyingly humongous. You might not really understand what’s happening, but when you face up to senior management and tell them that you’re managing thousands of things, well, they’ve got to be impressed.” This vendor certainly impressed their naïve client.
What is this beast called Quality index?
The QI index that was used by the teams I worked with had this definition:
The QI has been defined as a measure of defect density, such that the percentage of defects as a proportion of the total number of test cases executed is defined.
This is a measure of company’s software quality delivery to testing as opposed to company production quality.
Lower QI is better.
The report used to have statements like:
171 test cases executed successfully and 93 defects detected, providing a Quality Index (QI) = 54% (this is within the 1 – Unsatisfactory level).
There were graphs like the one below which explained what and how the quality of the build has been:
“So what’s the problem here?”, you may ask.
This seems to be a valid question, especially when you have been told about these indices and were presented with data that seemed accurate. We see such indices on TV every day where some eminent economist is presenting his view on the economy and predicting which way the markets will go, and later convincingly explains why the markets did not go the way he predicted. The simplest answer is that no one, including Nobel Prize winning economists, can predict the future. Humans simply do not have the ability to predict. You may say that you can predict that you will read the next word on this post – but even that is unpredictable. What you would actually mean is, “I predict that I might be able to read the next word provided the boss doesn’t call right at that moment, or the monitor doesn’t lose power or the sky doesn’t fall or..!” The list can go on.
I studied Statistics as one of the subjects during my Masters degree. While that study did not make me an expert in statistics, it did improve my knowledge of the subject though. And I think it will also help us examine what is the problem with this quality index. Let’s start by looking at definitions.
What is “quality”?
Jerry Weinberg defines quality as “value to some person(s)”. James Bach and Michael Bolton added ‘…who matter.’ to this definition. So the definition that I like is, “Quality is value to some person(s) who matter”.
Michael Bolton suggests that decisions about quality are always political and emotional; made by people with the power to make them; made with the desire to appear rational and yet ultimately based on how those people feel.
Let’s have a look at the overall definition once again. “The QI at Company X has been defined as a measure of defect density, such that the percentage of defects as a proportion of the total number of test cases executed is defined.”
What catches our attention is the term “defect”. Although I prefer calling them bugs.
Is there a point in defining defect density?
James Bach says that a bug is anything that threatens the value of a product – something that bugs someone, whose opinion matters (this last part was added by Michael Bolton). This definition automatically makes the benefits that someone may be seeking from quality index highly questionable. Like predicting the future, humans do not have the ability to find and explore all bugs that might be there in a system. Michael Bolton notes that “the idea of a “bug” is subject to the Relative Rule meaning a bug is not a thing that exists in the world; it doesn’t have a tangible form. However, a bug is a relationship between the product and some person. A bug is a threat to the value of the product to some person. The notion of a bug might be shared among many people, or it might be exclusive to some person.” So, there may be little point quantifying something that does not have a physical existence. Once people start counting bugs, they start falling in love with them. They feel that they own these conceptual non-physical things. Psychology defines this as reification, the perception of an object as having more spatial information than is actually present. Is it worthwhile defining density of an abstract concept or of something that is subject to relative rule?
The wild world of test cases
The definition of QI also talks about deriving a percentage as a ratio of number of test cases. What is your definition of a test case? My team stopped writing lengthy, step-to-step test cases or scripts in a deliberate move away from the idea that testers should develop test cases based on a requirements document. What I have observed is that many testers take a requirements document and create a number of test cases for each requirement like positive, negative, sedative, nonsense-itive and so on. These testers wrongly believe that these test cases provide complete coverage1 for the requirements. They create a Requirement Traceability Matrix (RTM) which is usually a table with test cases on one axis and requirements on the other. If all requirements have a mapped test case, then coverage is complete. The managers believe that these detailed test cases help their tester perform complete testing. Managers get upset when coverage metrics are not there or the RTM is missing.
What they don’t realize is that when they say test, what they really mean is check. Then, a single requirement may mean more than one assumption, proposition or assertion. A business analyst who writes business stakeholders requirements may interpret them entirely differently than the stakeholder herself. A developer may interpret them differently and similarly a tester may interpret and intersect the requirements in a very different fashion not understood or agreed by others. Hence writing a test case per requirement or multiple test cases per requirement sounds completely incorrect. If that is incorrect, then the ratio based on an incorrect number would be wrong too. And therefore, that makes the concept of Quality Index meaningless.
So what do you think of this claim now:
This is a measure of Company X’s software quality delivery to testing as opposed to Company X production quality.
Lower QI is better.
The QI as it is defined is not a measure of anything of value about the software or its quality – and is easily gamed (e.g. just split test cases into smaller test cases and hence immediately reduce the QI for exactly the same piece of software under test with exactly the same number of defects!). The QI itself even tells you that the testing it relates to (and the way it is being measured) is not great, by saying “Lower QI is better” – this means you are striving to make test cases that don’t find defects, why would you do that? Ethically we should not.
As skilled craftsmen, we should not waste time counting and showing percentages when we could best spend our time talking to our stakeholders about the things we see, the things that interest us, things that looks suspicious, the risks we observe, and the overall quality as we perceive it.
So, the next time you are counting test cases or bugs, working on a RTM or looking at percentages in reports, ask yourself, “Am I simply counting and giving statistics or am I helping our organization in delivering a quality product with the lowest risk of failure?”
1 Further reading:
http://developsense.com/articles/2008-09-GotYouCovered.pdf
http://developsense.com/articles/2008-10-CoverOrDiscover.pdf
http://developsense.com/articles/2008-11-AMapByAnyOtherName.pdf