Experiments and Results
1. Experiment with Reversi
In this experiment I want to measure how helpful T2 is for aiding programmers.
As a bit of background, one of the programming assignments our first
year students get is to write a Reversi (Othello) game
Java. In the heart of this game is the part that implements the game's
rules. It consists of a data structure representing the game board,
and the methods implementing basic operations e.g. to determine if a
move is possible, to execute a move, to determine if the game is over,
who wins, etc. This part is also very error prone to code.
This simple game has a lot of possible sequence of moves and game
states. Testing it by hands or by a tool is very challenging. Note
that there is also no simple way to express legal game states with a
Here is how the experiment is setup. I take the role of the
programmer. The task is to write a class implementing the game board
along with two basic methods:
- Method move(x,y). If setting a piece at (x,y) would be legal for the player currently in-turn, this move will be effected. The turn will be switched to the other player, and the method returns true. Else the method does nothing, and returns false.
- Method gameOver(). It checks if no more move is possible. If so it determines who wins (or draw).
I wrote several versions of ReversiCore:
- Core-1. This is coded with more effort. I use high level data structures (Collection) e.g. to pass set of values. This results in a version which is much less error prone. This one is tested with T2.
- Core-2. By the time our students have to do the Reversi assignment they have not learned Collections yet; so they have to do it with only arrays. I do this as well in Core-2. This version is much more error prone, and is also a lot harder to debug. After the initial version, two parallel branches are made. Core-2M is obtained by fixing Core-2 through manual testing. Core-2T is obtained by fixing Core-2 through T2 testing. They are only fixed up to no more faults are discovered by the prepared tests (in manual testing), or after repeated test runs do not expose faults (in T2 testing). Note that this does not mean that after the fix all faults are gone. We just fix those that can be exposed by either the test cases (manual) or the specifications (T2).
Some data on these versions:
| || LOC || #Instr || #Method || Ave. Cyclo. || Max. Cyclo. |
| Core-1 (including in-code specs) || 117 || 734 || 11 || 7.3 || 14 |
| Core-2T (including in-code specs) || 113 || 670 || 12 || 6.5 || 14 |
| Core-2M || 89 || 541 || 10 || 4.9 || 8 |
The table below shows effort spent on the different versions. Effort
is expressed in USD cost. The cost is fixed at 50 USD/hour.
| || Core-1 (tested with T2) || Core-2T (tested with T2) || Core-2M (manual tested) |
| Task || Cost (USD) || Effort in % || Cost (USD) || Effort in % || Cost (USD) || Effort in % |
| Coding || 79 || 70% || 43 || 45% || 43 || 36% |
| Testing and fixing(*) || 34 || 30% || 53 || 55% || 75 || 64% |
| Total || 113 || || 95 || || 118 || |
(*) in the manual testing this includes writing the tests; in the T2
testing this includes preparing T2 (which only takes few minutes) and
| || #faults found |
| Core-1 || 4 |
| Core-2T || 5 |
| Core-2M || 4 |
The faults found in Core-2T and Core-2M are complementary. So, manual
testing failed to expose all faults; likewise automated testing
with T2 also didn't expose all faults.
I also applied the same set of hand written tests used on Core-2M on
Core-1 but they didn't expose any fault.
Cost per bug
The actual cost per bug is hard to infer. For that we need to
develop each version to a near perfect version, and then calculate the
cost. But I didn't do this. So we can only extrapolate based on the
data we have:
| || Cost per fault |
| T2 testing || 7.1 - 12.6 USD per fault found |
| Manual testing || 18.8 per fault found |
A substantial part of the cost goes to locating the source of the
fault. This is definitely the case in manual testing. In T2 testing
writing specifications also takes substantial effort.
T2 can find faults that are missed by manual testing. However this
requires some effort invested in writing specifications. Still the
overall cost of T2 testing, including writting specifications, does
not exceed the cost of manual testing.
So, the best way to use T2 is to use it in tandem with manual
testing. This appears to double the cost of unit testing, but the
combination is also more powerful in exposing bugs, so as to prevent
them to surface at much later development stage, where the cost of
fixing them will multiply. An acedemic
survey by Ellims et al
indicates that this multiplication is between 3x - 7x,
whereas a survey by Parasoft
indicates the average cost of (using our 50 USD/h cost) 600 USD/bug; so this is a multiplication by a factor of 30.
2. Experiment with Coverage Measurement
In this experiment we measure the coverage delivered by T2 on the following example small to medium-sized classes. Note that high coverage does not mean we have purged most bugs! But if T2 delivers low coverage, it would be bad.
In average the classes we measure are about 135 LOCs. As comparison, the average LOCs per Java class is about 90 [ Kaczmarek and Kucharski
]. Here they are (in increasing LOCs):
| Class || About || #Method || Aver. Cyclomatic || Max. Cyclomatic || #Instruction || LOC |
| self invented example || 7 || 3.6 || 6 || 212 || 32 |
| T2's object pool || 10 || 2.9 || 7 || 340 || 62 |
| self invented example || 18 || 2.4 || 7 || 359 || 65 |
| T2's utility to print the state of an object || 13 || 4.4 || 33 || 844 || 143 |
| part of Metanotion's mini database engine || 17 || 3.3 || 12 || 946 || 176 |
| Java standard library of linked list || 52 || 2.0 || 6 || 1188 || 330 |
and Max cyclomatic
are respectively the average and the maximum of the cyclomatic count over the methods of the class in question.
In terms of cyclomatic complexity, these contain non trivial methods. As comparison, only less than 2% of all methods in popular software (Firefox, Apache, My-Sql) have cyclomatic count higher than 10 [ RSM Metrics of Popular Software Programs
How are they measured?
We run T2 on each of these examples. The number of generated trace steps are fixed to 5000. Classes from U2.T2 are given specifications. The other two classes are left untouched, so we basically only check for internal errors. In some cases, e.g. U2.T2.Pool, we do not found fault. In some cases, e.g. SortedList, we injected faults; T2 could find them. In other cases, e.g. Java's LinkedList, we simply have no specification. We found lots of faults, but more likely they are false positives due to the lack of specifications. Anyway, finding faults is not what we want observe here. It is coverage
that we want to measure.
Coverage are measured in terms of block and branch coverage. We use Emma and Corbetura.
All are based on 5000
total trace steps.
So good things to notice:
- T2 can pump out thousands calls in a second or less. So it lives up to its claim of being an almost interactive testing tool.
- The obtained coverage is also quite good, considering the minimum effort to setup the testing, and considering that some classes have methods with quite high cyclomatic counts.
Less good things to notice: the branch coverage is lower. This is because T2 has no mechanism to keep track of which branches have been taken, and thus have no way to direct coverage over branches. To do it requires some sophisticated techniques to be added to T2; something which is high in our Future Work list