TTSome Result

WP

Experiments and Results

1. Experiment with Reversi

In this experiment I want to measure how helpful T2 is for aiding programmers.

As a bit of background, one of the programming assignments our first year students get is to write a Reversi (Othello) game in Java. In the heart of this game is the part that implements the game's rules. It consists of a data structure representing the game board, and the methods implementing basic operations e.g. to determine if a move is possible, to execute a move, to determine if the game is over, who wins, etc. This part is also very error prone to code.

This simple game has a lot of possible sequence of moves and game states. Testing it by hands or by a tool is very challenging. Note that there is also no simple way to express legal game states with a class invariant.

The experiment

Here is how the experiment is setup. I take the role of the programmer. The task is to write a class implementing the game board along with two basic methods:

  1. Method move(x,y). If setting a piece at (x,y) would be legal for the player currently in-turn, this move will be effected. The turn will be switched to the other player, and the method returns true. Else the method does nothing, and returns false.
  2. Method gameOver(). It checks if no more move is possible. If so it determines who wins (or draw).

I wrote several versions of ReversiCore:

  • Core-1. This is coded with more effort. I use high level data structures (Collection) e.g. to pass set of values. This results in a version which is much less error prone. This one is tested with T2.
  • Core-2. By the time our students have to do the Reversi assignment they have not learned Collections yet; so they have to do it with only arrays. I do this as well in Core-2. This version is much more error prone, and is also a lot harder to debug. After the initial version, two parallel branches are made. Core-2M is obtained by fixing Core-2 through manual testing. Core-2T is obtained by fixing Core-2 through T2 testing. They are only fixed up to no more faults are discovered by the prepared tests (in manual testing), or after repeated test runs do not expose faults (in T2 testing). Note that this does not mean that after the fix all faults are gone. We just fix those that can be exposed by either the test cases (manual) or the specifications (T2).

Some data on these versions:

  LOC #Instr #Method Ave. Cyclo. Max. Cyclo.
Core-1 (including in-code specs) 117 734 11 7.3 14
Core-2T (including in-code specs) 113 670 12 6.5 14
Core-2M 89 541 10 4.9 8

Observation

The table below shows effort spent on the different versions. Effort is expressed in USD cost. The cost is fixed at 50 USD/hour.

  Core-1 (tested with T2) Core-2T (tested with T2) Core-2M (manual tested)
Task Cost (USD) Effort in % Cost (USD) Effort in % Cost (USD) Effort in %
Coding 79 70% 43 45% 43 36%
Testing and fixing(*) 34 30% 53 55% 75 64%
Total 113   95   118  

(*) in the manual testing this includes writing the tests; in the T2 testing this includes preparing T2 (which only takes few minutes) and writing specifications.

Faults found:

  #faults found
Core-1 4
Core-2T 5
Core-2M 4

The faults found in Core-2T and Core-2M are complementary. So, manual testing failed to expose all faults; likewise automated testing with T2 also didn't expose all faults.

I also applied the same set of hand written tests used on Core-2M on Core-1 but they didn't expose any fault.

Cost per bug

The actual cost per bug is hard to infer. For that we need to develop each version to a near perfect version, and then calculate the cost. But I didn't do this. So we can only extrapolate based on the data we have:

  Cost per fault
T2 testing 7.1 - 12.6 USD per fault found
Manual testing 18.8 per fault found

A substantial part of the cost goes to locating the source of the fault. This is definitely the case in manual testing. In T2 testing writing specifications also takes substantial effort.

Conclusion

T2 can find faults that are missed by manual testing. However this requires some effort invested in writing specifications. Still the overall cost of T2 testing, including writting specifications, does not exceed the cost of manual testing.

So, the best way to use T2 is to use it in tandem with manual testing. This appears to double the cost of unit testing, but the combination is also more powerful in exposing bugs, so as to prevent them to surface at much later development stage, where the cost of fixing them will multiply. An acedemic survey by Ellims et al indicates that this multiplication is between 3x - 7x, whereas a survey by Parasoft indicates the average cost of (using our 50 USD/h cost) 600 USD/bug; so this is a multiplication by a factor of 30.

2. Experiment with Coverage Measurement

In this experiment we measure the coverage delivered by T2 on the following example small to medium-sized classes. Note that high coverage does not mean we have purged most bugs! But if T2 delivers low coverage, it would be bad.

In average the classes we measure are about 135 LOCs. As comparison, the average LOCs per Java class is about 90 [ Kaczmarek and Kucharski ]. Here they are (in increasing LOCs):

Class About #Method Aver. Cyclomatic Max. Cyclomatic #Instruction LOC
U2.T2.examples.SortedList self invented example 7 3.6 6 212 32
U2.T2.Pool T2's object pool 10 2.9 7 340 62
U2.T2.examples.BinarySearchTree self invented example 18 2.4 7 359 65
U2.T2.Obj.Show T2's utility to print the state of an object 13 4.4 33 844 143
net.metanotion.util.skiplist.SkipSpan part of Metanotion's mini database engine 17 3.3 12 946 176
java.util.LinkedList Java standard library of linked list 52 2.0 6 1188 330

Aver Cyclomatic and Max cyclomatic are respectively the average and the maximum of the cyclomatic count over the methods of the class in question.

In terms of cyclomatic complexity, these contain non trivial methods. As comparison, only less than 2% of all methods in popular software (Firefox, Apache, My-Sql) have cyclomatic count higher than 10 [ RSM Metrics of Popular Software Programs ].

How are they measured?

We run T2 on each of these examples. The number of generated trace steps are fixed to 5000. Classes from U2.T2 are given specifications. The other two classes are left untouched, so we basically only check for internal errors. In some cases, e.g. U2.T2.Pool, we do not found fault. In some cases, e.g. SortedList, we injected faults; T2 could find them. In other cases, e.g. Java's LinkedList, we simply have no specification. We found lots of faults, but more likely they are false positives due to the lack of specifications. Anyway, finding faults is not what we want observe here. It is coverage that we want to measure.

Coverage are measured in terms of block and branch coverage. We use Emma and Corbetura.

Result

All are based on 5000 total trace steps.

Class LOC Aver. Cyclomatic Max. Cyclomatic Block Coverage (%) Branch Coverage (%) Time (ms)
examples.SortedList 32 3.6 6 98 85 203
U2.T2.Pool 62 2.9 7 91 69 515
examples.BinarySearchTree 65 2.4 7 94 72 391
U2.T2.Obj.Show 143 4.4 33 84 75 1047
net.metanotion.util.skiplist.SkipSpan 176 3.3 12 72 70 735
java.util.LinkedList 330 2.0 6 88 85 656

So good things to notice:

  • T2 can pump out thousands calls in a second or less. So it lives up to its claim of being an almost interactive testing tool.
  • The obtained coverage is also quite good, considering the minimum effort to setup the testing, and considering that some classes have methods with quite high cyclomatic counts.

Less good things to notice: the branch coverage is lower. This is because T2 has no mechanism to keep track of which branches have been taken, and thus have no way to direct coverage over branches. To do it requires some sophisticated techniques to be added to T2; something which is high in our Future Work list smile