IEEE International Conference on Data Mining
ICDM 2008 Data Mining Contest:
Radioxenon monitoring for verification of the Comprehensive nuclear-Test-Ban Treaty
Winner of the Contest CrownThe team of
Wei Fan1, ErHeng Zhong2, Sihong Xie2, Yuzhao Huang2, Kun Zhang3, Jing Peng4, and Jiangtao Ren1
1) IBM T. J. Watson Research Center
2) Sun Yat-Sen University
3) Xavier University of Louisiana
4) Montclair State University
Winner of the most muscularZhongfeng Zhang
Institute of Automation, Chinese Academy of Sciences
The Kangaroo Prize was postponed/cancelled due to an anomaly in the data set.
The ICDM Data Mining Contest 2008 is now officially over. However, we do encourage you to participate in the tasks, or otherwise explore and mine the data. If you do plan on publishing any results, please let us know.
The IEEE ICDM 2008 Data Mining Contest is, simply put, about keeping the world safe
using data mining. This contest is about developing and testing data mining techniques
to verify worldwide compliance of the global ban on nuclear tests.
Such tests can be detected by measuring the amount of special xenon isotopes.
Obviously, it's not just that simple; these isotopes are also emitted during various
The organiser of this year's contest is Health Canada, in particular Kurt Ungar, Trevor Stocki, Ian Hoffman and Jing Yi, Nathalie Japkowicz of University of Ottawa and Arno Siebes of Universiteit Utrecht. General questions should be addressed to Arno Siebes, questions regarding the AUC calculator software should be addressed to Trevor Stocki and last but not least, the submission forms should be sent to Jing Yi. Details regarding the submission deadlines are at the bottom of this page. The papers describing the best entries will be available as a handout accompanying the conference proceedings and also be made available on this website.
General Description of the Problem
Compliance verification of the Comprehensive Nuclear-Test-Ban Treaty (CTBT), when the treaty enters into force, will employ four remote sensing technologies to detect nuclear explosions. Only radionuclide detection can unequivocally establish that an explosion was due to a nuclear detonation. Radioactive noble gas (the following isotopes: Xe-131m, Xe-133m, Xe-133, and Xe-135) are sampled and measured in a procedure called radionuclide monitoring. Different relative combinations of these isotopes correspond to different signatures that can be mapped to distinct sources (such as nuclear power plants, medical isotope production facilities, or various types of weapons).
The problem of attributing a specific observation of airborne concentrations of radioxenon to an explosion is twofold. Firstly, in the first few weeks after an explosion the relative concentrations of the four isotopes are expected to be released in “fingerprint” relative concentrations quite distinct from other background sources. Since the CTBT stations are not located at the source of the explosion, the radioxenon is detected at a location which can be well over a thousand kilometres away. This atmospheric transport process can take weeks, which can increase the complexity of this signature. Secondly, one can never observe radioxenons emitted purely from an explosion source but admixtures of this gas with the radioxenons released from all background sources. These 2 points above constitute an interesting data mining problem for the Preparatory Commission for the Comprehensive Nuclear-Test-Ban Treaty Organization (CTBTO).
Description of the dataset to be used
Radioxenon measurements from four to five CTBTO monitoring sites will be provided. These will be comprised of a few hundred to a few thousand sets of observations of the four species for each site. A synthesized a set of explosion observations at these same sites will be added to actual radioxenon concentrations caused by background sources. The data sets are composed of two classes, Background (B), and Background plus Explosion (B+E). Each type has a set of quadruplets representing the four activity concentrations of Xe-131m, Xe-133m, Xe-133, and Xe-135 for a given air sample.
We will be issuing labelled data sets containing both classes during the first phase of the competition, while teams develop a classification method appropriate for this task. In a second phase, we will issue a new data set also containing data from both classes, but we will withhold the label. This testing data set will be used for our final evaluation.
Description of the computational tasks
Two versions of data sets will be provided. The first will have each datum described according to station of origin, a unique randomly assigned tracking number allowing the contest evaluators trace the datum back to the original scenario of explosion release, whether it is Background or whether it is Background plus Explosion. The second version will have each datum described by station of origin using the same stations as the first data set and a unique randomly assigned tracking number allowing the contest evaluators trace the datum back to the original scenario of explosion release. The second set of data will contain cases of B or B+E but this will be unknown to the contestants. The first version of the data will be employed in Tasks 1 and 2. The final version of the data will be employed in Task 3.
Task 1: The first task is to classify as accurately as possible the results as Background or Explosion over the entire set of stations provide with one classifier. Contestants may combine data as they see fit. They may separately tune classifier parameters for each station but they may not have separate classifier parameter types for each station nor separate classifiers. Contestants can to report on more than one classifier for this task.
Task 2: In the second task, conversely, the contestant is requested to identify an optimal algorithm for each station given.
Task 3: In the third task, the contestants will apply the classifiers developed in Tasks 1 and 2 using the second data set and report their results for evaluation. The primary goal of this contest is to produce methods that are broadly applicable over different station background measurement distributions and explosion source hypotheses. The best methods will also have a very efficient learning curve. Recognition will also be given to methods more proficient in properly categorizing data arising from specific classes of explosion release hypotheses or station background types, because these methods add a forensic or diagnostic dimension.
Timeline for the Contest:
1) September 12, 2008: Release of the Training Data and Test Data sets and the Software tools that will be used to evaluate the results.
2) Late October, 2008: Results from the labeled set are due.
3) November 8, 2008: Results obtained on the unlabeled data set are due.
4) December 15-19: Results of the competition are announced at the conference.
Detailed instructions on the three tasks of the Contest are available here. The datasets, template submission form and AUC calculation tool can be downloaded from here.