Computer Lab

There are two practical assignments. One assignment for the part on statistical pattern recognition,
and one for the part on geometrical pattern recognition. The assignments will be posted here in due course.

Introduction to R

For the assignment on statistical pattern recognition, we will work with R, a free software environment for statistical computing and graphics. R Studio is an IDE for R.

  1. To start R, go to Start --> All Programs --> Standard Applications --> Science --> Biology --> R for Windows 3.2.3
    and then choose "R for Windows" or "RStudio".
  2. If you want to install R on your own laptop: you can download R from this location.
  3. Go here for more information on R.
  4. R Studio is an IDE for R. You can also install RStudio on your laptop.

In the first computer lab session we are not going to work on the assignment yet. We will use it to gain some familiarity with R.

  1. If you are completely new to R, work through section 2.3 of the book "Introduction to Statistical Learning" by James et al.
  2. Here is an introduction to R for the first Computer Lab session.
  3. Here is an example to demonstrate overfitting (with nice graphs!).

After working through the examples above, do the following:

  1. Read the optical digit recognition data into R and read the documentation (see under datasets).
  2. Look at a summary of the data: can you already spot useless variables?
  3. Fit a linear regression model on the training data.
  4. Use it to predict the digit for each observation in the test set (see "predict.lm" in R). You may need to round the predictions!
    To make a cross table of the true class labels and the predicted class labels, the R function "table" may come in handy.
  5. Explain why the regression model is in fact not appropriate here. (Hint: suppose a coefficient in the model is positive. Think about what that means (study the documentation of the data), and whether such a model would make any sense here.)

Assignment 1

This assignment should be made by groups of two students. It should be handed in ultimately on Monday, December 19th. Send your report to A.J.Feelders@uu.nl. Put your names and studentnumbers in the body of the message and in the report.

For this assignment we are going to analyze a data set of handwritten digits. The task is to determine the digit (0..9) that was written, from the pixel image (28x28 greyscale values). The dataset to be used, and its documentation are given below:

  1. Documentation
  2. Data

As an introduction to programming in R, we recommend this book.

The data consists of 42,000 examples where each example has 28x28=784 feature values. You probably lack the computational resources to use the larger part of this dataset for training and model selection. Hence you are allowed to draw a random sample of a size you can handle. Note however that the bigger the sample, the higher the quality of the models tends to become! The remaining data should be used to assess the predictive performance of the selected models. For example, if you draw a random sample of size 1,000, you use these 1,000 examples for training and model selection (using cross-validation), and you estimate the error of the model finally selected on the remaining 41,000 examples.

Another way to reduce the size of the data is to reduce the number of features. Some features may be superfluous, or maybe you can derive clever features from the raw pixel data that can discriminate very well between the digits. You can also reduce the number of features by reducing the level of detail of the pixel image. For example, you can reduce the level of detail to a 14x14 pixel image. In that case you will have only 196 features instead of 784. Feel free to try anything.

Your report should contain at least the following:

  1. Begin with an exploratory analysis of the data. When you apply the "summary" function to the data you already get a lot of useful descriptive statistics. Can you spot superfluous variables by looking at the summary statisitcs? Consider the class distribution: what percentage of cases would be classified correctly if we simply predict the majority class? Convert the first column (the digit) to a categorical variable using "as.factor" in R.

    Discuss any findings from your exploratory analysis that you think are of interest.

  2. Derive from the raw pixel data a feature that quantifies "how much ink" a digit costs. Report the average and standard deviation of this feature within each class. Try to infer from these statistics which pairs of classes can be distinguished well, and which pairs will be hard to distinguish using this feature. Hint: Use the R function "tapply" to compute the mean and standard deviation per digit. If your feature is called "density", then "tapply(density,mnist.dat[,1],mean)" will compute the mean density for each digit. Using this single feature, fit a multinomial logit model and evaluate how well this model can distinguish between the different classes. For example, how well can the model distinguish between the digits "1" and "8"? And how well between "3" and "8"? Use the R function "scale" to scale your feature when you apply "multinom" to fit the multinomial logit model.

    Come up with one other feature, and explain why you think it might discriminate well between the digits. Perform the same analysis as you did for the first feature.

    Finally, fit a multinomial logit model using the two features together and see if and how it improves on the single-feature models.

    Your report should contain an unambiguous description of how the features are derived from the raw data. Since so far we are considering models containing just one or two features, you may be able to make use of nice visualisations to illustrate your findings.

  3. You should analyse the data with the regularized multinomial logit model (using the LASSO penalty), k-nearest neighbor and either neural networks or support vector machines. You may need to pre-process the data in different ways to apply different algorithms! However, you should use the same sample for training/model selection with the different algorithms. Otherwise the comparison wouldn't be fair.

    For each classification method that you apply, discuss the following:

    1. What data pre-processing (including feature extraction/feature selection) did you perform for this classification algorithm? You are allowed to use any feature, not just those that you proposed and analyzed at item 2 above. In particular, you are allowed to use the raw pixel features. New features that you derive from the raw pixel data must be described unambiguously in the report. The reader should be able to reproduce your analysis if he or she didn't have anything better to do.
    2. What are the complexity parameters (if any) of the classification algorithm, and how did you select their values?
    3. What is the estimated accuracy of the best classifier for this method?
    For these experiments, you should train models with different parameter settings, and use cross-validation to select the best parameter setting. Use the remaining data to produce an honest estimate of the accuracy of the model finally selected for each method. For example, the complexity parameter for k-nearest neighbor would be "k" (the number of neighbours used for majority voting). Try different values of "k" and use cross-validation to pick the value that produces the smallest classification error. Finally, to produce an unbiased estimate of the error for the value of "k" selected, compute its classification error on the remaining data.
  4. Which classification method(s) produced the best classifier for this problem?

    After step 3 you will have determined the best parameter settings for each algorithm (using cross-validation), and you will have applied the algorithm with these parameter settings to the complete training sample to produce a model. Furthermore you will have estimated the error of these models on the remaining data. Now compare these models with each other. Don't just look at their overall error rate, but also consider how well different digits are distinguished by each model.

To run your experiments, you may want to write some simple R code in order to call the classification methods with different data, or different parameter settings and collect the results. It can be very time consuming if you do everything "by hand" from the command line! To learn the basics of R, consult the online manual "Introduction to R" under the Help menu. (Note: the built-in editor in R is fairly awkward. To select another editor, you can type (for example) options(editor="Notepad") at the command line.
Then if you type fix(functionname), R will open the Notepad editor and you can edit the function. Don't forget to save in Notepad before you return to R.) However, it is recommended to use R Studio, an IDE for R.

This assignment is not about producing the best possible predictive model. What counts is the quality and breadth of your analysis. The experiments should be performed in a methodologically correct manner, so that your conclusions and findings are reliable. Please also pay attention to the quality and readability of your written report. If you have performed an excellent analysis, but you write it down very badly, then you still have a problem. Make sure I enjoy reading your work!

The report should be 10-15 pages long (excluding a possible appendix with tables and figures) and should be in pdf format.

There has been a competition on Kaggle with this data set. Feel free to use any information you find there, as long as you acknowledge any information used in your report.

Below you find an overview of relevant algorithms and pointers to their availability in R.

Algorithm R Package Function(s)
k Nearest Neighbour class knn
(regularized) Multinomial Logit nnet,glmnet multinom, glmnet, cv.glmnet
Local ("stepwise") Search MASS stepAIC
Neural Networks nnet, mxnet nnet, various
Support Vector Machine e1071 svm
Parameter Tuning e1071 tune, tune.wrapper

Assignment 2: Project with Geometric Pattern Recognition

The teams of four students have been formed and they have a topic. All further project information will be posted below.

Topic 1: Leadership patterns in trajectories of animals
Alex Klein, Niels Rustenburg, Bram Custers, Marthe Hegeman

Topic 2: Patterns in street networks
Victor Voorhuis, Marjolein Zwerver, Jack van der Drift, Claudio Freda

Topic 3: Optical character recognition for the Latin alphabet
Jaspreer Singh, Daniel Stekenburg, Jorrit Dorrestein, Ivar van der Hoog

Topic 4: Optical character recognition for kanji
Maguell Sandifort, Kevin Ouwehand, Sam de Redelijkheid, Alexey Rodriguez Blanter

Topic 6: Patterns in archaeology
Tom Jimkes, Bram Zijjlstra, Koen van Noortwijk, Marijn Valk

Topic 7: Geological patterns
Hugo Heemskerk, Jose Vargas, Victor Hansen, Pim Agterberg

Topic 8: Plant and tree patterns
Soogeun Park, Wouter van 't Hof, Tues van Oosterom, Casper van Dommelen

Topic 12: Star patterns in astronomy
Martijn Fleuren, Susan Brommer, Steven Langerwerf, Ricky van der Walenburg

Topic 13: Patterns in pedestrian movement
Peter Markotic, Paktwis Hamayun, Peter Blomsma, Floor Eigenhuis

Topic 16: Fingerprint patterns analysis
Sander Vankeste, Joep Hamersma, Geert Lugtenberg, Shu Cheng Jiang

Topic 17: Iris scan patterns analysis
Ruud van Asseldonk, Olaf Schalk, Kevin van Maastrigt, Jan Geertman

Topic 18: Spatial analysis for crime patterns
Petra G, Casper H, Roald N, Roberto P

Topic 19: Seismic patterns
Riccardo Bellana, Alejandro Barreda-Arrieta, Jasper Makkinje, Guillermo Ibanez