## Computer Lab

There are two practical assignments. The assignments will be posted here in due course.

### First Lab Session: Introduction to R and Linear Regression

For the first assignment, we will work with R, a free software environment for statistical computing and graphics. R Studio is an IDE for R.

1. To start R, go to Start --> All Programs --> Standard Applications --> Science --> Biology --> R for Windows [version number]
and then choose "R for Windows" or "RStudio".
2. If you want to install R on your own laptop: you can download R from this location.
3. Go here for more information on R.
4. R Studio is an IDE for R. You can also install RStudio on your laptop.

In the first computer lab session we are not going to work on the assignment yet. We will use it to gain some familiarity with R, and the linear regression model.

1. If you are completely new to R, work through section 2.3 of the book "Introduction to Statistical Learning" by James et al.
2. Otherwise, work through this introduction to R and linear regression.
3. Then work through these examples to demonstrate overfitting.

After working through the examples above, do the following:

1. Read the optical digit recognition data into R and read the documentation (see under datasets).
2. Look at a summary of the data: can you already spot useless variables?
3. Fit a linear regression model on the training data.
4. Use it to predict the digit for each observation in the test set (see "predict.lm" in R). You may need to round the predictions!
To make a cross table of the true class labels and the predicted class labels, the R function "table" may come in handy.
5. Explain why the regression model is in fact not appropriate here. Hint: suppose a coefficient in the model is positive. Think about what that means (study the documentation of the data), and whether such a model would make any sense here.

### Assignment 1

This assignment should be made by groups of two students. It should be handed in ultimately on Friday, December 14th 2018 at midnight. Send your report to A.J.Feelders@uu.nl. Put your names and studentnumbers in the body of the message and in the report.

For this assignment we are going to analyze a data set of handwritten digits. The task is to determine the digit (0..9) that was written, from the pixel image (28x28 greyscale values). The dataset to be used, and its documentation are given below:

As an introduction to programming in R, we recommend this book.

The data consists of 42,000 examples where each example has 28x28=784 feature values. The first column contains the class label.

Your report should contain at least the following:

1. Begin with an exploratory analysis of the data. Can you spot superfluous variables by looking at their summary statisitcs? Consider the class distribution: what percentage of cases would be classified correctly if we simply predict the majority class? Convert the first column (the digit) to a categorical variable using "as.factor" in R.

Discuss any findings from your exploratory analysis that you think are of interest.

2. Derive from the raw pixel data a feature that quantifies "how much ink" a digit costs. Report the average and standard deviation of this feature within each class. Try to infer from these statistics which pairs of classes can be distinguished well, and which pairs will be hard to distinguish using this feature. Hint: Use the R function "tapply" to compute the mean and standard deviation per digit. If your feature is called "density", then "tapply(density,mnist.dat[,1],mean)" will compute the mean density for each digit.

Using this single feature, fit a multinomial logit model and evaluate how well this model can distinguish between the different classes. For example, how well can the model distinguish between the digits "1" and "8"? And how well between "3" and "8"? Use the R function "scale" to scale your feature when you apply "multinom" to fit the multinomial logit model.

Come up with one other feature, and explain why you think it might discriminate well between the digits. Perform the same analysis as you did for the first feature.

Finally, fit a multinomial logit model using the two features together and see if and how it improves on the single-feature models.

Your report should contain an unambiguous description of how the features are derived from the raw data. Since so far we are considering models containing just one or two features, you may be able to make use of nice visualisations to illustrate your findings. Since in this part of the assignment we only consider very simple models, you may use the complete data set both for training and evaluation.

3. In this part we are going to train and evaluate more complex models, using a much larger set of potential features. You are allowed to use the raw pixel values as features, and any other feature you can derive from them (such as the two features from part 2).

You probably lack the computational resources to use the larger part of this dataset for training and model selection. Hence you are allowed to draw a random sample of a size you can handle. Note however that the bigger the sample, the higher the quality of the models tends to become! The remaining data should be used to assess the predictive performance of the selected models. For example, if you draw a random sample of size 5,000, you use these 5,000 examples for training and model selection (using cross-validation), and you estimate the error of the model finally selected on the remaining 37,000 examples.

Another way to reduce the size of the data is to reduce the number of pixel features. Some pixels may be superfluous, and you can also reduce the level of detail of the pixel image. For example, you can reduce the level of detail to a 14x14 pixel image. In that case you will have only 196 features instead of 784. Feel free to try anything. You should analyse the data with:

1. the regularized multinomial logit model (using the LASSO penalty),
2. k-nearest neighbor, and
3. support vector machines.
Note that you may need to pre-process the data in different ways to apply different algorithms! You should use the same sample for training/model selection with the different algorithms. Otherwise the comparison wouldn't be fair.

For each classification method that you apply, discuss the following:

1. What data pre-processing (including feature extraction/feature selection) did you perform for this classification algorithm? New features that you derive from the raw pixel data must be described unambiguously in the report. The reader should be able to reproduce your analysis.
2. What are the complexity parameters (if any) of the classification algorithm, and how did you select their values?
3. What is the estimated accuracy of the best classifier for this method?
For these experiments, you should train models with different parameter settings, and use cross-validation to select the best parameter setting. Use the remaining data to produce an honest estimate of the accuracy of the model finally selected for each method. For example, the complexity parameter for k-nearest neighbor would be "k" (the number of neighbours used for majority voting). Try different values of "k" and use cross-validation to pick the value that produces the smallest classification error. Finally, to produce an unbiased estimate of the error for the value of "k" selected, compute its classification error on the remaining data.
4. Which classification method(s) produced the best classifier for this problem?

After step 3 you will have determined the best parameter settings for each algorithm (using cross-validation), and you will have applied the algorithm with these parameter settings to the complete training sample to produce a model. Furthermore you will have estimated the error of these models on the remaining data. Now compare these models with each other. Don't just look at their overall error rate, but also consider how well different digits are distinguished by each model.

To run your experiments, you may want to write some simple R code in order to call the classification methods with different data, or different parameter settings and collect the results. It can be very time consuming if you do everything "by hand" from the command line! To learn the basics of R, consult the online manual "Introduction to R" under the Help menu. (Note: the built-in editor in R is fairly awkward. To select another editor, you can type (for example) options(editor="Notepad") at the command line.
Then if you type fix(functionname), R will open the Notepad editor and you can edit the function. Don't forget to save in Notepad before you return to R.) However, it is recommended to use R Studio, an IDE for R.

This assignment is not about producing the best possible predictive model. What counts is the quality and breadth of your analysis. The experiments should be performed in a methodologically correct manner, so that your conclusions and findings are reliable. Please also pay attention to the quality and readability of your written report. If you have performed an excellent analysis, but you write it down very badly, then you still have a problem. Make sure I enjoy reading your work!

The report should be 10-15 pages long (excluding a possible appendix with tables and figures) and should be in pdf format.

There has been a competition on Kaggle with this data set. Feel free to use any information you find there, as long as you acknowledge any information used in your report.

Below you find an overview of relevant algorithms and pointers to their availability in R.

Algorithm R Package Function(s)
k Nearest Neighbour class knn
(regularized) Multinomial Logit nnet,glmnet multinom, glmnet, cv.glmnet
Support Vector Machine e1071 svm
Parameter Tuning e1071 tune, tune.wrapper