GETTING STARTED WITH THE ASSIGNMENT IN R

To start R on computers in the lab, go to Start -> All Programs -> Standard Applications -> Science -> Biology -> R for Windows 3.3.1, and then select "R for Windows" or "Rstudio" (recommended). The latter is an integrated development environment (IDE) for R. Both programs are free software that you can easily install on your own laptop. We start with reading the credit scoring data from the lectures in R. The data can be found here. Suppose you save the data to a file "credit.txt" in the directory "dm" on the C drive. To read it into R type: > credit.dat <- read.csv('C:/dm/credit.txt') You have now assigned this data set to a variable called "credit.dat" (" <- " is the assignment symbol in R). To display its value, just type its name at the command line: > credit.dat age married house income gender class 1 22 0 0 28 1 0 2 46 0 1 32 0 0 3 24 1 1 24 1 0 4 25 0 0 27 1 0 5 29 1 1 32 0 0 6 45 1 1 30 0 1 7 63 1 1 58 1 1 8 36 1 0 52 1 1 9 23 0 1 40 0 1 10 50 1 1 28 0 1 (the first column are row numbers, and the first row are column names) "credit.dat" is now an object of type "data.frame". This is similar to (but subtly different from) a matrix. In any case, you can index a data frame like a matrix. Select the first row of credit.dat: > credit.dat[1,] age married house income gender class 1 22 0 0 28 1 0 Select the fourth column of credit.dat: > credit.dat[,4] [1] 28 32 24 27 32 30 58 52 40 28 Select the element in row 5, column 1: > credit.dat[5,1] [1] 29 Give the distinct values of income, sorted from low to high: > sort(unique(credit.dat[,4])) [1] 24 27 28 30 32 40 52 58 Add all the entries of the sixth column: > sum(credit.dat[,6]) [1] 5 Add the entries of each column of credit.dat: > apply(credit.dat,2,sum) age married house income gender class 363 6 7 351 5 5 Add the entries of each row: > apply(credit.dat,1,sum) 1 2 3 4 5 6 7 8 9 10 51 79 51 53 63 78 125 91 65 81 Select all rows where the first column is bigger than 27: > credit.dat[credit.dat[,1] > 27,] age married house income gender class 2 46 0 1 32 0 0 5 29 1 1 32 0 0 6 45 1 1 30 0 1 7 63 1 1 58 1 1 8 36 1 0 52 1 1 10 50 1 1 28 0 1 Construct a vector "x" with the numbers 2,5,10 in that order: > x <- c(2,5,10) > x [1] 2 5 10 Construct a vector consisting of the numbers 1 through 10: > c(1:10) [1] 1 2 3 4 5 6 7 8 9 10 Select the *row numbers* of the rows where the first column of credit.dat is bigger than 27: > c(1:10)[credit.dat[,1] > 27] [1] 2 5 6 7 8 10 Draw a random sample of size 5 from the numbers 1 through 10 (without replacement): > index <- sample(10,5) > index [1] 5 1 7 4 6 Select the corresponding rows: > train <- credit.dat[index,] > train age married house income gender class 5 29 1 1 32 0 0 1 22 0 0 28 1 0 7 63 1 1 58 1 1 4 25 0 0 27 1 0 6 45 1 1 30 0 1 Select all rows with row number not in "index": > test <- credit.dat[-index,] > test age married house income gender class 2 46 0 1 32 0 0 3 24 1 1 24 1 0 8 36 1 0 52 1 1 9 23 0 1 40 0 1 10 50 1 1 28 0 1 Consult the help page of the function "sample" > help(sample) Consult the R online manuals, and the "swirl" package to learn about the R programming language. At the end of a session (and also during a session), save your workspace to a file (choose "Save Workspace" from the file menu). Otherwise all results (the functions you created, etc.) will be lost after you quit R.

Practice exercise 1

Assume we have a classification problem with only 2 classes that are labeled 0 and 1 respectively. Write a function that computes the impurity of a vector (of arbitrary length) of class labels. Use either gini-index or entropy as impurity measure. Do not use a loop structure in your function, this is not necessary. Example (using gini-index): > y <- c(1,0,1,1,1,0,0,1,1,0,1) > y [1] 1 0 1 1 1 0 0 1 1 0 1 > impurity(y) [1] 0.2314050 If you are not working in Rstudio, to create the function, use: > fix(impurity, editor="Notepad") This will open a Notepad window. Type in the function definition, save the file and exit the editor.

Practice exercise 2

Write a function "bestsplit(x,y)" that computes the best split value on a numeric attribute x. Here x is a vector of numeric values, and y is the vector of class labels (assume there are only two classes, coded as 0 and 1). x and y must be of the same length: y[i] is the class label of the i-th observation, and x[i] is the corresponding value of attribute x. Only consider splits of type "x <= c" where "c" is the average of two consecutive values of x in the sorted order. So one child contains all elements with "x <= c" and the other child contains all elements with "x > c". The best split is the split that achieves the highest impurity reduction. Example (best split on income): > bestsplit(credit.dat[,4],credit.dat[,6]) [1] 36 Hint: Clever use of "subscripting" (selecting elements of vectors and matrices) is important in R. For example, y[x > 29] produces a vector with all elements of y whose corresponding x-element (that is the element of x with the same index) is bigger than 29. More formally: y[x > 29] = {y[i]: x[i] > 29}. The result is a vector, not a set, i.e. duplicate values may occur. Just try it! Hint: Example of how to determine candidate split points > income.sorted <- sort(unique(credit.dat[,4])) > income.sorted [1] 24 27 28 30 32 40 52 58 > income.splitpoints <- (income.sorted[1:7]+income.sorted[2:8])/2 > income.splitpoints [1] 25.5 27.5 29.0 31.0 36.0 46.0 55.0 Note: you may use the "brute force" approach, i.e. you don't have to implement the "segment borders" algorithm.