General
You are supposed to work in groups of two persons.
Although this page is in English (for historical reasons), you are supposed to write your report in Dutch.
Exercise 1
For exercise 1 you will make use of Java and SQLite. Prepare yourself by establishing a JDBC-connection using SQLiteJDBC. This module is offered as freeware by Xerial. If you have a convincing preference for C#, then try to establish a SQLite-connection using information from the blog of Tigran Gasparian. Be sure that you have finished these preparations before the exercise starts in week 20.
Ranking on query results
The goal of this exercise is basically to implement the ideas of the article by Agrawal, Chaudhuri e.a. We offer a table and a query workload.
The program should be able to process conjunctive equality queries (ceq) on the table. A ceq consists of predicates of the kind attr = value, separated by comma's, terminated by a semicolon. An input consists of a ceq and is read by a (simple) gui. The required value for k is also entered according to this syntax. When missing, use a default value k = 10. Example inputs:
k = 6, brand = 'volkswagen';
cylinders = 4, brand = 'ford';
The basic query is
SELECT * FROM autompg WHERE ceq;
In the ceq, the comma's are replaced by ANDs and the k-value is left out.
The output consists of the top-k tuples according to some ranking principle.
Your program will do some preprocessing on the data and/or the workload. This is done once, before processing a batch of queries. During this phase, a meta database will be constructed and filled. This metadb will be used when aswering the actual queries. Note that your metadatabase should be constructed only once, before processing a batch of queries.
Your software is supposed to meet requirement [1], and at least one of requirements [2] and [3]:
[1] deal with similarity properties of numerical attributes
[2] use sophisticated techniques for finding value-similarities
[3] use sophisticated techniques for top-k calculations.
The deliverables are:
- a text file metadb.txt, containing the data definitions required for your meta database (that is
filled during the preprocessing)
- a text file metaload.txt, containing sql-statements used to fill the metadb
- a java program to determine and fill the contents of the metadb, based on the preprocessing of data and the workload
- a java program to deal with the queries
- a description of your approach, explaining choices and describing experiences (we do not expect a user manual). Format: pdf; max 6 pages.
Zip your stuff before submitting. Deadline for exercise1 is Friday May 31, 23:55. A demonstration of your software in the week after the deadline is part of the protocol. Prepare test-queries that make sense.
Before we start with the actual exercise, we are going to practice with R. You can start R from the menu (Start -> Standard Applications -> Biology),
or by saving this R workspace
somewhere under your home directory and then double clicking it. If you use your own laptop, you can download R
here.
- Start by going through this short R tutorial.
- Browse through "An introduction to R" under Help -> Manuals (in PDF) in the R gui.
- The R homepage.
- A useful Online Resource.
- R Studio is an IDE for R. I have no experience with it.
If you decide to try it, I would like to hear about your experiences.
Assignment: Text Categorization
In this practical assignment you are going to apply data mining techniques to a text classification problem. You have a lot of freedom as to
which particular problem you are going to work on, but there are a number of constraints:
- The analysis has to be performed in R. We will provide information on packages in R that you can use for the analysis.
- The analysis has to start from some kind of text format, that is, it is not allowed to start from a fully prepared data file in which
all the features have already been constructed for you. The reason for this constraint is that it is important to consider the impact of
different pre-processing steps on the raw text data.
- The problem has to be a classification problem, either single-label (each document belongs to exactly one class) or multi-label (documents can
belong to more than one class). Multi-label classification is often performed by decomposing the problem into a collection of single-label
classification problems, and then combining the results. Hence, choosing multi-label classification means extra work!
- You have to apply at least two classification algorithms (e.g. kNN and SVM) to your problem.
Text mining in R
The package "tm" allows you to manage a corpus of text
documents, perform different pre-processing operations on it, and to extract document-term matrices with different weighting schemes
(e.g. term frequency, tf-idf,...) from such a corpus.
The document-term matrix can be used as a basis for learning classifiers with different classification algorithms, such as naive Bayes,
k-nearest neighbour and support vector machines.
Overview of useful R packages:
- tm: the text mining package that can be used to manage a text corpus and extract document-term matrices from them
- e1071: contains a function for learning naive Bayes classifiers and support vector machines
- class: contains a k-nearest neighbour algorithm
- MASS: contains functions "lda" and "qda" for linear and quadratic discriminant analysis
Setting up an R working environment
We discuss two options:
- The preferred option is to install R on your own laptop. It is easy to install packages from within R.
First check if a package is already installed.
Go to the menu "Packages --> Install package(s)" etc. To load an installed package, go to "Packages --> Load package".
- Use the PC's in the computer lab. Note: In this setting you probably won't get the "stemming" pre-processing step to work,
but that is not really a problem. Proceed as follows. Save this R workspace
somewhere under your home directory and double click it to start R. You can now install packages from within R (see (1))
but these packages will be removed by the next day, so you would have to install them again each time.
This does not take a lot of time however. The second option is to install the packages somewhere under your own home directory
where they won't be removed every time, e.g. in the directory "S:/dar/Rpackages/".
This works as follows. First select a CRAN mirror by going to the menu item "Packages --> Set CRAN mirror".
It makes sense to select "Netherlands (Utrecht)". Then type in the following in R (make sure the indicated directory exists):
# install package
> install.packages("tm", lib="S:/dar/Rpackages/")
# load package
> library(tm, lib.loc="S:/dar/Rpackages/")
Hints and Suggestions
- Start by working through the examples in "Introduction to the tm Package", see the
package home page.
- Try to analyse the Lyrics Data. See the lecture slides for examples.
To create the reader for the lyrics data, use this function.
- If you lack inspiration, you can use the Reuters-21578 corpus to define a classification problem.
An XML-encoded version of this corpus can be found here.
- Perhaps more challenging is analysis of the Enron e-mail corpus.
- You can try to automatically classify tweets, e.g. by subject or the sentiment (positive/negative) expressed in a tweet.
Since tweets are very short, this problem may be harder than standard text classification. The package "twitteR" alows you to
import tweets into R. Look here for instructions.
- Spam filtering is another problem where text classification can be applied.
- You can also pick up on my Music Genre classification idea. You could for example include more genres and build a larger corpus.
- Or do something completely different ...
Deliverables
Hand in a report (pdf) describing your analysis. The report must contain at least the following:
- A description of the text classification problem you have investigated.
- A description of the data set/corpus you have analysed. Include simple descriptive statistics like
number of documents, number of classes, frequencies of classes, etc.
- The pre-processing steps you have performed on the raw text data. This includes different weightings in the
document-term matrix, and feature selection. Note that you are also allowed to construct features in addition
to the terms that are present in the document-term matrix. For example, if you think total document length is
predictive for the class label, you can add this feature.
- A short description of the classification algorithms you have applied.
- A description of the training and testing of the classifiers, and a comparison of their predictive performance.
How did you select the value(s) of the complexity parameter(s) of a learning method? (e.g. the value of k in kNN,
or the value of C in a support vector machine).
- Interpretation of the results. Why did you get good or bad prediction results? Why do you think did one algorithm or
weighting scheme work better than another?
Send the report ultimately Friday, June 28 to me by e-mail.