Practical Assignment 2

Clustering Images using Compression

The case
Suppose you have a large set of images and you would like to partition this set into groups of similar images. This set of images can be anything, so we cannot make any prior assumptions about it. An example would be the set of all photos on Flickr, but one could also envision images from a specific domain such as medicin, astronomy, security, and so on.
 
Because we are potentially dealing with an immense amount of images, we would rather not do this partitioning manually. This would be a tedious job. We therefore use techniques from data mining to solve this problem, and let the computer compute the results for us.

Clustering
Clustering is a well-known task that is not only studied within data mining, but also in other fields. Its goal is to find clusters of objects that exhibit a large intracluster similarity and a small intercluster similarity. That is, each group of objects clearly belongs together, while the groups are clearly different from eachother.
 
For this assignment, we will use agglomerative hierarchical clustering. Such clustering methods take a distance matrix between objects as input and iteratively cluster objects by linking those object pairs that have the smallest distances.
Luckily, some of these algorithms are already implemented in R and these are easy to use. See this file for an example
of different clustering algorithms available in R.

A compression-based distance measure for images
To be able to cluster images, we need to generate a distance matrix that contains the distance d for each possible pair of images (x,y). In lecture 42A, we already introduced the distance measure we will use:
 


 
where C(x) is the compressed size of image x and C(xy) the compressed size of the concatenation of images x and y. As compressor, we will use JPEG, a lossy compression scheme developed specifically for images.

Implementation in R
First of all, install R from this location in (a subdirectory of) your home directory (or on your laptop!): you need to have writing permission in the directory where R is installed, because you have to install a new package called rtiff.

Finding out all the details of reading and writing images by yourself would be a waste of time and therefore we provide you a textfile containing some directions. You can find this file here.
 
To be able to execute this code, you will need to install the rtiff package. If you cannot load this package after successfull installation, you may need to install some dependencies. Simply download this file and unpack it to the rtiff lib directory, e.g.
if you installed R in the root of U: this is U:\R\R-2.12.0\library\rtiff\libs

Datasets
For this assignment, two image datasets are available:

  1. Small dataset (10 images, ~10Mb)
  2. Large dataset (60 images, ~60Mb)
We recommend you to start with the small dataset, to get a better understanding of the method and make it easier to get quick and easily interpretable results. However, your final results should be obtained exclusively on the large dataset.

Bonus
We think that a simple and straightforward solution of this assignment should not be too complicated. For those of you who want more, we offer you the opportunity to extend your solution. The following extensions are worth investigating:

  • JPEG quality - By default, R uses a JPEG quality of 75% to write files. How does JPEG quality influence the results? Does a lower/higher quality improve the results?
  • Other file formats - R offers other graphic file formats. Can you improve on JPEG by using one of these?
  • Combining images - Simply concatenating the images one next to each other is probably not the best way to combine images for maximal compression. Can you improve on this?
  • Distance measure - Using only compression, can you improve the distance measure?
  • Clustering with pam - Also analyse the data with the partitioning around medoids (pam) algorithm from library (=package) cluster.
    Give a short description of pam in your report, and determine the appropriate number of clusters using the silhouette index.

Final instructions
This assignment should be made in groups of two students. Solutions should be handed in ultimately Friday November 12. Send your solution by e-mail to ad@cs.uu.nl. Your solution should consist of the following:

  1. An R workspace containing your program code, distance matrix and cluster result (on the large dataset!),
  2. A flat ascii file containing the documented program code, and
  3. A PDF file containing a short report (at least 2 pages) of your analysis.
If you did any extras, do not forget to describe them (and the results) in your report. Always put name and student number on your work.

Grading
The following considerations are taken into account to determine the grade for this assignment:

  • Does the program work, and does it return the correct result?
  • Quality of the documentation.
  • Elegance and efficiency of the implementation.
  • Quality of the report.
  • Any extras you did (see also `Bonus').