Assignment: dating site using PHP (part 2)

Introduction

In this practical assignment you will be building "DataDate", a dating site using PHP, with a twist! It offers users the ability to create a profile and get in touch with other users that match their preferences.

Examples of dating sites are plentiful: have a look at Relatieplanet, Lexa or e-Matching for (Dutch) domain specific examples which you can use for inspiration. But note: we don't require you to rebuild these or any of the other sites out there; we concentrate on creating profiles, picture upload, profile matching and (bonus) a rudimentary messaging system. And, oh yes, the twist!

[marketing mode on]
The twist in our site is that our dating paradigm is based on a unique and scientifically sound profiling and matching technique that brings both personality and lifestyle into the dating equation and learns from user preferences to optimize the user's dating experience. It has been done before to some extent (BrandDating for lifestyle based dating and Parship for personality based dating), but we, of course, not only have better technologies, but also combine these systems and extend it with playing field changing self-learning features.
[marketing mode off]

This assignment requires a number of techniques that you've already seen during the lectures. Furthermore, the use of some additional techniques may be needed. A description of user profiles, matching and learning is provided, followed by a description of required functionality and the technical requirements that your work has to meet. Finally, some hints and tips are provided. Read it through before you start and don't get scared: the matching and learning is not as hard as it may seem at first.

User profiles, matching and learning

User profiles

A user profile should consist of the following information: username (nickname), real name, password, e-mail address, gender, date of birth, picture, description, gender preference, age preference, personality type, personality type preference and brand preferences. Some of these fields are straightforward, other fields need some clarification:

Finding a user's personality type

To find out the personality type of a user, your dating site should offer a personality test. The test consists of 19 questions. Each question helps deciding one of the four dichotomies. For each dichotomy, there are 4 to 6 questions. The questions, the dichotomies they decide and the (straightforward) scoring scheme can be found here. Feel free to translate the questions to Dutch if you'll be making a Dutch dating site. A single form for the whole personality test should do. To make the test more interesting, you may think (but that's not required) about varying the order of the questions and answers in your form, as is done on Kisa.ca. (Take the test a second time (reload the page) and you'll see that it has changed; look at the form HTML source and you'll see how it's done.) Again, note that 50% for both of the two opposites means neutrality within a dichotomy (50% E = 50% I, etc.) and the scores for both of the two opposites within a dichotomy should add up to 100% (37% E = 63% I). Always show a user's personality type as a combination of the dominant opposites and their scores. So rather "You are an I (80%) N (70%) T (51%) J (65%)" than "You are an E (20%) S (30%) F (49%) P (35%)" or "You are an I (80%) S (30%) F (49%) J (65%)".

Matching a user's personality type

While age and gender preferences can easily be matched based on ranges or exact matches, matching personality types requires a matching scheme. A user A will usually be attracted to a user B that matches the personality type preference of user A. Or, in other words, we need to find users B with a personality type that has a small "distance" to the prefered personality type of user A. To calculate the "distance" between two personality types, one could add the absolute differences in score (a percentage) for each dichotomy and divide by 400% (which is the maximum distance as a percentage for two users having a 100% score for all four dichotomies, but for exactly the opposites). In order to be able to calculate the difference, you need to first express the scores in the same opposite (if user A is I (70%) and user B is E (60%) you need to first translate E (60%) to I (40% = 100% - 60%) to be able to find the difference between these, in this case 70% - 40% = 30%). Some full examples: the distance between personality type I (70%) N (60%) T (50%) J (65%) and personality type E (60%) N (95%) T (50%) P (65%) is (70% - 40% (as 60% E = 40% I) + 95 % - 60% + 50% - 50% + 65% - 35% (as 65% P = 35% J)) / 400% = (30% + 35% + 0% + 30%) / 400% = 95% / 400% = 0.2375. And the distance between E (100%) N (100%) F (100%) J (100%) and I (100%) S (100%) T (100%) P (100%) is (100% - 0% + 100% - 0% + 100% - 0% + 100% - 0%) / 400% = 400% / 400% = 1, the maximum distance (exact opposites in all dichotomies). The distance between personality types is thus a value between 0 and 1.

Matching a user's lifestyle/brand preferences

If we want to match two lifestyles (expressed as two sets of brand preferences), we are gently moving into the field of information retrieval. To match two sets of brands, we could simply find the number of brands that they have in common: the size of the overlap. The so called simple overlap coefficient is defined as |X intersect Y| with X and Y being the two sets of brands. Note that this simple overlap coefficient favours profiles that list a lot of brands: chances are that the size of the overlap is rather large if both lifestyles contain a significant number of brands. Furthermore, if one of the lifestyles contains only one or a few brands and the other lifestyle contains many brands, the brands that are in the profile are much more significant (relevant, discriminating, descriptive) for the user having a small size set of brands than they are for the user having a huge set of brands. We should thus somehow normalize the result of matching (using the simple overlap coefficient) for the size of the sets of brands, putting a "penalty" on larger sets. Within the field of information retrieval, serveral similarity measures have been defined that work just like that. These (both formulas and names) are listed in the following figure:

Note how each of these similarity measures takes the simple overlap coefficient (|X intersect Y|) as its base and normalizes it to some degree. Also note that the result of calculating the similarity of two sets using one of these measures is always a value between 0 and 1. These are so called unweigthed normalized similarity measures. Unweighted, because there's no weight associated with the elements in each set. Weighted similarity measures, that take a quantified preference for brands ("I like this brand and give it an 8 out of 10 and I like that brand as well and give it a 7 out of 10") into account are generally more complex and beyond the scope of this assignment. Finally, keep in mind that we are interested in users that have a lifestyle that is similar, or, one could say, have sets of brand preferences that have little or no distance between them. As in the case of personality type matching, it might be more convenient to talk about "distance" rather than "similarity". Distance is easily found once similarity is known: if two sets of brands are equal, their similarity should be 1 according to the beforementiond unweigthed normalized similarity measures. Distance should then be minimal, or 0. On the other hand, brand sets that have no overlap at all are very dissimilar and their similarity is 0 according to our measures. Distance should then be maximal, or 1. The same goed for values in between: distance is the opposite of similarity. In short: distance can be defined as 1 - similarity. Our similarity measures can thus be easily rewritten to (unweighted normalized) distance measures.

To end our discussion of matching brand preferences, it might be interesting from a scientific point of view to have a look at some desired properties of distance measures in general. They are listed in the following figure:

D(X,Y) means the distance between set X and Y (the result of applying the distance measure D to two sets X and Y). Note how D2 can be identified as the "identity distance". D3 describes symmetry. And D4 describes the triangle inequality (driehoeksongelijkheid in Dutch). The interested reader is encouraged to check whether our unweigthed normalized distance measures (remember, 1 - similarity) match all of these desired properties. It might not turn out to be the way you think. Note that a distance measure that satisfies all of these desired properties is called a metric. You'll encounter other metrics later on during your studies.

Ranking matching results

One other thing to think about is what matching based on a complete profile exactly means, given the different fields that are available within a user profile. Of course, the gender and age of users returned by the matching process should at least fulfill the exact values or ranges provided for gender preference and age preference. For personality type preference and brand preference things are not as clear, as the result of matching these against some other profile is a distance value between 0 and 1 as discussed above for each of these two preferences. We can, however, rank the results according to the distance: the higher the distance, the lower the rank. One last thing to consider: how do we balance the two distance values for personality type preference and brand preference in our ranking? An easy way out is to introduce the x-factor "x", with x a value between 0 and 1, and define the overall distance of two profiles as follows:

In other words, the x-factor determines the importance of personality versus lifestyle in our matching process. The outcome of the overall distance, which is still a value between 0 and 1, can be used to rank the results of the matching process: the higher the distance, the lower the rank. Our users are usually only interested in the highest ranking profiles of other users.

Learning from user input

Now that we have methods for both matching personality type and lifestyle and for integrating the outcome, the only thing left to be addressed is our desire to have our dating site learn from user input. The idea is that user feedback concerning the results of the matching process can be used to adapt the user profile, more specifically the personality type preference. Remember that we keep track of two personality types: the user's personality type that is the result of our personality test and the user's personality type preference that is initially set to the exact opposite of the user's personality type. As you can imagine, it's this "opposites attract" adagium where improvements are likely. It's thus in here that the learning will take place. (In theory, brand preferences could also be learned, but that is more complex and beyond the scope of this assignment.)

To be able to learn from our matching results, we first need user input. The best compliment a user can provide to our matching algorithm is of course deciding to contact the other user by sending him or her an "I would like to get in touch with you" message. However, users are probably not going to send enough of these messages to be able to quickly learn their preference. (On top of that, the messaging system is a bonus part of this assignment.) Therefore, we provide our users with the option to mark another user as "hot" if they are looking at this other user's profile. A specific user A can mark another specific user B as "hot" only once, but a specific user B can be marked "hot" by several different users A, C, D, etc. Marking another user as "hot" is a decision that cannot be undone as that would imply unlearning of learned data (a whole new dimension!). Now how do we use this user input to learn a user's personality type preference? Well, just read on.

If a user A deems another user B to be "hot", we take a look at the personality type of "hot" user B and the current personality type preference of user A that is giving us the input. For each dichotomy we have a look at the score (percentage) for the opposites within these two profiles. The new score for each dichotomy within the personality type preference of user A will be alpha*(old score in personality type preference of user A) + beta*(score in personality type of user B), with alpha being a value between 0 and 1 and beta being defined as 1 - alpha. Note that for this formula to be useful, you again need to first express the scores in the same opposite (if user A is I (70%) and user B is E (60%) you need to first translate E (60%) to I (40% = 100% - 60%) to be able to compare these). In this formula, alpha is the so called "remembrance factor"; it signals how much of the old preference is maintained in the new preference. Beta is the so called "learning factor"; it signals how much of the "hot" user's personality is included in the new preference. The higher the learning factor, the faster the preference changes.

Let's have a look at a small example to clarify this. Let's assume the current personality type preference of user A is I (80%) N (70%) T (51%) J (65%) and the personality type of user B is E (70%) N (70%) F (90%) J (95%). Let's say that alpha = 0.9 and beta = 0.1. We can now calculate the new personality type preference of user A after deeming user B to be "hot" by first expressing the personality type of "hot" user B as I (30%) N (70%) T (10%) J (95%), followed by applying the formula for each dichotomy. So the new score for I in the preference of user A will become alpha*(I in preference of A) + beta*(I in personality of B) = 0.9*80% + 0.1*30% = 72% + 3% = 75%. The new score for N will become 0.9*70% + 0.1*70% = 70% (no change!). For T it will become 0.9*51% + 0.1*10% = 46.9% (actually changing the preference from dominant T to dominant F!). And for J it will become 0.9*65% + 0.1*95% = 68%. So the new personality type preference of user A will be I (75%) N (70%) F (53.1%) J (68%). Note that the change will increase with higher values of beta (and thus lower values of alpha).

This example completes our discussion of profiles, matching and learning. If you don't feel comfortable with what you've read, don't hesitate to ask for assistance and clarification during the practical sessions. Let's move on to a description of the required functionality of our dating site.

Description of functionality

The dating site should consist of three parts: an anonymous user front-end, a registered user front-end and a simple administrative back-end. The anonymous user front-end offers limited functionality that allows anonymous users to search for and browse user profiles that are available on DataDate, hiding the picture associated with profiles. The registered user front-end allows registered users to create and maintain a profile, match other user's profiles against their profile and (bonus) contact other users through a rudimentary messaging system, on top of the functionality offered to anonymous users. A full featured back-end, which allows site maintainers to maintain the user base of registered users and moderate profiles, and which is common for webapplications such as this one, is not part of this assignment (Pfew! ;-). However, a simple administrative back-end that can be used to configure the matching and learning process should be part of your site. The two front-ends and the back-end that you should build are described below.

Front-end: anonymous access

If an anonymous user accesses the site, he or she is not part of the DataDate community (yet) and should be granted only limited access to the site. This limited access consists of the following functionality:

Front-end: registered access

A registered user should - after logging in - have access to basically the same functionality as an anonymous user. There are, however, some additional features available to registered users. These are:

Back-end: administrative access

An administrator should - after loggin in - have access to basically the same functionality as a registered user (for testing purposes only, mind you!) and some additional features. Administrators may be hard coded in the database (you may assign administrator privileges to certain users by marking them by hand as administrator in your database). The additional features for administrator are:

Front-end: registered access, bonus functionality

If you have finished building both front-ends and the back-end and you've still got some time left, please feel free to implement the following bonus functionality:

Technical requirements

Here are some additional technical requirements for this assignment. These should not come as a (major) surprise:

Final words

Note that these specifications are not complete - as they never are in the real world. This is done on purpose. That means that some issues are left to your own interpretation and creativity. (Examples include: what will the lay-out of the site/display order of the fields be? What do we do if a chosen username already exists? Will we send a registration confirmation by e-mail? Do we offer an "I forgot my password, send it by e-mail"-link? Do we store session data in files on the server or in the database? Do we show the username of a logged in user on each page? Do we offer a log out button? Will sessions expire after n minutes? Do we offer a link to return from viewing a profile to the last search result? Can e-mail address exchange requests be deleted? Do they ever expire? Etc. etc.)

It also means that this assignment could be executed in a minimalistic way while still adhering to (only) the explicitely specified functionality. As you might expect, a minimalistic implementation will yield a minimalistic grade (although maybe even "voldoende", if you are lucky ;-)). You will get out of this assignment what you put into it, in terms of time, pleasure, knowledge and grades. Additional/original features are highly valued as we like to see enthousiasm during this course, but please do document them in your documentation (don't expect us to see the little 5-legged happy purple elephant in the right hand corner that by clicking on it allows you to play a Javascript based flight simulator game - if you don't tell us so ;-).

Although the description of the assignment may be somewhat 'vague' to simulate reality, it has of course been scaled to the amount of time available. We expect each of you to invest 20 hours a week into this course and - frankly - mainly into the practical assignments. Within that given amount of time it should be (more than) doable to produce a working site that fulfils all of the criteria in a sufficient and creative manner. However, tempus fugit, so please start immediately! Waiting for one or two weeks will significantly lower your chances to pass this assignment with a sufficient grade.

If, due to time constraints, you are forced to leave certain functionality out, please note that order of functionality is of importance for grading: first, create a static version of the user interface (assignment 1a), then try to get the whole session/login/profile creation functionality working. Then work on the searching and matching functionality. Then work on the administrative back-end. Then work on the picture upload. Finally, work on the (bonus) messaging system.

Important

The files which constitute your solution should be put in a zip file and submitted using Submit. Note that the maximum filesize in submit is 2000 kb for this assignment, so go easy on the graphics. In a separate README text document (readme.txt), which should accompany your solution, you should tell us:

As mentioned before, in addition to the two regular user accounts, we expect your site to be pre-filled with a number of other users (let's say at least 10 in total), profiles, etc. to be able to get an impression of the "look and feel" of your site. Make sure that some of these profiles are similar and others are more distant.

Do not remove the site before you have obtained your grade. Otherwise it can't and won't be graded. More submission details can be found here.

General and less general remarks that were made in reaction to previously handed in assignments can be found here. Read them.

Tips

Some tips for this assignment:

Help with PHP and PostgreSQL

See the hardware and software section of this site.

Useful links

See web links section of this site.


lennart@cs.uu.nl