CSE012/CS059 – Data Mining

Winter 2017

Administrative

Class Hours: Tuesday 4:00-7:00 pm.
Instructor: Panayiotis Tsaparas (tsap _at_ cs.uoi.gr), Office Β.3

Past Courses: Spring 2012, Fall 2012, Fall 2013, Fall 2013, Fall 2014, Fall 2015, Spring 2017

Grades: The grade for the course will be determined by the assignments and project. There will be no final exam either on January or September exam period.

Logistics: The slides with the logistics for the class (pdf

)

Announcements

· Thursday 1/2: Question 4 of Assignment 3: In this question you are asked to compare a new recommendation algorithm with the algorithms you implemented in Assignment 2, on the data you created in Assignment 2. For this question you can improve upon your solution in Assignment 2 and correct some of the mistakes you did in the data generation. Here are some common error in this question:

o Iterative pruning: The question asked for iterative pruning of the users and businesses until, in the data you created, all users had rated at least 10 businesses, and all businesses were rated by at least 10 users. Each time you prune a user or a business, the number of ratings received by the businesses or given by the users changes. You need to do the pruning iteratively, until no further pruning is possible.

o Data Structures: Some students complained that their program was too slow or consumed too much memory. For the latter, you should load in memory on the data from Toronto. The filtering step should happen while you read the data. You should never load all the lines of the file in memory. For the former, you should use the appropriate data structures. Using lists is very slow if you want to check whether a user or a business is in the data. The most reasonable data structure is a dictionary, which will have as values dictionaries. You may need more than one such dictionaries, one for users and one for businesses.

o Sampling: You should sample ratings, not users or businesses.

o Similarity: For the computation of similarity, you should subtract the mean of the non-zero entries (of a line or a column) from the non-zero entries, and then take the cosine similarity. Don’t take the mean of the full line or column, as this contains a lot of zeros.

o Neighbors: When you take the K nearest neighbors of a user, you should take the K nearest neighbors that have also rated the business in question (or, for a business, the K nearest businesses that have been rated by the user in question). Otherwise you may include a lot of zero values.

o SVD: Experiment with many values for K for SVD, and consider also larger values (e.g., K = 100)

· Sunday 24/12. Third Assignment: The third assignment is available on the Assignments page of the course.

· Sunday 26/11. Second Assignment: The second assignment is available on the Assignments page of the course.

· Friday 15/11. First Assignment: The first assignment is available on the Assignments page of the course.

· Tuesday 26/9. Welcome to Data Mining 2017!