CSE012/CS059 Ė Data Mining
Class Hours: Tuesday
Grades: The grade for the course will be determined by the assignments and project. There will be no final exam either on January or September exam period.
Logistics: The slides with the logistics for the class (pdf)
∑ Thursday 1/2: Question 4 of Assignment 3: In this question you are asked to compare a new recommendation algorithm with the algorithms you implemented in Assignment 2, on the data you created in Assignment 2. For this question you can improve upon your solution in Assignment 2 and correct some of the mistakes you did in the data generation. Here are some common error in this question:
o Iterative pruning: The question asked for iterative pruning of the users and businesses until, in the data you created, all users had rated at least 10 businesses, and all businesses were rated by at least 10 users. Each time you prune a user or a business, the number of ratings received by the businesses or given by the users changes. You need to do the pruning iteratively, until no further pruning is possible.
o Data Structures: Some students complained that their program was too slow or consumed too much memory. For the latter, you should load in memory on the data from Toronto. The filtering step should happen while you read the data. You should never load all the lines of the file in memory. For the former, you should use the appropriate data structures. Using lists is very slow if you want to check whether a user or a business is in the data. The most reasonable data structure is a dictionary, which will have as values dictionaries. You may need more than one such dictionaries, one for users and one for businesses.
o Similarity: For the computation of similarity, you should subtract the mean of the non-zero entries (of a line or a column) from the non-zero entries, and then take the cosine similarity. Donít take the mean of the full line or column, as this contains a lot of zeros.
o Neighbors: When you take the K nearest neighbors of a user, you should take the K nearest neighbors that have also rated the business in question (or, for a business, the K nearest businesses that have been rated by the user in question). Otherwise you may include a lot of zero values.