CS059 – Data Mining

Fall 2012

 

Home

Material

Slides

Assignments

Administrative

Class Hours: Tuesday 13:00-16:00, Room Ι3.
Instructor: Panayiotis Tsaparas (tsap _at_ cs.uoi.gr), Office
Β.3

Grades: The grade for the course will be determined by the assignments.

 

Announcements

·         Thursday March 7. Final grades. You can see your final grade here. If you have any questions please contact me within the next one-two days.

·         Friday February 15. Assignment 5. You can submit Assignment 5 until the end of day today.

·         Wednesday February 13. Assignment 4 grades. You can see the grades for Assignment 4 (together with those of the previous assignments) here.

·         Tuesday February 12. PageRank in MATLAB. For those who implement PageRank in MATLAB, you should use the efficient code we have in the slides of the course.

·         Tuesday February 5. Deadline for Assignment 5. To accommodate those that take the Graphics exam, the deadline for Assignment 5 is moved to February 15, 6pm.

·         Tuesday February 5. Assignment 3 grades. You can see the grades for Assignment 3 (together with those of the previous assignments) here.

·         Tuesday January 29. Assignment 2 grades. You can see the grades for Assignment 2 (together with those of Assignment 1) here.

·         Sunday January 27. Programming exercises. For the programming exercises it is required to submit a written report discussing the results of your program. Also, you should explain how to run the code, and you should have some comments in your code.

·         Tuesday January 22. Assignment 5. Together with your code for Assignment 5, submit also instructions on how to run your program.

·         Wednesday January 16. Deadline for Assignment 5. The new deadline for Assignment 5 is February 11. The Assignment has be slightly modified, so download the new version from the web page of the class.

·         Tuesday January 15. Assignment 1 grades. You can see the grades for Assignment 1 here.

·         Tuesday January 15. Assignment 5: Assignment 5 is available on the Assignments web page.

·         Monday January 14. January 15th Lecture. A reminder that tomorrow class will start at 1:00.

·         Monday January 7. Evaluation. A reminder that tomorrow we will have the course evaluation at the end of class.

·         Wednesday December 25. Deadline extension for Assignment 4. The deadline for Assignment 4 is extended for December 29. The assignment should be handed out until the end of the day.

·         Thursday December 13. Course Evaluation: At the end of class of Tuesday December 18, we will do the course evaluation.

·         Thursday December 13. Assignment 4: Assignment 4 is available on the Assignments web page.

·         Wednesday December 5. Extra class, Extension for Assignment 3: This Friday, December 7, there will be an extra lecture at 10-12 pm. The deadline of Assignment 3 is extended for Tuesday December 11, at the beginning of the class.

·         Sunday December 2. Assignment 3, Question 2: Two corrections for Question 2. The exercise in the textbook is 8.27, not 9.27. In the equation, there is a 2 in the denominator of the proportionality factor. The corrected assignment has been posted on the Assignments web page.

·         Sunday December 2. Assignment 3, Question 4: For the precision and recall values of k-means report the mean value for 5 runs. Also, except for the precision/recall values, give also your empirical observations about the kind of users that are grouped together in each cluster.

·         Saturday November 24. Assignment 3: Assignment 3 is available on the Assignments web page.

·         Tuesday November 20. Extra class, Assignment 2: This Friday we will have an extra lecture at 12:00 for an hour or two. You can submit Assignment 2 until the end of day today without any penalty.

·         Sunday November 18. Assignment 2, Question 2: For the hash functions provided by the book you should take the value of the function mod 5 as your hash function.

·         Thursday November 14. New class hours: From now on class hours will be 13:30 – 16:00. In case we need to cover some more material we will schedule classes on Fridays.

·         Monday November 12. Class Hours: The class hours are still 13:00-16:00. There is a mistake in the updated schedule on the department’s home page.

·         Friday, November 9. Assignment 2: Assignment 2 is available on the Assignments web page.

·         Friday November 9. Free pass policy for Assignments: For the Assignment deadlines you have 3 “free passes”. That is, you have three days which you can use for extending the deadline of an assignment. Details on the Assignments page.

·         Thursday November 8. Turn in of Assignment 1, Part B: You can turn-in the Assignment until the end of day on Friday without any penalty.

·         Thursday November 8. Clarifications for Assignment 1, Part B: Although it is the most common type of input, some of the implementations in FIMI may also work with strings as items instead of integers (one of your colleagues mentioned ECLAT). If this is the case then you obviously do not need to do the conversion to integers.

·         Thursday November 8. Clarifications for Assignment 1, Part B:

o   For question 3, if you plan to use WEKA, then each distinct word should be made into an attribute that takes values true/false, depending if the word is present or not. The number of attributes is then too large to fit in memory and you should use the sparse arff format. (For example, see the following posting: http://old.nabble.com/convert-market-basket-data-to-binary-form-for-fp-growth-td30651604.html -- there is more information online). Another idea is to throw out the words that are not frequent enough, but you may be left still with too many words. Alternatively, you can use one of the FIMI implementations (e.g., the LCM implementation is easy to use). In the input file each row should be a “basket” and the items are integers (separated with spaces), so you would need to assign an id to each word.

o   For question 2, the correct way to generate and count subsequences is by considering the subsequences generated with the leftmost item in the window. A different way to count the frequency is to count the number of windows that contain a subsequence. Although this is slightly different than what the question asks, it will be accepted.

·         Wednesday November 7. Time of Lecture, November 9: To avoid overlap with the class of Operating Systems the lecture will take place at 11:00-14:00.

·         Thursday November 1. Clarifications for Assignment 1, Part B:

o   For Question 2, the items of a subsequence maintain the order they have in the sequence. For example, the sequence BBAC contains the subsequence BAC, but not the subsequence ABC.

o   For Question 3, from the file with the Twitter profiles we are interested only in the 11th (eleventh) field that holds the user description. This is the field from which you should extract the frequent itemsets (frequent sets of words). If you want to consider other (additional) fields, you can suggest it as part of the option 3.

o   For Question 3, you have some freedom on how to preprocess the data. In your report you should be clear about the choices that you have made.

·         Thursday November 1. Postponed Lecture – Extension for Assignment 1, Part B: Next week’s lecture (November 6th) is postponed for Friday November 9, 9:00-12:00 am. The deadline for the second part of the first assignment is extended to the start of the lecture on November 9.

·         Monday October 29. Assignment 1 – Part 2 – Question 2 - Correction: In the second question, when a subsequence appears more than once in a window of length W, then it should be counted only once. For example for the sequence AABC, for W = 4, the subsequence AB should be counted only once, and not twice as it was originally stated in the question. For the sequence AABCB, for W = 4, the subsequence AB should be counted twice, once for the window AABC, and once for the window ABCB, due to the new appearance of B at the end of the window. This correction is necessary so that the anti-monotonicity property holds. For bonus marks, give a counter example that violates the anti-monotonicity property.

·         Friday, October 26. Assignment 1 – part 2: Part 2 of Assignment 1 is out, on the Assignments web page.

·         Thursday, October 25. Turn-in: To turn-in the first part of Assignment 1 use the command: turnin assignment1a@ple059 <your files>. Write your name and student number in the submitted files.

·         Friday, October 19. Assignment 1 – part 1: Part 1 of Assignment 1 is out, on the Assignments web page.