CS059 – Data Mining

Fall 2012

 

Home

Material

Slides

Assignments

Material

Books and Slides

·         Mining Massive Datasets by Anand Rajaraman and Jeff Ullman. Free online book. Slides from the course.

·         Material from the bookData Mining: Concepts and Techniques”, by Jiawei Han and Micheline Kamber.

·         Material from the bookIntroduction to Data Mining” by Tan, Steinbach, Kumar.

·         Material from the book "Introduction to Information Retrieval" by C. Manning, P. Raghavan, H. Schutze

·         Material from the book "Networks Crowds and Markets" by D. Easley, J. Kleinberg

 

Software

·         WEKA Data Mining Software: A software package that implements multiple data mining tools.

·         FIMI: Frequent Itemsets Mining Implementation: A repository of implementations for frequent itemset mining. All implementations assume the input format of the example datasets: text file where each row is a basket consisting of space separated integers that represent the items.

 

Datasets

·         UCI Machine Learning Repository

o    Data for Assignment 4:

§  Τhe Iris dataset (ARFF file).Τhe link to UCI repository.

§  The Mushroom dataset (ARFF file). The link to UCI repository.

§  The SpamBase dataset (ARFF file). Τhe link to UCI repository

·         Movie Lens Datasets by GroupLens Research

·         Twitter data from the paper “What is Twitter, a Social Network, or a News Media?” by Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. For the first Assignment, you need the Restricted User Profiles data file. The fields in the file are explained on the page, you are interested in the eleventh field which is the profile description.

·         English Stopwords. Txt file with a list of English stopwords.

·         SpamAssassin.

·         Stanford Network Analysis Project Datasets.

·         Movie-Actor Graph. Each line in the file is a tab-separated movie-actor pair, i.e., it corresponds to one edge in the graph.