CSE012/CS059 – Data Mining

Winter 2025

Home

Material

Lectures

Tutorials

Assignments

Material

Books and Slides

· The Data Science Design Manual, by Steven S. Skiena

· Material from the book “Introduction to Data Mining” by Tan, Steinbach, Kumar.

· Mining Massive Datasets by Anand Rajaraman, Jeff Ullman, and Jure Leskovec. Free online book. Includes slides from the course.

· All of Statistics by Larry A. WassermanAll of Statistics by Larry A. WassermanAll of Statistics by Larry A. Wasserman

· Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze. Free online book.

· Networks Crowds and Markets by D. Easley, J. Kleinberg. Free online book.

· Social Media Mining by R. Zafarani, M. Ali Abbasi, H. Liu. Free online book.

· Material from the book “Data Mining: Concepts and Techniques”, by Jiawei Han and Micheline Kamber.

· The Data Science Design Manual by Steven Skiena.

· All of Statistics by Larry A. Wasserman

Springer Online Books

Recently, Springer announced a list of free online books on Machine Learning and Data Mining.

Some of the most interesting and relevant books:

· The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani, Jerome Friedman

· Data Mining by Charu C. Aggarwal

· The Data Science Design Manual by Steven S. Skiena

· The Python Workbook by Ben Stephenson

Python

· Notes from the course Computational Tools for Data Science in BU

Useful Unix Commands

You may find the following unix commands useful when pre-processing data:

· cut: allows you to get specific columns from delimited data

· sort: sorts the rows of a file in lexicographic order, –n for numeric

· uniq: merges consecutive rows of a file that are identical.

· grep: finds a sting within a file

Do “man <command>” in unix/linux shell to get more information.

Software

· WEKA Data Mining Software: A software package that implements multiple data mining tools.

· FIMI: Frequent Itemsets Mining Implementation: A repository of implementations for frequent itemset mining. All implementations assume the input format of the example datasets: text file where each row is a basket consisting of space separated integers that represent the items.

· Liblinear: Software package for classification. Implements the Logistic Regression and SVM classifiers.

Datasets

· The Yelp Academic Challenge dataset

· UCI Machine Learning Repository

o Τhe Iris dataset (ARFF file).Τhe link to UCI repository.

o The SpamBase dataset (ARFF file). Τhe link to UCI repository

o The Mushroom dataset (ARFF file). The link to UCI repository.

· Movie Lens Datasets by GroupLens Research

· FourSquare tips with categories: a collection of FourSquare tips on restaurants in New York (thanks to Yiannis Kotrotsios).

· FourSquare tips with categories: a collection of FourSquare tips with the category of the corresponding venue for restaurants, nightlife venues, and shops in New York (thanks to Yiannis Kotrotsios).

· FourSquare users and venues: a collection of pairs of user ids and venue names in New York, where the user with the specific id has left a tip to the venue with the specific name on Foursquare (thanks to Yiannis Kotrotsios).