CSE012/CS059 – Data Mining

Fall 2025

Lecture Slides

For the slides of this course we will use slides and material from other courses and books. Many thanks to: Tan, Steinbach and Kumar, Anand Rajaraman Jeff Ullman, and Jure Leskovec, Evimaria Terzi, Aris Anagnostopoulos for the material from their slides that we have used in this course.

Introduction: Logistics (in Greek) (pptx, pdf)

Lecture 1: Introduction to Data Mining (pptx, pdf)

Chapters 1,2 from the book “Introduction to Data Mining” by Tan Steinbach Kumar.
Article: Data Scientist: The Sexiest Job of the 21st Century.

Lecture 2: What is data? The data mining pipeline. Preprocessing and postprocessing. Sampling and normalization. (pptx, pdf)

Chapter 2 from the book “Introduction to Data Mining” by Tan Steinbach Kumar.
Chapter 1 from the book Mining Massive Datasets by Anand Rajaraman and Jeff Ullman, Jure Leskovec.

Lecture 3: Data exploration and statistical analysis (pptx, pdf)

Chapter 1 from the book Mining Massive Datasets by Anand Rajaraman and Jeff Ullman, Jure Leskovec.
Chapters 7-8 (confidence interval, standard error), 11 (hypothesis testing), 16 (independence and correlation tests) from the book All of Statistics byLarry A. Wasserman (the chapter numbers are for the pdf, in the actual book the chapter numbers are -1 of the ones above).
Chapters 5,6 from The Data Science Design Manual by Steven S. Skiena
Error bars in experimental biology.

Lecture 4: Similarity and Distance. Recommendation Systems (pptx, pdf)

Chapter 2 from the book “Introduction to Data Mining” by Tan Steinbach Kumar.
Chapters 3,9 from the book Mining Massive Datasets by Anand Rajaraman and Jeff Ullman, Jure Leskovec.
Evaluation of Recommendation Systems
Article: The long tail.

Lecture 5: Dimensionality Reduction. Singular Value Decomposition (SVD). Principal Component Analysis (PCA). Model-based collaborative filtering (pptx, pdf)

Chapter 11 from the book Mining Massive Datasets by Anand Rajaraman and Jeff Ullman, Jure Leskovec.
Appendices A,B from the book “Introduction to Data Mining” by Tan Steinbach Kumar.
A tutorial on Principal Component Analysis, Jonathon Shlens
PCA notes by Aris Anagnostopoulos

Lecture 6: Clustering. The k-means algorithm. Hierarchical Clustering. The DBSCAN algorithm. Clustering Evaluation. (pptx, pdf)

Chapters 8,9 from the book “Introduction to Data Mining” by Tan Steinbach Kumar.
Chapter 7 from the book Mining Massive Datasets by Anand Rajaraman and Jeff Ullman, Jure Leskovec.

Lecture 7: Mixture Models. The EM Algorithm. (pptx, pdf)

Notes on the EM algorithm by Aris Anagnostopoulos, University of Rome La Sapienza

Lecture 8: Introduction to Supervised Learning. Linear Regression. Classification. Decision Trees - Expressiveness. Evaluation. (pptx, pdf)

Chapter 14 (in the pdf) from the book All of Statistics byLarry A. Wasserman
Chapters 4,5 from the book “Introduction to Data Mining” by Tan Steinbach Kumar.
Chapter 12 from the book Mining Massive Datasets by Anand Rajaraman and Jeff Ullman, Jure Leskovec.

Lecture 9: Nearest Neighbor Classification, Support Vector Machines, Logistic Regression, (Naive Bayes Classification). Neural Networks. Word Embeddings. The Supervised Learning pipeline. (pptx, pdf)

Chapters 4,5 from the book “Introduction to Data Mining” by Tan Steinbach Kumar.
Chapter 12 from the book Mining Massive Datasets by Anand Rajaraman and Jeff Ullman, Jure Leskovec.
Chris Manning, Natural Language Processing with Deep Learning, Lecture Notes, Part I
Chapter 13 from the book "Introduction to Information Retrieval" by C. Manning, P. Raghavan, H. Schutze

Lecture 10: Link Analysis Ranking Web Ranking. PageRank, Random Walks, HITS. (pptx, pdf)

Chapter 21 from the book "Introduction to Information Retrieval" by C. Manning, P. Raghavan, H. Schutze
Chapter 14 from the book "Networks Crowds and Markets" of D. Easley and J. Kleinberg