CSE012/CS059 Data Mining

Fall 2022






Lecture Slides

For the slides of this course we will use slides and material from other courses and books. We thank in advance:  Tan, Steinbach and Kumar, Anand Rajaraman Jeff Ullman, and Jure Leskovec, Evimaria Terzi, Aris Anagnostopoulos for the material from their slides that we have used in this course.

Introduction: Logistics (in Greek) (pptx, pdf)

Lecture 1: Introduction to Data Mining (pptx, pdf)

Tutorial 1: Introduction to discrete probabilities. (pptx, pdf)

  • Thanks to Aris Anagnostopoulos for the slides.
  • Part I from the book All of Statistics by Larry A. Wasserman

Lecture 2: What is data? The data mining pipeline. Preprocessing and postprocessing. Sampling and normalization. (pptx, pdf)

Lecture 3: Data exploration and statistical analysis (pptx, pdf)

  • Chapter 1 from the book Mining Massive Datasets by Anand Rajaraman and Jeff Ullman, Jure Leskovec.
  • Chapters 7-8 (confidence interval, standard error), 11 (hypothesis testing), 16 (independence and correlation tests) from the book All of Statistics byLarry A. Wasserman (the chapter numbers are for the pdf, in the actual book the chapter numbers are -1 of the ones above).
  • Error bars in experimental biology.

Tutorial 2: Introduction to notebooks. Python reminders.

Lecture 4: Similarity and Distance. Recommendation Systems (pptx, pdf)

Τutorial 3: Introduction to the Pandas library  (ipynb, html)

  • The files for the notebooks
  • Notes from the class of Evimaria Terzi and Mark Crovella

Τutorial 4: Libraries for statistical analysis and plotting

Lecture 5: Dimensionality Reduction. Singular Value Decomposition (SVD). Principal Component Analysis (PCA). Model-based collaborative filtering (pptx, pdf)

Tutorial 5: Introduction to the Numpy and SciPy libraries for matrix manipulation (ipynb, html).

Lecture 6: Clustering. The k-means algorithm. Hierarchical Clustering. The DBSCAN algorithm. Clustering Evaluation. (pptx, pdf)

Tutorial 6: Libraries for data preprocessing (ipynb, html)

Lecture 7: Mixture Models. The EM Algorithm. (pptx, pdf)

Tutorial 7: Introduction to the SciKit-Learn (sklearn) library for clustering (ipynb, html)

Lecture 8: Introduction to Supervised Learning. Linear Regression. Classification. Decision Trees - Expressiveness. Evaluation. (pptx, pdf)

Lecture 9: Nearest Neighbor Classification, Support Vector Machines, Logistic Regression, (Naive Bayes Classification). Neural Networks. Word Embeddings. The Supervised Learning pipeline. (pptx, pdf)

Tutorial 8:  Introduction to the scikit-learn library and applications to classification. The gensim library and word embeddings. (Notebook: ipynb, html).

Lecture 10: Link Analysis Ranking Web Ranking. PageRank, Random Walks, HITS. Absorbing Random Walks. (pptx, pdf)

Tutorial 9: Introduction to the library NetworkX (Notebook: ipynb, html).