Online Social Networks and Media -- Homework

network

Online Social Networks and Media
Homework

Home

Homework

Slides & References

Reading Material

Resources

Project Proposal Guidelines

You can find some guidelines for the project report here. Make sure that you start the report early!

Presentation Schedule

The updated schedule for the presentations is as follows:

Wednesday 18/1, starting at 13:00:

· Μαρία Ζέρβα, Γιάννης Κουβάτης:

o Discovering top-k teams of experts with/without a leader in social networks. Mehdi Kargar, Aijun An, CIKM 2011

· Χρήστος Σπαθάρης:

o Antisocial Behavior in Online Discussion Communities. Justin Cheng, Cristian Danescu-Niculescu-Mizil, Jure Leskovec, ICWSM 2015

· Μαρία Παππά:

o Tracking the Evolution of Communities in Dynamic Social Networks, Derek Greene, Dónal Doyle, Padraig Cunningham, ASONAM 2010

· Αδαμόπουλος Γιώργος, Βαλεκάρδας Δημήτρης:

o A hashtag recommendation system for twitter data streams. Eriko Otsuka, Scott A. Wallace, David Chiu, Computational Social Networks, 2016

· Κωνσταντίνος Δημολίκας, Chaysri Piyabhum:

o Quantifying Controversy in Social Media. Kiran Garimella, Gianmarco De Francisci Morales, Aristides Gionis, Michael Mathioudakis, WSDM 2016

· Παπαμιχαήλ Άγγελος:

o Near linear time algorithm to detect community structures in large-scale networks. Usha Nandini Raghavan, Reka Albert, Soundar Kumara, Physical Review E 76, 2007

· Βιργινία Τσίντζου:

o Density-based place clustering in geo-social networks. Shi, J., Mamoulis, N., Wu, D., & Cheung, D. W., SIGMOD 2014

Paper Presentation Guidelines

The presentations will be evaluated based on the quality of the presentation, and the comprehension of the material covered. The following are some guideline, tips and advice for preparing your presentation.

· You have 20 minutes for the presentation (1 student group) and 25 minutes (2 students group). We will enforce the time limit and cut you off if you have not completed on time. 10 more minutes will be allocated for questions. We may randomly pick someone from the audience to ask a question, so everyone should pay attention.

· You should prepare around 20-25 slides, given that a slide takes around a minute to talk about on average.

· Break you presentation into thematic units. The following flow is very common:

Motivate why the problem is important and give a high level idea;
Define clearly the problem;
Present the main idea and the fundamental algorithms;
Present the results (experimental or theoretical or both);
Conclusions.

· The talk should be self-contained. Do not assume that the audience has read the paper, or some previous work that you consider known. Define all the concepts you need and all the notation that you use. Refer only to related work that you know.

· Since the time for the talk is short, you will need to focus on the important parts of the paper and avoid going through all the details. The goal is to give a summary of the paper and have a clear message. Just because you read all the paper it does not mean that you should present everything. At the same time, you should not skip important information. Focusing on the right part to present is important since it shows that you understood the paper well.

· Prepare the slides carefully. Do not add too much text, and only the math symbols necessary. Do not use full sentences, but rather keywords and short phrases. Make sure the slides are readable and not too loaded. Never ever project parts of the paper pdf.

· Practice! Good talks are the result of a lot of practice even if they seem spontaneous and fun to the audience. Practice the talk several times, and time yourself to make sure you are within the time bounds.

Some fun advice on how to give a bad talk (and more) here.

Projects

The list of projects is available here. The assignment is First-Come-First-Serve.

The timeline for the projects is as follows:

Week before Christmas: Submit a ~2-page project proposal outlining what you plan to do. This should include the topic of your presentation. Present the proposed project in class. Set up the web page for the project.
January 11: Presentations.
January 31: Submit full project.

Assignment 2

Due on Wednesday 7/12 in class.

For this assignment, you can either write your own code or use implementations provided by SNAP, NetworkX, or other sources. Specify this in your report. Question 2 can be done in teams of two, while the remaining questions should be done individually.

Question 1

Consider an alternative definition of density for a set of nodes S in a graph G to be the minimum degree of any node in S in the induced subgraph G[S], that is,

where G[S] denotes the subgraph induced by the set of nodes S in the graph G, and degree(v,G[S]) denotes the degree of node in the subgraph G[S]. Prove that the greedy algorithm described in class that iteratively removes the node with the minimum degree, and keeps the set with the maximum density is optimal for the density function. (Hint: Consider the first time that the algorithm removes a node of the optimal solution).

Question 2

For this Question you will use again the DBLP10 dataset you used for Assignment 1. Use the link analysis ranking techniques we described in class to propose two algorithms for ranking the conferences in the data. Your algorithms should make use of the network between the authors. You can use existing implementations of the link analysis ranking algorithms when available.

Question 3

In class, we studied the problem of selecting seed nodes to maximize the expected spread of influence in a network for the independent cascade model. Consider the case that we are given a set of seed nodes S that initiate the spread of harmful information. We want to select a set of (non-seed) nodes to “immunize” so as to stop the spread of the harmful information. Immunized nodes do not propagate the information, thus blocking the spread. Given a network G, the set S, and a budget K, we want to find a set of nodes B of size K to immunize such that the expected number of “protected nodes” that never receive the information, P(B), is maximized.

a. Show that the function P(B) is not a submodular function. Note that for this proof it suffices to construct a counter-example where the submodularity property does not hold. (Hint: Assume deterministic propagation of the information).

b. Propose a heuristic algorithm for selecting the set B.

Teams for the Assignment 1

Team 1 (Forest Fire): Μαρία Ζέρβα, Ιωάννης Κουβάτης, Χρήστος Σπαθάρης

Team 2 (Kronecker graph): Άγγελος Παπαμιχαήλ, Δημήτρης Βαλεκάρδας, Βιργινία Τσίντζου, Γιώργος Αδαμόπουλος

Team 3 (Preferential attachment and copying model): Μαρία Παππά, Κωνσταντίνος Δημολίκας, Chaysri Piyabhum

Assignment 1

Due on Wednesday 16/11 in class.

The goal of this assignment is for you to familiarize yourself with network measurements and network generation models, and experiment with community detection/graph clustering algorithms. This is a group assignment. There will be 3 teams, and each team will work with a different graph model.

For this assignment, you can either write your own code or use implementations provided by SNAP, NetworkX, or other sources. Specify this in your report.

You will use the DBLP10 dataset which you can download from here.

Dataset description.

The DBLP₁₀ dataset includes publications from computers science conference between 2006 and 2015. Nodes correspond to authors. There is an edge between two authors if they have written an article together.

Co-authorship: Data in the form (id1, id2) meaning that author with id1 co-authored an article with author with id2.

Authors: Data in the form (id, n) indicating that the author (node) with identifier id has name n.

Label: Data in the form (id, c) indicating that author with identifier id wrote a paper at conference c (a node may have multiple labels).

Question (1) [models and measurements]

Consider the following graphs:

(1) The DBLP₁₀ co-authorship graph,

(2) An Erdos-Renyi random graph with the same number of nodes as DBLP₁₀graph, and

(3) A graph generated using one of the following alternatives: (a) a Kronecker graph, (b) preferential attachment and copying models, or (c) forest fire. Again the number of nodes of the generated graph should be similar to the number of nodes of DBLP.

For these graphs:

(a) Plot the degree distribution. Produce 5 plots (simple distribution, bins of equal size, bins of exponential size, cumulative, zipf). All plots should be in log-log scale.

(b) Report the effective diameter.

Questions (2) [homophily]

Use the labels of the authors in the co-authorship to study the homophily of the DBLP₁₀ co-authorship graph. Devise your own methodology for measuring homophily and describe it in your report.

Questions (3) [clustering]

(a) Apply clustering on the co-authorship graph. Report the number of clusters, information about their size and the name of the authors in the largest cluster. Also, evaluate the quality of clusters using the metrics described in class. You can use any of the clustering algorithms we presented in class. If necessary, experiment with different number of clusters.

(b) Use the labels of the authors to evaluate the homogeneity of the clusters. Devise your own metric for this, and describe it in the report.