Online Social Networks and Media -- Homework

network

Online Social Networks and Media
Homework

Home

Homework

Slides & References

Reading Material

Resources

Project Proposal Guidelines

You can find some guidelines for the project report here. Make sure that you start the report early!

Paper Presentation Guidelines

The presentations will be evaluated based on the quality of the presentation, and the comprehension of the material covered. The following are some guideline, tips and advice for preparing your presentation.

· You have 20 minutes for the presentation (1 student group) and 25 minutes (2 students group). We will enforce the time limit and cut you off if you have not completed on time. 10 more minutes will be allocated for questions. We may randomly pick someone from the audience to ask a question, so everyone should pay attention.

· You should prepare around 20-25 slides, given that a slide takes around a minute to talk about on average.

· Break you presentation into thematic units. The following flow is very common:

Motivate why the problem is important and give a high level idea;
Define clearly the problem;
Present the main idea and the fundamental algorithms;
Present the results (experimental or theoretical or both);
Conclusions.

· The talk should be self-contained. Do not assume that the audience has read the paper, or some previous work that you consider known. Define all the concepts you need and all the notation that you use. Refer only to related work that you know.

· Since the time for the talk is short, you will need to focus on the important parts of the paper and avoid going through all the details. The goal is to give a summary of the paper and have a clear message. Just because you read all the paper it does not mean that you should present everything. At the same time, you should not skip important information. Focusing on the right part to present is important since it shows that you understood the paper well.

· Prepare the slides carefully. Do not add too much text, and only the math symbols necessary. Do not use full sentences, but rather keywords and short phrases. Make sure the slides are readable and not too loaded. Never ever project parts of the paper pdf.

· Practice! Good talks are the result of a lot of practice even if they seem spontaneous and fun to the audience. Practice the talk several times, and time yourself to make sure you are within the time bounds.

Some fun advice on how to give a bad talk (and more) here.

Project Assignment

The assignment of projects (projects were handed out in First-Come-First-Serve order):

· Χρυσάνθη Κοσυφάκη, Ελένη Παχή: Topic 2. Food related information from Twitter

· Διονύσης Κεφαλληνός, Αστέριος Μέρκος: Topic 8. Twitter profiles

· Ιωάννης Λάζος, Σοφία Μπάτση: Topic 10. Community detection using random walks

· Δημήτριος Μπουχάρας, Ελένη Σκορδά: Topic 1. Team formation with negative links

· Γιώργος Χριστοδούλου, Αριστείδης Χροναράκης: Topic7. Teams with status

· Γιώργος Μάμαλης, Στέφανος Μάμαλης: Topic4. Fair Link Analysis

Projects

The list of projects is available here. The assignment is First-Come-First-Serve.

The timeline for the projects is as follows:

Week before Christmas: Submit a ~2-page project proposal outlining what you plan to do. This should include the topic of your presentation. Present the proposed project in class. Set up a web page for the project.
January 10: Presentations.
January 31: Submit full project.

Assignment 2

Due on Wednesday 6/12 in class.

Question 1

In class, we described the DeGroot opinion formation model and we said that it is guaranteed to converge to consensus. In this question you will prove this claim using a connection of the DeGroot model and random walks.

1. Prove that for some matrix . What properties does matrix have? How can we express the vector after infinite number of steps as a function of the initial vector of opinions ?

2. Using what you proved in 1, prove that the DeGroot model after infinite steps will converge to a vector of opinions that are all the same. Explain mathematically and intuitively what the consensus opinion will be.

3. (Optional) Explain why there is consensus for the DeGroot model and not for the Friedkin-Johnsen model.

Question 2

Consider a graph that is a binary tree of depth . Compute the expected spread for the Independent Cascade model if the root of the tree is the seed node, and each edge in the tree has transmission probability p. Give a closed formula for the expectation.

Hint: Consider separately the cases ,, and .

Question 3 (To be done in teams of at most 2 persons)

In this question you will experiment with link analysis ranking for finding important hashtags in a Twitter network.

You are given the file clinton_trump_tweets.txt. This file consists of a set of tweets by followers of Trump and Clinton. The file contains tab-separated entries with 14 columns that correspond to the following fields:

Name, ScreenName, UserID, FollowersCount, FriendsCount, Location, Description, CreatedAt, StatusID, Language, Place, RetweetCount, FavoriteCount, Text

Each line of the file is a tweet. Using this file you will create a graph of who retweets whom. Create the graph by adding an edge (xxx,yyy) if the user with screen name xxx retweeted a message from user with handle @yyy. For example, the line

Saint Saint2205 1537088066 70 133 Fri Oct 28 20:27:12 EEST 2016 792055126408048640 en null 27 0 RT @greeneyes0084: Wikileaks Email: Hillary Campaign Struggles to Reach F**king Dumb Young People https://t.co/S6QZhY9rBH via @realalexjo

in the file will result in the creation of the edge (saint2205,greeneyes0084) in the graph.

Create this graph, and then remove all nodes that are not in the file (that is, nodes that they have not tweeted anything). Also, iteratively prune the nodes with degree less than 5. Take the subgraph consisting of the largest connected component of this graph. This is the graph you will work with.

Collect the hashtags that are used in the tweets of the users in your graph. The goal is to determine the most important hashtags in the graph. Propose a methodology that uses the PageRank algorithm on the constructed retweet graph in order to determine the importance of hashtags. Report the top-50 hashtags, and comment on the results.

In the second step we want to determine if the Trump and Clinton followers tend to use different hashtags. You are given the file clinton_trump_user_classes.txt which is a tab-separated file, where for a Twitter UserId (not screen name) we have a 0/1 value. Class 0 corresponds to Trump followers, while class 1 corresponds to Clinton followers. Using this data split your graph into two graphs: one for the Trump followers and one for the Clinton followers. Take the largest connected component for each graph. Apply the ranking algorithms you devised in the first part and report again the top-50 hashtags. Comment on the results. Is there a difference between the two graphs, and is there a difference with the overall ranking? Also, if you take out the political hashtags, are the top non-political hashtags different between the two populations? (You will need to manually inspect the hashtags to determine that).

Assignment 1

Due on Wednesday 15/11 in class.

The goal of this assignment is for you to familiarize yourself with network measurements and network generation models, and experiment with community detection/graph clustering algorithms. You should submit a report and a zip file with your code. In the report, give a clear presentation of your results, and a detailed analysis of your observations. Your assignment will be graded based on the report, so it is very important that the report is well written and well presented.

You can work in teams of up to 3 members.

You can either write your own code or use implementations provided by SNAP, NetworkX, or other sources. Specify this in your report.

You will use the RETWEETL14 dataset which you can download from here.

Dataset description.

The RETWEETL14 dataset includes 10 graphs. Each graph corresponds to retweets contain the given hashtag. Nodes in the graph correspond to users. There is an edge between two users if they have retweet at least one tweet.

There are three files for each dataset.

Retweet-graph: An edge exists between two users u and v if there is at least one retweet between them that use the hashtag.

File format: u \t v

Retweets.txt: Contains the tweets that each user in the network has retweeted that use the specific hashtag.

File format: u \t Retweet

UserLabels.txt: Contains the label (C or T for Clinton or Trump) for each user. Some users may not appear in this file, this means that we do not have information about their label. There is no user with both labels.

File format: u \t Label

Question 1 [models and measurements]

Consider the following graphs:

(1) One of the 10 graphs in the RETWEETL14 dataset (each team should select a separate graph).

(2) An Erdos-Renyi random graph

(3) A graph generated using preferential attachment

(4) A graph generated using the forest fire model.

The number of nodes of the generated graph and (when possible) the number of edges of each of the synthetically generated graphs should be the same to one of the RETWEET14 graph.

For these graphs:

(a) Plot the degree distributions for each graph. Produce 5 plots (simple distribution, bins of equal size, bins of exponential size, cumulative, zipf). All plots should be in log-log scale. Include all graphs in the same plot for each above 5 cases.

(b) Report the effective diameter for all graphs.

Question 2 [community detection]

Select 4 of the graphs in the RETWEETL14 dataset, two with “political” hashtags (e.g., “climate change”) and two with non-political ones (“supernatural”).

(a) Find communities in these graphs using any two of the community detection algorithms presented in class. Report the number of clusters and the size of each cluster. Also, evaluate the quality of clusters using the metrics described in class. If necessary, experiment with different number of clusters.

(b) Use the labels of the users to evaluate the homogeneity of the clusters. Describe the metric you use in the report, and if necessary devise your own metric for homogeneity.

Question 3 [homophily] (optional)

Choose two of the “political” retweet graphs. Use the tweets of the corresponding users to study the homophily among the users in each graph. Describe your methodology in your report, and if necessary devise your own methodology for measuring homophily.