|
Online Social Networks and Media |
Home |
Project Proposal Guidelines You can find some guidelines for the project report here. Make sure that you start
the report early! Paper Presentation
Guidelines The presentations will be evaluated
based on the quality of the presentation, and the comprehension of the
material covered. The following are some guideline, tips and advice for
preparing your presentation. · You have 20 minutes for the presentation (1
student group) and 25 minutes (2 students group). We will enforce the time
limit and cut you off if you have not completed on time. 10 more
minutes will be allocated for questions. We may randomly pick someone from
the audience to ask a question, so everyone should pay attention. · You should prepare around 20-25 slides, given that a
slide takes around a minute to talk about on average. · Break you presentation into thematic units. The following flow
is very common:
· The talk should be self-contained. Do not assume
that the audience has read the paper, or some previous work that you consider
known. Define all the concepts you need and all the notation
that you use. Refer only to related work that you know. · Since the time for the talk is short, you will need to focus on
the important parts of the paper and avoid going through all the details. The
goal is to give a summary of the paper and have a clear message. Just because
you read all the paper it does not mean that you should present everything. At
the same time, you should not skip important information. Focusing on the
right part to present is important since it shows that you understood the
paper well. · Prepare the slides carefully. Do not add too much text, and only
the math symbols necessary. Do not use full sentences, but rather keywords
and short phrases. Make sure the slides are readable and not too loaded.
Never ever project parts of the paper pdf. · Practice! Good talks are the result of a lot of practice even if
they seem spontaneous and fun to the audience. Practice the talk several
times, and time yourself to make sure you are within the time bounds. Some fun advice on how to give a bad talk
(and more) here. Project Assignment The assignment of projects (projects were
handed out in First-Come-First-Serve order): ·
Χρυσάνθη
Κοσυφάκη,
Ελένη Παχή:
Topic 2. Food related
information from Twitter ·
Διονύσης
Κεφαλληνός,
Αστέριος Μέρκος:
Topic 8. Twitter profiles ·
Ιωάννης
Λάζος, Σοφία Μπάτση: Topic
10. Community detection using random walks ·
Δημήτριος
Μπουχάρας,
Ελένη Σκορδά:
Topic 1. Team formation with
negative links · Γιώργος
Χριστοδούλου,
Αριστείδης Χροναράκης:
Topic7. Teams with status ·
Γιώργος
Μάμαλης,
Στέφανος Μάμαλης:
Topic4. Fair Link Analysis Projects The
list of projects is available here. The assignment is
First-Come-First-Serve. The
timeline for the projects is as follows:
Assignment 2 Due on Wednesday 6/12 in class. Question 1 In class,
we described the DeGroot opinion formation model
and we said that it is guaranteed to converge to consensus. In this question
you will prove this claim using a connection of the DeGroot
model and random walks. 1.
Prove
that 2.
Using
what you proved in 1, prove that the DeGroot model
after infinite steps will converge to a vector of opinions that are all the
same. Explain mathematically and intuitively what the consensus opinion will
be. 3.
(Optional)
Explain why there is consensus for the DeGroot
model and not for the Friedkin-Johnsen model. Question 2 Consider a
graph that is a binary tree of depth Hint:
Consider separately the cases Question 3 (To be done in teams of
at most 2 persons) In this
question you will experiment with link analysis ranking for finding important
hashtags in a Twitter network. You are
given the file clinton_trump_tweets.txt.
This file consists of a set of tweets by followers of Trump and Clinton. The
file contains tab-separated entries with 14 columns that correspond to the
following fields: Name, ScreenName, UserID, FollowersCount, FriendsCount,
Location, Description, CreatedAt, StatusID, Language, Place, RetweetCount,
FavoriteCount, Text Each line of
the file is a tweet. Using this file you will create a graph of who retweets
whom. Create the graph by adding an edge (xxx,yyy) if the user with screen name xxx retweeted a
message from user with handle @yyy. For example,
the line Saint Saint2205 1537088066 70
133 Fri Oct
28 20:27:12 EEST 2016
792055126408048640 en null
27 0 RT @greeneyes0084: Wikileaks
Email: Hillary Campaign Struggles to Reach F**king Dumb Young People
https://t.co/S6QZhY9rBH via @realalexjo in the file will result in the
creation of the edge (saint2205,greeneyes0084) in the graph. Create this
graph, and then remove all nodes that are not in the file (that is, nodes
that they have not tweeted anything). Also, iteratively prune the nodes with degree less than 5. Take the
subgraph consisting of the largest connected component of this graph. This is
the graph you will work with. Collect the
hashtags that are used in the tweets of the users in your graph. The goal is
to determine the most important hashtags in the graph. Propose a methodology
that uses the PageRank algorithm on the constructed retweet graph in order to
determine the importance of hashtags. Report the top-50 hashtags, and comment
on the results. In the
second step we want to determine if the Trump and Clinton followers tend to
use different hashtags. You are given the file clinton_trump_user_classes.txt
which is a tab-separated file, where for a Twitter UserId
(not screen name) we have a 0/1 value. Class 0 corresponds to Trump
followers, while class 1 corresponds to Clinton followers. Using this data
split your graph into two graphs: one for the Trump followers and one for the
Clinton followers. Take the largest connected component for each graph. Apply
the ranking algorithms you devised in the first part and report again the
top-50 hashtags. Comment on the results. Is there a difference between the
two graphs, and is there a difference with the overall ranking? Also, if you
take out the political hashtags, are the top non-political hashtags different
between the two populations? (You will need to manually inspect the hashtags
to determine that). Assignment 1 Due on Wednesday 15/11 in class. The goal of
this assignment is for you to familiarize yourself with network measurements
and network generation models, and experiment with community detection/graph
clustering algorithms. You should submit a report and a zip file with your
code. In the report, give a clear presentation of your results, and a
detailed analysis of your observations. Your assignment will be graded based
on the report, so it is very important that the report is well written and
well presented. You can
work in teams of up to 3 members. You can
either write your own code or use implementations provided by SNAP, NetworkX, or other sources. Specify this in your report. You will
use the RETWEETL14 dataset which you can download from here. Dataset description. The RETWEETL14 dataset includes 10 graphs.
Each graph corresponds to retweets contain the given hashtag. Nodes in the
graph correspond to users. There is an
edge between two users if they have retweet at least one tweet. There are
three files for each dataset. Retweet-graph:
An edge exists between two users u and v if
there is at least one retweet between them that use the hashtag. File
format: u \t v Retweets.txt: Contains the tweets that each user
in the network has retweeted that use the specific hashtag. File
format: u \t Retweet UserLabels.txt: Contains the label (C or T for
Clinton or Trump) for each user. Some users may not appear in this file, this
means that we do not have information about their label. There is no user
with both labels. File
format: u \t Label
Question 1 [models and measurements] Consider
the following graphs: (1) One of
the 10 graphs in the RETWEETL14 dataset (each team should select a separate
graph). (2) An Erdos-Renyi random graph (3) A graph
generated using preferential attachment (4) A graph
generated using the forest fire model. The number
of nodes of the generated graph and (when possible) the number of edges of
each of the synthetically generated graphs should be the same to one of the
RETWEET14 graph. For these
graphs: (a) Plot the degree distributions for each
graph. Produce 5 plots (simple distribution, bins of equal size, bins of
exponential size, cumulative, zipf). All plots should be in log-log scale.
Include all graphs in the same plot for each above 5 cases. (b) Report the effective diameter for all
graphs. (c) Report the clustering co-efficient for
all graphs. Question 2 [community detection] Select 4 of
the graphs in the RETWEETL14 dataset, two with “political” hashtags (e.g.,
“climate change”) and two with non-political ones (“supernatural”). (a) Find
communities in these graphs using any two of the community detection
algorithms presented in class. Report the number of clusters and the size of
each cluster. Also, evaluate the quality of clusters using the metrics
described in class. If necessary, experiment with different number of
clusters. (b) Use the
labels of the users to evaluate the homogeneity of the clusters. Describe the
metric you use in the report, and if necessary devise your own metric for
homogeneity. Question 3 [homophily]
(optional) Choose two
of the “political” retweet graphs. Use the tweets of the corresponding users
to study the homophily among the users in each
graph. Describe your methodology in your report, and if necessary devise your
own methodology for measuring homophily. |