辅导COMP4880/8880、讲解Resilience留学生、讲解Python程序、辅导Python语言 讲解数据库SQL|辅导留学生Prol
- 首页 >> 其他 COMP4880/8880 Assignment 3 - Resilience,
Contagion, and Homophily
This assignment is graded out of 15, and counts towards 15% of the final grade.
You should submit your answers via Gradescope and your code via wattle. To avoid
losing marks both your answers and code must be submitted before the deadline.
Register for Gradescope at http://gradescope.com using your anu e-mail and include
your student ID number with sign-up. Use the entry code MZ35V8 to sign up for
COMP4880/COMP8880.
Submitting answers: Prepare answers to your homework in a single PDF file and submit
it via GradeScope. If you complete the assignment in a jupyter notebook (or R notebook
… etc) submitting a pdf copy of the notebook (containing all answers) will suffice. You are
responsible for clearly identifying the question being answered, and also make clear
what source code was used to answer the question. Some questions require you to
include source code in the answer pdf. At submission time you will need to indicate
which page(s) of the PDF answer each question. An introduction to submission through
gradescope is available here: https://www.youtube.com/watch?v=KMPoby5g_nE
Submitting code: Upload your code to wattle by the same submission deadline. Put all
the code into a single file and upload it. A jupyter, R notebook, or zip file is acceptable.
Completing the assignment You may choose to use any programming language. We
highly recommend a python notebook with the networkx package. You can use third
party packages such as graph libraries, for example networkx in python, igraph in R, and
so on -- this is strongly recommended.
Question 1: Resilience
This question requires you to use the unweighted undirected global flight network from
Assignment 0 available here: http://seeslab.info/downloads/air-transportation-networks/ .
Please read the description of the data on the page where you download the data. Node
information is stored in “global-cities.dat” (node ids are the second column), and an edge
list (on node ids) is stored in “global-net.dat”.1.1. (1 mark) Use moments of the degree sequence to calculate the critical threshold
under random node removal for the global flight network.
1.2.(2 mark) Simulate 20 random node removal sequences for the global flight
network and calculate the average fraction of nodes that need to be removed
before the largest component contains less than 2% of the number of nodes in the
original graph. Please report: the fraction of nodes removed in each simulation,
and the average fraction of nodes removed across all simulations.
1.3.(1 mark) Does the simulated result support the calculation from the degree
sequence? List the main reasons why they might be different? (Short answer or
bullet points are expected)
1.4.(1 mark) Is the condition that “the largest component contain less than 2% of
nodes in the graph” a realistic condition for breakdown of the flight network?
Come up with another condition you believe to be more realistic. Explain why your
new condition is more realistic.
Question 2: Contagion
A computer virus is spreading through a set of computers and network hubs. We will
model this as a bipartite graph with two types of nodes: computers C and hubs H. Both
computers and network hubs are susceptible to this computer virus. Computers are only
connected to hubs (computers do not infect other computers directly) and hubs are only
connected to computers (hubs do not infect other hubs directly). Hubs may connect to
multiple computers and computers may connect to multiple hubs, all connections are
bi-directional. There is a different infection rate for computer to hub infection than from
hub to computer infection, both are nonzero.
2.1.(1 mark) Write down an expression for the rate of change of the fraction of nodes
infected in the network version of the SI model for this bipartite case. Make sure to
define all the variables you use, and make explicit any assumptions you make.
Your expression should be in matrix form. Hint: Look at the lecture slides
(specifically the first part of the second lecture on contagion) and consider how
you might modify the expression, in particular consider changing beta (in the
lecture slides beta is the probability per unit time that infection will be transmitted
between two individuals).2.2. (1 mark) Use your bipartite SI model to derive an approximate expression for
the probability that each node in the network (both computers and hubs) is
infected at time t. This expression will be vector valued with a domain of the
nonnegative real numbers. You may assume that we care only about the early
time so the fraction of infected individuals is small. Also assume at t=0 there is one
infected node. Tip: your solution should involve finding the dominant eigenvector
of some matrix.
2.3. (2 marks) You have been given a bipartite graph (“computer_hub_graph.csv”
on wattle, stored as an adjacency list, the format is described here:
https://networkx.github.io/documentation/networkx-1.10/reference/readwrite.adjlist
.html ) with 30 computers and 15 hubs. Assume that the probability per unit time
that infection will be transmitted from a computer to a hub is 0.05, and 0.01 from a
hub to a computer. Using your answer to part 2 above plot out the average
number of nodes that are infected at time t (for t < 10). On your plot the x-axis
should be time and the y-axis should be the average number of nodes that are
infected. Also determine the 5 nodes that are most likely to be infected at time
t=10.
2.4. (2 marks) Now simulate the SI model on the network ( given by
“computer_hub_graph.csv” on wattle) assume that the probability per unit time
that infection will be transmitted from a computer to a hub is 0.05, and 0.01 from a
hub to a computer. Start the infection from a single random node. Run this
simulation 1000 times (with a new random start node each time) then plot the
result (for t < 10). On your plot the x-axis should be time and the y-axis should be
the average number of nodes that are infected. Also determine the 5 nodes that
are most likely to be infected at time t=10. Tip: the time until an infection spreads
across an edge can be modelled as an exponential random variable.
2.5. (1 mark) Give at least three reasons why your answer to part 3 is not the same
as your answer to part 4. (bullet points or short answer)
Question 3: Homophily
Download the CiteSeer for Document Classification dataset from
https://linqs.soe.ucsc.edu/data (direct link to data:https://linqs-data.soe.ucsc.edu/public/lbc/citeseer.tgz ). The nodes in this graph are
computer science papers and the edges are citations. Each paper belongs to one
category from: Agents, AI, DB, IR, ML, and HCI. For this question you should make the
graph undirected (to do this interpret each directed edge as undirected). Details of the
graph are in “citeseer.cites” while node categories are in “citeseer.content”. Please
review the README to understand this dataset. Note: there are some nodes in
“citeseer.cites” that are not present in “citeseer.content” and thus do not have an
assigned category, please remove these nodes from the graph before answering any of
the questions below.
3.1. (1 mark) Calculate and report: the number of nodes, the number of edges, the
number of nodes in each category, the number of edges wholly in each category (
to do this count the number of edges where both ends are in the category ).
3.2. (1 mark) Does the CiteSeer for Document Classification graph exhibit
homophily with respect to category? Calculate a relevant measure to support your
statement. Also explain what the measure computes and why it can be used to
detect homophily.
3.3. (1 mark) Does the CiteSeer for Document Classification graph exhibit
homophily with respect to node degree? Calculate a relevant measure to support
your statement. Also explain what the measure computes and why it can be used
to detect homophily.
Contagion, and Homophily
This assignment is graded out of 15, and counts towards 15% of the final grade.
You should submit your answers via Gradescope and your code via wattle. To avoid
losing marks both your answers and code must be submitted before the deadline.
Register for Gradescope at http://gradescope.com using your anu e-mail and include
your student ID number with sign-up. Use the entry code MZ35V8 to sign up for
COMP4880/COMP8880.
Submitting answers: Prepare answers to your homework in a single PDF file and submit
it via GradeScope. If you complete the assignment in a jupyter notebook (or R notebook
… etc) submitting a pdf copy of the notebook (containing all answers) will suffice. You are
responsible for clearly identifying the question being answered, and also make clear
what source code was used to answer the question. Some questions require you to
include source code in the answer pdf. At submission time you will need to indicate
which page(s) of the PDF answer each question. An introduction to submission through
gradescope is available here: https://www.youtube.com/watch?v=KMPoby5g_nE
Submitting code: Upload your code to wattle by the same submission deadline. Put all
the code into a single file and upload it. A jupyter, R notebook, or zip file is acceptable.
Completing the assignment You may choose to use any programming language. We
highly recommend a python notebook with the networkx package. You can use third
party packages such as graph libraries, for example networkx in python, igraph in R, and
so on -- this is strongly recommended.
Question 1: Resilience
This question requires you to use the unweighted undirected global flight network from
Assignment 0 available here: http://seeslab.info/downloads/air-transportation-networks/ .
Please read the description of the data on the page where you download the data. Node
information is stored in “global-cities.dat” (node ids are the second column), and an edge
list (on node ids) is stored in “global-net.dat”.1.1. (1 mark) Use moments of the degree sequence to calculate the critical threshold
under random node removal for the global flight network.
1.2.(2 mark) Simulate 20 random node removal sequences for the global flight
network and calculate the average fraction of nodes that need to be removed
before the largest component contains less than 2% of the number of nodes in the
original graph. Please report: the fraction of nodes removed in each simulation,
and the average fraction of nodes removed across all simulations.
1.3.(1 mark) Does the simulated result support the calculation from the degree
sequence? List the main reasons why they might be different? (Short answer or
bullet points are expected)
1.4.(1 mark) Is the condition that “the largest component contain less than 2% of
nodes in the graph” a realistic condition for breakdown of the flight network?
Come up with another condition you believe to be more realistic. Explain why your
new condition is more realistic.
Question 2: Contagion
A computer virus is spreading through a set of computers and network hubs. We will
model this as a bipartite graph with two types of nodes: computers C and hubs H. Both
computers and network hubs are susceptible to this computer virus. Computers are only
connected to hubs (computers do not infect other computers directly) and hubs are only
connected to computers (hubs do not infect other hubs directly). Hubs may connect to
multiple computers and computers may connect to multiple hubs, all connections are
bi-directional. There is a different infection rate for computer to hub infection than from
hub to computer infection, both are nonzero.
2.1.(1 mark) Write down an expression for the rate of change of the fraction of nodes
infected in the network version of the SI model for this bipartite case. Make sure to
define all the variables you use, and make explicit any assumptions you make.
Your expression should be in matrix form. Hint: Look at the lecture slides
(specifically the first part of the second lecture on contagion) and consider how
you might modify the expression, in particular consider changing beta (in the
lecture slides beta is the probability per unit time that infection will be transmitted
between two individuals).2.2. (1 mark) Use your bipartite SI model to derive an approximate expression for
the probability that each node in the network (both computers and hubs) is
infected at time t. This expression will be vector valued with a domain of the
nonnegative real numbers. You may assume that we care only about the early
time so the fraction of infected individuals is small. Also assume at t=0 there is one
infected node. Tip: your solution should involve finding the dominant eigenvector
of some matrix.
2.3. (2 marks) You have been given a bipartite graph (“computer_hub_graph.csv”
on wattle, stored as an adjacency list, the format is described here:
https://networkx.github.io/documentation/networkx-1.10/reference/readwrite.adjlist
.html ) with 30 computers and 15 hubs. Assume that the probability per unit time
that infection will be transmitted from a computer to a hub is 0.05, and 0.01 from a
hub to a computer. Using your answer to part 2 above plot out the average
number of nodes that are infected at time t (for t < 10). On your plot the x-axis
should be time and the y-axis should be the average number of nodes that are
infected. Also determine the 5 nodes that are most likely to be infected at time
t=10.
2.4. (2 marks) Now simulate the SI model on the network ( given by
“computer_hub_graph.csv” on wattle) assume that the probability per unit time
that infection will be transmitted from a computer to a hub is 0.05, and 0.01 from a
hub to a computer. Start the infection from a single random node. Run this
simulation 1000 times (with a new random start node each time) then plot the
result (for t < 10). On your plot the x-axis should be time and the y-axis should be
the average number of nodes that are infected. Also determine the 5 nodes that
are most likely to be infected at time t=10. Tip: the time until an infection spreads
across an edge can be modelled as an exponential random variable.
2.5. (1 mark) Give at least three reasons why your answer to part 3 is not the same
as your answer to part 4. (bullet points or short answer)
Question 3: Homophily
Download the CiteSeer for Document Classification dataset from
https://linqs.soe.ucsc.edu/data (direct link to data:https://linqs-data.soe.ucsc.edu/public/lbc/citeseer.tgz ). The nodes in this graph are
computer science papers and the edges are citations. Each paper belongs to one
category from: Agents, AI, DB, IR, ML, and HCI. For this question you should make the
graph undirected (to do this interpret each directed edge as undirected). Details of the
graph are in “citeseer.cites” while node categories are in “citeseer.content”. Please
review the README to understand this dataset. Note: there are some nodes in
“citeseer.cites” that are not present in “citeseer.content” and thus do not have an
assigned category, please remove these nodes from the graph before answering any of
the questions below.
3.1. (1 mark) Calculate and report: the number of nodes, the number of edges, the
number of nodes in each category, the number of edges wholly in each category (
to do this count the number of edges where both ends are in the category ).
3.2. (1 mark) Does the CiteSeer for Document Classification graph exhibit
homophily with respect to category? Calculate a relevant measure to support your
statement. Also explain what the measure computes and why it can be used to
detect homophily.
3.3. (1 mark) Does the CiteSeer for Document Classification graph exhibit
homophily with respect to node degree? Calculate a relevant measure to support
your statement. Also explain what the measure computes and why it can be used
to detect homophily.