辅导Stock market、辅导Python Data Structures课
- 首页 >> Python编程、Python编程语言作业调试
2019/10/3 homework_3 - Jupyter Notebook
localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 1/12
Stock market clustering
Data Structures and Algorithms Using Python, September 2019
Imperial College Business School
This assignment is divided into three parts. In the first part, you will work on pandas data analysis. In the
second part, you will implement a clustering algorithm to group companies based on their stock price
movements. In the final part, you will explore ways to extend and improve this analysis.
The assignment is due on Monday 7 October.
The assignment is graded not only on correctness but also on the presentation of the results. Try to
make the results of your calculations easy to read with eg string formatting, do some plots if you find them
useful, and comment your code.
There are no OK tests to test your functions in this assignment. It is intended to set you up working on a
real problem where you have to explore data and the problem to figure out your approach. The first part will
also require you to use a search engine to find the right pandas functions to use to analyse your data. Some
potentially useful pandas functions are listed in the file veryUseful.py .
You're working as a group, so you may wish to divide the work into smaller pieces. Some of you may
want to start working on the Pandas part, and others on the algorithm part. There is a set of intermediary
results available for testing your algorithm, so you can start immediately on both parts. See the details below in
question 3.
Setting up your group
You'll complete this assignment in your study groups. Start by creating this group on OK.
1. Gather the OK login emails of your group.
2. Log in to https://okpy.org (https://okpy.org).
3. Click on the group assignment in the assignment list.
4. Add your group members' emails and click on Invite to add them.
5. Each invited member should go to the group assignment and Accept the invite.
Submission
When you're ready to submit the assignment, use the command
python ok --submit
on the command line.
You may submit more than once before the deadline; only the final submission will be graded. Only one
submission is needed for your group.
2019/10/3 homework_3 - Jupyter Notebook
localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 2/12
Part 1: Pandas
30% of grade
In the previous homework, we used lists to study stock prices. The pandas library provides some more
effective tools for data analysis.
The assignment comes with two files containing company data:
SP_500_firms.csv with firm and ticker names
SP_500_close_2015.csv with stock price data for 2015
Let's first load up this data.
In [1]:
Question 1: Returns
In the previous homework, we calculated stock price returns over a period of time. The return is defined as the
percentage change, so the return between periods and for stock price would be
Calculate the returns in pandas for all the stocks in price_data .
In [2]:
# Load data into Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
def read_names_into_dict():
"""
Read company names into a dictionary
"""
d = dict()
input_file = csv.DictReader(open("SP_500_firms.csv"))
# Firm list obtained from https://github.com/datasets/s-and-p-500-companies
for row in input_file:
d[row['Symbol']] = [row['Name'], row['Sector']]
return d
names_dict = read_names_into_dict()
comp_names = names_dict.keys()
# Read price data with pandas
filename = 'SP_500_close_2015.csv'
price_data = pd.read_csv(filename, index_col=0)
# Calculate company returns in this cell
return_for_stock=(price_data.tail(251)-price_data.head(251))/price_data.head(251)
2019/10/3 homework_3 - Jupyter Notebook
localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 3/12
Question 1.1: Highest and lowest daily returns
Use pandas to find the 10 highest daily returns amongst all companies. Search online for what were the
reasons behind the highest returns. Present your results in a clean and immediately readable form.
Repeat with the lowest daily returns.
In [ ]:
Question 1.2: Highest and lowest yearly returns
Find the 10 highest yearly returns amongst all companies. Present your results in a clean and immediately
readable form.
Repeat with the lowest yearly returns.
In [ ]:
Question 1.3: Highest and lowest volatilities
Find the 10 highest yearly volatilities (standard deviations) amongst all companies. Present your results in a
clean and immediately readable form.
Repeat with the lowest volatilities.
In [ ]:
# Your code here
max_return=[]
df.sort_values(by=[comp_names],ascending=False)
# Your code here
# Your code here
### Question 2: Correlations
Analysts often care about the _correlation_ of stock prices between firms.
Correlation measures the statistical similarity between the two prices' movements.
If the prices move very similarly, the correlation of their _returns_ is close to
1. If they tend to make exactly the opposite movements (ie one price moves up and
the other one down), the correlation is close to -1. If there is no clear
statistical relationship between the movements of two stock prices, the
correlation in their returns is close to zero.
For a sample of stock price returns $x,y$ with observations for $n$ days, the
correlation $r_{xy}$ between $x$ and $y$ can be calculated as:
$$
r_{xy} = \frac{\sum x_i y_i - n \bar{x}\bar{y}}{ns_x s_y} = {\frac {n\sum
x_{i}y_{i}-\sum x_{i}\sum y_{i}}{{\sqrt {n\sum x_{i}^{2}-(\sum x_{i})^{2}}}~{\sqrt
{n\sum y_{i}^{2}-(\sum y_{i})^{2}}}}}.
$$
2019/10/3 homework_3 - Jupyter Notebook
localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 4/12
Calculate all correlations between companies. You can search online for a pandas or numpy function that
does this directly.
In [ ]:
Question 2.1
Next, analyse the correlations between the companies:
Define functions to print out the top and bottom correlated companies for any given company.
Use your functions to study the following companies in the tech sector: Amazon, Microsoft, Facebook,
Apple, and Google. Comment on the results. Which (possibly other) companies are they most closely
related to in terms of highest correlations? Would you have expected the results you see?
In [ ]:
Part 2: Clustering
30% of grade
In this part of the assignment, you will develop a clustering algorithm to study the similarity of different stocks.
The general purpose of clustering analysis is dividing a set of objects into groups that are somehow "similar" to
each other. It is a widespread tool used for exploratory data analysis in diverse fields in both science and
business. For example, in marketing analytics, cluster analysis is employed to group consumers into segments
based on their characteristics or _features_, such as age, post code, purchase history, etc. These features are
somehow aggregated to compare the similarity between consumers. Based on this similarity, a clustering
algorithm then divides the consumers into segments.
We will apply this idea on stock market data to identify groups of stocks that perform similarly over time. There
are many reasons for grouping stocks together, such as analysing trading strategies, risk management, or
simply presenting stock market information. Publicly traded companies are often grouped together by simple
features such as the industry they operate in (eg tech companies or pharma companies), but here we'll take a
data-driven approach, grouping together stocks that perform similarly over time.
Here $\bar{x}$ refers to the average value of $x$ over the $n$ observations, and
$s_x$ to its standard deviation.
Based on time series of the stock returns we just computed, we can calculate a
correlation value for each pair of stocks, for example between MSFT (Microsoft)
and AAPL (Apple). This gives us a measure of the similarity between the two stocks
in this time period.
# Your code here
# Your code here
2019/10/3 homework_3 - Jupyter Notebook
localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 5/12
Cluster analysis is an umbrella term for many different algorithmic approaches. Here you'll develop one that's
based on the concept of greedy algorithm design, specified below. You'll also have the opportunity to
explore other approaches using Python libraries.
What is a good measure for stocks "performing similarly" to use for clustering. Let's use the one we just
calculated: correlations in their returns. How can we use this similarity information for clustering? We now have
access to all correlations between stock returns in S&P 500. We can think of this as a graph as follows. The
nodes of the graph are the stocks (eg MSFT and AAPL). The edges between them are the correlations, which
we have just calculated between each stock, where the value of the correlation is the edge weight. Notice that
since we have the correlations between all companies, this is a dense graph, where all possible edges exist.
We thus have a graph representing pairwise "similarity" scores in correlations, and we want to divide the graph
into clusters. There are many possible ways to do this, but here we'll use a greedy algorithm design. The
algorithm is as follows:
1. Sort the edges in the graph by their weight (ie the correlation), pick a number for the number of iterations
of the algorithm
2. Create single-node sets from each node in the graph
3. Repeat times:
A. Pick the graph edge with the highest correlation
B. Combine the two sets containing the source and the destination of the edge
C. Repeat with the next-highest weight edge
4. Return the remaining sets after the iterations
What does the algorithm do? It first initializes a graph with no connections, where each node is in a separate
set. Then in the main loop, it runs through the highest-weighted edges, and adds connections at those
edges. This leads to sets being combined (or "merged"). The result is "groups" of stocks determined by the
highest correlations between the stock returns. These are your stock clusters.
For example, suppose that the toy graph below represents four stocks: A,B,C,D and their return correlations.
Suppose we pick and run the algorithm.
The algorithm would begin by initializing four separate sets of one node each: {A}, {B}, {C}, {D}. It would then
first connect C and D because of their correlation 0.95, resulting in just three sets: {A}, {B}, and {C,D}. Then it
would connect A and B, resulting in two sets of two nodes each: {A,B}, and {C,D}. These would be our clusters
for .
Question 3: Implementing the algorithm
Your task is to implement the clustering algorithm using the functions below. First, for convenience in
implementing the algorithm, let's create a list of the correlations from the pandas data.
2019/10/3 homework_3 - Jupyter Notebook
localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 6/12
In [8]:
Next, let's turn to the algorithm itself. Consider the example above, repeated here.
Suppose we pick and have sorted the edge list in step 1 of the algorithm. How should we represent the
clusters in step 2? One great way is to use a dictionary where each key is a node, and each value is another
node that this node "points to". A cluster is then a chain of these links, which we represent as a dictionary.
In step 2 of the algorithm, we start with four nodes that point to themselves, ie the dictionary
{'A':'A','B':'B','C':'C','D':'D'} . When a node points to itself, it ends the chain. Here the clusters
are thus just the nodes themselves, as in the figure below.
def create_correlation_list(correl):
"""
Creates a list of correlations from a pandas dataframe of correlations
Parameters:
correl: pandas dataframe of correlations
Returns:
list of correlations containing tuples of form (correlation, ticker1, ticker
"""
n_comp = len(correl.columns)
comp_names = list(correl.columns)
# Faster if we use a numpy matrix
correl_mat = correl.as_matrix()
L = [] # create list
for i in range(n_comp):
for j in range(i+1,n_comp):
L.append((correl_mat[i,j],comp_names[i],comp_names[j]))
return L
edges = create_correlation_list(correl)
2019/10/3 homework_3 - Jupyter Notebook
localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 7/12
Let's walk through the algorithm's next steps. We first look at the highest-weight edge, which is between C and
D. These clusters will be combined. In terms of the dictionary, this means that one of them will not point to
itself, but to the other one (here it does not matter which one). So we make the dictionary at C point to D .
The dictionary becomes {'A':'A','B':'B','C':'D','D':'D'} .
The next highest correlation is between A and B, so these clusters would be combined. The dictionary
becomes {'A':'B','B':'B','C':'D','D':'D'} .
The third highest correlation is between C and B. Let's think about combining these clusters using the
dictionary we have. Looking up B , we get B : the node B is in the bottom of the chain representing its cluster.
But when we look up C , it points to D . Should we make C point to B ? No - that would leave nothing
pointing at D , and C and D should remain connected! We could perhaps have C somehow point at both
nodes, but that could become complicated, so we'll do the following instead. We'll follow the chain to the
bottom. In the dictionary, we look up C and see that it points to D . We then look up D which points to itself,
so D is the bottom node. We then pick one of the bottom nodes B and D , and make it point to the other.
We then have the dictionary {'A':'B','B':'B','C':'D','D':'B'} , and the corresponding clustering in
the figure below.
In other words, we'll keep track of clusters in a dictionary such that each cluster has exactly one bottom
node. To do this, we need a mechanism for following a cluster to the bottom. You'll implement this in the
function find_bottom below. The function takes as input a node and a dictionary, and runs through the
"chain" in the dictionary until it finds a bottom node pointing to itself.
2019/10/3 homework_3 - Jupyter Notebook
localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 8/12
The other thing we'll need to do is combine clusters by connecting two nodes. This means taking the two
nodes, finding the bottom node for each node's cluster, and making one point to the other. You'll implement
this in the function merge_clusters below.
Finally, you'll need to set up the algorithm by sorting the correlations, and then looping through this merging
times. You'll implement this in the function cluster_correlations below. This completes the algorithm.
But there is one more thing. If you only keep track of a dictionary like
{'A':'B','B':'B','C':'D','D':'B'} , how do you actually find the clusters from the dictionary? A
convenient way is to store some extra information: the "starting nodes" of each cluster to which no other node
links. For example, above these "starting nodes" would include all nodes A,B,C,D in the beginning, but only
A and C in the end. If we keep track of these, we can then write a function that starts from each such
remaining "starting node", works through to the bottom, and creates the cluster along the way. You'll
implement this in the function construct_sets below.
Intermediary results
You can load a pre-computed set of results up to this point using the following commands.
In [12]:
Clustering implementation
Complete the following functions to implement the clustering algorithm.
['MMM', 'ABT', 'ABBV', 'ACN', 'ATVI', 'AYI', 'ADBE', 'AAP', 'AES', 'AE
T']
Out[12]:
[(0.59866616402973805, 'MMM', 'ABT'),
(0.32263699601940254, 'MMM', 'ABBV'),
(0.63205934885601889, 'MMM', 'ACN'),
(0.41855006701119907, 'MMM', 'ATVI'),
(0.45089749571328591, 'MMM', 'AYI'),
(0.46875484430451653, 'MMM', 'ADBE'),
(0.25713165217544326, 'MMM', 'AAP'),
(0.33537796741224424, 'MMM', 'AES'),
(0.31737374099675925, 'MMM', 'AET'),
(0.50593060558168279, 'MMM', 'AMG')]
# Load intermediary results from a "pickle" file
# You can use these with your algorithm below
import pickle
file_name = 'cluster_correlations'
with open(file_name, "rb") as f:
correl = pickle.load(f)
edges = pickle.load(f)
firms = list(correl.columns)
print(firms[:10])
edges[:10]
2019/10/3 homework_3 - Jupyter Notebook
localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 9/12
In [ ]:
def find_bottom(node, next_nodes):
"""
Find the "bottom" of a cluster starting from node in dictionary next_nodes
Parameters:
node: starting node
next_nodes: dictionary of node connections
Returns:
the bottom node in the cluster
"""
# Your code here
pass
def merge_sets(node1, node2, next_nodes, set_starters):
"""
Merges the clusters containing node1, node2 using the connections dictionary nex
Also removes any bottom node which is no longer a "starting node" from set_start
Parameters:
node1: first node the set of which will be merged
node2: second node the set of which will be merged
next_nodes: dictionary of node connections
set_starters: set of nodes that "start" a cluster
Returns:
does this function need to return something?
"""
# Your code here
def cluster_correlations(edge_list, firms, k=200):
"""
A mystery clustering algorithm
Parameters:
edge_list - list of edges of the form (weight,source,destination)
firms - list of firms (tickers)
k - number of iterations of algorithm
Returns:
next_nodes - dictionary to store clusters as "pointers"
- the dictionary keys are the nodes and the values are the node in the s
set_starters - set of nodes that no other node points to (this will be used
Algorithm:
1 sort edges by weight (highest correlation first)
2 initialize next_nodes so that each node points to itself (single-node clu
3 take highest correlation edge
check if the source and destination are in the same cluster using find_b
if not, merge the source and destination nodes' clusters using merge_set
4 if max iterations not reached, repeat 3 with next highest correlation
(meanwhile also keep track of the "set starters" ie nodes that have nothing
"""
# Sort edges
sorted_edges = _____
2019/10/3 homework_3 - Jupyter Notebook
localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 10/12
Once we've run the algorithm, we'll need to construct the clusters. You can use the function below to do so.
In [ ]:
Question 3.2: analysing the results
After you have implemented the algorithm in Python, add cells below answering the following questions:
# Initialize dictionary of pointers
next_nodes = {node: node for node in firms}
# Keep track of "starting nodes", ie nodes that no other node points to in next_
set_starters = {node for node in firms}
# Loop k times
for i in range(k):
# Your algorithm here
return set_starters, next_nodes
def construct_sets(set_starters, next_nodes):
"""
Constructs sets (clusters) from the next_nodes dictionary
Parameters:
set_starters: set of starting nodes
next_nodes: dictionary of connections
Returns:
dictionary of sets (clusters):
key - bottom node of set; value - set of all nodes in the cluster
"""
# Initialise an empty dictionary
all_sets = dict()
# Loop:
# Start from each set starter node
# Construct a "current set" with all nodes on the way to bottom node
# If bottom node is already a key of all_sets, combine the "current set" with th
# Otherwise add "current set" to all_sets
for s in set_starters:
cur_set = set()
cur_set.add(s)
p = s
while next_nodes[p] != p:
p = next_nodes[p]
cur_set.add(p)
if p not in all_sets:
all_sets[p] = cur_set
else:
for item in cur_set:
all_sets[p].add(item)
return all_sets
all_clusters = construct_sets(set_starters,next_nodes)
2019/10/3 homework_3 - Jupyter Notebook
localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 11/12
Do some detective work: what is the algorithm that you've implemented called? In what other graph
problem is it often used? How are the problems related? (Hint: the algorithm is mentioned on the Wikipedia
page for greedy algorithms.)
Run the algorithm and present the results formatted in a useful way.
Discuss the results for different values of .
Do the resulting clusters "make sense"? (You may need to search online what the companies do.) Verify
that the stocks in your clusters perform similarly by plotting the returns and the (normalised) stock prices
for some of the clusters.
You may use graphs etc. to present your results.
Part 3:
40% of grade
Depending on your interests, you may work on either subsection below, or both. You might go deeper into one
question than another, but for an outstanding grade, you should have at least some discussion on both.
In-depth analysis
The project is open in the sense that you can probably think of further interesting questions to look into based
on returns, correlations, and clusters. This is not required but being creative and going further than the above
questions will make your work stand out. You can explore one or several of the ideas below, or come up with
questions of your own.
Depending on your interests, you might look at different things. For example, when researching the algorithm,
you might be interested in its complexity, and how to improve your implementation's efficiency. On Wikipedia,
you may find a couple of ways to drastically improve the algorithm speed, but are relatively small changes to
your code.
If you're more interested in the financial applications of clustering, there are also opportunities to think about
further steps. For example, some people claim that you can derive trading strategies based on clustering - that
often one of the stocks in a cluster is a leader and the others follow that price. If this is true, you could track the
price of the leader stock and then trade the other stocks in the cluster based on changes in the leader's price.
Do you think this would make sense? Do you have an idea on how to identify a leader stock?
You might also want to repeat the analysis for different time periods. You would be able to do this by looking at
the code for the second homework to figure out how to read data from Yahoo Finance using pandas, and going
through the process for all companies in the csv file for another time period. Perhaps you could explore for
example how correlations between companies have changed over time, or how clusters found by your
algorithm change over time.
Exploring other clustering methods
You've used just one approach to clustering, and arguably not the best one. Research clustering algorithms
and libraries to apply them in Python. Discuss some other algorithms that could be used, and how they differ
from the one you've implemented. Look at the Python library scikit-learn . How would you apply the
clustering algorithms provided by the library to stock price data? Would you need to develop new metrics other
than correlations? If you want to go even further, try running some of these other clustering algorithms on your
data, and report the results. Start from here: http://scikit-learn.org/stable/modules/clustering.html#clustering
2019/10/3 homework_3 - Jupyter Notebook
localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 12/12
(http://scikit-learn.org/stable/modules/clustering.html#clustering); you'll find a stock market example there too.
For future reference, you may also find other interesting machine-learning tools for both stock market analysis
or other analytics purposes.
Question 4
Create cells below to add your extra part as code and narrative text explaining your idea and results.
All done!
Don't forget to submit with the command
python ok --submit
on the command line.