辅导Stock market、辅导Python Data Structures课

2019.10.06 - 首页 >> Python编程

、Python编程语言作业调试

2019/10/3 homework_3 - Jupyter Notebook

localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 1/12

Stock market clustering

Data Structures and Algorithms Using Python, September 2019

Imperial College Business School

This assignment is divided into three parts. In the first part, you will work on pandas data analysis. In the

second part, you will implement a clustering algorithm to group companies based on their stock price

movements. In the final part, you will explore ways to extend and improve this analysis.

The assignment is due on Monday 7 October.

The assignment is graded not only on correctness but also on the presentation of the results. Try to

make the results of your calculations easy to read with eg string formatting, do some plots if you find them

useful, and comment your code.

There are no OK tests to test your functions in this assignment. It is intended to set you up working on a

real problem where you have to explore data and the problem to figure out your approach. The first part will

also require you to use a search engine to find the right pandas functions to use to analyse your data. Some

potentially useful pandas functions are listed in the file veryUseful.py .

You're working as a group, so you may wish to divide the work into smaller pieces. Some of you may

want to start working on the Pandas part, and others on the algorithm part. There is a set of intermediary

results available for testing your algorithm, so you can start immediately on both parts. See the details below in

question 3.

Setting up your group

You'll complete this assignment in your study groups. Start by creating this group on OK.

1. Gather the OK login emails of your group.

2. Log in to https://okpy.org (https://okpy.org).

3. Click on the group assignment in the assignment list.

4. Add your group members' emails and click on Invite to add them.

5. Each invited member should go to the group assignment and Accept the invite.

Submission

When you're ready to submit the assignment, use the command

python ok --submit

on the command line.

You may submit more than once before the deadline; only the final submission will be graded. Only one

submission is needed for your group.

2019/10/3 homework_3 - Jupyter Notebook

localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 2/12

Part 1: Pandas

30% of grade

In the previous homework, we used lists to study stock prices. The pandas library provides some more

effective tools for data analysis.

The assignment comes with two files containing company data:

SP_500_firms.csv with firm and ticker names

SP_500_close_2015.csv with stock price data for 2015

Let's first load up this data.

In [1]:

Question 1: Returns

In the previous homework, we calculated stock price returns over a period of time. The return is defined as the

percentage change, so the return between periods and for stock price would be

Calculate the returns in pandas for all the stocks in price_data .

In [2]:

# Load data into Python

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import csv

def read_names_into_dict():

"""

Read company names into a dictionary

"""

d = dict()

input_file = csv.DictReader(open("SP_500_firms.csv"))

# Firm list obtained from https://github.com/datasets/s-and-p-500-companies

for row in input_file:

d[row['Symbol']] = [row['Name'], row['Sector']]

return d

names_dict = read_names_into_dict()

comp_names = names_dict.keys()

# Read price data with pandas

filename = 'SP_500_close_2015.csv'

price_data = pd.read_csv(filename, index_col=0)

# Calculate company returns in this cell

return_for_stock=(price_data.tail(251)-price_data.head(251))/price_data.head(251)

2019/10/3 homework_3 - Jupyter Notebook

localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 3/12

Question 1.1: Highest and lowest daily returns

Use pandas to find the 10 highest daily returns amongst all companies. Search online for what were the

reasons behind the highest returns. Present your results in a clean and immediately readable form.

Repeat with the lowest daily returns.

In [ ]:

Question 1.2: Highest and lowest yearly returns

Find the 10 highest yearly returns amongst all companies. Present your results in a clean and immediately

readable form.

Repeat with the lowest yearly returns.

In [ ]:

Question 1.3: Highest and lowest volatilities

Find the 10 highest yearly volatilities (standard deviations) amongst all companies. Present your results in a

clean and immediately readable form.

Repeat with the lowest volatilities.

In [ ]:

# Your code here

max_return=[]

df.sort_values(by=[comp_names],ascending=False)

# Your code here

### Question 2: Correlations

Analysts often care about the _correlation_ of stock prices between firms.

Correlation measures the statistical similarity between the two prices' movements.

If the prices move very similarly, the correlation of their _returns_ is close to

1. If they tend to make exactly the opposite movements (ie one price moves up and

the other one down), the correlation is close to -1. If there is no clear

statistical relationship between the movements of two stock prices, the

correlation in their returns is close to zero.

For a sample of stock price returns $x,y$ with observations for $n$ days, the

correlation $r_{xy}$ between $x$ and $y$ can be calculated as:

r_{xy} = \frac{\sum x_i y_i - n \bar{x}\bar{y}}{ns_x s_y} = {\frac {n\sum

x_{i}y_{i}-\sum x_{i}\sum y_{i}}{{\sqrt {n\sum x_{i}^{2}-(\sum x_{i})^{2}}}~{\sqrt

{n\sum y_{i}^{2}-(\sum y_{i})^{2}}}}}.

2019/10/3 homework_3 - Jupyter Notebook

localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 4/12

Calculate all correlations between companies. You can search online for a pandas or numpy function that

does this directly.

In [ ]:

Question 2.1

Next, analyse the correlations between the companies:

Define functions to print out the top and bottom correlated companies for any given company.

Use your functions to study the following companies in the tech sector: Amazon, Microsoft, Facebook,

Apple, and Google. Comment on the results. Which (possibly other) companies are they most closely

related to in terms of highest correlations? Would you have expected the results you see?

In [ ]:

Part 2: Clustering

30% of grade

In this part of the assignment, you will develop a clustering algorithm to study the similarity of different stocks.

The general purpose of clustering analysis is dividing a set of objects into groups that are somehow "similar" to

each other. It is a widespread tool used for exploratory data analysis in diverse fields in both science and

business. For example, in marketing analytics, cluster analysis is employed to group consumers into segments

based on their characteristics or _features_, such as age, post code, purchase history, etc. These features are

somehow aggregated to compare the similarity between consumers. Based on this similarity, a clustering

algorithm then divides the consumers into segments.

We will apply this idea on stock market data to identify groups of stocks that perform similarly over time. There

are many reasons for grouping stocks together, such as analysing trading strategies, risk management, or

simply presenting stock market information. Publicly traded companies are often grouped together by simple

features such as the industry they operate in (eg tech companies or pharma companies), but here we'll take a

data-driven approach, grouping together stocks that perform similarly over time.

Here $\bar{x}$ refers to the average value of $x$ over the $n$ observations, and

$s_x$ to its standard deviation.

Based on time series of the stock returns we just computed, we can calculate a

correlation value for each pair of stocks, for example between MSFT (Microsoft)

and AAPL (Apple). This gives us a measure of the similarity between the two stocks

in this time period.

# Your code here

2019/10/3 homework_3 - Jupyter Notebook

localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 5/12

Cluster analysis is an umbrella term for many different algorithmic approaches. Here you'll develop one that's

based on the concept of greedy algorithm design, specified below. You'll also have the opportunity to

explore other approaches using Python libraries.

What is a good measure for stocks "performing similarly" to use for clustering. Let's use the one we just

calculated: correlations in their returns. How can we use this similarity information for clustering? We now have

access to all correlations between stock returns in S&P 500. We can think of this as a graph as follows. The

nodes of the graph are the stocks (eg MSFT and AAPL). The edges between them are the correlations, which

we have just calculated between each stock, where the value of the correlation is the edge weight. Notice that

since we have the correlations between all companies, this is a dense graph, where all possible edges exist.

We thus have a graph representing pairwise "similarity" scores in correlations, and we want to divide the graph

into clusters. There are many possible ways to do this, but here we'll use a greedy algorithm design. The

algorithm is as follows:

1. Sort the edges in the graph by their weight (ie the correlation), pick a number for the number of iterations

of the algorithm

2. Create single-node sets from each node in the graph

3. Repeat times:

A. Pick the graph edge with the highest correlation

B. Combine the two sets containing the source and the destination of the edge

C. Repeat with the next-highest weight edge

4. Return the remaining sets after the iterations

What does the algorithm do? It first initializes a graph with no connections, where each node is in a separate

set. Then in the main loop, it runs through the highest-weighted edges, and adds connections at those

edges. This leads to sets being combined (or "merged"). The result is "groups" of stocks determined by the

highest correlations between the stock returns. These are your stock clusters.

For example, suppose that the toy graph below represents four stocks: A,B,C,D and their return correlations.

Suppose we pick and run the algorithm.

The algorithm would begin by initializing four separate sets of one node each: {A}, {B}, {C}, {D}. It would then

first connect C and D because of their correlation 0.95, resulting in just three sets: {A}, {B}, and {C,D}. Then it

would connect A and B, resulting in two sets of two nodes each: {A,B}, and {C,D}. These would be our clusters

for .

Question 3: Implementing the algorithm

Your task is to implement the clustering algorithm using the functions below. First, for convenience in

implementing the algorithm, let's create a list of the correlations from the pandas data.

2019/10/3 homework_3 - Jupyter Notebook

localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 6/12

In [8]:

Next, let's turn to the algorithm itself. Consider the example above, repeated here.

Suppose we pick and have sorted the edge list in step 1 of the algorithm. How should we represent the

clusters in step 2? One great way is to use a dictionary where each key is a node, and each value is another

node that this node "points to". A cluster is then a chain of these links, which we represent as a dictionary.

In step 2 of the algorithm, we start with four nodes that point to themselves, ie the dictionary

{'A':'A','B':'B','C':'C','D':'D'} . When a node points to itself, it ends the chain. Here the clusters

are thus just the nodes themselves, as in the figure below.

def create_correlation_list(correl):

"""

Creates a list of correlations from a pandas dataframe of correlations

Parameters:

correl: pandas dataframe of correlations

Returns:

list of correlations containing tuples of form (correlation, ticker1, ticker

"""

n_comp = len(correl.columns)

comp_names = list(correl.columns)

# Faster if we use a numpy matrix

correl_mat = correl.as_matrix()

L = [] # create list

for i in range(n_comp):

for j in range(i+1,n_comp):

L.append((correl_mat[i,j],comp_names[i],comp_names[j]))

return L

edges = create_correlation_list(correl)

2019/10/3 homework_3 - Jupyter Notebook

localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 7/12

Let's walk through the algorithm's next steps. We first look at the highest-weight edge, which is between C and

D. These clusters will be combined. In terms of the dictionary, this means that one of them will not point to

itself, but to the other one (here it does not matter which one). So we make the dictionary at C point to D .

The dictionary becomes {'A':'A','B':'B','C':'D','D':'D'} .

The next highest correlation is between A and B, so these clusters would be combined. The dictionary

becomes {'A':'B','B':'B','C':'D','D':'D'} .

The third highest correlation is between C and B. Let's think about combining these clusters using the

dictionary we have. Looking up B , we get B : the node B is in the bottom of the chain representing its cluster.

But when we look up C , it points to D . Should we make C point to B ? No - that would leave nothing

pointing at D , and C and D should remain connected! We could perhaps have C somehow point at both

nodes, but that could become complicated, so we'll do the following instead. We'll follow the chain to the

bottom. In the dictionary, we look up C and see that it points to D . We then look up D which points to itself,

so D is the bottom node. We then pick one of the bottom nodes B and D , and make it point to the other.

We then have the dictionary {'A':'B','B':'B','C':'D','D':'B'} , and the corresponding clustering in

the figure below.

In other words, we'll keep track of clusters in a dictionary such that each cluster has exactly one bottom

node. To do this, we need a mechanism for following a cluster to the bottom. You'll implement this in the

function find_bottom below. The function takes as input a node and a dictionary, and runs through the

"chain" in the dictionary until it finds a bottom node pointing to itself.

2019/10/3 homework_3 - Jupyter Notebook

localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 8/12

The other thing we'll need to do is combine clusters by connecting two nodes. This means taking the two

nodes, finding the bottom node for each node's cluster, and making one point to the other. You'll implement

this in the function merge_clusters below.

Finally, you'll need to set up the algorithm by sorting the correlations, and then looping through this merging

times. You'll implement this in the function cluster_correlations below. This completes the algorithm.

But there is one more thing. If you only keep track of a dictionary like

{'A':'B','B':'B','C':'D','D':'B'} , how do you actually find the clusters from the dictionary? A

convenient way is to store some extra information: the "starting nodes" of each cluster to which no other node

links. For example, above these "starting nodes" would include all nodes A,B,C,D in the beginning, but only

A and C in the end. If we keep track of these, we can then write a function that starts from each such

remaining "starting node", works through to the bottom, and creates the cluster along the way. You'll

implement this in the function construct_sets below.

Intermediary results

You can load a pre-computed set of results up to this point using the following commands.

In [12]:

Clustering implementation

Complete the following functions to implement the clustering algorithm.

['MMM', 'ABT', 'ABBV', 'ACN', 'ATVI', 'AYI', 'ADBE', 'AAP', 'AES', 'AE

T']

Out[12]:

[(0.59866616402973805, 'MMM', 'ABT'),

(0.32263699601940254, 'MMM', 'ABBV'),

(0.63205934885601889, 'MMM', 'ACN'),

(0.41855006701119907, 'MMM', 'ATVI'),

(0.45089749571328591, 'MMM', 'AYI'),

(0.46875484430451653, 'MMM', 'ADBE'),

(0.25713165217544326, 'MMM', 'AAP'),

(0.33537796741224424, 'MMM', 'AES'),

(0.31737374099675925, 'MMM', 'AET'),

(0.50593060558168279, 'MMM', 'AMG')]

# Load intermediary results from a "pickle" file

# You can use these with your algorithm below

import pickle

file_name = 'cluster_correlations'

with open(file_name, "rb") as f:

correl = pickle.load(f)

edges = pickle.load(f)

firms = list(correl.columns)

print(firms[:10])

edges[:10]

2019/10/3 homework_3 - Jupyter Notebook

localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 9/12

In [ ]:

def find_bottom(node, next_nodes):

"""

Find the "bottom" of a cluster starting from node in dictionary next_nodes

Parameters:

node: starting node

next_nodes: dictionary of node connections

Returns:

the bottom node in the cluster

"""

# Your code here

pass

def merge_sets(node1, node2, next_nodes, set_starters):

"""

Merges the clusters containing node1, node2 using the connections dictionary nex

Also removes any bottom node which is no longer a "starting node" from set_start

Parameters:

node1: first node the set of which will be merged

node2: second node the set of which will be merged

next_nodes: dictionary of node connections

set_starters: set of nodes that "start" a cluster

Returns:

does this function need to return something?

"""

# Your code here

def cluster_correlations(edge_list, firms, k=200):

"""

A mystery clustering algorithm

Parameters:

edge_list - list of edges of the form (weight,source,destination)

firms - list of firms (tickers)

k - number of iterations of algorithm

Returns:

next_nodes - dictionary to store clusters as "pointers"

- the dictionary keys are the nodes and the values are the node in the s

set_starters - set of nodes that no other node points to (this will be used

Algorithm:

1 sort edges by weight (highest correlation first)

2 initialize next_nodes so that each node points to itself (single-node clu

3 take highest correlation edge

check if the source and destination are in the same cluster using find_b

if not, merge the source and destination nodes' clusters using merge_set

4 if max iterations not reached, repeat 3 with next highest correlation

(meanwhile also keep track of the "set starters" ie nodes that have nothing

"""

# Sort edges

sorted_edges = _____

2019/10/3 homework_3 - Jupyter Notebook

localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 10/12

Once we've run the algorithm, we'll need to construct the clusters. You can use the function below to do so.

In [ ]:

Question 3.2: analysing the results

After you have implemented the algorithm in Python, add cells below answering the following questions:

# Initialize dictionary of pointers

next_nodes = {node: node for node in firms}

# Keep track of "starting nodes", ie nodes that no other node points to in next_

set_starters = {node for node in firms}

# Loop k times

for i in range(k):

# Your algorithm here

return set_starters, next_nodes

def construct_sets(set_starters, next_nodes):

"""

Constructs sets (clusters) from the next_nodes dictionary

Parameters:

set_starters: set of starting nodes

next_nodes: dictionary of connections

Returns:

dictionary of sets (clusters):

key - bottom node of set; value - set of all nodes in the cluster

"""

# Initialise an empty dictionary

all_sets = dict()

# Loop:

# Start from each set starter node

# Construct a "current set" with all nodes on the way to bottom node

# If bottom node is already a key of all_sets, combine the "current set" with th

# Otherwise add "current set" to all_sets

for s in set_starters:

cur_set = set()

cur_set.add(s)

p = s

while next_nodes[p] != p:

p = next_nodes[p]

cur_set.add(p)

if p not in all_sets:

all_sets[p] = cur_set

else:

for item in cur_set:

all_sets[p].add(item)

return all_sets

all_clusters = construct_sets(set_starters,next_nodes)

2019/10/3 homework_3 - Jupyter Notebook

localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 11/12

Do some detective work: what is the algorithm that you've implemented called? In what other graph

problem is it often used? How are the problems related? (Hint: the algorithm is mentioned on the Wikipedia

page for greedy algorithms.)

Run the algorithm and present the results formatted in a useful way.

Discuss the results for different values of .

Do the resulting clusters "make sense"? (You may need to search online what the companies do.) Verify

that the stocks in your clusters perform similarly by plotting the returns and the (normalised) stock prices

for some of the clusters.

You may use graphs etc. to present your results.

Part 3:

40% of grade

Depending on your interests, you may work on either subsection below, or both. You might go deeper into one

question than another, but for an outstanding grade, you should have at least some discussion on both.

In-depth analysis

The project is open in the sense that you can probably think of further interesting questions to look into based

on returns, correlations, and clusters. This is not required but being creative and going further than the above

questions will make your work stand out. You can explore one or several of the ideas below, or come up with

questions of your own.

Depending on your interests, you might look at different things. For example, when researching the algorithm,

you might be interested in its complexity, and how to improve your implementation's efficiency. On Wikipedia,

you may find a couple of ways to drastically improve the algorithm speed, but are relatively small changes to

your code.

If you're more interested in the financial applications of clustering, there are also opportunities to think about

further steps. For example, some people claim that you can derive trading strategies based on clustering - that

often one of the stocks in a cluster is a leader and the others follow that price. If this is true, you could track the

price of the leader stock and then trade the other stocks in the cluster based on changes in the leader's price.

Do you think this would make sense? Do you have an idea on how to identify a leader stock?

You might also want to repeat the analysis for different time periods. You would be able to do this by looking at

the code for the second homework to figure out how to read data from Yahoo Finance using pandas, and going

through the process for all companies in the csv file for another time period. Perhaps you could explore for

example how correlations between companies have changed over time, or how clusters found by your

algorithm change over time.

Exploring other clustering methods

You've used just one approach to clustering, and arguably not the best one. Research clustering algorithms

and libraries to apply them in Python. Discuss some other algorithms that could be used, and how they differ

from the one you've implemented. Look at the Python library scikit-learn . How would you apply the

clustering algorithms provided by the library to stock price data? Would you need to develop new metrics other

than correlations? If you want to go even further, try running some of these other clustering algorithms on your

data, and report the results. Start from here: http://scikit-learn.org/stable/modules/clustering.html#clustering

2019/10/3 homework_3 - Jupyter Notebook

localhost:8891/notebooks/Desktop/hw03/homework_3.ipynb 12/12

(http://scikit-learn.org/stable/modules/clustering.html#clustering); you'll find a stock market example there too.

For future reference, you may also find other interesting machine-learning tools for both stock market analysis

or other analytics purposes.

Question 4

Create cells below to add your extra part as code and narrative text explaining your idea and results.

All done!

Don't forget to submit with the command

python ok --submit

on the command line.