辅导Analytics Individual、 讲解SQL程序设计、讲解Marketing component、辅导SQL语言
- 首页 >> Database作业Marketing and Business Analytics Individual
Assignment
1. This assignment accounts for 22% of your final mark.
15% from the Marketing component
7% from the Business Analytics component
2. This is an individual assignment.
3. The assignment is due before 17:00 26th October 2018. The late penalty for the assignment is 10% of
the assigned mark per day, starting after 5pm on the due date. The closing date 5pm 2nd November
2018 is the last date on which an assessment will be accepted for marking.
4. Please only include your student ID in the submitted report, and do NOT include your name.
Background
You are employed as a Data Scientist at a telecommunications company. One of the biggest issues facing
the company is minimising customer churn. Customer churn is when a customer changes provider. The
business is interested in analysing and predicting churn since the cost of acquiring new customers is higher
than retaining existing customers.
This is particularly problematic for the telecommunications industry as changing telecommunications
provider is relatively easy. There are also a large number of price competitive providers to choose from,
which encourages churning.
Your job is to help the marketing department to undertake the following:
Investigate why and which customers churn
Discover retention oppurtunities and strategies
Identify current customers that are likely to churn so that the retention strategy can be applied
Identify potential customers so that the incentive strategy can be applied
Files
Part 1:
churn_survey.json
Part 2:
tweets.db
Submission Instructions
This assessment will be automatically marked. Any deviation from the stated output form will be marked as
an incorrect answer. It is your responsibility to check that the output matches the given template or
example for each question.
What to Submit
Submit only your .ipynb file. You can choose to use this file as a template OR start from scratch. Just make
sure that running your notebook generates all the answers!
Filename
The filename must be "BUSS6002_MKBA_STUDENTID.ipynb"
Loading the data files when marking
We will run your notebook in the same directory as the data files. We will assume the original file names.
Output
The output for each question should be saved to the same directory as the notebook.
Checking the format of your output files
We have created ED Challenges for each question. The challenges will tell you if the FORMAT of your
output file is correct. It does not tell you if your answer is correct. Please test your output files on Ed before
submitting your assignment.
Timeout
We will automatically run your notebook. Each notebook will be given a maximum of 1 minute to be
completed. Please ensure any model training or optimisation will be easily completed in this time frame.
Marking Criteria
1. Correctness of results
Set your Student ID here
In [ ]:
Helper Function
In [ ]:
YOURSTUDENTID = 480181910
# This function is used to save answers with a non-tabular output
def write_txt(student_id, part_number, data):
file = open("ID_{0}_Q_{1}.txt".format(student_id, part_number), 'w')
file.write(data)
file.close()
Part 1 - Understanding Churn and Identifying
Retention Strategies (20 Marks)
Data
The marketing team has collected information from a subset of current and past customers. To understand
why customers churn and identify why customers have churned use the churn_survey.json file.
Data Dictionary
Variable Description
Churn Whether the customer churned or not
Contract The contract term of the customer
Dependents Whether the customer has dependents or not
DeviceProtection Whether the customer has purchased the device protection service or not
Gender Customer gender
InternetService Customer’s internet service type
MonthlyCharges The amount charged to the customer monthly
MultipleLines Whether the customer has multiple lines or not
OnlineBackup Whether the customer has purchased the additional online backup service or not
OnlineSecurity Whether the customer has purchased the additional online security service or not
PaperlessBilling Whether the customer has paperless billing or not
Partner Whether the customer has a partner or not
PaymentMethod The customer’s payment method
PhoneService Whether the customer has a phone service or not
SeniorCitizen Whether the customer is a senior citizen or not
StreamingMovies Whether the customer has purchased the additional streaming movie service or not
StreamingTV Whether the customer has purchased the additional streaming TV service or not
TechSupport Whether the customer has purchased the additional tech support service or not
Tenure Number of months the customer has stayed with the company
TotalCharges The total amount charged to the customer
In [12]:
In [15]:
Questions
1.1 What is the most popular payment method? (1 Mark)
Output your answer as a .txt file containing the name of the most popular payment method.
FILENAME: ID_STUDENTID_Q_1_1.txt
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3521 entries, 0 to 3520
Data columns (total 20 columns):
Churn 3520 non-null object
Contract 3521 non-null object
Dependents 3521 non-null object
DeviceProtection 3521 non-null object
Gender 3521 non-null object
InternetService 3520 non-null object
MonthlyCharges 3521 non-null float64
MultipleLines 3520 non-null object
OnlineBackup 3521 non-null object
OnlineSecurity 3521 non-null object
PaperlessBilling 3521 non-null object
Partner 3521 non-null object
PaymentMethod 3520 non-null object
PhoneService 3521 non-null object
SeniorCitizen 3521 non-null object
StreamingMovies 3521 non-null object
StreamingTV 3519 non-null object
TechSupport 3520 non-null object
Tenure 3520 non-null float64
TotalCharges 3513 non-null float64
dtypes: float64(3), object(17)
memory usage: 550.2+ KB
# Use this area to load the data
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize
import json
with open('churn_survaey.json') as k:
data = json.load(k)
##pprint(data)
df_1 = json_normalize(data)
df_1.columns
df_1.head(8)
df_1.info(8)
In [19]:
In [28]:
1.2 What is the mean amount spent per month for the churn and non-churn
customers? (1 Mark)
Output your answer as a .csv file with the following format to four decimal places. DO NOT include the $
sign.
Churn MonthlyCharges
No 00.0000
Yes 00.0000
FILENAME: ID_STUDENTID_Q_1_2.csv
Out[19]:
Electronic check 1159
Credit card (automatic) 802
Mailed check 799
Bank transfer (automatic) 760
Name: PaymentMethod, dtype: int64
-------------------------------------------------------------------
--------
NameError Traceback (most recent call last)
<ipython-input-28-369f42b8e974> in <module>()
3 YOURSTUDENTID = 480181910
4 # This will save your answer to a .txt file
----> 5 write_txt(YOURSTUDENTID, "1_1", method_name)
NameError: name 'write_txt' is not defined
df_1['PaymentMethod'].value_counts()
# WRITE YOUR CODE HERE
method_name = 'Electronic check'
YOURSTUDENTID = 480181910
# This will save your answer to a .txt file
write_txt(YOURSTUDENTID, "1_1", method_name)
In [26]:
In [30]:
1.3 What is the standard deviation of amount spent per month for the churn
and non-churn customers? (1 Mark)
Output your answer as a .csv file with the following format to four decimal places. DO NOT include the $
sign.
Churn MonthlyCharges
No 00.0000
Yes 00.0000
FILENAME: ID_STUDENTID_Q_1_3.csv
In [31]:
In [32]:
Out[26]:
MonthlyCharges
Churn
No 61.113471
Yes 74.652519
Out[31]:
MonthlyCharges
Churn
No 31.145963
Yes 25.064977
# WRITE YOUR CODE HERE
MonthlyCharges=pd.DataFrame(df_1.groupby('Churn').mean()['MonthlyCharges'])
MonthlyCharges
MonthlyCharges.to_csv('ID_STUDENTID_Q_1_2.csv')
# WRITE YOUR CODE HERE
MonthlyCharges=pd.DataFrame(df_1.groupby('Churn').std()['MonthlyCharges'])
MonthlyCharges
MonthlyCharges.to_csv('ID_STUDENTID_Q_1_3.csv')
1.4 What is the percentage of contract type for the churn and non-churn
customers (2 Marks)
Output your answer as a .csv file with the following format to two decimal places. Do not include the %
symbol.
Churn Month-to-month One year Two year
No 00.00 00.00 00.00
Yes 00.00 00.00 00.00
This percentage should be relative to the churn status NOT the entire sample i.e. the top left cell is the
percentage of customers on month-to-month contracts who didn't churn.
FILENAME: ID_STUDENTID_Q_1_4.csv
In [34]:
1.5 Which feature and value is most correlated with MonthlyCharges? (2.5
Marks)
Output your answer as a .csv file with the following format containing the most correlated feature name and
value.
Feature Value
FEATURE_NAME FEATURE_VALUE
FILENAME: ID_STUDENTID_Q_1_5.csv
Contract Month-to-month One year Two year
Churn
No 1135 647 805
Yes 826 78 29
Contract Month-to-month One year Two year
Churn
No 0.438732 0.250097 0.311171
Yes 0.885316 0.083601 0.031083
# WRITE YOUR CODE HERE
churn_contract=pd.crosstab(df_1['Churn'],
df_1['Contract'],rownames=['Churn'])
print(churn_contract)
T=churn_contract.apply(lambda x:x/x.sum(),axis=1)
print(T)
T.to_csv(' ID_STUDENTID_Q_1_4.csv')
In [ ]:
1.6 What is the count, mean, std, min, quartiles and max of time before a
customer churns? (2.5 Marks)
Output your result as a two column .csv with the following format to four decimal places
Tenure
count 0.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
FILENAME: ID_STUDENTID_Q_1_6.csv
In [ ]:
1.7 What is the proportion of purchase for each account addon for male and
female customers? (4 Marks)
Output your result as a .csv with the following format to four decimal places
Gender ADDON1 ADDON2 ...
Female 0.0000 0.0000 ..
Male 0.0000 0.0000 ..
Please use the original name of the addon from the data. You must use your understanding of the data and
the problem to determine where you can find this information in the dataset.
FILENAME: ID_STUDENTID_Q_1_7.csv
# WRITE YOUR CODE HERE
df1_corr=df_1.corr()
df1_corr_sorted=df1_corr.sort_values('MonthlyCharges',axis=0,ascending=false)
corr_dict={'Feature Name':[df1_corr_sorted.index[1]],'Values':[df1_corr_sorted.iloc[1,0]]}
# WRITE YOUR CODE HERE
In [36]:
In [37]:
1.8 Of the listed account addons, which addon/addons could be offered to
churning customers for free or at a discounted rate in order to best retain
them? (6 Marks)
Output your file as a single column .csv with the following format
Addon
ADDONX
ADDONY
...
where ADDONX is the name of one addon that you suggest. You must suggest at least 1 account addon up
the total amount of addons listed in the dataset. You must exercise your best judgement and supporting
evidence from the data to obtain a list of suggested addons. These addons should reflect the interests of
the churning customers, i.e. which addons they actually care about.
FILENAME: ID_STUDENTID_Q_1_8.csv
In [ ]:
Part 2 - Churn Intervention (24 Marks)
Out[36]:
Index(['Churn', 'Contract', 'Dependents', 'DeviceProtection', 'Gender',
'InternetService', 'MonthlyCharges', 'MultipleLines', 'OnlineBackup',
'OnlineSecurity', 'PaperlessBilling', 'Partner', 'PaymentMethod',
'PhoneService', 'SeniorCitizen', 'StreamingMovies', 'StreamingTV',
'TechSupport', 'Tenure', 'TotalCharges'],
dtype='object')
Out[37]:
['OnlineBackup',
'OnlineSecurity',
'StreamingMovies',
'StreamTV',
'Techsupport']
# WRITE YOUR CODE HERE
df_1.columns
columinNames=['OnlineBackup','OnlineSecurity','StreamingMovies','StreamTV','Techsupport']
columinNames
# WRITE YOUR CODE HERE
Part 2 - Churn Intervention (24 Marks)
The marketing teams wants to know if it is possible to identify customers on social media before they churn
and if they can identify potential customers that want to move away from their existing provider. If a
customer is identified before they churn then the retention strategy that you developed in Part 1.6 can be
applied to stop the customer leaving.
The marketing team has outsourced the collection of data and labels. The data was collected from twitter
and includes tweets referencing your company and competitors. The data is available in the tweets.db
SQLite file.
To achieve the goals of the project you will need to do some EDA to understand the nature of the data and
attempt to build a classifier to predict if an individual is likely to churn based on what they wrote in their
tweet.
Data
Schema
The schema for the tweets.db file is below:
churn
Column Description
tid Tweet ID
churn Churn status
set Training or Hidden
tweets
Column Description
tid Tweet ID
uid User ID
date Datetime of the tweet
text Content of the tweet
Training and Hidden Sets
The data has been divided into two sets:
Set Tweets Target
Training Yes Yes
Hidden Yes No
The Churn labels for the training sets has been made available. However the marketing team wants to know
how well your classifier will work on future and unseen data before deploying it. They will assess your
classification performance on the hidden set.
In [44]:
In [ ]:
In [ ]:
2.2 Find all tweets in the training set containing the strings "AT&T", "Verizon"
and "switch" (2.5 Marks)
Output the tweets as a two column .csv file with the following format:
tid text
tweet_id1 text1
tweet_id2 text2
tweet_id3 text3
... ...
The first column should be the tweet id and the second column should be the original text of the tweet.
Your search should be invariant to capitilisation.
FILENAME: ID_STUDENTID_Q_2_2.csv
In [ ]:
import sqlite3
from sqlite3 import Error
database="tweets.db"
conn=sqlite3.connect(database)
cur=
### 2.1 How many tweets in the training set contain at least one of the strings "AT&T", "Verizon" or "T-Mobile"
Output the number of tweets as an integer to a .txt file. Your search should be invariant to capitilisation.
FILENAME: ID_STUDENTID_Q_2_1.txt
# WRITE YOUR CODE HERE
number_tweets =
# This will save your answer to a .txt file
write_txt(YOURSTUDENTID, "2_1", number_tweets)
# WRITE YOUR CODE HERE
2.3 Identify Churning Customers via Logistic Regression (Total 10
Marks)
Train a Logistic Regression Classifier to identify tweets from churning customers
Requirements
The original features must be the tweet text data
Use dimension reduction to reduce the dimensionality of the problem. In class you have learnt about
PCA. However PCA will not work for TF or TF-IDF data as it is sparse. You must find an alternative
method in scikit-learn that works with sparse data.
Maximum of 5 components
In Q2.3.5 your marks will be assigned based on your classifiers performance on the hidden set. Make sure
you tune your model thoroughly in section Q2.3.3.
2.3.1 Transform Features (1.5 Marks)
Given the original text data, use an sklearn vectoriser to convert the text to a numeric representation.
Output your fitted vectoriser as a pickle file.
FILENAME: ID_STUDENTID_Q_2_3_1.pickle
In [ ]:
In [ ]:
2.3.2 Dimension Reduction (1.5 Marks)
Reduce the dimensionality of your features to a maximum of 5 components.
Output your fitted dimensionality reducing object as a pickle file.
FILENAME: ID_STUDENTID_Q_2_3_2.pickle
# WRITE YOUR CODE HERE
# This code will save your Transformer/Vectoriser object to a file
import pickle
filename = "ID_{0}_Q_2_3_1.pickle".format(YOURSTUDENTID)
# MYTRANSFORMEROBJECT must be a sklearn transformer or vectoriser
s = pickle.dump(MYTRANSFORMEROBJECT, open(filename, 'wb'))
In [ ]:
In [ ]:
2.3.3 Tuning (2 Marks)
Tune your model hyper-parameters for best performance. Make sure to tune thoroughly!
Output your fitted GridSearchCV or RandomisedSearchCV object as a pickle file.
FILENAME: ID_STUDENTID_Q_2_3_3.pickle
In [ ]:
In [ ]:
In [ ]:
2.3.4 Output Model (1 Marks)
Output your trained logistic regression model as a pickle file. In the next part you will be competing against
other students. So make sure you tune your model as best you can!
FILENAME: ID_STUDENTID_Q_2_3_4.pickle
# WRITE YOUR CODE HERE
# This code will save your Dimensionality Reducer object to a file
import pickle
filename = "ID_{0}_Q_2_3_2.pickle".format(YOURSTUDENTID)
# MYREDUCEROBJECT must be a valid dimensionality reducer from sklearn
s = pickle.dump(MYREDUCEROBJECT, open(filename, 'wb'))
# WRITE YOUR CODE HERE
# This code will save your GridSearchCV or RandomisedSearchCV to a file
import pickle
filename = "ID_{0}_Q_2_3_3.pickle".format(YOURSTUDENTID)
# MYGRIDSEARCHOBJECT must be GridSearchCV or RandomisedSearchCV
s = pickle.dump(MYGRIDSEARCHOBJECT, open(filename, 'wb'))
In [ ]:
In [ ]:
2.3.5 Predicting Churn for the Hidden Customers (4 Marks)
We will assign marks to this question based on the relative performance of each students classifier. You
must try and tune your classifier in Question 2.3 as best you can!
Output your predictions as a two column .csv file with the following format:
tid Churn
tweet_id1 0
tweet_id2 1
tweet_id3 0
... ...
where pred1 is the predicted class i.e. 1 is "Churn" and 0 is "Not churn".
FILENAME: ID_STUDENTID_Q_2_3_5.csv
In [ ]:
# WRITE YOUR CODE HERE
# This code will save your LogisticRegression to a file
import pickle
filename = "ID_{0}_Q_2_3_4.pickle".format(YOURSTUDENTID)
# MYLOGISTICREGRESSION must be of type sklearn.linear_model.LogisticRegression
s = pickle.dump(MYLOGISTICREGRESSION, open(filename, 'wb'))
# WRITE YOUR CODE HERE
2.4 Prediction Competition (Total 10 Marks)
We will assign marks to this question based on the relative performance of each students classifier.
Your goal is to build the most accurate classification pipeline for the hidden data. You should do your own
research to find suitable preprocessing steps and classifier. You are allowed to use any preprocessing you
like and any sklearn compatible classifier i.e. it must support the following functions:
fit
predict
You must output your classifier (as a pickle file) and predictions (as csv) using the format from Question
2.3.4 and 2.3.5.
Good luck!
FILENAMES:
ID_{0}_Q_2_4_1.pickle
ID_{0}_Q_2_4_1.csv
In [ ]:
# WRITE YOUR CODE HERE