辅导Analytics Individual、讲解SQL程序设计、讲解Marketing component、辅导SQL语言

2018.10.26 - 首页 >> Database作业

Marketing and Business Analytics Individual

Assignment

1. This assignment accounts for 22% of your final mark.

15% from the Marketing component

7% from the Business Analytics component

2. This is an individual assignment.

3. The assignment is due before 17:00 26th October 2018. The late penalty for the assignment is 10% of

the assigned mark per day, starting after 5pm on the due date. The closing date 5pm 2nd November

2018 is the last date on which an assessment will be accepted for marking.

4. Please only include your student ID in the submitted report, and do NOT include your name.

Background

You are employed as a Data Scientist at a telecommunications company. One of the biggest issues facing

the company is minimising customer churn. Customer churn is when a customer changes provider. The

business is interested in analysing and predicting churn since the cost of acquiring new customers is higher

than retaining existing customers.

This is particularly problematic for the telecommunications industry as changing telecommunications

provider is relatively easy. There are also a large number of price competitive providers to choose from,

which encourages churning.

Your job is to help the marketing department to undertake the following:

Investigate why and which customers churn

Discover retention oppurtunities and strategies

Identify current customers that are likely to churn so that the retention strategy can be applied

Identify potential customers so that the incentive strategy can be applied

Files

Part 1:

churn_survey.json

Part 2:

tweets.db

Submission Instructions

This assessment will be automatically marked. Any deviation from the stated output form will be marked as

an incorrect answer. It is your responsibility to check that the output matches the given template or

example for each question.

What to Submit

Submit only your .ipynb file. You can choose to use this file as a template OR start from scratch. Just make

sure that running your notebook generates all the answers!

Filename

The filename must be "BUSS6002_MKBA_STUDENTID.ipynb"

Loading the data files when marking

We will run your notebook in the same directory as the data files. We will assume the original file names.

Output

The output for each question should be saved to the same directory as the notebook.

Checking the format of your output files

We have created ED Challenges for each question. The challenges will tell you if the FORMAT of your

output file is correct. It does not tell you if your answer is correct. Please test your output files on Ed before

submitting your assignment.

Timeout

We will automatically run your notebook. Each notebook will be given a maximum of 1 minute to be

completed. Please ensure any model training or optimisation will be easily completed in this time frame.

Marking Criteria

1. Correctness of results

Set your Student ID here

In [ ]:

Helper Function

In [ ]:

YOURSTUDENTID = 480181910

# This function is used to save answers with a non-tabular output

def write_txt(student_id, part_number, data):

file = open("ID_{0}_Q_{1}.txt".format(student_id, part_number), 'w')

file.write(data)

file.close()

Part 1 - Understanding Churn and Identifying

Retention Strategies (20 Marks)

Data

The marketing team has collected information from a subset of current and past customers. To understand

why customers churn and identify why customers have churned use the churn_survey.json file.

Data Dictionary

Variable Description

Churn Whether the customer churned or not

Contract The contract term of the customer

Dependents Whether the customer has dependents or not

DeviceProtection Whether the customer has purchased the device protection service or not

Gender Customer gender

InternetService Customer’s internet service type

MonthlyCharges The amount charged to the customer monthly

MultipleLines Whether the customer has multiple lines or not

OnlineBackup Whether the customer has purchased the additional online backup service or not

OnlineSecurity Whether the customer has purchased the additional online security service or not

PaperlessBilling Whether the customer has paperless billing or not

Partner Whether the customer has a partner or not

PaymentMethod The customer’s payment method

PhoneService Whether the customer has a phone service or not

SeniorCitizen Whether the customer is a senior citizen or not

StreamingMovies Whether the customer has purchased the additional streaming movie service or not

StreamingTV Whether the customer has purchased the additional streaming TV service or not

TechSupport Whether the customer has purchased the additional tech support service or not

Tenure Number of months the customer has stayed with the company

TotalCharges The total amount charged to the customer

In [12]:

In [15]:

Questions

1.1 What is the most popular payment method? (1 Mark)

Output your answer as a .txt file containing the name of the most popular payment method.

FILENAME: ID_STUDENTID_Q_1_1.txt

RangeIndex: 3521 entries, 0 to 3520

Data columns (total 20 columns):

Churn 3520 non-null object

Contract 3521 non-null object

Dependents 3521 non-null object

DeviceProtection 3521 non-null object

Gender 3521 non-null object

InternetService 3520 non-null object

MonthlyCharges 3521 non-null float64

MultipleLines 3520 non-null object

OnlineBackup 3521 non-null object

OnlineSecurity 3521 non-null object

PaperlessBilling 3521 non-null object

Partner 3521 non-null object

PaymentMethod 3520 non-null object

PhoneService 3521 non-null object

SeniorCitizen 3521 non-null object

StreamingMovies 3521 non-null object

StreamingTV 3519 non-null object

TechSupport 3520 non-null object

Tenure 3520 non-null float64

TotalCharges 3513 non-null float64

dtypes: float64(3), object(17)

memory usage: 550.2+ KB

# Use this area to load the data

import pandas as pd

import numpy as np

from pandas.io.json import json_normalize

import json

with open('churn_survaey.json') as k:

data = json.load(k)

##pprint(data)

df_1 = json_normalize(data)

df_1.columns

df_1.head(8)

df_1.info(8)

In [19]:

In [28]:

1.2 What is the mean amount spent per month for the churn and non-churn

customers? (1 Mark)

Output your answer as a .csv file with the following format to four decimal places. DO NOT include the $

sign.

Churn MonthlyCharges

No 00.0000

Yes 00.0000

FILENAME: ID_STUDENTID_Q_1_2.csv

Out[19]:

Electronic check 1159

Credit card (automatic) 802

Mailed check 799

Bank transfer (automatic) 760

Name: PaymentMethod, dtype: int64

-------------------------------------------------------------------

--------

NameError Traceback (most recent call last)

<ipython-input-28-369f42b8e974> in <module>()

3 YOURSTUDENTID = 480181910

4 # This will save your answer to a .txt file

----> 5 write_txt(YOURSTUDENTID, "1_1", method_name)

NameError: name 'write_txt' is not defined

df_1['PaymentMethod'].value_counts()

# WRITE YOUR CODE HERE

method_name = 'Electronic check'

YOURSTUDENTID = 480181910

# This will save your answer to a .txt file

write_txt(YOURSTUDENTID, "1_1", method_name)

In [26]:

In [30]:

1.3 What is the standard deviation of amount spent per month for the churn

and non-churn customers? (1 Mark)

Output your answer as a .csv file with the following format to four decimal places. DO NOT include the $

sign.

Churn MonthlyCharges

No 00.0000

Yes 00.0000

FILENAME: ID_STUDENTID_Q_1_3.csv

In [31]:

In [32]:

Out[26]:

MonthlyCharges

Churn

No 61.113471

Yes 74.652519

Out[31]:

MonthlyCharges

Churn

No 31.145963

Yes 25.064977

# WRITE YOUR CODE HERE

MonthlyCharges=pd.DataFrame(df_1.groupby('Churn').mean()['MonthlyCharges'])

MonthlyCharges

MonthlyCharges.to_csv('ID_STUDENTID_Q_1_2.csv')

# WRITE YOUR CODE HERE

MonthlyCharges=pd.DataFrame(df_1.groupby('Churn').std()['MonthlyCharges'])

MonthlyCharges

MonthlyCharges.to_csv('ID_STUDENTID_Q_1_3.csv')

1.4 What is the percentage of contract type for the churn and non-churn

customers (2 Marks)

Output your answer as a .csv file with the following format to two decimal places. Do not include the %

symbol.

Churn Month-to-month One year Two year

No 00.00 00.00 00.00

Yes 00.00 00.00 00.00

This percentage should be relative to the churn status NOT the entire sample i.e. the top left cell is the

percentage of customers on month-to-month contracts who didn't churn.

FILENAME: ID_STUDENTID_Q_1_4.csv

In [34]:

1.5 Which feature and value is most correlated with MonthlyCharges? (2.5

Marks)

Output your answer as a .csv file with the following format containing the most correlated feature name and

value.

Feature Value

FEATURE_NAME FEATURE_VALUE

FILENAME: ID_STUDENTID_Q_1_5.csv

Contract Month-to-month One year Two year

Churn

No 1135 647 805

Yes 826 78 29

Contract Month-to-month One year Two year

Churn

No 0.438732 0.250097 0.311171

Yes 0.885316 0.083601 0.031083

# WRITE YOUR CODE HERE

churn_contract=pd.crosstab(df_1['Churn'],

df_1['Contract'],rownames=['Churn'])

print(churn_contract)

T=churn_contract.apply(lambda x:x/x.sum(),axis=1)

print(T)

T.to_csv(' ID_STUDENTID_Q_1_4.csv')

In [ ]:

1.6 What is the count, mean, std, min, quartiles and max of time before a

customer churns? (2.5 Marks)

Output your result as a two column .csv with the following format to four decimal places

Tenure

count 0.0

mean 0.0

std 0.0

min 0.0

25% 0.0

50% 0.0

75% 0.0

max 0.0

FILENAME: ID_STUDENTID_Q_1_6.csv

In [ ]:

1.7 What is the proportion of purchase for each account addon for male and

female customers? (4 Marks)

Output your result as a .csv with the following format to four decimal places

Gender ADDON1 ADDON2 ...

Female 0.0000 0.0000 ..

Male 0.0000 0.0000 ..

Please use the original name of the addon from the data. You must use your understanding of the data and

the problem to determine where you can find this information in the dataset.

FILENAME: ID_STUDENTID_Q_1_7.csv

# WRITE YOUR CODE HERE

df1_corr=df_1.corr()

df1_corr_sorted=df1_corr.sort_values('MonthlyCharges',axis=0,ascending=false)

corr_dict={'Feature Name':[df1_corr_sorted.index[1]],'Values':[df1_corr_sorted.iloc[1,0]]}

# WRITE YOUR CODE HERE

In [36]:

In [37]:

1.8 Of the listed account addons, which addon/addons could be offered to

churning customers for free or at a discounted rate in order to best retain

them? (6 Marks)

Output your file as a single column .csv with the following format

Addon

ADDONX

ADDONY

...

where ADDONX is the name of one addon that you suggest. You must suggest at least 1 account addon up

the total amount of addons listed in the dataset. You must exercise your best judgement and supporting

evidence from the data to obtain a list of suggested addons. These addons should reflect the interests of

the churning customers, i.e. which addons they actually care about.

FILENAME: ID_STUDENTID_Q_1_8.csv

In [ ]:

Part 2 - Churn Intervention (24 Marks)

Out[36]:

Index(['Churn', 'Contract', 'Dependents', 'DeviceProtection', 'Gender',

'InternetService', 'MonthlyCharges', 'MultipleLines', 'OnlineBackup',

'OnlineSecurity', 'PaperlessBilling', 'Partner', 'PaymentMethod',

'PhoneService', 'SeniorCitizen', 'StreamingMovies', 'StreamingTV',

'TechSupport', 'Tenure', 'TotalCharges'],

dtype='object')

Out[37]:

['OnlineBackup',

'OnlineSecurity',

'StreamingMovies',

'StreamTV',

'Techsupport']

# WRITE YOUR CODE HERE

df_1.columns

columinNames=['OnlineBackup','OnlineSecurity','StreamingMovies','StreamTV','Techsupport']

columinNames

# WRITE YOUR CODE HERE

Part 2 - Churn Intervention (24 Marks)

The marketing teams wants to know if it is possible to identify customers on social media before they churn

and if they can identify potential customers that want to move away from their existing provider. If a

customer is identified before they churn then the retention strategy that you developed in Part 1.6 can be

applied to stop the customer leaving.

The marketing team has outsourced the collection of data and labels. The data was collected from twitter

and includes tweets referencing your company and competitors. The data is available in the tweets.db

SQLite file.

To achieve the goals of the project you will need to do some EDA to understand the nature of the data and

attempt to build a classifier to predict if an individual is likely to churn based on what they wrote in their

tweet.

Data

Schema

The schema for the tweets.db file is below:

churn

Column Description

tid Tweet ID

churn Churn status

set Training or Hidden

tweets

Column Description

tid Tweet ID

uid User ID

date Datetime of the tweet

text Content of the tweet

Training and Hidden Sets

The data has been divided into two sets:

Set Tweets Target

Training Yes Yes

Hidden Yes No

The Churn labels for the training sets has been made available. However the marketing team wants to know

how well your classifier will work on future and unseen data before deploying it. They will assess your

classification performance on the hidden set.

In [44]:

In [ ]:

2.2 Find all tweets in the training set containing the strings "AT&T", "Verizon"

and "switch" (2.5 Marks)

Output the tweets as a two column .csv file with the following format:

tid text

tweet_id1 text1

tweet_id2 text2

tweet_id3 text3

... ...

The first column should be the tweet id and the second column should be the original text of the tweet.

Your search should be invariant to capitilisation.

FILENAME: ID_STUDENTID_Q_2_2.csv

In [ ]:

import sqlite3

from sqlite3 import Error

database="tweets.db"

conn=sqlite3.connect(database)

cur=

### 2.1 How many tweets in the training set contain at least one of the strings "AT&T", "Verizon" or "T-Mobile"

Output the number of tweets as an integer to a .txt file. Your search should be invariant to capitilisation.

FILENAME: ID_STUDENTID_Q_2_1.txt

# WRITE YOUR CODE HERE

number_tweets =

# This will save your answer to a .txt file

write_txt(YOURSTUDENTID, "2_1", number_tweets)

# WRITE YOUR CODE HERE

2.3 Identify Churning Customers via Logistic Regression (Total 10

Marks)

Train a Logistic Regression Classifier to identify tweets from churning customers

Requirements

The original features must be the tweet text data

Use dimension reduction to reduce the dimensionality of the problem. In class you have learnt about

PCA. However PCA will not work for TF or TF-IDF data as it is sparse. You must find an alternative

method in scikit-learn that works with sparse data.

Maximum of 5 components

In Q2.3.5 your marks will be assigned based on your classifiers performance on the hidden set. Make sure

you tune your model thoroughly in section Q2.3.3.

2.3.1 Transform Features (1.5 Marks)

Given the original text data, use an sklearn vectoriser to convert the text to a numeric representation.

Output your fitted vectoriser as a pickle file.

FILENAME: ID_STUDENTID_Q_2_3_1.pickle

In [ ]:

2.3.2 Dimension Reduction (1.5 Marks)

Reduce the dimensionality of your features to a maximum of 5 components.

Output your fitted dimensionality reducing object as a pickle file.

FILENAME: ID_STUDENTID_Q_2_3_2.pickle

# WRITE YOUR CODE HERE

# This code will save your Transformer/Vectoriser object to a file

import pickle

filename = "ID_{0}_Q_2_3_1.pickle".format(YOURSTUDENTID)

# MYTRANSFORMEROBJECT must be a sklearn transformer or vectoriser

s = pickle.dump(MYTRANSFORMEROBJECT, open(filename, 'wb'))

In [ ]:

2.3.3 Tuning (2 Marks)

Tune your model hyper-parameters for best performance. Make sure to tune thoroughly!

Output your fitted GridSearchCV or RandomisedSearchCV object as a pickle file.

FILENAME: ID_STUDENTID_Q_2_3_3.pickle

In [ ]:

2.3.4 Output Model (1 Marks)

Output your trained logistic regression model as a pickle file. In the next part you will be competing against

other students. So make sure you tune your model as best you can!

FILENAME: ID_STUDENTID_Q_2_3_4.pickle

# WRITE YOUR CODE HERE

# This code will save your Dimensionality Reducer object to a file

import pickle

filename = "ID_{0}_Q_2_3_2.pickle".format(YOURSTUDENTID)

# MYREDUCEROBJECT must be a valid dimensionality reducer from sklearn

s = pickle.dump(MYREDUCEROBJECT, open(filename, 'wb'))

# WRITE YOUR CODE HERE

# This code will save your GridSearchCV or RandomisedSearchCV to a file

import pickle

filename = "ID_{0}_Q_2_3_3.pickle".format(YOURSTUDENTID)

# MYGRIDSEARCHOBJECT must be GridSearchCV or RandomisedSearchCV

s = pickle.dump(MYGRIDSEARCHOBJECT, open(filename, 'wb'))

In [ ]:

2.3.5 Predicting Churn for the Hidden Customers (4 Marks)

We will assign marks to this question based on the relative performance of each students classifier. You

must try and tune your classifier in Question 2.3 as best you can!

Output your predictions as a two column .csv file with the following format:

tid Churn

tweet_id1 0

tweet_id2 1

tweet_id3 0

... ...

where pred1 is the predicted class i.e. 1 is "Churn" and 0 is "Not churn".

FILENAME: ID_STUDENTID_Q_2_3_5.csv

In [ ]:

# WRITE YOUR CODE HERE

# This code will save your LogisticRegression to a file

import pickle

filename = "ID_{0}_Q_2_3_4.pickle".format(YOURSTUDENTID)

# MYLOGISTICREGRESSION must be of type sklearn.linear_model.LogisticRegression

s = pickle.dump(MYLOGISTICREGRESSION, open(filename, 'wb'))

# WRITE YOUR CODE HERE

2.4 Prediction Competition (Total 10 Marks)

We will assign marks to this question based on the relative performance of each students classifier.

Your goal is to build the most accurate classification pipeline for the hidden data. You should do your own

research to find suitable preprocessing steps and classifier. You are allowed to use any preprocessing you

like and any sklearn compatible classifier i.e. it must support the following functions:

fit

predict

You must output your classifier (as a pickle file) and predictions (as csv) using the format from Question

2.3.4 and 2.3.5.

Good luck!

FILENAMES:

ID_{0}_Q_2_4_1.pickle

ID_{0}_Q_2_4_1.csv

In [ ]:

# WRITE YOUR CODE HERE