代写Bank Customer Churn Analysis Based on Decision Trees代做留学生SQL语言程序

- 首页 >> Database作业

Masters Thesis

Bank Customer Churn Analysis Based on Decision Trees

  Course code and name:

F21RP-Research Methods and Project Planning

Type of assessment:

Individual

  Coursework Title:

  Bank Customer Churn Analysis Based on Decision Trees

Abstract

Rapid development of a new information network changes the quality of the traditional business model. Especially in the financial arena, customers have higher expectations for banks to meet the needs of products and services, and at the same time, the competition will intensify. In such a setting, customer loyalty  has turned  into an  important  indicator of the ability of building and retaining  the  customer  relationship  that  enables  banks  to  further  their competitive advantage.

On this point, the research is based on basic data of the customers of the bank and is based on the needs of business marketing scenarios, abstracting the problem model for the solution of individual customer churn. It is processed through the use of feature engineering combined with the customer labeling system  applicable  data  characteristics  of the  bank's  customer  churn  early warning  model.  The  churn  model  that  is  applicable  to  the  effective-value customers of the bank is being built using the decision tree algorithm, through which one can identify the areas of bank products and strategies in marketing, maintenance of customers, and service. Furthermore, it develops a model that can be applied in banks to come up with an effective value customer churn model, whereby it suggests the areas related to bank products and marketing strategies, customer maintenance, and services.

Keywords:Decision Tree, Bank Customer Churn, Unbalanced Data,Hybrid Feature Selection.

1. Introductory

1.1 Purpose of The Study

Customers are the root of the bank, is its survival and development of the fundamental resources. Presently, as the time has passed, the competition between banks for customer resources has gotten tighter, and attention in the banking  industry  started  to  shift  towards  customer  needs,  where  efforts focused on satisfying customer needs for improved customer service. In this context, the maintenance of bank customers and improving customer loyalty and dependence  is very  critical. With the gradual development of modern technology, there  has  been  a fundamental  change  in the  operation of the modern financial industry. The customer relationship management system is an important tool that not only serves for batch maintenance and operation of customers but also offers great possibilities for the in-depth analysis of data,

mining of potential demand,  precision  marketing,  and  control  of  marketing processes.

1.2 Shortcomings of Existing Studies

Further  on  the  statistical  analysis,   it  shows  that  the  individual  customer becomes the principal customer group for banks. This group is maintained in its functioning and its expansion, so it has become the one with the most priority  in  the daily  operation of commercial  banks. With  the  upgrading  of customers' consumption demand, space is given to customers in the degree of requirement  for  financial  services,  the  difference  of  supply  and  demand patterns in the financial market, and the customers of the bank customer base change their kind of increase due to a combination of factors. On the other hand, customer attrition would not only escalate the bank marketing expenses and opportunity costs but also give a bad reputation to the bank [Wang, W.Q., Yao, R., Liu, C., et al., 2014]. However, studies have shown that while for banking, the line customer churn has a huge impact on profits, reduction of customer churn by 5% can bring 30% to 85% increase in profits. The cost of developing  a  new  customer  is  five  to  seven  times  more  than  the  cost  of retaining  one,  and  the  success  rate  of  developing  customers   is  only one-sixteenth of keeping customers [Xiao, J., Liu, D.H., and He, C.Z., 2012].

1.3 Research Motivation

Research the customer churn early warning model, find out the influencing key  factors of customer churn, and then apply them to actual engineering based on  retail customer labeling data. Use the method of classification prediction to find  the customer churn prediction, or the key factors that influence customer churn. Then, establish a model to effectively forecast the possibility of customer churn  and formulate the corresponding recovery measures so that the customers will  not be lost. This is how banks can enhance their core competitiveness.

1.4 Project Research Objectives

The main objective of the research would be to cover, in detail, the customer churn behavior of the bank by analyzing loads of data and then predicting, using the latest modeling techniques, and finally making an effective customer retention strategy to support continued business development for the bank, along with growth in profits. The specific objectives of the study can be broken down as follows:

O1. Identify key predictors of customer churn

By  collecting,  analyzing,  and  processing  the  customer  data  in  the  bank, including but not limited to the demographic characteristics, types of accounts, transaction behaviors, credit histories, and service interaction records of the customer. This will  use  descriptive  statistics  and  exploratory  data analysis while seeking to find probable major influencing variables of customer decision churn.

O2. Model Evaluation and Performance Optimization

The performance metrics can be accuracy, recall, F1 score, area under ROC, among others. Further, we will study how model performance can be optimized using techniques like Pruning, Integrated Learning methods (Random Forest and Boosted Tree).

O3. Construct and optimize a decision tree predictive model

Now, you  are to  deploy  a  decision tree  algorithm for the  building  of your customer churn predictive model. This model is going to allow you to really say that so many customers are potential churn customers, and it really drives key factors' predictive power for churn. Model building involves selection of the right  parameters,  cross-validation  to  avoid  overfitting,  and  checking  the model's accuracy and robustness with the help of training and test sets.

O4. Develop data-driven customer retention based strategies

Propose strategies based on the model insights that could reduce customer churn.  This  may  include  enhancing  the  experience  of  customer  service, changing   pricing  strategies,  targeting  delivery  of  customized   marketing campaigns,  and  optimization  of the  product  portfolio.  This  will  help  in  the recommendation of the appropriate strategy advice that ensures the needs and preferences of the different segments are maximized.

With these nuanced research objectives, the present study tries to enhance not only customer retention for banks but tries to provide a methodological reference of applying the decision tree model to a complex data environment. The same can assist banks in better understanding of the service optimization needs  in  the  customer,  and  simultaneously  compete  well   in  the  highly competing financial market.

1.5 Possible innovations and shortcomings

1.5.1 Possible innovations

1. In fact, the feature selection strategy of the dataset in the research project is the strong mining capacity with regard to hybrid. The experiment proves that the method mines feature information and intrinsic connection, which is very effective to help enhance the effect of model prediction.

2. Research finds that the two-stage serial combination applies the two-stage serial combination model to achieve the two-stage combination model. The results from the  experiment  applied  show the  combination  model  to  be effectively able to enhance the predictive ability of the model.

1.5.2 Possible shortcomings

1. The selected feature processing mainly relies on the original data, and the selection chooses without taking into consideration the actual significance of the variables. They will probably miss some of the most important features.

2. The  presented  research  project  adopts  a  two-stage  serial  combination model in the construction of a combination model and makes its comparison with the single model. Later, other forms of the combination are tried for the comparative analysis.

2. Background

2.1 Review of bank customer churn prediction methods

The  study  of  customer   churn  prediction  methods  encompasses  various  research  efforts  that  provide  insights  into  how  organizations  can  mitigate  customer turnover.  Colgate  et  al.  explored  the  churn  behaviors  of  tertiary  students within Irish financial services using questionnaires to analyze churn  causes  and  patterns.  They  also  examined   how  these  factors  align  with  financial  policies  and  marketing  strategies,  emphasizing  the  need  to  tailor  these  aspects  to  reduce  churn  [Colgate,  M.,  Stewart,  K.,  Kinsella,  R.,  1996].Walsh et al. utilized structural equation modeling based on surveys from  462  customers  of  a  German  utility  company.  Their  findings  suggest  that  enhancing customer satisfaction through targeted marketing can significantly  reduce churn. They also highlighted how corporate reputation and customer  satisfaction impact churn, offering actionable insights for refining marketing  strategies [Walsh, G., Dinnie, K., Wiedmann, K., 2006].Sohn et al. developed a  competitive risk model that incorporates customer characteristics, which was  particularly aimed at the mobile telecommunications sector in South Korea.  Their study was catalyzed by the introduction of mobile number portability, and  they  proposed  management  guidelines  based  on  their  findings   to  help  companies better handle customer retention under this new regime [Sohn, S.Y., Lee, J.K., 2008].Chen et al. crafted a customer value model using logistics  industry data, identifying key factors that lead to the loss of valuable customers.

Their research provides strategic recommendations for customer management, aiming to enhance retention and prevent churn [Chen, K., Hu, Y.H., Hsieh, Y.C., 2015].These studies collectively advance the understanding of customer churn  and offer a solid foundation for companies to develop targeted strategies to  enhance  customer  loyalty  and  retention  [Chen,  K.,  Hu,  Y.H.,  Hsieh,  Y.C.,  2015].Hwang H et al. posited that customer churn could be assessed from the  perspective  of the  customer's value  to the  bank  and  their  potential  future  revenue generation. This approach focuses on the economic contributions of  customers  to  predict  churn,  advocating  a  value-based  strategy  to  identify  high-risk  customers  [Hwang,  H.,  Jung,  T.,   Suh,  E.,  2004].  Lu  N  et  al.  implemented logistic regression as a base learner in their churn prediction  models, enhancing model accuracy through the use of boosting techniques.  Their  research  targets  the  creation  of  early  warning  systems  tailored  to  different bank customer groups, demonstrating the effectiveness of adaptive  learning methods in improving predictive accuracy [Lu, N., Lin, H., Lu, J., et al.,  2014]. Vafeiadis et al. examined several common machine learning algorithms  for classifying potential churners but found no definitive best learner due to the  complex  interplay  of  factors   like  data  type   and  distribution.  Their  work  underscores  the  challenges   in   selecting  the  optimal   machine   learning  approach  in  environments  with  diverse  data  characteristics  [Vafeiad is,  T.,  Diamantaras, K.I., Sarigiannid is, G., et al., 2015].

The existing literature primarily explores the use of integrated algorithms for predicting bank customer churn, with several studies incorporating advanced feature  processing  techniques  like  sampling  and  feature  derivation. These techniques have proven effective at extracting valuable insights from the data, though the overall enhancement  in  model  performance  remains  moderate. The studies suggest that while current methods are capable of identifying key indicators  of  churn,  there  remains  a  significant  opportunity  to  explore combinatorial  modeling  approaches,  which  have  been  less  utilized  in  this domain. This gap in research presents a potential area for further exploration to develop more robust and comprehensive predictive models.

2.2 Overview of modeling methods

Mozer et al. designed an early warning model for subscriber churn based on extensive US domestic subscriber data, totaling nearly 47,000 entries. This data encompassed a wide range of variables including consumption history, billing   information,   credit   card   data,   application   usage,   and   customer complaints.   The   researchers   applied   a   variety   of   predictive   modeling techniques, such as logistic regression, decision trees, neural networks, and boosting algorithms. The insights gained from the model were used to tailor subscriber  incentives,  aiming  to  improve  retention  rates  and  maximize operator profits. This model's effectiveness was not only theoretically proven but also practically verified in real business environments, demonstrating its applicability and impact on business operations [Mozer, M.C., Wolniewicz, R., Grimes, D.B., et al., 2000].Mohammed et al. focused on comparing decision tree and logistic regression models to determine their efficacy in predicting customer churn. Their study was grounded in empirical data from a mobile operator's business records. The findings indicated that decision trees offered superior   performance   over   logistic   regression   models   in   this   context, suggesting  that  decision  trees   might   be   more   adept   at   handling   the complexities and nuances involved in churn prediction in the telecom sector [Mohammed, H., Ali, T., Tariq, E., et al., 2015].Vafeiadis et al. utilized a public telecom customer dataset to perform a comparative analysis of five different algorithms using the Monte Carlo method, enhanced with boosting techniques. The  algorithms  tested   included   artificial  neural  networks,  support  vector machines,   decision   trees,   plain   Bayes,   and   logistic   regression.   Their comprehensive evaluation concluded that support vector machines provided the  best  results  among  the  tested  models,  particularly  in  managing  and interpreting the dataset's variability and complexity [Vafeiadis T, Diamantaras K I,Sarigiannidis G, et al., 2015].Huang conducted an analysis based on real customer  data  from  an   Irish  telecom   company.   He  compared  several algorithms   including   logistic   regression,   linear   classifiers,   naive   Bayes, decision   trees,   multi-layer   perceptron   neural   networks,   support   vector machines, and genetic algorithms.  His  study  aimed to determine which of these methods was most effective at predicting customer churn, offering a broad  perspective on the  relative strengths of these diverse techniques  in handling  real-world  data  [Huang,  B.,  Kechadi,  M.T.,  Buckley,  B.,  2012].Lu focused on real data from a telecommunications company to research churn prediction. Utilizing the boosting algorithm to weight and segment customer groups, Lu developed an early warning model for churn and compared it to a standard logistic regression model. The results showed that the boosted model provided   superior   predictive   accuracy,    illustrating   the   effectiveness   of ensemble  methods  in  enhancing  churn  prediction  [Lu  N,  Lin  H,  Lu  J,  et al.,2014].Bi addressed the challenges of predicting customer churn in big data environments.  He  proposed  a  clustering  method  named  SDSCM, which  is based   on   Sequential   Clustering   Method   (SCM)   and   Adaptive   Feature Selection (AFS). This method was applied specifically to tackle churn in China Telecom's  vast   customer  data,   providing   new   strategies   for   improving customer  churn   management  at  scale   [Bi,  W.,   Cai,  M.,  Liu,  M.,  et  al., 2016].Amin explored the use of rough set theory in the telecom industry to predict customer churn.  He tested several algorithms, including exhaustive algorithms, genetic algorithms, and covering algorithms based on rough set theory. His work verified the effectiveness of rough set theory as a valuable tool in understanding and predicting customer churn, offering a novel approach to mining complex customer data[Amin, A., Anwar, S., Adnan, A., et al., 2017].  Although there is a large amount of academic research literature in the area, comprehensive study of the modeling of bank customer churn has not been pursued. The above characteristic data have been considered in the paper. This  is  used  for  a  hybrid  feature  selection  strategy  with  comprehensive sampling  to  further  evaluate  and  have  practical  application  value  for  the prediction of customer churning research in banks.

2.3   Review   of   customer   churn    influencing   factors   and

retention strategies

The present literature almost, in a quantitative manner, purely reviewed the customer    churn-related    factors    from     the    perspective    of    customer characteristics, aiding, from many perspectives, banks in developing targeted programs  and  exploring  marketing  countermeasures  for  the  key  customer segments [Oskarsdottir, M., Baesens, B., Vanthienen, J., 2018]. And it wants to set  its  customer  within  the  telecom  and  banking  saturated  industries  that relatively have a fast-growing market and take into consideration the changes in value of the customer's lifecycle and the heterogeneity in Customer Lifetime Value (CLV). The proposed Expected Maximum Profit (EMP) metric measures with  the  characteristics  of  the  customer   base  offer  an   insight   into  new customer retention [Kumar, A., Luthra, S., 2017].Use hierarchical analysis and the decision evaluation laboratory to propose a customer retention strategy in the    automotive     industry    [Bahri-Ammari,     N.,     Bilgihan,    A.,    2017].A telecommunication  industry  environment  will   believe  that  satisfaction  and loyalty  are  the  cardinal  factors  in  predicting  the  operator's  and  customer retention as  key  predictors  of  satisfactory  relationships.  Distributive  justice enhances customer satisfaction and loyalty [Bahri-Ammari, N., Bilgihan, A., 2017].

Distributive justice, with relationship-centered theorists such as Gurit, E.G., and Interactional Justice, suggests a link to customer loyalty and customer satisfaction in the telecommunication industry environment as one of the key determinants of operator and customer retention and satisfaction. [Diaz, G.R., 2017].Proposed that service quality  has  a two-sided  positive  and  negative effect on customer satisfaction, using logistic regression and GSEM estimation methods. These were the only assessments showing a positive impact on the categories of customer satisfaction and also  proving to  have a statistically significant impact on other attributes of service, such as customer care, tariff and plan information, and billing clarity. Similar asymmetric results were found for other economic, socio-economic, and geographic determinants of customer decision making.

2.4 Shortcomings of Existing Research

Through the current literature research, I identified that most of the literature research on customer churn focuses on the use of big data mining technology for establishment through the data mining technology of the customer churn prediction  model.  And  then,  through  the  data  to  observe  the  behavior  of customers and how to explore deeply the problem of the existence of massive raw data, category imbalance, and high feature dimension, development of a variety of intelligent algorithms and technologies to solve the problem, using it as a basis to verify the accuracy of the early-warning churn model, giving a comparison of good and bad models.  In most of the literature studies, the causes of customer churn were observed separate from churn prediction and customer retention. In addition, while a few models are highly accurate for customer churn prediction, most prediction models are derived only from the data interaction between a customer and the enterprise. The blind data mining only   considers   the   customer's   own   factors   and    ignores   the   outside environment that may cause aspects of customer churn easily hidden by the bank. The factors that lead to customer churn in the bank and the external environment are easily concealed. About whether a customer is lost or not, the research should dwell on a customer churn early warning that should not only simply predict whether the customer is lost but rather suggest how many of the relative customers it is worthy to retain and strategies for the same.Meanwhile, in the relatively mature field of imbalance data research in medicine, finance, and the internet, banks in this area of research are somewhat comprehensive. Further, discussion of customer enhancement strategies of banks generally focuses on maintaining and marketing key customers, and thus the bias in research objects. Also, the research studies the assessment of the space of profitability,  the   competitiveness  cross-regional  development,  and  market positioning, risk management, and capital replenishment in brief, but very little in terms of giving research on the early warning of bank customer loss risk. The paper thus takes the detailed analysis of the comprehensive consideration of the influencing factors of customer churn and elaborates on the current situation of bank development. After the model for early warning of customer churn risk is selected appropriately, the original data is processed so as to provide   a   good   and   predictive   method  for  the   customer   relationship management  of  a  bank,  and  then  formulate  the  corresponding  retention method based on research data results.

3. Project Requirements and Methodology

3.1Project requirements/sequence/classification

Table 3. 1 Project Requirements

3.2 Decision Tree Models

A decision tree, which looks like a tree in its analytical model. The root node of the tree corresponds to the space of the data collection; each node of a branch corresponds to a classification problem, which is a test of a single attribute that results from splitting the space for data collection into two or more subsets; and each leaf node is a data partition with a categorization. The path from the root node of the decision tree to leaf nodes forms a prediction on the class of the corresponding object. The generation of the decision tree is a top-down, divide-and-conquer process that may be applied to a classification problem or a  rule   learning  problem.The  internal  nodes  of  a  decision  tree  represent attributes or sets of attributes, and the leaf nodes represent the categories or conclusions of the learning division. The features of the internal nodes are sometimes also referred to as test attributes or split attributes. The terminal nodes are categories; the branches that move down from the non-terminal nodes are the attribute values of the attributes. For a new instance that does not know the category, one can start from the root node of the decision tree and test for similar attributes, compare the attribute values corresponding to this  node,  then   move  down  the   branch,  repeating  the   comparison  and selecting the correct branch at each sub-node that it passes, until he reaches the leaf node where the class attributes of the sample instance await.

3.2.1 Decision Tree Construction Ideas

1. Starting with a training set of processed data and an empty tree, it checks each point of the current knot and divides.

2. If all training samples at the current node belong to the same class, then create a leaf node with the class label and finish. Labeled by that class, form a leaf node with that label and finish.

3. Compute every possible partition for each set using the optimal measure.

4. Select the optimal division as the test for the current node and create a child node of that division.

5. Name the edge of the  parent  and  child  classes with  the output of that division, and divide the training data into the child node using the output of that division.

6. Treated as the child node, repeat procedures 2 to 5 in a loop with this as the new current node until there is no more divisible node for it.

3.2.2 Decision Tree Growth

The root node represents the whole dataset, and the tree develops with the recursive division of data to the subsets of nodes, where the division is carried out by testing all the given attributes until higher "purity" is achieved in that subset. This would basically mean that the purity of the samples taken in the branch is so high that it stops making sense for such samples to be further divided. It stops growing when all the samples have cohorts too small to divide further in a meaningful way. The main question of growing a decision tree is to establish the division criteria. This is to mean the diagram is able to show the growth process of a decision tree. 

Figure 3. 1 Decision Tree Pruning

3.2.3 Pruning of decision tree

According to the growth process above for generating a decision tree, usually, interference outliers end up being the problem of overfitting, so that the new samples which are to be predicted by the decision tree categorized often do not make better results. So, there appeared a solution for avoiding such a problem: one should use the method of decision tree pruning. Commonly, the methods  being  used  for   pruning  the   decision  tree   are  pre-pruning  and post-pruning.

Pre-pruning can directly give the  maximum depth  in advance to avoid the minimum sample size of each tree node of the decision tree being too small on a certain tree node. All these prevent the tree from growing to its full depth, although setting an appropriate depth for the tree or sample threshold requires iterative trials in order to tune these parameters. This relies on the user having a fuller knowledge of the distribution patterns of the variables.

The post-pruning technique is waiting for the perfect growth of the decision tree, then it prunes off those branches that are of little significance with regards to classification based on certain criteria. It is basically a testing and pruning process. It calculates the prediction accuracy for the output variables of the current decision tree. It is supposed to give the maximum error rate which can be tolerated in advance; if the same is reached, then pruning is to be ended immediately;  if  not,  pruning  should  continue.   Post-pruning   uses  the  test sample set's data for making a decision on the stopping point of the pruning process. It should stop the process of pruning until the error rate from the test set becomes drastically larger. 

Figure 3.2 Construction of decision trees

3.2.4 Dataset Overview

Dataset Summary:

Entries: 175,028.

Features: 25 columns.

Description of Features:

Surname: Encoded identifier for customers.

CreditScore:   Numerical   score   representing   the   creditworthiness   of   the customer.

Age: Customer's age.

Tenure: Number of years the customer has been with the bank.

Balance: Current account balance.

NumOfProducts: Number of bank products the customer uses.

HasCrCard: Indicates the possession of a bank-issued credit card (1 = Yes, 0 = No).

IsActiveMember: Active membership status (1 = Yes, 0 = No).

EstimatedSalary: Estimated annual salary of the customer.

Exited: Churn status (1 = Churned, 0 = Not churned).

Geographical and Demographic Data:

France, Germany, Spain: Binary indicators for the customer's country. Female, Male: Binary gender indicators.

Engineered Features:

Features derived from existing data such as Surname_tfidf_* (representing transformed surname features), various ratios or interactive features like     Mem     no     Products, Cred_Bal_Sal, Bal_sal, Tenure_Age.

Preparation for Analysis:

The dataset may require preprocessing steps such as scaling, normalization, or encoding prior to modeling. Decision trees will benefit from an examination of feature importance to refine the model and enhance predictive accuracy.

Dataset Source:

The dataset can be accessed and downloaded from:

https://www.kaggle.com/datasets/cybersimar08/binary-classification-of-bank-c hurn-synthetic-data

3.2.5 Common Decision Tree Algorithms

3.2.5.1 CART Algorithm

Some other decision tree algorithms before CART were applied, among which is  the  famous  ID3  and  its  improved  version,  C4.5.  Applications  of  these algorithms were specifically classification tasks. The decision tree has some very unique aspects, as the application of the decision tree could be used not only  for  classification  but  also  for  regression  problems.  And  with  it, one chooses the classification attribute in the CART decision tree using the Gini index. The purity of the dataset D can be measured by the Gini value using the same notation as in Eq:

(3.1)

The smaller the Gini (D), the larger the importance of the purity of the dataset. Under the same notation as --, the Gini index for attribute a is defined as:

(3.2)

Thus, the following condition is to be satisfied by the selected attribute from the candidate attributes A for the minimum computation of the Gini index after making the segmentation with it. 

(3.3)

3.2.5.2 Algorithm properties:

1.   Dual purpose: It may be used in problems of classification (construction of classification trees) or regression problems (construction of regression trees).

2.   Binary  tree  structure:  CART  is  deterministic  and  always  produces  a  binary tree, i.e., it always divides into two children in every node, while ID3, C4.5, and some other algorithms can but need not make multi-nomial  trees.

3.   Pruning technique: CART uses smart pruning that reduces the danger of overfitting and hence improves model generalization. Pruning is done by eliminating unwanted nodes from a fully developed tree that yield little influence on the predictive performance of the model.

3.2.5.3 Algorithmic Process:

1.   Segmentation Criteria:In the classification, it divides the criterion of Gini impurity, and in the regression, it uses the least squares deviation.

2.   Recursive Segmentation: Starting from the root node, the algorithm will generate, in a recursive way, two subsets from the current node's dataset while optimally choosing the cutting (splitting) point that minimizes the impurity (for classification) or bias (for regression) from the cut.

3.   Stop condition: The other main benefits of CART are its high processing flexibility (could address different data and problems), the feature of no need for data pre-processing, and high interpretability. Furthermore, it can handle missing data and feature selection automatically in-built.



站长地图