代写Bank Customer Churn Analysis Based on Decision Trees代做留学生SQL语言程序

2024.07.03 - 首页 >> Database作业

Masters Thesis

Bank Customer Churn Analysis Based on Decision Trees

Course code and name:	F21RP-Research Methods and Project Planning
Type of assessment:	Individual
Coursework Title:	Bank Customer Churn Analysis Based on Decision Trees

Abstract

Rapid development of a new information network changes the quality of the traditional business model. Especially in the financial arena, customers have higher expectations for banks to meet the needs of products and services, and at the same time, the competition will intensify. In such a setting, customer loyalty has turned into an important indicator of the ability of building and retaining the customer relationship that enables banks to further their competitive advantage.

On this point, the research is based on basic data of the customers of the bank and is based on the needs of business marketing scenarios, abstracting the problem model for the solution of individual customer churn. It is processed through the use of feature engineering combined with the customer labeling system applicable data characteristics of the bank's customer churn early warning model. The churn model that is applicable to the effective-value customers of the bank is being built using the decision tree algorithm, through which one can identify the areas of bank products and strategies in marketing, maintenance of customers, and service. Furthermore, it develops a model that can be applied in banks to come up with an effective value customer churn model, whereby it suggests the areas related to bank products and marketing strategies, customer maintenance, and services.

Keywords:Decision Tree, Bank Customer Churn, Unbalanced Data,Hybrid Feature Selection.

1. Introductory

1.1 Purpose of The Study

Customers are the root of the bank, is its survival and development of the fundamental resources. Presently, as the time has passed, the competition between banks for customer resources has gotten tighter, and attention in the banking industry started to shift towards customer needs, where efforts focused on satisfying customer needs for improved customer service. In this context, the maintenance of bank customers and improving customer loyalty and dependence is very critical. With the gradual development of modern technology, there has been a fundamental change in the operation of the modern financial industry. The customer relationship management system is an important tool that not only serves for batch maintenance and operation of customers but also offers great possibilities for the in-depth analysis of data,

mining of potential demand, precision marketing, and control of marketing processes.

1.2 Shortcomings of Existing Studies

Further on the statistical analysis, it shows that the individual customer becomes the principal customer group for banks. This group is maintained in its functioning and its expansion, so it has become the one with the most priority in the daily operation of commercial banks. With the upgrading of customers' consumption demand, space is given to customers in the degree of requirement for financial services, the difference of supply and demand patterns in the financial market, and the customers of the bank customer base change their kind of increase due to a combination of factors. On the other hand, customer attrition would not only escalate the bank marketing expenses and opportunity costs but also give a bad reputation to the bank [Wang, W.Q., Yao, R., Liu, C., et al., 2014]. However, studies have shown that while for banking, the line customer churn has a huge impact on profits, reduction of customer churn by 5% can bring 30% to 85% increase in profits. The cost of developing a new customer is five to seven times more than the cost of retaining one, and the success rate of developing customers is only one-sixteenth of keeping customers [Xiao, J., Liu, D.H., and He, C.Z., 2012].

1.3 Research Motivation

Research the customer churn early warning model, find out the influencing key factors of customer churn, and then apply them to actual engineering based on retail customer labeling data. Use the method of classification prediction to find the customer churn prediction, or the key factors that influence customer churn. Then, establish a model to effectively forecast the possibility of customer churn and formulate the corresponding recovery measures so that the customers will not be lost. This is how banks can enhance their core competitiveness.

1.4 Project Research Objectives

The main objective of the research would be to cover, in detail, the customer churn behavior of the bank by analyzing loads of data and then predicting, using the latest modeling techniques, and finally making an effective customer retention strategy to support continued business development for the bank, along with growth in profits. The specific objectives of the study can be broken down as follows:

O1. Identify key predictors of customer churn

By collecting, analyzing, and processing the customer data in the bank, including but not limited to the demographic characteristics, types of accounts, transaction behaviors, credit histories, and service interaction records of the customer. This will use descriptive statistics and exploratory data analysis while seeking to find probable major influencing variables of customer decision churn.

O2. Model Evaluation and Performance Optimization

The performance metrics can be accuracy, recall, F1 score, area under ROC, among others. Further, we will study how model performance can be optimized using techniques like Pruning, Integrated Learning methods (Random Forest and Boosted Tree).

O3. Construct and optimize a decision tree predictive model

Now, you are to deploy a decision tree algorithm for the building of your customer churn predictive model. This model is going to allow you to really say that so many customers are potential churn customers, and it really drives key factors' predictive power for churn. Model building involves selection of the right parameters, cross-validation to avoid overfitting, and checking the model's accuracy and robustness with the help of training and test sets.

O4. Develop data-driven customer retention based strategies

Propose strategies based on the model insights that could reduce customer churn. This may include enhancing the experience of customer service, changing pricing strategies, targeting delivery of customized marketing campaigns, and optimization of the product portfolio. This will help in the recommendation of the appropriate strategy advice that ensures the needs and preferences of the different segments are maximized.

With these nuanced research objectives, the present study tries to enhance not only customer retention for banks but tries to provide a methodological reference of applying the decision tree model to a complex data environment. The same can assist banks in better understanding of the service optimization needs in the customer, and simultaneously compete well in the highly competing financial market.

1.5 Possible innovations and shortcomings

1.5.1 Possible innovations

1. In fact, the feature selection strategy of the dataset in the research project is the strong mining capacity with regard to hybrid. The experiment proves that the method mines feature information and intrinsic connection, which is very effective to help enhance the effect of model prediction.

2. Research finds that the two-stage serial combination applies the two-stage serial combination model to achieve the two-stage combination model. The results from the experiment applied show the combination model to be effectively able to enhance the predictive ability of the model.

1.5.2 Possible shortcomings

1. The selected feature processing mainly relies on the original data, and the selection chooses without taking into consideration the actual significance of the variables. They will probably miss some of the most important features.

2. The presented research project adopts a two-stage serial combination model in the construction of a combination model and makes its comparison with the single model. Later, other forms of the combination are tried for the comparative analysis.

2. Background

2.1 Review of bank customer churn prediction methods

The study of customer churn prediction methods encompasses various research efforts that provide insights into how organizations can mitigate customer turnover. Colgate et al. explored the churn behaviors of tertiary students within Irish financial services using questionnaires to analyze churn causes and patterns. They also examined how these factors align with financial policies and marketing strategies, emphasizing the need to tailor these aspects to reduce churn [Colgate, M., Stewart, K., Kinsella, R., 1996].Walsh et al. utilized structural equation modeling based on surveys from 462 customers of a German utility company. Their findings suggest that enhancing customer satisfaction through targeted marketing can significantly reduce churn. They also highlighted how corporate reputation and customer satisfaction impact churn, offering actionable insights for refining marketing strategies [Walsh, G., Dinnie, K., Wiedmann, K., 2006].Sohn et al. developed a competitive risk model that incorporates customer characteristics, which was particularly aimed at the mobile telecommunications sector in South Korea. Their study was catalyzed by the introduction of mobile number portability, and they proposed management guidelines based on their findings to help companies better handle customer retention under this new regime [Sohn, S.Y., Lee, J.K., 2008].Chen et al. crafted a customer value model using logistics industry data, identifying key factors that lead to the loss of valuable customers.

Their research provides strategic recommendations for customer management, aiming to enhance retention and prevent churn [Chen, K., Hu, Y.H., Hsieh, Y.C., 2015].These studies collectively advance the understanding of customer churn and offer a solid foundation for companies to develop targeted strategies to enhance customer loyalty and retention [Chen, K., Hu, Y.H., Hsieh, Y.C., 2015].Hwang H et al. posited that customer churn could be assessed from the perspective of the customer's value to the bank and their potential future revenue generation. This approach focuses on the economic contributions of customers to predict churn, advocating a value-based strategy to identify high-risk customers [Hwang, H., Jung, T., Suh, E., 2004]. Lu N et al. implemented logistic regression as a base learner in their churn prediction models, enhancing model accuracy through the use of boosting techniques. Their research targets the creation of early warning systems tailored to different bank customer groups, demonstrating the effectiveness of adaptive learning methods in improving predictive accuracy [Lu, N., Lin, H., Lu, J., et al., 2014]. Vafeiadis et al. examined several common machine learning algorithms for classifying potential churners but found no definitive best learner due to the complex interplay of factors like data type and distribution. Their work underscores the challenges in selecting the optimal machine learning approach in environments with diverse data characteristics [Vafeiad is, T., Diamantaras, K.I., Sarigiannid is, G., et al., 2015].

The existing literature primarily explores the use of integrated algorithms for predicting bank customer churn, with several studies incorporating advanced feature processing techniques like sampling and feature derivation. These techniques have proven effective at extracting valuable insights from the data, though the overall enhancement in model performance remains moderate. The studies suggest that while current methods are capable of identifying key indicators of churn, there remains a significant opportunity to explore combinatorial modeling approaches, which have been less utilized in this domain. This gap in research presents a potential area for further exploration to develop more robust and comprehensive predictive models.

2.2 Overview of modeling methods

Mozer et al. designed an early warning model for subscriber churn based on extensive US domestic subscriber data, totaling nearly 47,000 entries. This data encompassed a wide range of variables including consumption history, billing information, credit card data, application usage, and customer complaints. The researchers applied a variety of predictive modeling techniques, such as logistic regression, decision trees, neural networks, and boosting algorithms. The insights gained from the model were used to tailor subscriber incentives, aiming to improve retention rates and maximize operator profits. This model's effectiveness was not only theoretically proven but also practically verified in real business environments, demonstrating its applicability and impact on business operations [Mozer, M.C., Wolniewicz, R., Grimes, D.B., et al., 2000].Mohammed et al. focused on comparing decision tree and logistic regression models to determine their efficacy in predicting customer churn. Their study was grounded in empirical data from a mobile operator's business records. The findings indicated that decision trees offered superior performance over logistic regression models in this context, suggesting that decision trees might be more adept at handling the complexities and nuances involved in churn prediction in the telecom sector [Mohammed, H., Ali, T., Tariq, E., et al., 2015].Vafeiadis et al. utilized a public telecom customer dataset to perform a comparative analysis of five different algorithms using the Monte Carlo method, enhanced with boosting techniques. The algorithms tested included artificial neural networks, support vector machines, decision trees, plain Bayes, and logistic regression. Their comprehensive evaluation concluded that support vector machines provided the best results among the tested models, particularly in managing and interpreting the dataset's variability and complexity [Vafeiadis T, Diamantaras K I,Sarigiannidis G, et al., 2015].Huang conducted an analysis based on real customer data from an Irish telecom company. He compared several algorithms including logistic regression, linear classifiers, naive Bayes, decision trees, multi-layer perceptron neural networks, support vector machines, and genetic algorithms. His study aimed to determine which of these methods was most effective at predicting customer churn, offering a broad perspective on the relative strengths of these diverse techniques in handling real-world data [Huang, B., Kechadi, M.T., Buckley, B., 2012].Lu focused on real data from a telecommunications company to research churn prediction. Utilizing the boosting algorithm to weight and segment customer groups, Lu developed an early warning model for churn and compared it to a standard logistic regression model. The results showed that the boosted model provided superior predictive accuracy, illustrating the effectiveness of ensemble methods in enhancing churn prediction [Lu N, Lin H, Lu J, et al.,2014].Bi addressed the challenges of predicting customer churn in big data environments. He proposed a clustering method named SDSCM, which is based on Sequential Clustering Method (SCM) and Adaptive Feature Selection (AFS). This method was applied specifically to tackle churn in China Telecom's vast customer data, providing new strategies for improving customer churn management at scale [Bi, W., Cai, M., Liu, M., et al., 2016].Amin explored the use of rough set theory in the telecom industry to predict customer churn. He tested several algorithms, including exhaustive algorithms, genetic algorithms, and covering algorithms based on rough set theory. His work verified the effectiveness of rough set theory as a valuable tool in understanding and predicting customer churn, offering a novel approach to mining complex customer data[Amin, A., Anwar, S., Adnan, A., et al., 2017]. Although there is a large amount of academic research literature in the area, comprehensive study of the modeling of bank customer churn has not been pursued. The above characteristic data have been considered in the paper. This is used for a hybrid feature selection strategy with comprehensive sampling to further evaluate and have practical application value for the prediction of customer churning research in banks.

2.3 Review of customer churn influencing factors and

retention strategies

The present literature almost, in a quantitative manner, purely reviewed the customer churn-related factors from the perspective of customer characteristics, aiding, from many perspectives, banks in developing targeted programs and exploring marketing countermeasures for the key customer segments [Oskarsdottir, M., Baesens, B., Vanthienen, J., 2018]. And it wants to set its customer within the telecom and banking saturated industries that relatively have a fast-growing market and take into consideration the changes in value of the customer's lifecycle and the heterogeneity in Customer Lifetime Value (CLV). The proposed Expected Maximum Profit (EMP) metric measures with the characteristics of the customer base offer an insight into new customer retention [Kumar, A., Luthra, S., 2017].Use hierarchical analysis and the decision evaluation laboratory to propose a customer retention strategy in the automotive industry [Bahri-Ammari, N., Bilgihan, A., 2017].A telecommunication industry environment will believe that satisfaction and loyalty are the cardinal factors in predicting the operator's and customer retention as key predictors of satisfactory relationships. Distributive justice enhances customer satisfaction and loyalty [Bahri-Ammari, N., Bilgihan, A., 2017].

Distributive justice, with relationship-centered theorists such as Gurit, E.G., and Interactional Justice, suggests a link to customer loyalty and customer satisfaction in the telecommunication industry environment as one of the key determinants of operator and customer retention and satisfaction. [Diaz, G.R., 2017].Proposed that service quality has a two-sided positive and negative effect on customer satisfaction, using logistic regression and GSEM estimation methods. These were the only assessments showing a positive impact on the categories of customer satisfaction and also proving to have a statistically significant impact on other attributes of service, such as customer care, tariff and plan information, and billing clarity. Similar asymmetric results were found for other economic, socio-economic, and geographic determinants of customer decision making.

2.4 Shortcomings of Existing Research

Through the current literature research, I identified that most of the literature research on customer churn focuses on the use of big data mining technology for establishment through the data mining technology of the customer churn prediction model. And then, through the data to observe the behavior of customers and how to explore deeply the problem of the existence of massive raw data, category imbalance, and high feature dimension, development of a variety of intelligent algorithms and technologies to solve the problem, using it as a basis to verify the accuracy of the early-warning churn model, giving a comparison of good and bad models. In most of the literature studies, the causes of customer churn were observed separate from churn prediction and customer retention. In addition, while a few models are highly accurate for customer churn prediction, most prediction models are derived only from the data interaction between a customer and the enterprise. The blind data mining only considers the customer's own factors and ignores the outside environment that may cause aspects of customer churn easily hidden by the bank. The factors that lead to customer churn in the bank and the external environment are easily concealed. About whether a customer is lost or not, the research should dwell on a customer churn early warning that should not only simply predict whether the customer is lost but rather suggest how many of the relative customers it is worthy to retain and strategies for the same.Meanwhile, in the relatively mature field of imbalance data research in medicine, finance, and the internet, banks in this area of research are somewhat comprehensive. Further, discussion of customer enhancement strategies of banks generally focuses on maintaining and marketing key customers, and thus the bias in research objects. Also, the research studies the assessment of the space of profitability, the competitiveness cross-regional development, and market positioning, risk management, and capital replenishment in brief, but very little in terms of giving research on the early warning of bank customer loss risk. The paper thus takes the detailed analysis of the comprehensive consideration of the influencing factors of customer churn and elaborates on the current situation of bank development. After the model for early warning of customer churn risk is selected appropriately, the original data is processed so as to provide a good and predictive method for the customer relationship management of a bank, and then formulate the corresponding retention method based on research data results.

3. Project Requirements and Methodology

3.1Project requirements/sequence/classification

Table 3. 1 Project Requirements

3.2 Decision Tree Models

A decision tree, which looks like a tree in its analytical model. The root node of the tree corresponds to the space of the data collection; each node of a branch corresponds to a classification problem, which is a test of a single attribute that results from splitting the space for data collection into two or more subsets; and each leaf node is a data partition with a categorization. The path from the root node of the decision tree to leaf nodes forms a prediction on the class of the corresponding object. The generation of the decision tree is a top-down, divide-and-conquer process that may be applied to a classification problem or a rule learning problem.The internal nodes of a decision tree represent attributes or sets of attributes, and the leaf nodes represent the categories or conclusions of the learning division. The features of the internal nodes are sometimes also referred to as test attributes or split attributes. The terminal nodes are categories; the branches that move down from the non-terminal nodes are the attribute values of the attributes. For a new instance that does not know the category, one can start from the root node of the decision tree and test for similar attributes, compare the attribute values corresponding to this node, then move down the branch, repeating the comparison and selecting the correct branch at each sub-node that it passes, until he reaches the leaf node where the class attributes of the sample instance await.

3.2.1 Decision Tree Construction Ideas

1. Starting with a training set of processed data and an empty tree, it checks each point of the current knot and divides.

2. If all training samples at the current node belong to the same class, then create a leaf node with the class label and finish. Labeled by that class, form a leaf node with that label and finish.

3. Compute every possible partition for each set using the optimal measure.

4. Select the optimal division as the test for the current node and create a child node of that division.

5. Name the edge of the parent and child classes with the output of that division, and divide the training data into the child node using the output of that division.

6. Treated as the child node, repeat procedures 2 to 5 in a loop with this as the new current node until there is no more divisible node for it.

3.2.2 Decision Tree Growth

The root node represents the whole dataset, and the tree develops with the recursive division of data to the subsets of nodes, where the division is carried out by testing all the given attributes until higher "purity" is achieved in that subset. This would basically mean that the purity of the samples taken in the branch is so high that it stops making sense for such samples to be further divided. It stops growing when all the samples have cohorts too small to divide further in a meaningful way. The main question of growing a decision tree is to establish the division criteria. This is to mean the diagram is able to show the growth process of a decision tree.

Figure 3. 1 Decision Tree Pruning

3.2.3 Pruning of decision tree

According to the growth process above for generating a decision tree, usually, interference outliers end up being the problem of overfitting, so that the new samples which are to be predicted by the decision tree categorized often do not make better results. So, there appeared a solution for avoiding such a problem: one should use the method of decision tree pruning. Commonly, the methods being used for pruning the decision tree are pre-pruning and post-pruning.

Pre-pruning can directly give the maximum depth in advance to avoid the minimum sample size of each tree node of the decision tree being too small on a certain tree node. All these prevent the tree from growing to its full depth, although setting an appropriate depth for the tree or sample threshold requires iterative trials in order to tune these parameters. This relies on the user having a fuller knowledge of the distribution patterns of the variables.

The post-pruning technique is waiting for the perfect growth of the decision tree, then it prunes off those branches that are of little significance with regards to classification based on certain criteria. It is basically a testing and pruning process. It calculates the prediction accuracy for the output variables of the current decision tree. It is supposed to give the maximum error rate which can be tolerated in advance; if the same is reached, then pruning is to be ended immediately; if not, pruning should continue. Post-pruning uses the test sample set's data for making a decision on the stopping point of the pruning process. It should stop the process of pruning until the error rate from the test set becomes drastically larger.

Figure 3.2 Construction of decision trees

3.2.4 Dataset Overview

Dataset Summary:

Entries: 175,028.

Features: 25 columns.

Description of Features:

Surname: Encoded identifier for customers.

CreditScore: Numerical score representing the creditworthiness of the customer.

Age: Customer's age.

Tenure: Number of years the customer has been with the bank.

Balance: Current account balance.

NumOfProducts: Number of bank products the customer uses.

HasCrCard: Indicates the possession of a bank-issued credit card (1 = Yes, 0 = No).

IsActiveMember: Active membership status (1 = Yes, 0 = No).

EstimatedSalary: Estimated annual salary of the customer.

Exited: Churn status (1 = Churned, 0 = Not churned).

Geographical and Demographic Data:

France, Germany, Spain: Binary indicators for the customer's country. Female, Male: Binary gender indicators.

Engineered Features:

Features derived from existing data such as Surname_tfidf_* (representing transformed surname features), various ratios or interactive features like Mem no Products, Cred_Bal_Sal, Bal_sal, Tenure_Age.

Preparation for Analysis:

The dataset may require preprocessing steps such as scaling, normalization, or encoding prior to modeling. Decision trees will benefit from an examination of feature importance to refine the model and enhance predictive accuracy.

Dataset Source:

The dataset can be accessed and downloaded from:

https://www.kaggle.com/datasets/cybersimar08/binary-classification-of-bank-c hurn-synthetic-data

3.2.5 Common Decision Tree Algorithms

3.2.5.1 CART Algorithm

Some other decision tree algorithms before CART were applied, among which is the famous ID3 and its improved version, C4.5. Applications of these algorithms were specifically classification tasks. The decision tree has some very unique aspects, as the application of the decision tree could be used not only for classification but also for regression problems. And with it, one chooses the classification attribute in the CART decision tree using the Gini index. The purity of the dataset D can be measured by the Gini value using the same notation as in Eq:

(3.1)

The smaller the Gini (D), the larger the importance of the purity of the dataset. Under the same notation as --, the Gini index for attribute a is defined as:

(3.2)

Thus, the following condition is to be satisfied by the selected attribute from the candidate attributes A for the minimum computation of the Gini index after making the segmentation with it.

(3.3)

3.2.5.2 Algorithm properties:

1. Dual purpose: It may be used in problems of classification (construction of classification trees) or regression problems (construction of regression trees).

2. Binary tree structure: CART is deterministic and always produces a binary tree, i.e., it always divides into two children in every node, while ID3, C4.5, and some other algorithms can but need not make multi-nomial trees.

3. Pruning technique: CART uses smart pruning that reduces the danger of overfitting and hence improves model generalization. Pruning is done by eliminating unwanted nodes from a fully developed tree that yield little influence on the predictive performance of the model.

3.2.5.3 Algorithmic Process:

1. Segmentation Criteria:In the classification, it divides the criterion of Gini impurity, and in the regression, it uses the least squares deviation.

2. Recursive Segmentation: Starting from the root node, the algorithm will generate, in a recursive way, two subsets from the current node's dataset while optimally choosing the cutting (splitting) point that minimizes the impurity (for classification) or bias (for regression) from the cut.

3. Stop condition: The other main benefits of CART are its high processing flexibility (could address different data and problems), the feature of no need for data pre-processing, and high interpretability. Furthermore, it can handle missing data and feature selection automatically in-built.