43265 Pages
16341 Words
Churn Prediction Using “Machine Learning “Algorithms Assignment
Chapter 1: Introduction
1.1 Introduction
The research project has been prepared for analyzing the “customer churn” and the “machine learning “is used for the proper prediction of the “customer churn”. “customer churn” prediction is a crucial aspect of any business that aims to retain its customers. It refers to the process of identifying customers who are likely to leave a company and taking proactive measures to prevent them from doing so. The importance of “customer churn” prediction lies in its ability to help businesses reduce their customer acquisition costs and increase revenue. By identifying potential churners, companies can take steps such as offering personalized incentives or improving their products and services to retain these customers. Moreover, “customer churn” prediction also helps companies improve their overall customer experience by providing insights into what drives customers away. This information can be used to make necessary changes in the company's operations or marketing strategies. The “machine learning “algorithm has been vastly used in the prediction of the “customer churn” and the “machine learning “has been implemented by the “python “programming language.
At New Assignment Help, we understand the importance of timely submissions and excellent grades. That's why our team of proficient writers offers comprehensive assignment writing help in the UK. Dive into our Free Assignment Samples to grasp concepts better and enhance your academic performance.
In the first step of the research paper the aim and the objective has been defined for the proper development. In the chapter 1 introduction section the research background has been presented along with that the research question and the research significance has been presented.
1.2 Research background
““customer churn” prediction” is a crucial aspect of any business that relies on “customer retention”. It refers to the ability to forecast which “customers” are likely to leave a company and take their “business” elsewhere. This “prediction” helps businesses take “proactive measures” to retain their “customers”, thereby reducing the risk of “losing revenue”. ““customer churn” prediction” allows businesses to allocate “resources” more effectively towards retaining “high-value customers” (Çelik and Osmanoglu, 2019). This means that companies can focus on providing “personalized experiences” for their most “valuable customers” while also identifying those who are at risk of leaving.
(Source: Lalwani et al. 2022)
The above figure has represented how the different experiences affect the interaction with the brand and the decrease has been analyzed by the “customer churn” prediction. The “customer churn” prediction is focused to develop the analytics in the customer data and the trends. The analysis presents how the graph changes after one bad experience and how the graph changes after several bad experiences. Analyzing the customer's large dataset is next to impossible in the manual techniques (Agrawal et al. 2018). In that scenario the “machine learning “techniques help by predicting the proper churn among the customer database. In “machine learning “development the “python “programming language also helps by providing the framework and the libraries. In conclusion, the ability to predict ““customer churn”” is essential for any “business” looking to maintain “long-term success”. By taking “proactive measures” based on this “prediction”, companies can improve “customer satisfaction” and “loyalty” while also increasing “revenue and profitability”.
1.3 Aim and objective
Aim
The main aim of the research study is to understand the implementation of the “machine learning “algorithm for the “customer churn” prediction and also to predict the customer intentions for the development of the business organization.
Objective
- To understand the main factors, which contribute to the “customer churn” in the dataset
- To understand the best algorithm for the “customer churn” prediction and also to compare the different baseline model
- To compare the accuracy of the new algorithm with the traditional “machine learning “algorithm
- To improve the customer retention and reduce the the churn rates by developing the new techniques and process for the improvement
1.4 Research question
- What are the main factors that contribute to ““customer churn”” in this dataset?
- Which “algorithm” performs the best, and how does its “performance” compare to a simple “baseline model”?
- How does this approach compare to traditional ““machine learning “algorithms” in terms of “accuracy and interpretability”?
- How can businesses use “churn prediction models” to improve “customer retention” and reduce “churn rates”?
1.5 Research rationale
What is the issue
The present issues in the business organization is the development of the customer database. The company is focused on developing the big customer base for getting more revenue from the customer. Among the customers the “customer churn” is generally found for the dissatisfaction in the customer service along with that the customer is unsatisfied for the unfulfillment for the requirement.
Why is the issue
The issues in the customer dissatisfaction is generally found for the decrease in the customer services. Along with that the decrease in the product quality is also the factor for the decrease in the customer base.
What is the issues now
In the present scenario it is currently impossible for analyzing the customer data along with that there are no such techniques for development of the analytics.
What the research study shed lights upon
The research study has been developed to analyze the -”customer churn” data by implementing the “machine learning “techniques. For the analysis the EDA has been implemented and the “machine learning “model has been developed for the proper prediction of the churn. The research study has also focused on choosing the best algorithm for the prediction, so that the organization can mitigate the problems.
1.6 Research significance
The research study focuses on the development of the “customer churn” prediction and also to improve the prediction results by implementing the “machine learning “techniques. ““machine learning “techniques” have revolutionized the way “businesses” approach ““customer churn” prediction”. With the increasing “competition” in the market, it has become “essential” for companies to retain their “customers” and prevent them from switching to their “competitors”. ““machine learning “algorithms” can analyze vast amounts of “data” and identify “patterns” that are not visible to “humans”, making it an effective tool for predicting ““customer churn”” (Rahman and Kumar, 2020). One of the “significant advantages” of ““machine learning “techniques” is that they can continuously learn from “new data” and improve their “accuracy” over time. This means that as more “customer data” becomes available, the “algorithms” can refine their “predictions” and provide more “accurate insights” into which “customers” are likely to churn. Moreover, ““machine learning “techniques” allow businesses to personalize their approach towards “each customer” by analyzing individual “behavior patterns”. This enables companies to offer “tailored solutions” that address specific issues faced by each “customer”, thereby improving “retention rates”. In conclusion, ““machine learning “techniques” have significant implications for businesses looking to predict ““customer churn”” accurately (Gaur and Dubey, 2018). By leveraging these tools effectively, companies can gain a “competitive edge” by retaining customers and improving “overall profitability”.
1.7 Summary
The research study is developed by analyzing the secondary data and the “machine learning “algorithm has been developed with the help of “python “programming language. The vast collection of the libraries and the framework has been used for the “machine learning “model development and the EDA. The research paper has been divided into different sections. In the first chapter the aim and the objective along with that the research significance and the research background has been presented. The research questions have been formulated from the aim and the objective of the research study. Further it can be summarized that the research study focuses on finding the better process of the “customer churn” prediction. The research rationale has presented how the research study develops the new techniques and process for the “customer churn” prediction. The research study also focuses on understanding the main factors, which contribute to the “customer churn” in the dataset. For performing the dataset analysis the proper steps of the analysis has been performed and the data has been perfectly preprocessed.
Chapter 2: Literature review
2.1 Introduction
The research project has been prepared on the “customer churn” analysis and the “machine learning “techniques has been used for the analysis. In the “machine learning “techniques the exploratory data analysis and the “machine learning “model, such as the random forest and the logistic regression model has been created with the neural networks model. The research data has been collected from the secondary type of resources, the secondary method of data collection has been prepared. The secondary resources, such as the journals, articles and the research papers have been assessed for collecting the theories and the hypothesis for the research study. The literature review represents the theories and the hypothesis along with that the limitation of the research has been also added in the research gap section. The relevant theories such as the theories of the “customer churn” prediction along with the significance of the “customer churn” has been presented with a presentation of the importance of the “machine learning “algorithm.
2.1.1 Introduction to the “customer churn” analysis
““customer churn” analysis” is a crucial aspect of any business that seeks to retain its “customers” and maintain “profitability”. It involves the identification of “customers” who are likely to discontinue using a product or service and taking proactive measures to prevent the “customer” from leaving. The process of ““customer churn” analysis” begins with the “collection of data” on “customer behavior”, including the “purchasing habits”, “frequency of use”, and “level of engagement” with the “business”. This “data” is then analyzed to identify “patterns and trends” that may indicate a “likelihood of churn”. Once “potential churners” have been identified, “businesses” can take steps to address their “concerns” and improve their “experience” (Cenggoro et al. 2021). This may involve offering “incentives or discounts”, providing “additional support” or “resources”, or addressing “specific pain points” that are causing “dissatisfaction”.Ultimately, effective ““customer churn” analysis” requires ongoing “monitoring” and adjustment as “customer needs” and preferences evolve over time. By staying attuned to these changes and proactively addressing “potential issues”, “businesses” can build “stronger relationships” with the customers and increase the chances of “long-term success”.
2.1.2 “customer churn” analysis for analyzing the loopholes and retaining the customers
““customer churn” analysis” is a vital tool for “businesses” to identify the reasons why “customers are leaving and take steps to retain them. It involves analyzing “customer behavior”, “preferences”, and “feedback” to identify loopholes in the “business processes” that may be causing “dissatisfaction” among customers. As per the understanding of the Cenggoro et al. (2021), it has been seen that by analyzing ““customer churn” data”, businesses can identify “patterns” and trends in “customer behavior” that may indicate areas where improvements can be made. For example, if a large number of customers are leaving due to “poor customer service” or “product quality issues”, then the business can take steps to address these “issues” and improve the “overall customer experience”. Furthermore, by identifying the “reasons” why customers are “leaving”, “businesses” can develop “targeted retention strategies” that focus on addressing these “specific issues”. This could involve offering “incentives or discounts” to “loyal customers” or implementing “new policies” or “procedures” that address “common complaints”. ““customer churn” analysis” is an essential tool for “businesses” looking to improve their “customer retention rates”. By identifying the “loopholes” in their “processes” and taking steps to address them, “businesses” can retain more “customers” and ultimately increase the “profitability” over time.
2.1.3 Significance of the “customer churn” analysis
The ““customer churn” analysis” is a crucial tool for “businesses” to understand the reasons behind the “customers” leaving and to prevent it from happening in the future. It helps companies to “identify patterns” and “trends” that lead to “customer attrition”, such as “poor customer service”, “product quality issues”, or “pricing problems”. By conducting a “churn analysis”, businesses can gain “valuable insights” into their “customers' behavior” and “preferences”. They can use this “information” to improve their “products and services”, enhance “customer experience”, and retain “loyal customers”. Moreover, it allows them to allocate “resources” more effectively by focusing on retaining “high-value customers” rather than acquiring new ones (Dias et al. 2020). The significance of ““customer churn” analysis” cannot be overstated in today's “competitive business environment”. “Companies” that fail to understand why their “customers” are leaving “risk” losing “market share” and “revenue”. On the other hand, those that invest in analyzing “churn data” can make “informed decisions” that drive “growth and profitability”.
Moreover, the ““customer churn” analysis” is an “essential tool” for any “business” looking to improve its “bottom line”. By understanding why “customers” leave and taking steps to prevent it from happening again, “companies” can build “stronger relationships” with their “customers” and achieve “long-term success”
The above figure has shown the process of the “customer churn” prediction. The implementation of “machine learning “can help the prediction by providing the “machine learning “model. From the above figure it has been seen that in the first step the patterns are recognised from the churned customer history. Further, the predictive models are prepared depending on the recognised pattern. The predictive model further predicts the happy customer and the non satisfied customer from the existing customer (Karvana et al. 2019). In the last step the business organization also develops the re-engagement process in the marketing and improves the customer services. The risky customers are identified by the “customer churn” prediction and further the organization also develops the mitigation plan for retaining the customer base for the business.
2.2 Importance of “machine learning “in “customer churn” analysis
2.2.1 Classification and boosting algorithms for the “customer churn” analysis
Classification algorithms and boosting algorithms are two powerful tools for analyzing “customer churn” in businesses. “customer churn” refers to the rate at which customers stop doing business with a company, and it is a critical metric for any organization that wants to maintain its profitability and growth. Classification algorithms are used to classify customers into different categories based on their behavior, demographics, or other factors. These algorithms can help businesses identify which customers are most likely to churn so that they can take proactive steps to retain them (Kavitha et al. 2020). Boosting algorithms, on the other hand, are used to improve the accuracy of classification models by combining multiple weak models into a stronger one. This technique is particularly useful when dealing with complex data sets where traditional classification methods may not be effective. By using these two types of algorithms together, businesses can gain valuable insights into their customer base and develop targeted strategies for reducing churn rates. Ultimately, this can lead to increased customer loyalty, improved revenue streams, and long-term success for the organization as a whole.
2.2.2 Advantages of machine learnings in “customer churn” analysis
““machine learning “techniques” have become widely used in ““customer churn” analysis” due to the ability to analyze large volumes of “data” quickly and accurately. ““machine learning “algorithms” can analyze “vast amounts of data”, including “customer interactions”, “historical purchase data”, “demographic information”, and more, to accurately predict which “customers” are likely to churn. This helps “businesses” proactively identify “customers” who may be at “risk” of leaving and take “appropriate actions” to retain them. ““machine learning “algorithms” can process “data” at a much faster rate compared to “manual analysis”, allowing “businesses” to quickly “identify patterns” and trends that may be indicative of ““customer churn”” (Stucki, 2019). This enables businesses to respond promptly and take “proactive measures” to prevent ““customer churn””. ““machine learning “algorithms” can automatically generate “insights” and recommendations from the “data”, without the need for “human intervention”. This saves time and effort for “businesses”, as they do not have to manually analyze “data” or interpret “results”, and can instead focus on taking action based on the recommendations provided by the ““machine learning “models”.
““machine learning “models” can easily scale to handle “large datasets”, making them suitable for “businesses” with a “large customer” base or high “transaction volumes”. As the volume of “data” grows, ““machine learning “models” can continue to analyze and identify “patterns”, providing “accurate predictions” for ““customer churn””. As per the recommendation Stucki, (2019) it has been understood that ““machine learning “techniques” offer a wide range of “algorithms” and “approaches” that can be tailored to suit the specific “needs and characteristics” of different businesses and industries. This flexibility allows businesses to customize the ““customer churn” analysis” models to their “unique requirements and obtain more accurate predictions”. ““machine learning “models” can be deployed in “real-time”, allowing businesses to continuously monitor “customer behavior” and quickly detect changes that may indicate “potential churn” (He et al. 2020). This enables businesses to take immediate action to prevent ““customer churn””, such as sending “targeted offers” or “personalized communications” to “retain at-risk” “customers”. While implementing ““machine learning “models” may require an “upfront investment” in “technology” and expertise, the “long-term benefits” can outweigh the costs.
The above figure has represented the different advantages of “machine learning “in different scenarios and in different sectors. By accurately predicting and preventing ““customer churn””, businesses can save on the “cost of acquiring” new customers to replace those lost to “churn”, resulting in improved “profitability”. Further, “machine learning” has numerous “advantages” in ““customer churn” analysis”, including “predictive accuracy”, “faster analysis”, “automated insights”, “scalability”, “flexibility”, “real-time monitoring”, and “cost-effectiveness” (Jain et al. 2020). These benefits can help businesses proactively “retain customers”, “reduce churn”, and “improve customer retention rates”, leading to increased “customer satisfaction” and “long-term business success”.
2.3 Development of the “machine learning “algorithm by “python “programing language
“Python” provides a “powerful” and “versatile ecosystem” for developing ““machine learning “algorithms”, with its “rich libraries”, “tools”, and “community support”. Following the proper steps an “effective “machine learning “algorithm” can be developed using “Python” for ““customer churn” analysis” or any other ““machine learning “problem”. The ““python “programming language” has been used in the development of the vast collection of the “libraries and the framework”. The library, such as the “numpy, pandas and the matplotlib” is used in the “dataset operations” along with that the “data visualization” is performed. Developing a ““machine learning “algorithm” using ““python “programming language” typically involves the following steps, the first step is to Define the “problem”: Clearly define the “problem” that is going to be coral by the ““machine learning “algorithm”. In the second step, it is important to Gather and preprocess the “data” to “train and evaluate” the ““machine learning “model” (Vo et al. 2021). This may involve tasks such as “data cleaning, data integration, feature engineering, and data splitting” for “training and testing purposes”. The third step is choosing the ““machine learning “algorithm” for “prediction”.
“Python” provides a rich ecosystem of ““machine learning “libraries” such as “scikit-learn”, “TensorFlow”, and “PyTorch” that offer a wide range of “algorithms” for “classification”, “regression”, “clustering”, and more. In the next step the “data” is transformed into a format suitable for the chosen ““machine learning “algorithm”(Pamina et al, 2020). This may involve “feature scaling”, “encoding categorical variables”, and “handling missing values”. The prepared data is used to “train” the ““machine learning “model”. This typically involves “feeding the data” into the “algorithm”, “setting hyperparameters”, and “iteratively” optimizing the “model” using “techniques” such as “cross-validation” to evaluate its performance (Amuda and Adeyemo, 2019). The “performance” of the “trained model” is assessed using “appropriate evaluation metrics” such as “accuracy”, “precision”, “recall”, “F1-score”, or area under the “ROC curve (AUC-ROC)” depending on the problem type. This helps to understand how well the “model” is performing and if it meets the desired “performance thresholds”. “Fine-tune” of the “model” is performed by adjusting “hyperparameters”, “feature selection”, or “model architecture” to optimize its “performance”. This may involve experimenting with different “hyperparameter” values or trying different “algorithms” to find the best model for the “specific problem”.
Once the “model” is optimized, validate its “performance” on an “unseen dataset” to ensure its “generalization capability”. This may involve using a “hold-out dataset” or performing “cross-validation” on multiple folds of “data”. Interpreting the results of the “trained model” is important to “gain insights” into the “underlying patterns” and make “informed decisions”. This may involve “visualizing the model's predictions”, analyzing “feature importances”, or interpreting “model output”. Once the “model” is ready, deploy it to a “production environment” to start making “predictions” on new, “unseen data” (Leung et al. 2021). This may involve integrating the “trained model” into a “web application”, “API”, or any other “suitable deployment method”. “Python” provides a “powerful” and “versatile ecosystem” for developing ““machine learning “algorithms”, with its “rich libraries”, “tools”, and “community support”. Following the above steps can help to “develop” an “effective “machine learning “algorithm” using “Python” for ““customer churn” analysis” or any other ““machine learning “problem”.
2.4 “machine learning “algorithms for the “customer churn” prediction
2.4.1 Logistic regression algorithm for the “customer churn” prediction
“Logistic regression” is a “statistical method” used to analyze the relationship between a “dependent variable” and one or more “independent variables”. It is widely used in “predictive modeling”, particularly for ““customer churn” prediction”. ““customer churn”” refers to the “loss of customers” due to various reasons such as dissatisfaction with the “product or service”, “competition”, or other “factors”. The “logistic regression algorithm” for ““customer churn” prediction” involves building a “model” that predicts whether a “customer” is likely to leave or stay with the company based on “historical data”. As per the understanding of Leung et al. (2021), it has been understood that, the algorithm uses various “independent variables” such as “demographics”, “usage patterns”, and “transaction history” to “predict” the “probability of churn”. The “logistic regression model” is trained using “historical data” and then applied to “new data” to predict “future outcomes”. The “accuracy” of the “model” can be improved by “fine-tuning” the “parameters” and selecting relevant features. Overall, “logistic regression” is an effective tool for predicting ““customer churn”” and can help “companies” take “proactive measures” to retain their customers (Sudharsan and Ganesh, 2022). By identifying “customers” who are at risk of leaving, companies can take targeted actions such as offering “discounts” or improving their “product/service” quality to retain them.
The steps of the “customer churn” data prediction has been presented in the above figure and the logistic regression has been implemented in the prediction process. The telecom data has been taken for the analysis and the problem of the dataset has been generally identified in the first step. In step 2 the data cleaning is performed and in the data cleaning process the data is cleaned, the null values and the other abnormalities are removed. In step 3 the “machine learning “model is selected and the dataset is splitted for the model fitting and after building the “machine learning “model the model is evaluated or tested for analyzing the prediction results (Kumar and Kumar, 2019). The last step of the model building is the deployment of the “machine learning “model. In between the preprocessing and the model development the “exploratory data analysis” is also performed for visualizing the dataset and the problems are identified by the visualization.
2.4.2 Random forest algorithm in the “customer churn” prediction
“Random Forest Algorithm” is a popular ““machine learning “technique” that has been widely used in “various fields”, including ““customer churn” prediction”. ““customer churn”” refers to the situation where customers stop using a “company's products” or services. Predicting ““customer churn”” is essential for businesses as it helps them to retain the “customers” and improve the “revenue”. “Random Forest Algorithm” works by creating multiple “decision trees” and combining their results to make “predictions”. Each “decision tree” is trained on a “random subset” of the “data”, which helps to reduce “overfitting” and “improve accuracy”. The “algorithm” also uses “feature importance measures” to “identify the most important factors” that contribute to ““customer churn”” (Halibas et al. 2019). In ““customer churn” prediction”, “Random Forest Algorithm” can be used to analyze “customer behavior data” such as “purchase history”, “usage patterns”, and “demographics”. By analyzing this data, the “algorithm” can identify “patterns” and predict which “customers” are likely to leave. Moreover, “Random Forest Algorithm” is an effective tool for “predicting “customer churn””. It allows “businesses” to take proactive measures to retain their “customers” and improve their bottom line.
The above figure has represented the process of the “customer churn” analysis by the random forest model development. In the first step the ecommerce retail data are taken and the research variables are initialized. After that the data standardization is also performed for the data analysis and the data is prepared for the model development. In the data preparation process the model data is divided into train and test data , further depending upon the train data and the test data the random forest “machine learning “model is created (Momin et al. 2020). After that, the “customer churn” is predicted along with that the model accurecy is also calculated for analysing the other parameters.
2.4.3 Sequential neural network
“Sequential Neural Network (SNN)” is a type of “artificial neural network” that is designed to process “sequential data”. It is a powerful tool for modeling “time-series data”, such as “speech”, “audio”, and “video signals”. “SNNs” are composed of “multiple layers” of “interconnected nodes” or “neurons” that process “input data” and “generate output predictions”. Labhsetwar, (2020) states that the “key feature” of “SNNs” is their ability to remember past inputs and use them to make “predictions” about “future outputs”. This makes them particularly “useful” for “tasks” such as “speech recognition”, where the context of “previous words” can greatly influence the interpretation of current words. SNNs have been used successfully in a wide range of applications, including natural language processing, image recognition, and financial forecasting. They have also been used in neuroscience research to model the behavior of biological neurons (Ullah et al. 2019). Moreover, Sequential Neural Networks are an important tool for processing sequential data and have many practical applications in both industry and academia. As technology continues to advance, it is likely that SNNs will become even more powerful and versatile tools for solving complex problems.
2.4.5 XG Boosting
“XGBoost” is a popular ““machine learning “algorithm” that has gained “significant attention” in recent years due to its “effectiveness” in solving “complex problems”. It is an ensemble learning method that combines “multiple decision trees” to make more “accurate predictions”. “XGBoost” uses “gradient boosting”, which involves iteratively adding “decision trees” to the model while minimizing the “loss function”. One of the key advantages of “XGBoost” is its ability to handle “large datasets” with high “dimensionality”. It can also handle “missing values” and “outliers”, making it a “robust algorithm” for “real-world applications”. Additionally, “XGBoost” provides “feature importance scores”, which can help identify the most “important features” for “prediction”. However, like any ““machine learning “algorithm”, “XGBoost” has its limitations. It can be prone to “overfitting” if not properly tuned and may require significant computational resources for “training” on “large datasets” (El Zarif et al. 2020). Despite these “limitations”, “XGBoost” remains a “powerful tool” for “predictive modeling” and has been used successfully in various domains such as “finance”, “healthcare”, and “natural language processing”.
Its popularity is expected to continue growing as more researchers and practitioners recognize its potential for solving challenging problems. “XGBoosting” is a powerful ““machine learning “algorithm that has been widely used in various fields, including ““customer churn” prediction”. In ““customer churn” prediction”, “XGBoosting” can be used to analyze “customer behavior” and identify patterns that indicate potential churn. By analyzing various “factors” such as “purchase history”, “demographics”, and “usage patterns”, “XGBoosting” can predict which customers are likely to leave in the near future.
2.5 Research Gap
The research has been performed for the “customer churn” analysis and to identify the mistakes of the business organization. Further the organization also created a mitigation plan according to the analysis for retaining the customer base in the near future. Different limitations have been also identified in performing the research study as there are different disadvantages of machine learning. One of the major drawbacks of “machine learning” is the lack of “transparency” in “decision-making”. ““machine learning “models” are often considered “black boxes”, making it difficult to understand how they arrived at a “particular decision”. This can be “problematic” in “critical applications” such as “healthcare and finance” where “decisions” need to be explainable. Another “disadvantage” is the “potential for bias” in ““machine learning “algorithms”. If the “training data” used to develop these “algorithms” is biased, then the model will also be “biased”, leading to “unfair decisions” and perpetuating “societal inequalities”.
Lastly, “machine learning” requires large amounts of “high-quality data” for training. This can be costly and “time-consuming”, especially for small businesses or organizations with limited resources. Along with that the model overfitting are also needed to be focused and the model overfitting is generally seen for the un cleaned and unprocessed raw data or the high amount of data. These problems can be mitigated by the proper data preprocessing. Another literature gap has been found in the time of data collection. The premium type of data has not been accessed, as the proper budget has not been accessed. Further, in the near future the proper funds needed to be allocated for assessing all the required -journals. Along with that the
2.6 Summary
The literature review in this research study has presented the theories and the hypothesis for the “machine learning “development for the “customer churn” prediction. The “customer churn” prediction is generally performed for the understanding of the loopholes in the organization. Further, it can be summarized that the random forest, logistic regression, and the neural network can efficiently analyze the “customer churn”. It can be also summarized that One of the main growth pillars for products with a subscription-based business model is customer retention. The SaaS industry is very competitive since clients have a wide range of options for providers, even within a single product category. A consumer may stop buying from you after one or more negative experiences. And if hordes of dissatisfied consumers leave at a rapid clip, there would be significant financial losses as well as reputational harm. It can be also summarized that the “customer churn” analysis involves analyzing “customer behavior”, “preferences”, and “feedback” to identify loopholes in the “business processes” that may be causing “dissatisfaction” among customers.
Chapter 3: Research methodology
3.1 Introduction
The research methodology presents all the methodology that has been implemented in the research study. In this section the research methodologies, such as the research approach, research design and the data collection methods have been also defined. Defining the proper research methodology is also important for proceeding the research study in a structural manner. The research study has been developed for the implementation of the “machine learning “algorithm for the “customer churn” prediction. The “customer churn” prediction is generally performed for understanding the problems in the business organization, which is related to the sales and marketing section of the organization. For the research design the experimental research design has been presented, along with that the deductive research approach and the secondary data collection method has been followed for the research study. The secondary data such as the journal, article, and the previous research study has been accessed. For the dataset the online platform has been used for the collection of the “customer churn” data.
3.2 Research methods
The research project has been prepared by defining the research methodology in the first step. The research methodologies need to be defined as the research methodology helps in structuring the research study in an efficient manner and the research methodology helps to process the research study smoothly without any error. In the data collection time the secondary method of data collection has been followed for collecting the theories and the hypothesis. The secondary types of data such as the journal, research article and the previous research papers have been assessed for the development of the theories and hypothesis. Align with that the experimental research design has been followed for the research study, as the research study is completely dependent on the scientifical experiments between different “machine learning “algorithms. The deductive research approach has been followed for reducing the count of the data. The deductive research approach tested the theories and the hypothesis and reduced the unnecessary research theories and the hypothesis.
The total “machine learning “development project has been managed by implementing the waterfall methodology. There are different steps in the waterfall methodology and the steps are requirement gathering, requirement analysis, design, development, testing and maintenance. All the steps have been followed for developing the prediction “machine learning “model. All the key considerations, such as the ethical, legal and the social consideration has been followed for the research study. For the development of the “machine learning “environment the “python “programming language and the libraries and the framework of “python “helps in developing the algorithm efficiently.
3.3 Dataset
For the data analysis the “customer churn” data has been analyzed by the “machine learning “algorithm. The “customer churn” data has been gathered from the online resource, the data has been downloaded from the kaggle. ““customer churn” data” is a critical “metric” for any “business” that relies on recurring “revenue”. It refers to the rate at which customers stop doing “business” with a company over a “given period”. This “data” is essential because it helps businesses understand how well they are retaining their “customers” and identify areas where they can improve (Santharam, and Krishnan, 2018). Understanding ““customer churn” data” allows businesses to predict “future revenue streams” accurately. By forecasting how many “customers” are likely to leave in the “future”, companies can adjust their “marketing strategies” and retention efforts accordingly. In conclusion, analyzing ““customer churn” data” is crucial for any business that wants to maintain its “competitive edge” in today's market. It provides “valuable insights” into “customer behavior” and helps companies make “informed decisions” about how best to retain their “existing customers” while attracting new ones.
3.4 Data Pre processing
“Data preprocessing” is an essential step in ““machine learning “development”. It involves the transformation of “raw data” into a format that can be easily understood by ““machine learning “algorithms”. The process includes “cleaning”, “normalization”, “feature selection”, and “dimensionality reduction”. “Cleaning” involves removing “irrelevant or redundant data” from the “dataset” (Jain et al. 2020). This ensures that the “algorithm” does not get confused by “irrelevant information”. “Normalization” involves “scaling the data” to a common range to ensure that all “features” are equally important. The “data preprocessing” helps to “improve accuracy” and “efficiency” while reducing “computational complexity”. Therefore, “developers” must pay close attention to this process when developing ““machine learning “models”.
3.4.1 Data cleaning
“Data cleaning” is a critical step of data preprocessing that ensures the “accuracy and reliability” of “data” used for “decision-making purposes”. “Data cleaning” involves several steps such as “removing duplicate records”, “correcting typos”, “standardizing formats”, and dealing with “missing values”. The importance of “data cleaning” cannot be overstated as it helps to eliminate errors that can lead to incorrect conclusions or decisions. For instance, if a “dataset” contains “duplicate records” or “inconsistent values”, it may lead to “overestimation or underestimation” of “certain variables” (Sabbeh, 2018). This can have serious consequences when making important “decisions” based on the “data”. Moreover, “data cleaning” also helps to improve the “quality of data” by ensuring that it is “consistent and complete”. This makes it easier for analysts to extract “meaningful insights” from the “data”.
The above picture has represented the step by step process of the data cleaning. The first step of the data cleaning is removing the unwanted observation, such as duplicate, redundant values and the values are deleted from the dataset in order to represent the data in proper format. The next step is the missing data handling. In the mixing data handling process, the null values are removed or replaced according to the requirement of the dataset. The structural error solving and the outlier management is also under data cleaning. Structural errors are errors that occur during data dimension transfer or in other circumstances that are similar (Ahn et al. 2020). Types in feature names, incorrect class labels, classes with the same category but inconsistent capitalization, and other structural problems include these. In the aforementioned cases, even when almost all of the data is purged, there is a chance that the model won't produce the desired outcomes. It results from values that are noticeably dissimilar to all other observations.
Those are merely extreme cases. In general, we don't get rid of outliers until we have a good cause to. Removing them can sometimes enhance performance, but not always. However, in some circumstances, suspicious numbers that are highly improbable should be identified and eliminated from the database.
3.4.2 Feature extraction
“Feature selection” is another crucial step in “data preprocessing”. It involves identifying and selecting “relevant features” that will have a significant impact on the outcome of the “algorithm”. “Dimensionality reduction” is also essential as it reduces the “number of features” in “large datasets”, making it easier for “algorithms” to process. “Supervised feature extraction” involves using “labeled data” to identify “features” that are most relevant to the problem at hand. This approach requires prior “knowledge” of the “problem domain” and can be “time-consuming”, but it often results in more “accurate models” (Kim and Lee, 2022). “Unsupervised feature extraction”, on the other hand, involves identifying “patterns” in “unlabeled data” without prior knowledge of the “problem domain”. This approach is faster but may not result in “accurate models” as “supervised feature extraction”. Moreover, “feature extraction” is an essential step in “machine learning” and “computer vision” that helps improve “model accuracy” by selecting “relevant features” from “raw data”. It requires careful consideration of both “supervised” and “unsupervised approaches” to ensure optimal results.
The above figure has represented the process of the feature extraction. The figure shows that in the first step the raw data are collected from the different resources and the dataset is merged into one dataset. The next step is cleaning and the transformation, this process extracts the features from the dataset. After extracting the features the features are used for building the “machine learning “model and the insights are gained from the “machine learning “model. In the feature creation process the most useful variables are found for predictive modeling (Singh et al. 2018). In the transformation process the predictor variables are adjusted to improve the accuracy and the performance. By ensuring that all the variables are on the same scale and that the model is flexible enough to accept input from a range of data, the “transformation” makes the model simpler to comprehend. In order to prevent any “computational error”, it increases the model's correctness and makes sure that all of the features are within the permitted range.
3.5 Classification and neural network for the “customer churn” prediction
“Classification” and “neural networks” are two powerful techniques that can be used for ““customer churn” prediction”. ““customer churn”” is a critical issue for “businesses”, as it can lead to a loss of “revenue” and “market share”. Therefore, predicting ““customer churn”” is essential to retain “customers” and maintain “business growth”. “Classification” involves dividing “data” into different categories or “classes” based on specific criteria. In the context of ““customer churn” prediction”, “classification algorithms” can be used to identify “customers” who are likely to leave the “company”. This information can then be used to develop targeted “retention strategies” (Vo et al. 2018). “Neural networks”, on the other hand, are ““machine learning “models” that mimic the structure and function of the “human brain”. They are particularly useful for analyzing “complex data sets” and identifying “patterns” that may not be apparent through “traditional statistical methods”. By combining “classification” and “neural network techniques”, businesses can develop highly “accurate models” for predicting ““customer churn””. These models can help companies take “proactive measures” to retain “customers” before they decide to leave.
3.5.1 Logistic regression
Logistic regression is a statistical method used to analyze and model the relationship between a dependent variable and one or more independent variables. It is commonly used in fields such as healthcare, finance, and marketing to predict the probability of an event occurring based on certain factors. The logistic regression model works by transforming the linear equation into a sigmoid curve that ranges from 0 to 1. This allows for the prediction of binary outcomes, such as whether a patient will develop a disease or not. One of the key advantages of logistic regression is its ability to handle both categorical and continuous variables (Kozak et al. 2021). It also provides interpretable coefficients that can be used to understand the impact of each independent variable on the dependent variable. However, it is important to note that logistic regression assumes linearity between the independent variables and log odds of the dependent variable. Additionally, it may not perform well when there are high levels of multicollinearity among independent variables.
3.5.2 Random Forest classification
Random Forest Classification is a “machine learning “algorithm that has gained popularity in recent years due to its high accuracy and ability to handle large datasets. The algorithm works by creating multiple decision trees, each trained on a random subset of the data, and then combining their predictions to make a final classification. The strength of Random Forest Classification lies in its ability to reduce overfitting, which occurs when a model is too complex and fits the training data too closely, resulting in poor performance on new data (Spiteri and Azzopardi, 2018). By using multiple decision trees that are trained on different subsets of the data, Random Forest Classification can reduce overfitting and improve generalization. Additionally, Random Forest Classification can handle missing values and noisy data without requiring preprocessing. It also provides feature importance measures that can help identify the most important variables for making predictions.
3.5.3 Neural network
Neural networks are a type of artificial intelligence that have been developed to mimic the way the human brain works. They are made up of interconnected nodes, or neurons, which work together to process information and make decisions. The basic idea behind neural networks is that they can learn from experience. By analyzing large amounts of data, they can identify patterns and make predictions about future events. This makes them particularly useful in fields such as finance, marketing, and healthcare (Pamina et al. 2019). One of the key advantages of neural networks is their ability to handle complex data sets. Unlike traditional statistical methods, which require a lot of manual input and analysis, neural networks can automatically identify relevant features and relationships within the data. However, there are also some limitations to neural networks. They can be computationally expensive to train and may require large amounts of data in order to achieve high accuracy. Additionally, it can be difficult to interpret the results produced by a neural network, making it hard to understand how it arrived at its conclusions.
3.6 Overall workflow
The overall workflow represents the step by step process of “machine learning “development. In the first step of the data analysis the important libraries and the framework have been imported. The libraries and the framework helps in the development process and the model fitting is done efficiently. After that the dataset is imported in the algorithm and after that the data preprocessing is performed for the proper representation of the data. The Exploratory data analysis is performed after the data preprocessing and the EDA helps to identify the trends and the insights of the data. The model fitting and testing is done in the last step following the dataset splitting, where the data is splitted into train and test data.
3.7 Summary
The research methodologies and the process that has been followed throughout the research studies has been presented in the above chapter. Further, it can be summarized that the step by step process of “machine learning “has been perfectly developed, where the classification and the neural networks has been vastly used for the “customer churn” prediction. It can be also summarized that the data preprocessing has been perfectly performed for the proper fitting of the data in the “machine learning “model.
Chapter 4: Result and analysis
4.1 Introduction
The research project has been prepared to explore data analysis by the implementation of machine learning. The “customer churn” data has been analyzed and predicted by implementing the classification techniques. The proper steps of the data analysis have been followed throughout the development. The first step has been the data preprocessing and in the data preprocessing step the libraries have been imported and the data set has been imported. After that the null values have been removed and some extra features have been added in the dataset. The next step is the exploratory data analysis and the bar graph and histogram has been created for the proper representation of the data. After the data visualizations the “machine learning “model has been created by the proper splitting of the data, where the random sampling method has been implemented by the minimax scaler framework. For analyzing the results and the prediction the table has been created for the representation of the accuracy of the “machine learning “model.
4.2 Experimental setup for the development
The research project has been created for the development of the “customer churn” analytics for the telecom company. The customer data has been analyzed by implementing the classification and the boosting techniques. The churn customer and the normal customer has been analyzed in order to identify the loopholes in the business. The experimental analysis has been performed in the personal computer system and the “python “libraries and the framework has been used. The table below has represented the requirement and the configuration of the system.
Specification of the computer system |
Configuration of the system |
Operating system |
“Windows 11” |
Random access Memory |
8 GB |
System Type |
“64 bit windows operating system” |
Base Processor |
“X64” |
Processor |
“Intel core i5-10510U Processor” |
Generation |
“10th generation” |
Clock Speed |
“4.80Ghz” |
Python |
“3.7” |
“python “Libraries |
“Pandas, Numpy, seaborn, matplotlib, sklearn, XGBoost” |
Table 1: System configuration for the “machine learning “development
(Source: Self-Created)
4.3 Development interface and development environment
The ““machine learning “algorithm” has been developed in the “jupyter notebook” and the easy to use “interface” and the “open source platform” has helped to develop the “analytics” efficiently. One of the key advantages of “Jupyter Notebook” is its ability to combine code with rich text elements such as “equations”, “images”, “videos”, and “interactive widgets”. This makes it easy to create “reports” that are both “informative” and “visually appealing”. Additionally, “Jupyter Notebook” allows users to “collaborate” on projects in “real-time” by sharing “notebooks” via “GitHub” or other “cloud-based services”. Another benefit of “Jupyter Notebook” is its support for “data visualization” libraries such as “Matplotlib" and “Seaborn” (Scriney et al. 2020). These libraries make it easy to create “charts and graphs” that help users understand “complex datasets”. Moreover, “Jupyter Notebook” is a powerful tool for “data analysis” that offers a range of “features” that make it easy to work with data in an “interactive environment”. Its “flexibility” and ease-of-use have made it a popular choice among “researchers”, “scientists”, “analysts”, and “developers” alike. The ““python “programming language” has also helped the development by the implementation of the “framework” and the “libraries of python”. “Python”, a “high-level programming language”, has become one of the most “popular languages” for “machine learning” development due to its “simplicity and flexibility”. With its “vast array of libraries” and “tools”, “Python” has made it easier than ever to build “powerful “machine learning “models”. Python's “scikit-learn library” provides a “wide range of algorithms” for “classification”, “regression”, “clustering”, and “dimensionality reduction” (Gregory, 2018). Additionally, “TensorFlow” and “Keras” are two popular “deep learning frameworks” in “Python” that allow developers to build “complex neural networks” with ease.
The “machine learning “has been developed following some crucial steps. the first step the raw data are collected from the different resources and the dataset is merged into one dataset. The next step is cleaning and the transformation, this process extracts the features from the dataset. After extracting the features the features are used for building the “machine learning “model and the insights are gained from the “machine learning “model. In the feature creation process the most useful variables are found for predictive modeling (Singh et al. 2018). In the transformation process the predictor variables are adjusted to improve the accuracy and the performance. By ensuring that all the variables are on the same scale and that the model is flexible enough to accept input from a range of data, the “transformation” makes the model simpler to comprehend. In order to prevent any “computational error”, it increases the model's correctness and makes sure that all of the features are within the permitted range.
4.4 Data preprocessing
“Data preprocessing” is an essential step in “machine learning” that involves “cleaning and transforming raw data” into a format suitable for “analysis”. The goal of “data preprocessing” is to ensure that the “data” is “accurate”, “complete”, and “consistent”. This process involves several steps, including “data cleaning”, “feature selection”, “normalization”, and “transformation”. “Data cleaning” involves identifying and “correcting errors” in the dataset. This can include removing “duplicates”, filling in “missing values”, and “correcting inconsistent data”. “Feature selection” involves selecting the “relevant features” from the “dataset” that are most important for the “analysis” (Ahn et al. 2020). “Normalization” involves scaling the “data” so that it falls within a “specific range”. This can help to improve the “accuracy” of ““machine learning “algorithms” by reducing the “impact of outliers”. “Transformation” involves converting “categorical variables” into “numerical values” or creating “new features” based on existing ones. This can help to improve the “accuracy” of ““machine learning “models” by providing more information about the relationships between “different variables”.
The above figure shows that the libraries have been imported in the algorithm, the algorithm has been developed efficiently by the implementation of the libraries and the framework of python. First the basic library, such as the numpy and the pandas has been imported and the numpy pandas has been imported for the dataset operation. The numpy has been imported for the numerical operation and the dataset has been imported with the help of pandas. The numpy has been defined as np and the pandas has been defined as pd (Wang et al. 2019). After importing the basic libraries the data visualization libraries, such as matplotlib, seaborne and the plotly library have been imported. The seaborne library has been imported as sns, the matplotlib has been imported as plt and the plotly.io has been imported as pio. After the data visualization the framework has been imported for the “machine learning “model creation and the sklearn and the xgboost framework has been imported. In the sklearn framework the sklearn model selection, sklearn preprocessing, sklearn.metrics and the sklearn linear assemble model has been imported in the program.
After importing the libraries in the program the dataset of the customer has been also imported and the above figure has shown the process of importing the dataset. In the importation of the dataset the dataframe has been named as the Data. For importing the dataset the path of the data in the computer system has been defined and the type of the dataset has been also defined for the data visualization. The pd.read_csv command has been used for importing the dataset and after that the .head command has been used for visualizing the head of the dataset and the above five rows of the dataset have been visualized.
The information of the dataset has been analyzed by implementing the .info command and in the information table the detailed information has been visualized. For the different columns the data types have been presented and the datatypes are object, int64 and float64. The count of every column has been represented along with that the memory usage has been also represented and total 399.9 kb memory has been used. The features of the data has been identified by the .info command.
The null value of the dataset has been checked after extracting the information of the dataset. The .isnull().sum() command has been used for checking the null value count. In the dataset no null value has been found in the dataset.
In the data preprocessing steps the feature creation has been also performed after the data cleaning. In the feature extraction process the categorical features and the binary features have been added for the proper data visualization. First the tenure group feature has been added in the dataframe for representing the dataframe. After that the binary features have been added for the visualization of the churn. Along with the binary features the total services have been added for the proper representation of the churn data.
4.5 Exploratory data analysis
“Exploratory Data Analysis (EDA)” is a crucial step in the ““machine learning “process”. It helps to understand the “data”, “identify patterns”, and “relationships” between “variables”. “EDA” is essential in building “accurate and reliable models” that can make “predictions” with “high accuracy”. In “Machine Learning”, “data” is the foundation of any “model”. Without proper analysis of “data”, it's impossible to build an “effective model”. “EDA” helps to identify “missing values”, “outliers”, and “anomalies” in “data” that can affect the performance of a “model”. It also helps to select “relevant features” for training a “model”. “EDA” provides insights into the “distribution of data” and its characteristics such as “skewness”, “kurtosis”, and “variance” (Xiahou and Harada, 2022). This information helps to choose “appropriate algorithms” for “modeling” and “preprocessing techniques” like “normalization” or “scaling”. Moreover, “EDA” helps to visualize “data” using “graphs” and “charts” which makes it easier to communicate findings with “stakeholders or clients”.
In the data preprocessing step the categorical features and the binary features have been added for the proper visualization of the churn data. In the first step the categorical columns have been defined and the unique categories dataframe has been also defined for the bar plot creation. After that, the function has been created for the bar plot in the bar plot functions the column name and the data has been also defined. In the next step under the plot function the figure size has been defined as (8,4). After defining the figure size the column name and the colors has been defined by the sns.countplot. In the x label the column name and in the y label the count has been represented and the column name has been created according to the variable.
After defining the x and the y labels the count values have been created for the bar chart. The plot.show has been used for visualizing all the plots together. The step by step process of the plotting has been performed by defining the functions and the seaborn and matplotlib has been defined. The seaborne library has been defined as sns and the matplotlib has been defined as plt.
The above bar graph has been created to represent the frequency of use according to that the valuable customer can be recognized. Analyzing the bar graph it has been seen that the valuable customers are in between the 0 to 50 range and in between the 100 to 150 range.
The above histogram has been created in order to analyze the seconds of use. The histogram presents the user data how many seconds the user uses the services from the organization. Along with that it has been also seen that for creating the histogram first the column second of use has been used for the proper representation of the user data. And after that the plot title has been also given to understand the visualization. The plotting represents that the highest second of use has been seen for 0 to 2500. Along with that the lowest count has been seen in between the 12500 to 17500.
4.6 Classification “machine learning “algorithm for churn prediction
““customer churn” prediction” is a crucial task for “businesses” to retain their “customers” and “increase revenue”. “Machine learning” has emerged as a powerful tool to “predict “customer churn”” by analyzing “customer behavior” and identifying “patterns” that indicate the likelihood of “churn”. ““machine learning “algorithms” can analyze “large amounts of data”, including “customer demographics”, “purchase history”, and interactions with the company (De Caigny et al. 2020). By using this “data”, ““machine learning “models” can identify “factors” that contribute to ““customer churn”” and predict which “customers” are most likely to leave. One of the benefits of using “machine learning” for ““customer churn” prediction” is its ability to continuously learn from “new data”. As more information becomes available, the “model” can adjust its “predictions” and improve “accuracy” over time.
4.6.1 Data splitting
“Data splitting” is a crucial step in “machine learning” that involves dividing a “dataset” into two or more subsets for “training and testing” purposes. The primary goal of “data splitting” is to evaluate the “performance” of a ““machine learning “model” accurately. “Splitting data” into “training and testing sets” helps in preventing “overfitting”, which occurs when a “model” performs well on the “training set” but poorly on “new data”. There are various “techniques” for “splitting data”, including “random sampling”, “stratified sampling”, and “k-fold cross-validation”. “Random sampling” involves randomly dividing the “dataset” into “training and testing sets”. “Stratified sampling” ensures that each “subset” has an equal representation of different “classes or categories” present in the “dataset” (Domingos et al. 2021). “K-fold cross-validation” involves dividing the “dataset” into “k subsets” and using each subset as both “training and testing sets”. Moreover, “data splitting” is an essential step in “machine learning” that helps to evaluate the “performance of models” accurately. It prevents “overfitting” by ensuring that “models” generalize well to “new data”. Different “techniques” can be used for “splitting data”, depending on the “nature of the dataset” and the research question at hand.
The data splitting has been performed after creating the dummy dataframe and the dummy data frame has been named as Data_dummy. The above figure has represented the dummy dataset, which has been created for the data splitting.
After creating the new dummy dataset the dataset has been also divided into x and y. The division of the dataset has been performed for the model training and testing. Depending upon the x and y the X_train, X_test, Y-train and the Y_test has been defined for the “machine learning “model building. In the output the churn and the train and the test data size has been also defined.
The proper data scaling has been performed for the model fitting and the minmaxscaler has been defined for the data scaling and after that the X_train and the X-Test has been also defined for the transformations.
4.6.2 Random forest model building
“Random Forest Model” is a powerful ““machine learning “algorithm” that has gained “immense popularity” in recent times. It is an ensemble “learning method” that combines “multiple decision trees” to create a robust and “accurate predictive model”. The algorithm works by randomly selecting a subset of “features” and “samples from the dataset”, and then building multiple “decision trees” on these subsets (Zhong and Li, 2019). The “Random Forest Model” is known for its ability to handle “high-dimensional data” with “complex relationships” between “variables”. It can be used for both “classification” and “regression tasks”, making it a versatile tool in “data science”.
The first model that has been created is the random forest classifier and in the first step the result table has been defined. The train data and the test data has been defined after that the result has been defined as result_RF. The train data has been defined for the model development and the test data has been defined for the prediction table. In the last step the “machine learning “model name has been defined along with that the accuracy data has been defined.
4.6.3 Logistic regression model building
“Logistic regression” is a statistical technique used to analyze the relationship between a “dependent variable” and one or more “independent variables”. In the context of ““customer churn” prediction”, “logistic regression” has been used to identify which “factors” are most likely to lead to ““customer churn””. “Logistic regression” is an important tool for “businesses” looking to improve their “customer retention rates” (De Caigny et al. 2020). By identifying the key drivers of ““customer churn””, businesses can take “proactive steps” to keep their customers “happy and loyal”.
The logistic regression model has been created after the random forest development. First the accuracy, recall, roc_auc and the precision value has been added in the result table. In the next step the train and the test dataset has been defined with the “machine learning “model. The random forest classifier model has been defined as the RF. For predicting the result, the result_df_RF has been defined and the prediction table has been defined with this command. For every result score the data has been defined in the “machine learning “model and the test data has been defined in the results.
4.6.3 XGBoost model building
The “Xgboosting algorithm” works by creating a series of “decision trees” that are trained on subsets of the “data”. Each tree makes a “prediction” based on a set of “features”, and the “final prediction” is made by combining the “predictions” from all the trees. In ““customer churn” prediction”, the “Xgboosting algorithm” has been used to identify which “customers” are most likely to leave and why. By analyzing “customer data” such as “purchase history”, “demographics”, and “behavior patterns”, the “algorithm” can identify “key factors” that contribute to “churn” (Ahn et al. 2020). Overall, the “Xgboosting algorithm” is an effective tool for predicting ““customer churn”” and improving “retention rates”. By using this “technique”, businesses can gain “valuable insights” into their “customers' behavior” and take proactive steps to retain them.
Following the same process the XGBoosting algorithm has been created and in the first step the result table has been defined in the first step. The random state for the XGboosting classifier has been defined as 0. For the model fitting the x_train and the y_train has been defined and for the prediction the x_test has been defined.
4.6.4 Decission tree model building
Figure 4.6.4.1: Decision tree classifier
(Source: Self-Created)
The above figure has represented that the decision tree classifier has been created for the classification model. Along with that accuracy score for the decision tree has been seen as 93. The cross tab has been also created for visualizing the predictions. For the model building the X_train and the y_train has been defined and after that the accuracy score has been defined. In the prediction of the accuracy score the y_test and the X_test data has been defined.
4.6.5 Neural Network
Figure 4.6.5.1: Neural network model building
(Source: Self-Created)
The above figure has represented that the artificial neural network has been created for the proper classification. In the first step the liabrary has been imported after that the hidden layers has been define din the neural network model. In the third step the output layers has been defined for the proper representtaion of the accurecy and the neural network model has been compiled by defining the adam as the optimizer, the binary crossentropy has been defined as the loss and the accurecy has been defined as the metrics. For the model fitting the batch size has been defined as 32 and the epochs has been defined as 100.
4.7 Results and the prediction tables for three different model
The result has been visualized in this section; the prediction table has been created for each “machine learning “model along with that the confusion matrix has been also created for the “machine learning “model to visualize the accuracy of the “machine learning “model. In the accuracy table the accuracy score, recall, Roc_Auc, Cross_val_Acc and the Cross_val_f1 has been presented for the analysis of the “machine learning “model.
The result table represents that the accuracy score for the logistic regression is 90, the recall value for the logistic regression is 48, the Roc_Auc for the logistic regression is 73. Along with that the precision value is 85, the Cross_val_acc is 88 and the Cross_val_f1 is 51
The accuracy score for the random forest model is 95, the recall value is 81 and the precision value for the random forest is 90. The cross-value accuracy is 94 and the cross_val_f1 is seen as 81.
The result for the xgboosting “machine learning “model has been presented in the above figure. The Accuracy score for the XGboosting is 96, the recall value for the model is 52 and the precision value for the XGboosting has been seen as 92
The highest accuracy has been seen in the XGBoosting model, where the accuracy score is 96. The proper model fitting has been performed and the proper prediction has been seen in the logistic regression “machine learning “model.
4.8 Summary
The random forest, logistic regression and the XGBoosting has been used for the prediction of the “customer churn” data. The research project is completely focused on the “customer churn” data prediction by the “machine learning “algorithm. Further it can be summarized that the proper steps of the data analysis has been performed and the first step data preprocessing has been performed for the proper model fitting. The data preprocessing has been performed for the removal of the null values and some extra features have been added for the data visualization. The data visualization has identified the “customer churn” in the different variables of the dataset. Further, the organization is also needed to improve the customer services and the service quality for retaining the churn again and also to build the customer loyalty more efficiently. It can be also summarized that the “customer churn” prediction helps to identify the loopholes in the organizations, for which the churn in the customer base is increasing on a daily basis.
Chapter 5: Conclusion and recommendation
5.1 Summary of the research
The research study has been prepared for the data analysis of the “customer churn” data prediction. The “machine learning “techniques have been utilized for the proper data analysis and the classification techniques have been used for predicting the “customer churn”.The “customer churn” data prediction is important as the “customer churn” data prediction helps to identify the loopholes in the organization for retaining the customer database. For performing the research study the proper methods and process has been followed. The research paper has been made by dividing into different chapters. The introduction has presented the aim objectives of the research along with the significance of the research. The research background has been also presented in the introduction chapter. In chapter 2 the literature has been presented along with the theories and the hypothesis the literature gap has been also presented in the literature review section. The methodology states all the methods and the techniques that have been implemented in the research process. In the last the result analysis section the development has been focused, where the “machine learning “model has been created and evaluated with the testing results.
The model development, accuracy score and the exploratory data analysis has been presented in the result and analysis section. Further it can be summarized that the significant churn has been identified in the different services and in the different groups of customers. The “customer churn” has been significantly seen for the poor service quality and the decreased customer services. Further the organization can retain the customer base again by developing the services in a better format and increasing the product quality. This chapter conclusion and the recommendation has presented the result and the conclusion and also recommend the better solutions according to the problem statement. The linking with objective section is going to check if all the objectives have been met or not. In the limitation the disadvantages and the obstacles are going to be presented, which has been faced by the researcher during the research study. Further it can be also summarized that the waterfall methodology has been followed for managing the development project.
5.2 Linking with objectives
Linking with objective 1
The first objective of the research project is to understand the main factors, which contribute to the “customer churn” in the dataset. The objective has been met in the literature review section particularly in the point 2.2, 2.3 and 2.4 section of the literature review. In point 2.2 the importance of “machine learning “has been discussed in the “customer churn” analysis. In the 2.3 the implementation of the “python “programming language has been discussed in machine learning. The 2,4 point has discussed the different “machine learning “algorithms which are important for the “customer churn” analysis.
Linking with objective 2
The second objective of the research project is to understand the best algorithm for the “customer churn” prediction and also to compare the different baseline models. The second objective has been met in the literature review section. Particularly in the 2.4 different models has been evaluated for the “customer churn” prediction. “python “provides a powerful and versatile ecosystem for developing “machine learning “algorithms, with its rich libraries, tools, and community support. Following the proper steps an effective “machine learning “algorithm can be developed using “python “for “customer churn” analysis or any other “machine learning “problem.
Linking with objective 3
The third objective of the research study is to compare the accuracy of the new algorithm with the traditional “machine learning “algorithm. The third objective has been met in the result and analysis section and in the development part three classification models have been created and analyzed to find the best accuracy score. The best accuracy score has been found for the XG Boosting machine learning model and the accuracy score is 96.
Linking with objective 4
The fourth objective of the research study is to improve the customer retention and reduce the churn rates by developing the new techniques and process for the improvement. The 4th objective has been met in the result and analysis sections, where the analysis has been also performed by the Data visualization and the loopholes in the organization have been identified for the mitigation and the loopholes have been seen in the fiber optics services. Further the customer [churn represents the dissatisfaction of the user in the organization. Along with that the organization can retain the customers by improving the weak areas and services, such as the fiber optics connections.
Conclusion
The research project focused on the data analysis of the customer data and the “customer churn” is going to be predicted by the implementation of the “machine learning “techniques. “machine learning “has also been developed by the “python “programming language. Further, it can be concluded that the research study has predicted the data and the service quality has decreased in the internet service section. The churn has been found in the different sections such as male, female and the senior citizen section of the customer. Further the improvement can be done on the service providing. It can be also concluded that the logistic regression “machine learning “model has presented the highest accuracy in the prediction model. Accuracy score for the logistic regression model has been found as 90. In the random forest model the accuracy score is 95 and for the XGBoost “machine learning “model the accuracy score is 96.
Limitation
The research project has been prepared for analyzing the “customer churn” and the “machine learning “is used for the proper prediction of the “customer churn”. “customer churn” prediction is a crucial aspect of any business that aims to retain its customers. It refers to the process of identifying customers who are likely to leave a company and taking proactive measures to prevent them from doing so. “machine learning “has efficiently helped in the analysis process by providing the proper data visualization and the predictive model. Further different disadvantages and limitations have been also found in the time of “machine learning “development. The main problem that is found is the model overfitting. The model overfitting is generally found for the irregular data division in the dataset and the oversized dataset. The limitation has been also found in the time of data collection. The data collection has been performed by assessing the secondary resource of data, where some premium articles have not been assessed due to the lack of funds in the research study.
Recommendation
Focusing on the limitations of the research study it can be said that in the near future the research can be performed efficiently by implementing the evolved and modern “machine learning “techniques. With the development of technologies significant changes have been seen in different areas. The model overfitting can be prevented by the proper data preprocessing and the data preprocessing reduce the redundancy in the dataset and tune the dataset efficiently for projecting proper result. It can be also recommended that the proper fund should be also allocated to the researcher for assessing all types of journals and articles.
References
Çelik, O. and Osmanoglu, U.O., 2019. Comparing to techniques used in “customer churn” analysis. Journal of Multidisciplinary Developments, 4(1), pp.30-38.
Lalwani, P., Mishra, M.K., Chadha, J.S. and Sethi, P., 2022. “customer churn” prediction system: a “machine learning “approach. Computing, pp.1-24.
Agrawal, S., Das, A., Gaikwad, A. and Dhage, S., 2018, July. “customer churn” prediction modelling based on behavioural patterns analysis using deep learning. In 2018 International conference on smart computing and electronic enterprise (ICSCEE) (pp. 1-6). IEEE.
Rahman, M. and Kumar, V., 2020, November. “machine learning “based “customer churn” prediction in banking. In 2020 4th international conference on electronics, communication and aerospace technology (ICECA) (pp. 1196-1201). IEEE.
Gaur, A. and Dubey, R., 2018, December. Predicting “customer churn” prediction in telecom sector using various “machine learning “techniques. In 2018 International Conference on Advanced Computation and Telecommunication (ICACAT) (pp. 1-5). Ieee.
Cenggoro, T.W., Wirastari, R.A., Rudianto, E., Mohadi, M.I., Ratj, D. and Pardamean, B., 2021. Deep learning as a vector embedding model for “customer churn”. Procedia Computer Science, 179, pp.624-631.
Dias, J., Godinho, P. and Torres, P., 2020. “machine learning “for “customer churn” prediction in retail banking. In Computational Science and Its Applications–ICCSA 2020: 20th International Conference, Cagliari, Italy, July 1–4, 2020, Proceedings, Part III 20 (pp. 576-589). Springer International Publishing.
Mohammad, N.I., Ismail, S.A., Kama, M.N., Yusop, O.M. and Azmi, A., 2019, August. “customer churn” prediction in telecommunication industry using “machine learning “classifiers. In Proceedings of the 3rd international conference on vision, image and signal processing (pp. 1-7).
Karvana, K.G.M., Yazid, S., Syalim, A. and Mursanto, P., 2019, October. “customer churn” analysis and prediction using data mining models in banking industry. In 2019 International Workshop on Big Data and Information Security (IWBIS) (pp. 33-38). IEEE.
Kavitha, V., Kumar, G.H., Kumar, S.M. and Harish, M., 2020. Churn prediction of customer in telecom industry using “machine learning “algorithms. International Journal of Engineering Research & Technology (IJERT), 9(5), pp.181-184.
Stucki, O., 2019. Predicting the “customer churn” with “machine learning “methods: case: private insurance customer data.
He, Y., Xiong, Y. and Tsai, Y., 2020, April. “machine learning “based approaches to predict “customer churn” for an insurance company. In 2020 Systems and Information Engineering Design Symposium (SIEDS) (pp. 1-6). IEEE.
Panjasuchat, M. and Limpiyakorn, Y., 2020, August. Applying Reinforcement Learning for “customer churn” Prediction. In Journal of Physics: Conference Series (Vol. 1619, No. 1, p. 012016). IOP Publishing.
Jain, H., Yadav, G. and Manoov, R., 2020. Churn prediction and retention in banking, telecom and IT sectors using “machine learning “techniques. In Advances in “machine learning “and Computational Intelligence: Proceedings of ICMLCI 2019 (pp. 137-156). Singapore: Springer Singapore.
Vo, N.N., Liu, S., Li, X. and Xu, G., 2021. Leveraging unstructured call log data for “customer churn” prediction. Knowledge-Based Systems, 212, p.106586.
Amuda, K.A. and Adeyemo, A.B., 2019. Customers churn prediction in financial institution using artificial neural network. arXiv preprint arXiv:1912.11346.
Pamina, J., Beschi Raja, J., Sam Peter, S., Soundarya, S., Sathya Bama, S. and Sruthi, M.S., 2020. Inferring “machine learning “based parameter estimation for telecom churn prediction. In Computational Vision and Bio-Inspired Computing: ICCVBIC 2019 (pp. 257-267). Springer International Publishing.
Leung, C.K., Pazdor, A.G. and Souza, J., 2021, October. Explainable artificial intelligence for data science on “customer churn”. In 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA) (pp. 1-10). IEEE.
Sudharsan, R. and Ganesh, E.N., 2022. A Swish RNN based “customer churn” prediction for the telecom industry with a novel feature selection strategy. Connection Science, 34(1), pp.1855-1876.
Bhatnagar, A. and Srivastava, S., 2019, December. A robust model for churn prediction using supervised machine learning. In 2019 IEEE 9th international conference on advanced computing (IACC) (pp. 45-49). IEEE.
Kumar, S. and Kumar, M., 2019. Predicting “customer churn” using artificial neural network. In Engineering Applications of Neural Networks: 20th International Conference, EANN 2019, Xersonisos, Crete, Greece, May 24-26, 2019, Proceedings 20 (pp. 299-306). Springer International Publishing.
Halibas, A.S., Matthew, A.C., Pillai, I.G., Reazol, J.H., Delvo, E.G. and Reazol, L.B., 2019, January. Determining the intervening effects of exploratory data analysis and feature engineering in telecoms “customer churn” modelling. In 2019 4th MEC International Conference on Big Data and Smart City (ICBDSC) (pp. 1-7). IEEE.
Tariq, M.U., Babar, M., Poulin, M. and Khattak, A.S., 2022. Distributed model for “customer churn” prediction using convolutional neural network. Journal of Modelling in Management, 17(3), pp.853-863.
Momin, S., Bohra, T. and Raut, P., 2020. Prediction of “customer churn” using machine learning. In EAI International Conference on Big Data Innovation for Sustainable Cognitive Computing: BDCC 2018 (pp. 203-212). Springer International Publishing.
Labhsetwar, S.R., 2020. Predictive analysis of “customer churn” in telecom industry using supervised learning. ICTACT Journal on Soft Computing, 10(2), pp.2054-2060.
Ullah, I., Raza, B., Malik, A.K., Imran, M., Islam, S.U. and Kim, S.W., 2019. A churn prediction model using random forest: analysis of “machine learning “techniques for churn prediction and factor identification in telecom sector. IEEE access, 7, pp.60134-60149.
El Zarif, O., Da Costa, D.A., Hassan, S. and Zou, Y., 2020, June. On the relationship between user churn and software issues. In Proceedings of the 17th International Conference on Mining Software Repositories (pp. 339-349).
Santharam, A. and Krishnan, S.B., 2018. Survey on “customer churn” prediction techniques. International Research Journal of Engineering and Technology, 5(11), p.3.
Jain, H., Khunteta, A. and Srivastava, S., 2020. Churn prediction in telecommunication using logistic regression and logit boost. Procedia Computer Science, 167, pp.101-112.
Sabbeh, S.F., 2018. Machine-learning techniques for customer retention: A comparative study. International Journal of advanced computer Science and applications, 9(2).
Wu, S., Yau, W.C., Ong, T.S. and Chong, S.C., 2021. Integrated churn prediction and customer segmentation framework for telco business. IEEE Access, 9, pp.62118-62136.
Ahn, J., Hwang, J., Kim, D., Choi, H. and Kang, S., 2020. A survey on churn analysis in various business domains. IEEE Access, 8, pp.220816-220839.
Kim, S. and Lee, H., 2022. “customer churn” prediction in influencer commerce: an application of decision trees. Procedia Computer Science, 199, pp.1332-1339.
Amin, A., Al-Obeidat, F., Shah, B., Adnan, A., Loo, J. and Anwar, S., 2019. “customer churn” prediction in telecommunication industry using data certainty. Journal of Business Research, 94, pp.290-301.
Singh, M., Singh, S., Seen, N., Kaushal, S. and Kumar, H., 2018, November. Comparison of learning techniques for prediction of “customer churn” in telecommunication. In 2018 28th International Telecommunication Networks and Applications Conference (ITNAC) (pp. 1-5). IEEE.
Vo, N.N., Liu, S., Brownlow, J., Chu, C., Culbert, B. and Xu, G., 2018. Client churn prediction with call log analysis. In Database Systems for Advanced Applications: 23rd International Conference, DASFAA 2018, Gold Coast, QLD, Australia, May 21-24, 2018, Proceedings, Part II 23 (pp. 752-763). Springer International Publishing.
Kozak, J., Kania, K., Juszczuk, P. and Mitr?ga, M., 2021. Swarm intelligence goal-oriented approach to data-driven innovation in “customer churn” management. International Journal of Information Management, 60, p.102357.
Spiteri, M. and Azzopardi, G., 2018, September. “customer churn” prediction for a motor insurance company. In 2018 Thirteenth international conference on digital information management (ICDIM) (pp. 173-178). IEEE.
Pamina, J., Raja, B., SathyaBama, S., Sruthi, M.S. and VJ, A., 2019. An effective classifier for predicting churn in telecommunication. Jour of Adv Research in Dynamical & Control Systems, 11.
Scriney, M., Nie, D. and Roantree, M., 2020. Predicting “customer churn” for insurance data. In Big Data Analytics and Knowledge Discovery: 22nd International Conference, DaWaK 2020, Bratislava, Slovakia, September 14–17, 2020, Proceedings 22 (pp. 256-265). Springer International Publishing.
Gregory, B., 2018. Predicting “customer churn”: Extreme gradient boosting with temporal data. arXiv preprint arXiv:1802.03396.
Ahn, Y., Kim, D. and Lee, D.J., 2020. Customer attrition analysis in the securities industry: a large-scale field study in Korea. International Journal of Bank Marketing, 38(3), pp.561-577.
Wang, C., Han, D., Fan, W. and Liu, Q., 2019. “customer churn” prediction with feature embedded convolutional neural network: An empirical study in the internet funds industry. International journal of computational intelligence and applications, 18(01), p.1950003.
Xiahou, X. and Harada, Y., 2022. B2C E-commerce “customer churn” prediction based on K-means and SVM. Journal of Theoretical and Applied Electronic Commerce Research, 17(2), pp.458-475.
De Caigny, A., Coussement, K., De Bock, K.W. and Lessmann, S., 2020. Incorporating textual information in “customer churn” prediction models based on a convolutional neural network. International Journal of Forecasting, 36(4), pp.1563-1578.
Domingos, E., Ojeme, B. and Daramola, O., 2021. Experimental analysis of hyperparameters for deep learning-based churn prediction in the banking sector. Computation, 9(3), p.34.
Zhong, J. and Li, W., 2019, March. Predicting “customer churn” in the telecommunication industry by analyzing phone call transcripts with convolutional neural networks. In Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence (pp. 55-59).