Posted by Mohamad Ivan Fanany
This writing summarizes and reviews the first reported work on deep learning for churn (the loss of customers because they move out to competitors.) prediction: Using Deep Learning to Predict Customer Churn in a Mobile Telecommunication Network.
- In telecommunication companies, churn costs roughly $10 billion per year.
- Acquiring new customers costs five to six times more than retaining existing ones.
- The current focus is to move from customer acquisition towards customer retention.
- Being able to predict customer churn in advance, provides to a company a high valuable insight in order to retain and increase their customer base. Tailored promotions can be offered to specific customers that are not satisfied.
- Deep learning attempts to learn multiple levels of representation and automatically comes up with good features and representation for the input data.
- To investigate and consider the application of deep learning as a predictive model to avoid time-consuming feature engineering effort and ideally to increase the predictive performance of previous models.
Addressed problem: Using deep learning for predicting churn in a prepaid mobile telecommunication network
- Most advanced models make use of state-of-the-art machine learning classifiers such as random forests , .
- Use graph processing techniques .
- Predict customer churn by analyzing the interactions between the customer and the Customer Relationship Management (CRM) data 
- Base their effectiveness in the feature engineering process.
- The feature engineering process is usually time consuming and tailored only to specific datasets.
- Machine learning classifiers work well if there is enough human effort spent in feature engineering.
- Having the right features for each particular problem is usually the most important thing.
- Features obtained in this human feature engineering process are usually over-specified and incomplete.
- Introduce a data representation architecture that allows efficient learning across multiple layers of detailed user behavior representations. This data representation enables the model to scale to full-sized high dimensional customer data, like the social graph of a customer.
- The first work reporting the use of deep learning for predicting churn in a mobile telecommunication network.
- Churn in prepaid services is actually measured based on the lack of activity
- Infer when this lack of activity may happen in the future for each active customer.
Network architecture: A four-layer feedforward architecture.
Learning algorithms: Autoencoders, deep belief networks and multi-layer feedforward networks with different configurations.
Dataset: Billions of call records from an enterprise business intelligence system. This is a large-scale historical data from a telecommunication company with ≈1.2 million customers and span over sixteen months.
- Churn rate is very high and all customers are prepaid users, so there is no specific date about contract termination and this action must be inferred in advance from similar behaviors.
- There are complex underlying interactions amongst the users.
- Churn prediction is viewed as a supervised classification problem where the behavior of previously known churners and non-churners are used to train a binary classifier.
- During the prediction phase new users are introduced in the model and the likelihood of becoming a churner is obtained.
- Depending on the balance replenishment events, each customer can be in one of the following states: (i) new, (ii) active, (iii) inactive or (iv) churn.
- Customer churn is always preceded by an inactive state and since our goal is to predict churn we will use future inactive state as a proxy to predict churn. In particular, t=30 days without doing a balance replenishment event is set as the threshold used to change the state from active to inactive.
Input data preparation:
- Two main sources of information: Call Detail Records (CDR) and balance replenishment records.
- Each CDR provides a detailed record about each call made by the customer, having (at least) the following information:
- Id of the cell tower where the call is originated.
- Phone number of the customer originating the call.
- Id of the cell tower where the call is finished.
- Destination number of the call.
- Time-stamp of the beginning of the call.
- Call duration (secs).
- Unique identification number of the phone terminal.
- Incoming or outgoing call.
- On the other hand, balance replenishment records has (at least) the following information:
- Phone number related to the balance replenishment event.
- Time-stamp of the balance replenishment event.
- Amount of money the customer spent in the balance replenishment event.
- Input vector preparation steps:
- Compute one input vector per user-id for each month.
- Input vector contains both calls events and balance replenishment history of each customer.
- User-user adjacency matrix is extremely huge (roughly 1.2M x 1.2M entries) and not really meaningful.
- Consider only call records for each top-3 users (often called by the user).
- Create a 48-dimensional vector X where each position refers to the sum of total seconds each user spent on the phone in that 30 minutes time interval over the complete month.
- We will have a 145-dimensional input vector (~ 3 x 48).
- Add another 48-dimensional vector per user with the total amount of cash spent in each monthly slot.
- Include 5 features specific to the business.
- Add the binary class indicating if the user will be active in month M + 1 or not.
- Finally, we end up with a 199-dimensional input vector X.
Deep learning model:
- Use a standard multi-layer feedforward architecture.
- The function tanh is used for the nonlinear activation.
- Each layer computes an output vector using the output of the previous layer.
- The last output layer generates the predictions by applying the softmax function.
- Use the negative conditional log-likelihood as a loss function whose expected value over pairs (between training example and output value) is minimized.
- Apply a standard stochastic gradient descendent (SGD) with the gradient computed via backpropagation.
- The initial weight distribution we employ a random initialization drawn from a normal distribution with 0.8 standard deviation.
- Use dropout as a regularization technique to prevent over-fitting while improving generalization.
Training and validation:
- Deep neural network is trained to distinguish between active and inactive customers based on learned features associated with them.
- In the training phase each instance contains input data for each customer together with the known state in the following month.
- In the validation phase data from the next month is introduced into the model and the prediction errors are computed.
- Each instance in the training and validation data may refer to different customers. Hence, the customer identification is not accounted for the predictions.
- The model has been evaluated over 12 months of real customer data (from March 2013 to February 2014),
- 12 models are trained and each model generates prediction for all the months.
Results: On average, the model achieves 77.9% AUC on validation data, significantly better than our prior best performance of 73.2% obtained with random forests and an extensive custom feature engineering applied to the same datasets.
- Multi-layer feedforward models are an effective algorithm for predicting churn and capture the complex dependency in the data.
- Experiments show that the model is quite stable along different months, thus generalize well with future instances and do not overfit the training data.
- Its success on churn prediction can be potentially applied to other business intelligence prediction tasks like fraud detection or upsell.
- Include location data (latitude and longitude) of each call into the input data and hope to improve obtained results.
- Apply deep belief networks for unsupervised pre-training improves the predictive performance .
- Input data architecture may encode also long-term interactions among users for better model but the full user-user input data is extremely sparse and if we want to consider long-term user interactions it becomes very big.
- This is a nice and interesting article that highlights the success of deep learning to unsupervisedly extract better features for churn out prediction.
- Even though stated in the abstract, the autoencoders and deep belief networks are not yet implemented.
- The result using the same data is 4.7% higher than state of the art random forest technique in term of AUC. However, the computation cost overhead compared to the previous method is unclear.