Review on The First Deep Learning for Churn Prediction

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews the first reported work on deep learning for churn (the loss of customers because they move out to competitors.) prediction: Using Deep Learning to Predict Customer Churn in a Mobile Telecommunication Network.


  • In telecommunication companies, churn costs roughly $10 billion per year.
  • Acquiring new customers costs five to six times more than retaining existing ones.
  • The current focus is to move from customer acquisition towards customer retention.
  • Being able to predict customer churn in advance, provides to a company a high valuable insight in order to retain and increase their customer base. Tailored promotions can be offered to specific customers that are not satisfied.
  • Deep learning attempts to learn multiple levels of representation and automatically comes up with good features and representation for the input data.
  • To investigate and consider the application of deep learning as a predictive model to avoid time-consuming feature engineering effort and ideally to increase the predictive performance of previous models.

Addressed problem: Using deep learning for predicting churn in a prepaid mobile telecommunication network

Previous works:

  • Most advanced models make use of state-of-the-art machine learning classifiers such as random forests [6], [10].
  • Use graph processing techniques [8].
  • Predict customer churn by analyzing the interactions between the customer and the Customer Relationship Management (CRM) data [9]
  • Base their effectiveness in the feature engineering process.
  • The feature engineering process is usually time consuming and tailored only to specific datasets.
  • Machine learning classifiers work well if there is enough human effort spent in feature engineering.
  • Having the right features for each particular problem is usually the most important thing.
  • Features obtained in this human feature engineering process are usually over-specified and incomplete.

Key ideas:

  • Introduce a data representation architecture that allows efficient learning across multiple layers of detailed user behavior representations. This data representation enables the model to scale to full-sized high dimensional customer data, like the social graph of a customer.
  • The first work reporting the use of deep learning for predicting churn in a mobile telecommunication network.
  • Churn in prepaid services is actually measured based on the lack of activity
  • Infer when this lack of activity may happen in the future for each active customer.

Network architecture: A four-layer feedforward architecture.

Learning algorithms: Autoencoders, deep belief networks and multi-layer feedforward networks with different configurations.

Dataset: Billions of call records from an enterprise business intelligence system. This is a large-scale historical data from a telecommunication company with ≈1.2 million customers and span over sixteen months.


  • Churn rate is very high and all customers are prepaid users, so there is no specific date about contract termination and this action must be inferred in advance from similar behaviors.
  • There are complex underlying interactions amongst the users.


  • Churn prediction is viewed as a supervised classification problem where the behavior of previously known churners and non-churners are used to train a binary classifier.
  • During the prediction phase new users are introduced in the model and the likelihood of becoming a churner is obtained.
  • Depending on the balance replenishment events, each customer can be in one of the following states: (i) new, (ii) active, (iii) inactive or (iv) churn.
  • Customer churn is always preceded by an inactive state and since our goal is to predict churn we will use future inactive state as a proxy to predict churn. In particular, t=30 days without doing a balance replenishment event is set as the threshold used to change the state from active to inactive.

Input data preparation:

  • Two main sources of information: Call Detail Records (CDR) and balance replenishment records.
  • Each CDR provides a detailed record about each call made by the customer, having (at least) the following information:
    • Id of the cell tower where the call is originated.
    • Phone number of the customer originating the call.
    • Id of the cell tower where the call is finished.
    • Destination number of the call.
    • Time-stamp of the beginning of the call.
    • Call duration (secs).
    • Unique identification number of the phone terminal.
    • Incoming or outgoing call.
  • On the other hand, balance replenishment records has (at least) the following information:
    • Phone number related to the balance replenishment event.
    • Time-stamp of the balance replenishment event.
    • Amount of money the customer spent in the balance replenishment event.
  • Input vector preparation steps:
    • Compute one input vector per user-id for each month.
    • Input vector contains both calls events and balance replenishment history of each customer.
    • User-user adjacency matrix is extremely huge (roughly 1.2M x 1.2M entries) and not really meaningful.
    • Consider only call records for each top-3 users (often called by the user).
    • Create a 48-dimensional vector X where each position refers to the sum of total seconds each user spent on the phone in that 30 minutes time interval over the complete month.
    • We will have a 145-dimensional input vector (~ 3 x 48).
    • Add another 48-dimensional vector per user with the total amount of cash spent in each monthly slot.
    • Include 5 features specific to the business.
    • Add the binary class indicating if the user will be active in month M + 1 or not.
    • Finally, we end up with a 199-dimensional input vector X.

Deep learning model:

  • Use a standard multi-layer feedforward architecture.
  • The function tanh is used for the nonlinear activation.
  • Each layer computes an output vector using the output of the previous layer.
  • The last output layer generates the predictions by applying the softmax function.
  • Use the negative conditional log-likelihood as a loss function whose expected value over pairs (between training example and output value) is minimized.
  • Apply a standard stochastic gradient descendent (SGD) with the gradient computed via backpropagation.
  • The initial weight distribution we employ a random initialization drawn from a normal distribution with 0.8 standard deviation.
  • Use dropout as a regularization technique to prevent over-fitting while improving generalization.

Training and validation:

  • Deep neural network is trained to distinguish between active and inactive customers based on learned features associated with them.
  • In the training phase each instance contains input data for each customer together with the known state in the following month.
  • In the validation phase data from the next month is introduced into the model and the prediction errors are computed.
  • Each instance in the training and validation data may refer to different customers. Hence, the customer identification is not accounted for the predictions.
  • The model has been evaluated over 12 months of real customer data (from March 2013 to February 2014),
  • 12 models are trained and each model generates prediction for all the months.

Results: On average, the model achieves 77.9% AUC on validation data, significantly better than our prior best performance of 73.2% obtained with random forests and an extensive custom feature engineering applied to the same datasets.


  • Multi-layer feedforward models are an effective algorithm for predicting churn and capture the complex dependency in the data.
  • Experiments show that the model is quite stable along different months, thus generalize well with future instances and do not overfit the training data.
  • Its success on churn prediction can be potentially applied to other business intelligence prediction tasks like fraud detection or upsell.

Future works:

  • Include location data (latitude and longitude) of each call into the input data and hope to improve obtained results.
  • Apply deep belief networks for unsupervised pre-training improves the predictive performance [20].
  • Input data architecture may encode also long-term interactions among users for better model but the full user-user input data is extremely sparse and if we want to consider long-term user interactions it becomes very big.

My Review:

  • This is a nice and interesting article that highlights the success of deep learning to unsupervisedly extract better features for churn out prediction.
  • Even though stated in the abstract, the autoencoders and deep belief networks are not yet implemented.
  • The result using the same data is 4.7% higher than state of the art random forest technique in term of AUC. However, the computation cost overhead compared to the previous method is unclear.

2 thoughts on “Review on The First Deep Learning for Churn Prediction

  1. Hi,I read your review on the first deep learning for churn prediction.I can not see any deep learning algorithms being applied in that article.It was just predicting churn with ANN containing 4 hidden layers. There is no unsupervised learning(feature extraction) that was happening there.


    1. Dear bbnsumanth,

      Thank you for your comments.

      The NN is actually a special case of deep learning.
      The key differences of deep learning compared with
      conventional NN are two-fold:

      1. Deep learning use raw features rather than human-made
      features as used by conventional NN
      2. Deep learning go through pre-training (greedy layer wise
      training) before fine-tuning, whereas a conventional NN
      only do fine-tuning (error backpropagation)

      As for the used features. in the paper, they mainly use raw features:
      the sum of total seconds of a user call to top-3 (the most frequent call)
      during a month (145 dimension input vector). Another raw features:
      total amount of cash spent in each month) (48 dimension) , is then
      embedded to the previous raw features, resulting 183 dimension input vector.
      Additional 5 dimension (business features) are added. The total dimension
      of input vector are 199. Conventional NN do not use such a raw features,
      it might use features such as age, living expenses, living area, etc.

      As for the learning method, one of the biggest problem in conventional NN
      when we add more layers is “diminishing error” problem, i.e., the error
      getting smaller as it was back-propagated to the lower layers. This problem
      is tried using appropriate initialization, activation function, etc:

      I agree with you that the paper is using conventional NN and it was
      unclear if the paper tried to solve the second point of doing the
      pre-training before fine-tuning. As I stated in my review:
      “Even though stated in the abstract, the autoencoders and deep
      belief networks are not yet implemented”.

      I still think, however, that the paper do a deep learning mainly
      from the first reason (using raw features). In addition, they use
      a NN with more than 2 layers (in fact 4 layers). The reason of set
      4 number of layers is maybe because the authors wanted to extract 3
      degree relationship between a user to top-3 callee. In such “not
      really deep” structure, the “diminishing error” problem might be solved
      by using a good initialization and error or activation functions
      (in the paper, they use tanh and softmax).

      By the way, I become interested to ask about this to the
      authors directly.

      Thank you.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s