Review on Deep Learning for Signal Processing

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews Deep Learning and Its Applications to Signal and Information Processing


  • Signal processing research significantly widened its scope [4].
  • Machine learning has been an important technical area of the signal processing.
  • Since 2006, deep learning—a new area of machine learning research—has emerged [7], impacting a wide range of signal and information processing.


  • Introduce the emerging technologies enabled by deep learning.
  • Review the research in deep learning which relevant to signal processing.
  • Point out the future research directions.
  • Provide a brief survey of deep learning applications in three main categories:
    1. Speech and audio
    2. Image and video
    3. Language processing and information retrieval

Introduction to Deep Learning:

  • Traditional machine learning and signal processing exploit shallow architectures (contain a single layer of nonlinear feature transformation) such as:
    • Hidden Markov models (HMMs),
    • Linear or nonlinear dynamical systems,
    • Conditional random fields (CRFs),
    • Maximum entropy (MaxEnt) models,
    • Support vector machines (SVMs),
    • Kernel regression,
    • Multilayer perceptron (MLP) with a single hidden layer.
  • SVM is a shallow linear separation model with one feature transformation layer when kernel trick is used, and with zero feature transformation layer when kernel trick is not used.
  • Human information processing mechanisms (e.g., vision and speech), need deep architectures for extracting complex structure and building internal representation from rich sensory inputs (e.g., natural image and its motion, speech, and music).
  • Human speech production and perception systems are layered hierarchical structures that transform information from the waveform level to the linguistic level and vice versa.
  • Processing human information media signals would keep be advance if efficient and effective deep learning algorithms are developed.
  • Signal processing systems with deep architectures are composed of many layers of nonlinear processing stages, where each lower layer’s outputs are fed to its immediate higher layer as the input.
  • Two key properties of successful deep learning techniques:
    • The generative nature of the model, which requires an additional top layer to perform the discriminative task
    • Unsupervised pretraining that effectively uses large amounts of unlabeled training data for extracting structures and regularities in the input features.

Brief history:

  • The concept of deep learning originated from artificial neural network research.
  • Multilayer perceptron with many hidden layers is a good example of deep architectures.
  • Backpropagation is a well-known algorithm for learning the weights of multilayer perceptron.
  • Backpropagation alone does not work well with more than a small number of hidden layers (see a review and analysis in [1]).
  • The pervasive presence of local optima in the nonconvex objective function of the deep networks is the main source of difficulty in learning.
  • Backpropagation is based on local gradient descent and starts usually at some random initial points.
  • Backpropagation often gets trapped in local optima and the severity increases significantly as the depth increases.
  • Due to local optima problem, many machine learning and signal processing research steered away from neural networks to shallow models that have convex loss functions (e.g., SVMs, CRFs, and MaxEnt models) for which global optimum can be efficiently obtained at the cost of less powerful models.
  • An unsupervised learning algorithm, which efficiently alleviates local minima problem, was introduced in 2006 by Hinton et al. [7] for a class of deep generative models that is called deep belief networks (DBNs).
  • A core component of the DBN is a greedy, layer-by-layer learning algorithm that optimizes DBN weights at time complexity linear to the size and depth of the networks.
  • Separately and with some surprise, initializing the weights of an MLP with a correspondingly configured DBN often produces much better results than that with the random weights [1], [5].
  • Deep networks that are learned with unsupervised DBN pretraining followed by the backpropagation fine-tuning are also called DBNs (e.g., [8] and [9]).
  • DBN attractive properties:
    1. Makes effective use of unlabeled data;
    2. Can be interpreted as Bayesian probabilistic generative models;
    3. Hidden variables in the deepest layer are efficient to compute;
    4. The overfitting problem (often observed in the models with millions of parameters such as DBNs), and the underfitting problem (often occurred in deep networks) are effectively addressed by the generative pre-training step.
  • Since the publication of the seminal work of [7], numerous researchers have been improving and applying the deep learning techniques with success.
  • Another popular technique is to pretrain the deep networks layer by layer by considering each pair of layers as a denoising auto-encoder [1].

Applications of Deep Learning to Signal Processing:

  • Technical scope of signal processing expands from traditional types of signals (audio, speech, image and video), now also includes text, language, and document to convey high-level, semantic information for human consumption.
  • The scope of processing has been extended from the conventional coding, enhancement, analysis, and recognition to include more human-centric tasks of interpretation, understanding, retrieval, mining, and user interface [4].
  • The signal processing areas can be defined by a matrix constructed with the two axes of “signal” and “processing”.
  • The deep learning techniques have recently been applied to quite a number of extended signal processing areas.

Speech and audio:

  • The traditional MLP has been in use for speech recognition for many years.
  • Used alone, MLP performance is typically lower than the state-of-the-art HMM systems with observation probabilities approximated with Gaussian mixture models (GMMs).
  • Deep learning technique was successfully applied to phone [8], [9]). and large vocabulary continuous speech recognition (LVCSR) by integrating the powerful discriminative training ability of the DBNs and the sequential modeling ability of the HMMs.
  • Such a model is typically named DBN-HMM, where the observation probability is estimated using the DBN and the sequential information is modeled using the HMM.
  • In [9], a five-layer DBN was used to replace the Gaussian mixture component of the GMM-HMM and the monophonestate was used as the modeling unit.
  • Although the monophone model was used, the DBN-HMM approach achieved competitive phone recognition accuracy with the state-of-the-art triphone GMM-HMM systems.
  • DBN-CRF in [8], improved the DBN-HMM used in [9] by using the CRF instead of the HMM to model the sequential information and by applying the maximum mutual information (MMI) in training speech recognition.
  • The sequential discriminative learning technique developed in [9] jointly optimizes the DBN weights, transition weights, and phone language model and achieved higher accuracy than the DBN-HMM phone recognizer with the frame-discriminative training criterion implicit in the DBN’s fine-tuning procedure implemented in [9].
  • The DBN-HMM can be extended from the context-independent model to the context-dependent model and from the phone recognition to the LVCSR.
  • Experiments on the challenging Bing mobile voice search data set collected under the real usage scenario demonstrate that the context-dependent DBN-HMM significantly outperforms the state-of-the-art HMM system.
  • Three factors contribute to the success of context-dependent DBN-HMM:
    • Triphone senones as the DBN modeling units,
    • Triphone GMM-HMM to generate the senone alignment,
    • the tuning of the transition probabilities.
  • Experiments indicate decoding time of a five-layer DBN-HMM is almost as that of the state-of-the-art triphone GMM-HMM.
  • In [5], the deep auto-encoder [7] is explored for speech feature coding with the goal to compress the data to a predefined number of bits with minimal reproduction error.
  • DBN pretraining is found to be crucial for high coding efficiency.
  • When DBN pretraining is used, the deep auto-encoder is shown to significantly outperform a traditional vector quantization technique.
  • If weights in the deep auto-encoder are randomly initialized, the performance is substantially degraded.
  • Another popular deep model: convolutional DBN
  • Application of convolutional DBN to audio and speech data shows strong result for music artist and genre classification, speaker identification, speaker gender classification, and phone classification.
  • Deep-structured CRF, which stacks many layers of CRFs, have been successfully used in the speech-related task of language identification, phone recognition, sequential labeling [15], and confidence calibration.

Image and video:

  • The original DBN and deep auto-encoder (AE) were developed and success on the simple image recognition and dimensionality reduction (coding) tasks (MNIST) in [7].
  • Interesting finding: the gain of coding efficiency of DBN-based auto-encoder (on the image data) over the conventional method of principal component analysis as demonstrated in [7] is very similar to the gain reported in [5] on the speech data over the traditional technique of vector quantization.
  • In [10], Nair and Hinton developed a modified DBN where the top-layer uses a third-order Boltzmann machine.
    • Apply the modified DBN to the NORB database—a three-dimensional object recognition task.
    • Report an error rate close to the best published result on this task.
    • DBN substantially outperforms shallow models such as SVMs.
  • Tang and Eliasmith developed two strategies to improve the robustness of the DBN in [14].
    1. Use sparse connections in the first layer of the DBN as a way to regularize the model.
    2. Developed a probabilistic denoising algorithm. Both techniques are shown to be effective in improving the robustness against occlusion and random noise in a noisy image recognition task.
  • Image recognition with a more general approach than DBN appears in [11])].
  • DBNs have also been successfully applied to create compact but meaningful representations of images for retrieval purposes.
  • On the large collection image retrieval task, deep learning approaches also produced strong results.
  • The use of conditional DBN for video sequence and human motion synthesis was reported in [13].
  • The conditional DBN makes the DBN weights associated with a fixed time window conditioned on the data from previous time steps.
  • Temporal DBN opens opportunity to improve the DBN-HMM towards efficient integration of temporal-centric human speech production mechanisms into DBN-based speech production models.

Language processing and information retrieval:

  • Research in language, document, and text processing has seen increasing popularity in signal processing research.
  • The society’s audio, speech, and language processing technical committee designated language, document, ant text processing as one of the main focus area.
  • Long history of using (shallow) neural networks in language modeling (LM)—an important component in speech recognition, machine translation, text information retrieval, and in natural language processing.
  • Recently, a DBN-HMM model was used for speech recognition. The observation probabilities are estimated using the DBN. The state values can be syllables, phones, subphones, monophone states, or triphone states and senones.
  • Temporally factored RBM has been used for LM. Unlike the traditional N-gram model, the factored RBM uses distributed representations not only for context words but also for the words being predicted. This approach can be directly generalized to deeper structures.
  • Collobert and Weston [2] developed and employed a convolutional DBN as the common model to simultaneously solve a number of classic problems including part-of-speech tagging, chunking, named entity tagging, semantic role identification, and similar word identification.
  • A similar multitask learning technique with DBN is used in [3] to attack the machine transliteration problem, which may be generalized to the more difficult problem of machine translation.
  • DBN and deep autoencoder are used for document indexing and retrieval [ [11], [12].
    • The hidden variables in the last layer are easy to infer.
    • Gives a much better representation of each document (based on the word-count features) than the widely used latent semantic analysis.
    • Using compact code produced by deep networks, documents are mapped to memory addresses in such a way that semantically similar text documents are located at nearby address to facilitate rapid document retrieval.
    • This idea is explored for audio document retrieval and speech recognition [5].


  • Deep learning have already demonstrated promising results in many signal processing applications.

Future directions:

  • Better understanding the deep model and deep learning:
    • Why is learning in deep models difficult?
    • Why do the generative pretraining approaches seem to be effective empirically?
    • Is it possible to change the underlining probabilistic models to make the training easier?
    • Are there other more effective and theoretically sound approaches to learn deep models?
  • Better feature extraction models at each layer.
    • Without derivative and accelerator features in the DBN-HMM, the speech recognition accuracy is significantly reduced.
    • The current Gaussian-Bernoulli layer is not powerful enough to extract important discriminative information from the features.
    • Using a three-way associative model called mcRBM, derivative and accelerator features are no longer needed to produce state-of-the-art recognition accuracy.
    • No reason to believe mcRBM is the best first-layer model for feature extraction either.
    • Theory needs to be developed to guide the search of proper feature extraction models at each layer.
  • More powerful discriminative optimization techniques.
    • Although current strategy of generative pretraining followed by discriminative fine-tuning seems to work well empirically for many tasks, it failed to work for some other tasks such as language identification.
    • The features extracted at the generative pretraining phase seem to describe the underlining speech variations well but do not contain enough information to distinguish between different languages.
    • A learning strategy that can extract discriminative features for language identification tasks is in need.
    • Extracting discriminative features may also greatly reduce the model size needed in the current deep learning systems.
  • Better deep architectures for modeling sequential data.
    • The existing approaches, such as DBN-HMM and DBN-CRF, represent simplistic and poor temporal models.
    • Models that can use DBNs in a more tightly integrated way and learning procedures that optimize the sequential criterion are important to further improve the performance of sequential classification tasks.
  • Adaptation techniques for deep models.
    • Many conventional models such as GMM-HMM have well-developed adaptation techniques that allow for these models to perform well under diverse and changing real-world environments.
    • Without effective adaptation techniques, deep techniques cannot outperform the conventional models when the test set is different from the training set, which is common in real applications.

My Review:

  • This is an introductory and easy reading on the application of deep learning to continuously expanding area of signal processing.
  • The deep learning slightly biased towards DBN.
  • The referred convolutional DBN is actually convolutional NN.
  • Future directions is the most interesting part.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s