Posted by Mohamad Ivan Fanany This writing summarizes and reviews a deep learning for large-scale sentiment classification (or sentiment analysis): Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach
- The rise of social media such as blogs and social networks, reviews, ratings and recommendations are rapidly proliferating
- The ability to automatically filter them is a current key challenge for businesses to sell their wares and identify new market opportunities.
- Why evaluating Deep Learning for sentiment analysis is interesting?:
- There exist generic concepts that characterize product reviews accross domains.
- Deep Learning can disentangle the underlying factors of variation.
- Domain adaptation for sentiment analysis becomes a medium for better understanding deep architectures.
- Even though Deep Learning have not yet been evaluated for domain adaptation of sentiment classifiers, several very interesting results have been reported on other tasks involving textual data, beating the previous state-of-the-art in several cases (Salakhutdinov and Hinton, 2007; Collobert and Weston, 2008; Ranzato and Szummer, 2008).
- Reviews can span so many different domains. It is difficult to gather annotated training data for all of them.
- Sentiment classification (or sentiment analysis): determine the judgment of a writer with respect to a given topic based on a given textual comment.
- Tackle the problem of domain adaptation for sentiment classifiers: a system is trained on labeled reviews from one source domain but is meant to be deployed on another.
- Sentiment analysis is now a mature machine learning research topic (Pang and Lee, 2008).
- Large variety of data sources makes it difficult and costly to design a robust sentiment classifier.
- Reviews deal with various kinds of products or services for which vocabularies are different.
- Data distributions are different across domains -> Solutions:
- Learn a different system for each domain:
- High cost to annotate training data for a large number of domains,
- Cannot exploit information shared across domains.
- Learn a single system from the set of domains.
- Learn a different system for each domain:
- The problem of training and testing models on different distributions is known as domain adaptation (Daum´e III and Marcu, 2006).
- Learning setups relating to domain adaptation have been proposed before and published under different names.
- Daum´e III and Marcu (2006) formalized the problem and proposed an approach based on a mixture model.
- Ways to address domain adaptation:
- Instance weighting (Jiang and Zhai, 2007): in which instance-dependent weights are added to the loss function
- Data representation: the source and target domains present the same joint distribution of observations and labels.
- Formal analysis of the representation change (Ben-David et al. (2007))
- Structure Correspondence Learning (SCL): makes use of the unlabeled data from the target domain to find a low-rank joint representation of the data
- Ignoring the domain difference (Dai et al., 2007): consider source instances as labeled data and target ones as as unlabeled data.
- Dai et al., (2007) approach is very close to self-taught learning by Raina et al., (2007) in which one learns from labeled examples of some categories as well as unlabeled examples from a larger set of categories.
- Like the proposed method in this paper, Raina et al. (2007) relies crucially on the unsupervised learning of a representation.
Inspiration from previous works:
- RBMs with (soft) rectifier units have been introduced in (Nair and Hinton, 2010). The authors have used such units because they have been shown to outperform other non-linearities on a sentiment analysis task (Glorot et al., 2011).
- Support Vector Machines (SVM) being known to perform well on sentiment classification (Pang et al., 2002). The authors use a linear SVM with squared hinge loss. This classifier is eventually tested on the target domain(s).
- Deep learning for extracting a meaningful representation in an unsupervised fashion.
- Deep learning for domain adaptation of sentiment classifiers.
- Existing domain adaptation methods for sentiment analysis focus on the information from the source and target distributions, whereas the proposed unsupervised learning (SDA) can use data from other domains, sharing the representation across all those domains.
- Such representation sharing reduces the computation required to transfer to several domains because a single round of unsupervised training is required, and allows us to scale well with large amount of data and consider real-world applications.
- Existing domain adaptation methods for sentiment analysis map inputs into a new or an augmented space using only linear projections. The code learned by the proposed SDA is a non-linear mapping of the input and can therefore encode complex data variations.
- Rectifier non-linearities have the the nice ability to naturally provide sparse representations (with exact zeros) for the code layer, well suited to linear classifiers and are efficient with respect to computational cost and memory use.
- The training and testing data are sampled from different distributions.
- Deep Learning algorithms learns intermediate concepts between raw input and target.
- These intermediate concepts could yield better transfer across domains.
- Exploit the large amounts of unlabeled data across all domains to learn these intermediate representations.
- Amazon data: More than 340,000 reviews regarding 22 different product types and for which reviews are labeled as either positive or negative.
- Challenges: heterogeneous, heavily unbalanced and large-scale.
- A smaller and more controlled version has been released:
- Only 4 different domains: Books, DVDs, Electronics and Kitchen appliances.
- 1000 positive and 1000 negative instances for each domain
- A few thousand unlabeled examples.
- The positive and negative examples are also exactly balanced
- The reduced version is used as a benchmark in the literature.
- The paper will contain the first published results on the large Amazon dataset.
- Structural Correspondence Learning (SCL) for sentiment analysis (Blitzer et al. 2007)
- Multi-label Consensus Training (MCT) approach which combines several base classifiers trained with SCL (Li and Zong 2008).
- Spectral Feature Alignment (Pan et al., 2010)
- Stacked Denoising Auto-encoder (Vincent et al., 2008).
- Access to unlabeled data from various domains, but access to the labels for one source domain only.
- Two-step procedure:
- Unsupervisedly learn higher-level feature from the text reviews of all the available domains using a Stacked Denoising Autoencoder (SDA) with rectifier units (i.e. max(0, x)).
- Train a linear classifier on the transformed labeled data of the source domain.
- Preprocessing follows (Blitzer et al., 2007):
- Each review text is treated as a bag-of-words and transformed into binary vectors encoding the presence/absence of unigrams and bigrams.
- Keep 5000 most frequent terms of the vocabulary of unigrams and bigrams in the feature set.
- Split train/test data.
- Baseline: a linear SVM trained on the raw data
- The proposed method is also a linear SVM but trained and tested on data for which features have been transformed by.
- The hyper-parameters of all SVMs are chosen by crossvalidation on the training set.
- Explored an extensive set of hyper-parameters:
- A masking noise probability (its optimal value was usually high: 0.8);
- A Gaussian noise standard deviation for upper layers;
- A size of hidden layers (5000 always gave the best performance);
- An L1 regularization penalty on the activation values.
- A learning rate.
- All algorithms were implemented using the Theano library (Bergstra et al., 2010).
- Basic Metric:
- Transfer error: the test error obtained by a method trained on the source domain and tested on the target domain.
- In domain error: the source domain and the tested domain is the same.
- Test error: the test error obtained by the baseline method, i.e., a linear SVM on raw features, trained and tested on the raw features of the target domain.
- Transfer loss: the difference between the transfer error and the in domain baseline error.
- For a large number of heterogeneous domains with different difficulties (as with the large Amazon data), the transfer loss is not satisfactory.
- Advanced metric:
- Transfer ratio: it also characterizes the transfer but is defined by replacing the difference by a quotient. This is less sensitive to important variations of in-domain errors, and thus more adapted to averaging.
- In-domain ratio.
- Compare the results from the original paper of 3 compared methods (SCL, MCT, SFA), which have been obtained using the whole feature vocabulary and on different splits, but of identical sizes:
- Results are consistent whatever the train/test splits as long as set sizes are preserved.
- All baselines achieve similar performances.
- Compare the results from Transductive SVM (Sindhwani and Keerthi, 2006) trained in a standard semi-supervised setup: the training set of the source domain is used as labeled set, and the training set of the other domains as the unlabeled set:
- The unsupervised feature extractor is made of a single layer of 5000 units.
- Sentiment classifiers trained with this high-level feature representation clearly outperform state-of-the-art methods on a benchmark composed of reviews of 4 types of Amazon products.
- This method scales well and allowed us to successfully perform domain adaptation on a larger industrial-strength dataset of 22 domains.
- The paper demonstrated that a Deep Learning system based on Stacked Denoising Auto-Encoders with sparse rectifier units can perform an unsupervised feature extraction which is highly beneficial for the domain adaptation of sentiment classifiers.
- Experiments have shown that linear classifiers trained with this higher-level learnt feature representation of reviews outperform the current state-of-the-art.
- Furthermore, the paper successfully perform domain adaptation on an industrial-scale dataset of 22 domains, where significantly improve generalization over the baseline and over a similarly structured but purely supervised alternative.
- This paper demonstrates a nice proof that learnt high level features produced by deep learning leads to lower classification error compared to the-state-of-the-art of two-level classifiers.
- One important motivation for using the high level features is domain adoption problem that can be spesifically addressed by Deep Learning.
- It would be very nice if the authors put the data online so that we can tested it also using different Deep Learning algorithms and techniques.