Sentiment analysis with Bayesian features

Movies are great! Sometimes… But what if we want to find out if one is worth watching? A good start would be to read the reviews of other film enthusiasts on the biggest reviewing platform, IMDB. However, this takes some time… so what about making our computer read the reviews and assess if they are rather positive or negative?

Thanks to the size of this database, this toy problem has been studied a lot, with different algorithms. Aditya Timmaraju and Vikesh Khanna from Stanford University give a really nice overview of the various methods that can be used to tackle this problem, achieving a maximum accuracy of 86.5% with support vector machines. James Hong and Michael Fang used paragraph vectors and recurrent neural networks to classify correctly 94.5% of the reviews. Today, we explore a much simple algorithm, yet very effective, proposed by Sida Wang and Christopher D. Manning: the Naive Bayes Support Vector Machine (NBSVM). We will propose a geometric interpretation of this method, in addition to a Python implementation that yields 91.6% of accuracy on the IMDB dataset in only a few lines of code.

On sentiment analysis

Sentiment analysis, or more generally text classification, is a very hot topic nowadays. Flagging spams, fake news or toxic content on social media is a thankless task for which very sophisticated algorithms have been designed. As often, deep learning seems to achieve state of the art performances when the dataset is large enough. However, there are still some simple and elegant solutions that do not rely on huge corpora and computation power. In this article, we present a very efficient solution to text classification, that nonetheless yields great performances.

Multinomial Naive Bayes classifier

Bayesian classifiers are a very popular and efficient way to tackle text classification problems. With this method, we represent a text by a vector $f$ of occurrences, for which each element denotes the number of times a certain word appears in this text. The order of the words in the sentence doesn’t matter, only the number of times each word appears. The Bayes formula gives us the probability that a certain text is a positive review (label $Y=1$):

We want to find the probability that a given text $f$ is a positive review ($Y=1$). Thanks to this formula, we only need to know the probability that this review, knowing that it is positive, was written. ($P(f|Y=1)$), and the overall probability that a review is positive $P(Y=1)$. Although $P(f)$ appears in the formula, it does not really matter for our classification, as we will see.

Probability of a review? The sentence the probability of a review to be written can be confusing when taken out of context. When speaking of randomness, we generally picture a coin flip or a dice roll, and it is fairly intuitive to assign a probability to the outcomes of these actions. But can we do the same thing for a tweet that I’m about to write? The short answer is yes, thanks to statistical language models. These models are just a systematic way of assigning a probability to a sequence of words. Multinomial naive models are probably the simplest ones, but a lot of other more sophisticated methods exists too.

$P(Y=1)$ can be easily estimated: it is the frequency of positive reviews in our corpus (noted $\frac{N^+}{N}$). However, $P(f|Y=1)$ is more difficult to estimate, and we need to make some very strong assumptions about it. In fact, we will consider that the appearance of each word of the text is independent of the appearance of the other words. This assumption is very naive, thus illustrating the name of the method.

We now consider that $f|Y$ follows a multinomial distribution: for a review of $n$ words, what is the probability that these words are distributed as in $f$? If we denote $p_i$ the probability that a given word $i$ appears in a positive review (and $q_i$ that it appears in a negative review), the multinomial distributions assume that $f|Y$ is distributed as follows:

Thus, we can predict that the review is positive if $P(Y=1|f) \geq P(Y=0|f)$ , that is if the likelihood ratio $L$ is greater than one:

Or, equivalently, if its logarithm is greater than zero:

Which can be written as:

We see that our decision boundary is linear in the log-space of the features.

Alternative representation

However, I like to see this formula as written differently:

is equivalent to

where $\circ$ stands for the element-wise product and $1$ for the unitary vector $(1,1,...1)$. Now our Bayesian features vector is $(w\circ f)$ and our hyperplane is orthogonal to $1$. However we can wonder if this particular hyperplane is the most efficient for classifying the reviews… and the answer is no! Here is our free lunch: we will use support vector machines to find a better separating hyperplane for these Bayesian features.

SVM or logistic regression?

The geometry of support vector machines

A support vector machine tries to find a separation plane $w^T.f=b$ that maximises the distance between the plane and the closest points. This distance, called margin, can be expressed in terms of $w$ : A point is correctly classified if it is on the good side of the plane, and outside of the margin. On this image, we see that a sample is correctly classified if $w^T.f + b > 1$ and $Y=1$ or $% $ and $Y=0$. This can be summarised as $(2y_i - 1)(w^T.f+b) > 1$. We want to maximise the margin $\frac{2}{||w||} > 1$ thus the optimisation problem of a support vector classifier is:

However, if our observations are not linearly separable, such a solution doesn’t exist. Therefore we introduce slack variables that allow our model to incorrectly classify some points at some cost $C$:

Logistic regression

In logistic regression, the probability of a label to be $1$ given a vector $f$ is:

If we add a l2-regularisation penalty to our regression, the objective function becomes:

Where $\sum_{i=1}^n \ln\left(1+e^{-y_i(w^T.f+b)}\right)$ is the negative log-likelihood of our observations. If you like statistics, it is worth noting that adding the l2-penalty is the same as maximizing the likelihood with a Gaussian prior on the weights (or a Laplacian prior for an l1-penalty).

Why are they similar?

We define the likelihood ratio as

the cost of a positive example for the support vector machine is:

and for the logistic regression with a l2-regularisation penalty:

If we plot the cost of a positive example for the two models, we see that we have very similar losses: This is why a SVC with a linear kernel will give results similar to an l2-penalized linear regression.

In our classification problem, we have 25000 training examples, and more than 130000 features, so a SVC will be extremely long to train. However, a linear classifier with an l2 penalty is much faster than a SVC when the number of samples grows, and gives very similar results, as we just saw.

Dual formulations

When the number of samples is fewer than the number of features, as it is here, one might consider solving the dual formulation of the logistic regression. If you are interested in finding out about this formulation, I recommend Hsiang-Fu Yu, Fang-Lan Huang, and Chih-Jen Lin which makes a nice comparison between the linear SVC and the dual formulation of the logistic regression, uncovering more similarities between these techniques.

Implementation

From reviews to vectors

The original dataset can be found here. However, this script named IMDB.py loads the reviews as a list of strings for both the train and the test sets:

Feel free to use it, it downloads and unzips the database automatically if needed. We will use Scikit.TfidfVectorizer to transform our texts into vectors. Instead of only counting the words, it will return their frequency and apply some very useful transformations, such as giving more weight to uncommon words. The vectorizer I used is a slightly modified version of TfidfVectorizer which a custom pre-processor and tokenizer (which keeps exclamation marks, useful for sentiment analysis). By default, it doesn’t only count words but also bi-grams (pairs of consecutive words), as this gives the best results at the cost of an increasing features space. You can find the code here, and use it to run your own test:

You can tune every parameter of it, just as with a standard TfidfVectorizer. For instance, if you want to keep only individual words and not bi-grams:

From now on, we will only use:

This will keep all words and bi-grams that appear more than 5 times in our corpus. This is a lot of words: our features space has 133572 dimensions, for 25000 training points! Now that we know how to transform our reviews to vectors, we need to choose a machine learning algorithm. We talked about support vector machines. However, they scale very poorly and are too slow to be trained on 25000 points with more than 100000 features. We will thus use a slightly modified version, the dual formulation of an l2-penalized logistic regression. We will now explain why this is very similar to a support vector classifier.

The model

As seen before, we define

For some smoothing parameter $\alpha$. The log-ratio $R$ is defined as:

Where $||.||_1$ stands for the $L^1$ norm.

At last, the Bayesian features used to fit our SVC will be

Of course, we will use a sparse matrix to save memory (our vectors are mostly zeros). Wrapped in some python code, this gives:

And finally (I chose the parameters $\alpha = 0.1$ and $C=12$ with a cross-validation):

Accuracy: 0.91648


That was a pretty painless way of achieving 91.6% accuracy!

One last word: Natural language processing is a hot topic, and transfer learning is becoming a viable solution. I’m not expecting simple models such as this one to perform better than deep models forever, even on small datasets. So keep this algorithm in mind as a strong baseline, but don’t forget to keep up with the literature and try other things!