What is Sentiment Analysis?
Sentiment Analysis is the process of identifying information in text and categorizing them in different tones such as Positive, Negative and Neutral.
In a more practical sense, if we were dealing with political comment reviews, we would want the sentence ‘great job. may the lord give him strength.’ to be labelled as Positive, and the sentence ‘we know who you are and nothing can expect good from your hands.’ labelled as Negative.
Getting machines to do this is not an easy task, and it involves skills from different field of knowledge such as Computer Science, Statistics, etc. Moreover, Sentiment Analysis is the common Natural Language Processing (NLP) task that Data Scientists need to analyse the feelings (emotions, opinions, attitudes, thoughts, etc.) behind the words.
Why sentiment Analysis?
Business: In marketing field, companies use it to understand the consumer feelings towards a product or a brand, how people respond to their campaigns, why consumers don’t buy some products and to develop the marketing strategies.
Politics: In political field, it is used to keep track of political view, to detect consistency and inconsistency of the statements and actions at the government level. It can be used to predict election results as well.
Sentiment analysis allows businesses to quickly process and extract actionable insights from huge volume of text without having to read all of them.
So, in this article we will use a dataset containing a collection of Youtube comments to detect the sentiment associated with a particular comment and detect it as negative or positive accordingly using Machine Learning.
Data set Description
Given a training sample of Youtube comments and its labels where label ‘Positive’ denotes the comment is positive and label ‘Negative’ denotes the comment is negative, our objective is to predict the sentiment on the given data set.
Id: The id associated with the comments in the given data set
Comment: The comments collected from Youtube videos in political domain
Sentiment: Two categories of ‘Positive’ is of positive sentiment while ‘Negative’ is of negative sentiment.
Step 01: Pre- processing and Cleaning
Text comments contain many slang words, punctuation marks, links, special characters, numbers, block letters and terms which don’t carry much weight to the text. For example,
Therefore, Pre-processing of the text data is an essential step as it makes raw text ready for the training of the model. It becomes easier to extract information from the text and apply machine learning algorithms to it. Skipping of this step will make noisy and inconsistent data, that will make wrong predictions.
The pre-processing steps are given below.
Lowercasing
Lowercasing all the text data, although many overlooked, is one of the simplest and most effective form of text pre-processing.
Remove Special Characters
The text data contain tags, HTML entities, punctuation, non-alphabets, and any other kind of characters which might not be a part of the language. The general method of such cleaning involves regular expressions which can be used to filter out most of the unwanted texts.
Emoticons which are made up of non-alphabets also pay an important role in sentiment analysis. “:), :(, -_-, :D, xD”, all these, when processed correctly, can help with a better sentiment analysis.
Given below is a user-defined function to remove unwanted text patterns from texts. It takes two arguments; one is the original string of text and the other is the pattern of text that we want to remove from the string. The function returns the same input string but without the given pattern. We will use this function to remove the pattern ‘@user’ from all the texts in our data.
Now apply the cleaning and pre-processing functions to the text data. Note that “@[\w]*” as the pattern to the remove_call_sign function. It is a regular expression which will pick any word starting with ‘@’.
Here you can see the pre-processed comments. Only the important words in the texts have been retained and the noise (numbers, punctuations, and special characters) has been removed.
Step 02: Encoding the label
Since the response variable, sentiment is a categorical variable, we need to convert it to numeric values. Sklearn provides a very efficient tool for encoding the levels of categorical features into numeric values. LabelEncoder is used to encode the labels ‘Positive’ as 1 and ‘Negative’ a 0.
Step 03: Vectorization
In order for this text data to make sense to our machine learning algorithm, we need to convert each review to a numeric representation, which we call vectorization (Feature Extraction).
Bag of Words Model (BoW)
Here each word is assigned a unique number and any document can be encoded as a fixed length vector with the length of the vocabulary of known words. The value in each position in the vector could be filled with a count or frequency of each word in the encoded document.
1. Convert text to word count vectors using CountVectorizer
The CountVectorizer tokenize a collection of text documents and build a vocabulary of known words.
An encoded vector is returned with a length of entire vocabulary and an integer count for the number of times each word appeared in the document.
Below is an example of using the CountVectorizer to tokenize, build a vocabulary, and then encode a document.
2. Convert text to word frequencies using TfidfVectorizer
Since word counts are very basic, one issue with simple counts is that some words like ‘’the” will appear many times and its large counts will not be very meaningful in the encoded vectors.
An alternative to this is to calculate the word frequencies by TF-IDF method.
The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow to encode new documents. Alternately, if already have a learned CountVectorizer, you can use it with a TfidfTransformer to just calculate the inverse document frequencies and start encoding documents.
Step 04: Build Classifier
Now the dataset was transformed into a format suitable for modelling. Logistic regression is a good baseline model to use here. Here hyper- parameter tuning was done to select the best parameter values.
The logistic regression model gave us a validation accuracy of 94% for the validation set. Now this model can be used for prediction of the test data.