The following blog post is a joint venture between Fredrik SCHALLING and Hampus PETTERSSON. For more information about the authors, please see our Authors section.
Customer experience has since long been measured through actively reaching out in customer surveys through phone calls or questionnaires to get the current satisfaction of customers. Now good businesses can get acknowledged without spending a penny on marketing through the word of mouth on steroids – The social media. The emergence of social media has enabled customers, both happy and furious to express their opinion for the world to see, and this is something we do. We share opinions like never before! This is a radical change to the rules of competition since with only a few clicks on the phone we now have access to a huge opinion database.
So the question arises, how do a company stay afloat and use this gold mine of voluntarily given and free of access feedback? Through Opinion mining, to scan social media and aggregating information related to a certain topic. The first essential part of opinion mining is to acquire data, often through surveying available APIs of social media services or manually through web scraping. The second step in the process is to analyze the acquired data, to extract a comparable measure from the massive amount of information gathered, often text.
The field of how a computer process and analyze natural language is commonly referred to as Natural Language Processing (NLP), a subfield of linguistics and artificial intelligence. Traditionally computers require humans to communicate with them in a precise and highly structured manner in for example a programming language. A programming language is just that, a common set of rules to make communication between computers and humans easy for the computer to understand (and most often challenging for the human…). NLP applications aim to flip the coin and make natural language processable for machines. To interpret human speech, however, is a real challenge since the meaning often is ambiguous and dependent on several complex parameters as social context, slang and what a receiver supposedly already knows about the speaker. Not convinced? Try saying the following sentence with emphasis on a different word every time. “I never said she stole my money.”
Let’s attack the matter from a different point of view. Computers are really good at processing simple and structured data. The idea is if we can create a simple and structured representation of natural language we can make use of computers’ speed to analyze large amounts of information very efficiently. Early methods included rule-based representations created by hand whereas current methods are statistical. Statistical methods work by giving a computer a large set of typical real-world examples of language possibly including annotations of meaning and then letting the computer find the patterns itself. This is what is called Machine Learning when a computer is given data to find patterns in, in comparison to traditional programing where a computer is given the rules.
What we got is an approach that enables computers to analyze large amounts of information stored in natural language, that’s text. We now have the two parts: a large amount of information in text (social media) and an approach to make computers understand it (NLP). Connecting back to opinion mining, one of the many use cases for NLP, to check the polarity of the current voice of customer/user. We will now continue with the creation of an algorithm that can check for polarity in tweets. To train the model (find patterns in text) we use an annotated dataset where people have said whether the respective tweet is positive, negative or neutral.
The dataset used in this experiment consists of 14,000 tweets that mention at least one of the US Airlines: United Airlines, US Airways, American Airlines, Southwest Airlines, Delta Air Lines or Virgin America. The data was scrapped from Twitter during 2015 and contributors were asked to label the tweets as positive, neutral or negative. The data was made open for public through Kaggle, (https://www.kaggle.com/crowdflower/twitter-airline-sentiment). We read the dataset using the Python package Pandas show some example of entries.
import pandas as pd tweets_raw = pd.read_csv('data/Tweets.csv') tweets_raw.head()
Where airline_sentiment is the label that the contributor gave the tweet. Airline_sentiment_confidence is the confidence that the contributor had on the given label and the negativereason_confidence is the confidence the contributor was on the reason they set for a negative tweet. To give a hint of the content of the dataset we also check the distribution between the labels and featured companies.
Something to have in mind moving on is the large bias against negative tweets, this is probably something useful to address. Concerning the different companies there is a relatively even distribution except for “Virgin America Airline” which is significantly lower represented. If we were to compare the different companies this could be an issue. With this annotated data we can now create a model for determining the sentiment of short texts which then can be used in a variety of applications, not limited to tweets.
Before training a model on the data we clean it by removing the parts we do not want to affect the model. This could be information present in the training sample which we know does not represent the full population. An example of information we want to exclude from the training sample is the tags of other accounts, the “@names”. We do this since we want our model to generalize on tweets in general and not take into account whether a specific user or company is tagged. To find these tags we can search the text with the regular expression “@[^\s]+” which in English reads as “An at sign followed by one or more non-blankspace characters”. All the matches can temporarily be changed to some string we remember like “A_USER_TAG”.
import re tweet = ' @VirginAmerica plus you've added commercials to the experience... tacky. re.sub('@[^\s]+', 'AT_USER', tweet) >> A_USER_TAG plus you've added commercials to the experience... tacky
The next step of our preprocessing is to extract a list of the words we want to train our model on. If we just split the tweets on every blank space we get a list with informative words but also words needed to create sentences but not very informative by themselves, examples are “the”, “is”, “at”, “which”, and “on”. There exists a multitude of already compiled lists of so-called stop words we import one from the Python package “Natural Language Toolkit” to which we also add “A_USER_TAG”.
from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from string import punctuation tweet_tokens = word_tokenize(tweet) tweet_tokens >> ['AT_USER', 'plus', 'you', "'ve", 'added', 'commercials', 'to', 'the', 'experience', '...', 'tacky', '.'] # Filter out stopwords stopwords = set(stopwords.words('english') + list(punctuation) + ['A_USER_TAG’]) tweet_tokens = [token for token in tweet_tokens if token not in stopwords] tweet_tokens >> ['plus', "'ve", 'added', 'commercials', 'experience', '...', 'tacky']
An alternative approach to not include the words that are not very informatics by themselves, is to include them in their context. To include the context one dimension is added to the features, instead of just training on single words, sequences of words are used. These sequences are called n-grams, where n is the number of words in the sequence. The common stop word “very” can with this method have a valuable impact on the sample, for example when the 2-gram is added – “…very good…” -> [“very”, “good”, “very good”]
Let’s create the model!
We will train a logistic regression model with a One-vs-all approach to enable our model to predict multiple classes. To not make this post into an academic literature study we will not discuss the selection of the type of model just briefly mention the essentials. In short, this means that we train a logistic regression model between all the possible pairs of classes and then we say the class with the most prediction is the most likely label. (Link a video “for the interested user”)[StatQuest with Josh Starmer](https://youtu.be/yIYKR4sgzI8)
from sklearn.linear_model import LogisticRegression from sklearn.multiclass import OneVsRestClassifier lr = LogisticRegression() ovr = OneVsRestClassifier(lr)
Next step, we preprocess the data. Here we will implement a 3-gram approach with the open-source Python packages Natural Language Toolkit and Scikit-learn.
from sklearn import preprocessing from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(ngram_range=(1, 3), tokenizer=tokenizer.tokenize) full_text = list(tweets_raw['text'].values) vectorizer.fit(full_text) train_vectorized = vectorizer.transform(tweets_raw['text']) labels = tweets_raw['airline_sentiment'] enc = preprocessing.LabelEncoder() enc.fit(labels) y = enc.transform(labels)
Now it is time to fit the model to our training set. To be able to assess the performance of the trained model we split the dataset into one training set (80 %) and one set for testing only (20 %).
from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.metrics import accuracy_score x_train , x_val, y_train , y_val = train_test_split(train_vectorized,y,test_size = 0.2) ovr.fit(x_train, y_train) print(classification_report(ovr.predict(x_val), y_val)) print("Accuracy ", accuracy_score(ovr.predict(x_val), y_val)) >> precision recall f1-score support 0 0.99 0.74 0.85 2512 1 0.25 0.81 0.38 180 2 0.42 0.83 0.56 236 Accuracy 0.7517076502732241
The performance of the model shows a typical behavior from a model trained on unbalanced data. The accuracy is high and so is the precision on the dominating class (negative tweets). The underlying issue is the incite of predicting the dominating class which gives a high amount of false negatives on the minority classes which is seen as the low precision of these classes. To improve the balance in the sample we remove some tweets with a negative sentiment. Since we have a feature stating the confidence of the label we try using this and remove all negative tweets below a chosen threshold confidence on the sentiment (100 %) and a threshold for the reason for the negativity (80 %) . And then we train and evaluate again.
lr_filter = LogisticRegression() ovr_filter = OneVsRestClassifier(lr_filter) ovr_filter.fit(x_train,y_train) print(classification_report( ovr_filter.predict(x_val) , y_val)) print("Accuracy ", accuracy_score( ovr_filter.predict(x_val) , y_val)) >> precision recall f1-score support 0 0.98 0.76 0.86 927 1 0.61 0.81 0.69 293 2 0.62 0.84 0.72 292 Accuracy 0.7876984126984127
We see a large increase in precision for the two minority classes at the loss of only 0.01 for the dominant class and even an (small) increase in the recall for the dominant class. An explanation for the latter could be that the threshold for confidence makes this class more distinct than before since the vague cases previously could have induced noise to the model. We can also observe a small increase in precision (by training on less data!).
Conclusion and discussion
In this example we show the importance of preprocessing of data has on the quality of the trained model. Even removing samples from a dataset can improve the result since the data was biased towards negativity i.e. if the model was uncertain a guess on negative was a safe bet. It all boils down to the use case for a model, a model biased against negative tweets will find most of the negative tweets but possibly on the cost of not giving a representative general opinion. Hence, defining the use case is highly relevant in the process of creating and tuning of the model.
There exist a multitude of ways to address ambiguities in tweets, some almost philosophical. Does a misspelt word represent something useful (tweet might be written in excitement or fury) or should you try to spellcheck text? Before looking into the data itself it is valuable to think of parameters that could induce bias. When do you typically write a review? When you are disappointed or when the experience was just fine?
After making these choices you can end up with an algorithm able to determine the opinion of written text. With such an algorithm one could include (and make use of) a free text field in questionnaires with a large number of respondents without having an entire team decoding them. Or with access to social media posts read how the public opinion on a matter change over time, and quantitatively show rising trends.