The customer is always right….

….a brute force method to analyze customer reviews

Two travellers headed for the same final destination but booked on different flights, one is satisfied and the other can’t stop swearing about the service and that the airline couldn’t do it’s job properly. Who hasn’t heard this one before? This could of course be isolated unfortunate events and one should not base decisions about which airline to fly with on anecdotal evidence…..but what if this is something reccurant? And how bad or good is a given company compared to all others? Airlines fly, ater all, millions of passengers every year and you cannot expect everyone to be happy.

Happy costumers

This blog is not about pointing fingers at a particular airline (even though this recently happened to me and my spouse, but we are well behaved individuals and will not name the culprit) but rather to show how one can use brute force methods to quick retrieve reviews on comparable companies. This is turn gives enough information on how costumer friendly companies are….or aren’t. The reviews themselves might not be sufficient to close the case on a particular airline (in this example) as different regions in the world experience competition between airlines in different ways and that some nationalities are more prone to make their complaints heard than others.

I do not either intend to rank all airlines in the world as there are over 5000 airlines and will restrict to major airlines with over 100 destinations. It would be unfair for to compare Emirate with some obscure little airline transporting people between Stockholm and Sveg on small 4-seat aircrafts, although the ride on the latter might be an experience in itself.

So, lets start of by determining which airline companies actually have at least 100 destinations. As in many cases, our dear friend Wikipedia has the answer. There is a total of 35 airlines that have at least 100 companies. As this blog intends to give methods to retrieve internet data it comes as natural to start of lightly by showing how to scrape tables from any site that allows scraping. How do we know if a site allows scraping you might ask? One needs to look at the robots.txt for wikipedia (or any site you wish to scrape your data from). In this case

https://en.wikipedia.org/robots.txt

The list contains all the engines that are forbidden to be use to retrieve data, often become their speed and hits per second exceed a given threshold and thus cause the service to slow down, others because of their commercial purposes.  Whenever the mention

User-agent: *
Allow: /

is given you are free to scrape, at least as long as you do not cause mayhem for the site. So, how do we do this? The easiest way (not copying and pasting the table with Excel) is to use the R-package rvest. We will here skip the technical details about how data is stored on servers and how your browser knows how to display data in a for the user useful manner. We need however know what to scrape on the site and since we are only interested in obtaining the following table

table

we need to know which part of the html code generates this table. The easiest way to do so is to right-click the webpage and choose to Inspect the content of the page, at which stage hovering over the code will show which web content is generated by the code.

hoveringtable

Once this is identified you simply need to right-click and copy the Xpath in order to feed it the R-code associated with rvest to retrieve the entire table.

hoveringtableCopyXpath

library("rvest")
url <- "https://en.wikipedia.org/wiki/List_of_airlines_with_more_than_100_destinations"
airlines <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/div/table') %>%
html_table()
airlines <- airlines[[1]]

airlines = as.data.frame(airlines)

Et voilà!

airlinesDF

To retrieve reviews from any site we obviously need to acquire the URL of every company. My choice is to use Trustpilot, a service gathering customer reviews of all companies and brands. The idea is to batch import all reviews using the url from Trustpilot (https://www.trustpilot.com/review/) concatenated with the airline url (e.g. airchina.com), i.e., https://www.trustpilot.com/review/airchina.com. This thus implies that I need to obtain every single airliner’s url. There are several ways to do this, one of them of course being to execute the tedious task of googeling all the addresses….which I wouldn’t recommend if your list of brands is long and you wish to keep your sanity. A straight forward approach would be to use Google Sheets endowed with the Blockspring Add-on as described in the following video:

The only requirement to do this is to be connected to Bings API, which is easy to do. Since the function “I am feeling lucky!” is used, one must still check for misstakes. In our case, one mistake was introduced out of the 35 companies I am looking at but I swiftly replaced it with the right URL.

To avoid having too much or too little data to handle, I chose to eliminate a few companies. For some airlines, a sentiment analysis might be unnecessary because they are notoriously costumer unfriendly and I therefore chose not to include them. Besides, these companies often do not have the same business ideology as major international airlines and a comparision might be unfair or skewed. Other companies may have more than 100 destinations but are restricted to a very little geographical area. As I only consider reviews written in English there might be to few comments written in that language, the rest being in the languages of the destinations served by the airline. I therefore excluded these companies and retained a total of 22 airlines out of the initial 35.

Now to the fun part of retrieving the reviews. First of all, one needs to append the airline name and URL to the Truspilot URL for that particular airline. At Trustpilot, all reviews for companies named COMPANY_NAME are found under the link

https://www.trustpilot.com/review/URL(COMPANY_NAME).

For instance

https://www.trustpilot.com/review/www.britishairways.com for British Airways.

airlines = as.data.frame(airlines)
AirlinesURL = fread("C:**********/Airlines/AirlinesURL.csv")
AirlinesURL = as.data.frame(AirlinesURL)
AirlinesConc = merge(airlines,AirlinesURL, by="Airline")
AirlinesConc$Trustlink = paste("https://www.trustpilot.com/review/", AirlinesConc$URL, sep="")
AirlinesConc$Trustlink[AirlinesConc$Trustlink == "https://www.trustpilot.com/review/www.koreanair.com"]="https://www.trustpilot.com/review/www.korean-air.vn"

The last line of the code above illustrates how to change the URL-address as mentioned above.

It is important before creating any data-retrieval code to know which information is of relevance for the purposes of your analysis.

Informationpages

As you can see from the picture above, there are several pieces of data that can be useful. The most relevant for our purposes is the customer review and perhaps the reviewer’s id and the rating given to the airline by this specific individual. Also not that there for British Airways exists 58 pages containing review. For the purposes of my blog I am only interested in the reviews and the reviewer’s id to do a personal ranking of the airlines. I could retrieve the star ratings and rank the airlines by them, but the issue here is that many of them may have the same rating. Also, the ratings are given by each individual and is a rather uncertain. What is te difference between 2 and 3 stars? Between 3 and 4? I want to determine if the overall rating of the airline is reflected in the content of the reviews. To do so, I will use Sentiment Analysis.

First of all, I need to determine how many pages of review each airline has. As before,  we need to determine. If we inspect the content of the webpage we see that the tag needed to point to the right page is given by pagination-page.

pages

To determine the total number of pages for all companies we create a function that determines that number:

url2 = AirlinesConc$Trustlink[i]
   company_name = AirlinesConc$Airline[i] 
  get_last_page = function(html){
                  pages_data = html %>% 
                  html_nodes('.pagination-page') %>% 
                 html_text() 
pages_data[(length(pages_data)-1)] %>% 
unname() %>% 
as.numeric() 
}

where i goes from 1 to N (N being the number of airlines). When scraping the first page and all subsequent pages we need to feed our program with their respective URL. We therefore need to create a list of all URLs for the given airlines reviews as a string:

first_page = read_html(url2)
latest_page_number = get_last_page(first_page)
All_pages = str_c(url2, '?page=', 1:latest_page_number)

Now that we know where to look for reviews we need to identify the correct htlm node for the reviewer information and that particular individual’s review:

consumerinforeviewtag

Once this information has been identified I created two functions to retrieve that data:

Reviews = function(html){
html %>% 
html_nodes('.review-info__body__text') %>% 
html_text() %>% 
str_trim() %>% 
unlist() 
}

Reviewer_info = function(html){
html %>% 
html_nodes('.consumer-info') %>% 
html_text() %>% 
str_trim() %>% 
unlist()
}

These functions are designed to read the information on individual URLs and need to be incorporated into a new function gathering all gathered information into one single data frame. I chose the package tibble to perform this task, thus having one row for each review.

Data_table = function(html, company_name){
Reviews = Reviews(html)
Reviewer_info = Reviewer_info(html)

combined_data = tibble(reviewer = Reviewer_info, review = Reviews) 
combined_data %>% 
mutate(company = company_name) %>% 
select(company,reviewer, review)
}

I finally chose to write a function which scans a list of URLs associated to one single company and scrapes the relevant data from each page:

get_data_from_url = function(url, company_name){
html = read_html(url)
Data_table(html, company_name)
}
get_data_from_url(url = url2,company_name)

scrape_write_table = function(url, company_name){
first_page = read_html(url)
latest_page_number = get_last_page(first_page)
list_of_pages = str_c(url,'?page=', 1:latest_page_number)
list_of_pages %>% 
map(get_data_from_url, company_name) %>% 
bind_rows() %>% 
write_tsv(str_c(company_name,'.tsv')) 
}
scrape_write_table(url2, company_name)
}

Note that the data the retrieved information is saved into a tab-separated file *.tsv rather that a comma separated fine *.csv. I chose to use this format to avoid confusion since the reviews themselves can contain commas. Let’s see now how the result is stored.

americanAirline = read_tsv("American Airlines.tsv")
head(americanAirline, 50)
AmericanAirlines = as.data.frame(americanAirline)

which gives

americanAirlinesdump

As our goal is to analyze the reviews of all 22 companies it makes life a tiny bit easier if all our data is combined into one single data frame.

list.files(pattern=".tsv$")
list.filenames = list.files(pattern=".tsv$")
#list.filenames
list.data = list()
Company.data = list()

combined = data.frame("company" = as.character(),"reviewer"= as.character(),"review"= as.character())
for (i in list.filenames){
combined = rbind(combined,read_tsv(i))
}
AllCompanies = as.data.frame(combined)

Furthermore, as we are interested in having the aggregated reviews, i.e., we are not intersted in the opinion of an individual reviewer (although that might also be of interest), we need to collapse the review column by airlines. The best tool for this is to use dplyr,

Reduced_AllCompanies = AllCompanies[,c("company","review")]
Collapsed            = Reduced_AllCompanies %>%
                       group_by(company) %>%
summarise_all(funs(paste(., collapse = " ")))

which now allows us to perform a sentiment analysis by company name. Once this step is done, one needs to perform the necessary for all the companies at once. Repeating the steps need therefore to be done simultaneously or at least in a loop. This will be show below. But before we get there, we need to choose what kind of sentiment analysis we want to get. One option is a simple count of positive and negative words using the bing-dictionary (see my previous blog post Sentiments in a romantic classic for a detailed despcription). This approach can be sufficient to classify or rank the different airlines. A lot of feelings, both negative and positive, can be associated to reviews. Indeed, a lot of people write reviews using a wide range of emotional words. To capture the different nuances of a review one might want to use a dictionary with more than just a bipolar spectrum (positive/negative). To this end, the NRC-dictionary can come to good use. This dictionary contains eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). To know more, read my previous blog post Sentiments in a romantic classic. For my analysis I chose to use both dictionaries.

I give the complete code to perform the sentiment analysis:

list.files(pattern=".tsv$")
list.filenames = list.files(pattern=".tsv$")
#list.filenames
list.data = list()
Company.data = list()
combined = data.frame("company" = as.character(),
"reviewer"= as.character(),
"review"= as.character())

for (i in list.filenames){
combined = rbind(combined,read_tsv(i))
}
AllCompanies = as.data.frame(combined)


AllCompaniesnames = as.vector(unique(AllCompanies$company))
Reduced_AllCompanies = AllCompanies[,c("company","review")]

Collapsed = Reduced_AllCompanies %>%
group_by(company) %>%
summarise_all(funs(paste(., collapse = " ")))

## Create empty dataset to gather the counts of different emotions and sentiments

NRC_anger              = data.frame()
NRC_anticipation       = data.frame()
NRC_disgust            = data.frame()
NRC_fear               = data.frame()
NRC_joy                = data.frame()
NRC_negative           = data.frame()
NRC_positive           = data.frame()
NRC_sadness            = data.frame()
NRC_surprise           = data.frame()
NRC_trust              = data.frame()
NRC_sentiment          = data.frame()
NRC_Words              = data.frame()
BINGPositiveSentiments = data.frame()
BINGNegativeSentiments = data.frame()
BINGOverAllSentiments  = data.frame()

Airlines = as.vector(AllCompaniesnames) 
for(i in 1:length(Airlines)){

## Read all the data scraped on trustpilot and saved in tsv-files
## Use glue-package to this end

fileName = glue("C:/********/Airlines/", list.filenames[i],sep = "")
fileName = trimws(fileName) # trimws erases trailing spaces in file names
fileText = glue(read_file(fileName))
fileText = gsub("\\$", "", fileText) # removes dollar signs as they are reserved for variables definitions 

### Tokenize the scripts (each word is considered as a singleton

tokens = data_frame(text = fileText) %>% unnest_tokens(word, text)
Words = sapply(gregexpr("\\W+", fileText), length) + 1
sentimentNRC = tokens %>%
inner_join(get_sentiments("nrc")) %>%      # Get only sentiment words using the nrc dictionnary 
count(sentiment) %>%                       # count the number of positive & negative words
spread(sentiment, n, fill = 0) %>%         # made data wide rather than narrow
mutate(sentiment = positive - negative)

## Use BING-dictionary to get positive and negative sentiments 

sentimentBING = tokens %>%
inner_join(get_sentiments("bing")) %>%      # Get only sentiment words using the bing dictionnary 
count(sentiment) %>%                        # count the number  of positive & negative words
spread(sentiment, n, fill = 0) %>%          # made data wide rather than narrow
mutate(sentiment = positive - negative)     # of positive words - # of negative owrds

## Append to NRC_emotion files and BING-sentiment files the data for each airline
NRC_anger        = rbind(NRC_anger, sentimentNRC$anger)
NRC_anticipation = rbind(NRC_anticipation, sentimentNRC$anticipation)
NRC_disgust      = rbind(NRC_disgust, sentimentNRC$disgust)
NRC_fear         = rbind(NRC_fear, sentimentNRC$fear)
NRC_joy          = rbind(NRC_joy, sentimentNRC$joy)
NRC_negative     = rbind(NRC_negative, sentimentNRC$negative)
NRC_positive     = rbind(NRC_positive, sentimentNRC$positive)
NRC_sadness      = rbind(NRC_sadness, sentimentNRC$sadness)
NRC_surprise     = rbind(NRC_surprise, sentimentNRC$surprise)
NRC_trust        = rbind(NRC_trust, sentimentNRC$trust)
NRC_sentiment    = rbind(NRC_sentiment, sentimentNRC$sentiment)
NRC_Words        = rbind(NRC_Words, Words)

BINGPositiveSentiments = rbind(BINGPositiveSentiments,sentimentBING$positive)
BINGNegativeSentiments = rbind(BINGNegativeSentiments,sentimentBING$negative)
BINGOverAllSentiments = rbind(BINGOverAllSentiments,sentimentBING$sentiment)
}

## Create master dataset
NRC_ALL = cbind(Airlines,NRC_Words,NRC_sentiment,NRC_trust,NRC_surprise,NRC_sadness,NRC_positive,NRC_negative,NRC_joy,NRC_fear,NRC_disgust,NRC_anticipation,NRC_anger,BINGPositiveSentiments,BINGNegativeSentiments,BINGOverAllSentiments)
## Rename columns
colnames(NRC_ALL) = c("Airlines","NRC_Words","NRC_sentiment","NRC_trust","NRC_surprise","NRC_sadness","NRC_positive","NRC_negative","NRC_joy","NRC_fear","NRC_disgust","NRC_anticipation","NRC_anger","BINGPositiveSentiments","BINGNegativeSentiments","BINGOverAllSentiments")

NRC_ALL

Before analyzing this data one needs to deal with the fact that some airlines have far more reviews than others and the sentiment count between them aren’t comparable. We therefore put these count in relation with the total number of written words for each airline. As this blog is mainly a way to to show how techniques can be used to retrieve data I have not taken to the time consuming task to look for other information than the reviews. Of course, to do a complete analysis one would want to find data on which destinations all these airlines serve, the number of passengers cariied, the number of kilometer per passenger flewn and so forth. We shall just stick to the sentiments expressed by passengers and rank the airline according to these data.

If we simply look at the overall sentiment (positive – negative) using the Bing-doctionary we can see that that SAS and KLM are regarded as the best airlines while Air France and Vueling are seen as the ones waking the most negative feelings.

newBing

Note that only SAS, KLM and JetBlue are the only airlines giving reviewers an overall positive sentiment. This is somewhat surprising and should make anyone suspicious about using review as a measure of how good airlines are. Can it be so that only people having had negative experiences take the time to write reviews? Can there be other factors such as the kind of language used when writing posts? People from different cultures or nationalities might have different ways to express feelings of unsatisfaction.

Another observation that needs to be made is that the ranking depends partly on the sentiment dictionary used. Indeed, the classification of words into sentiment classes is somewhat arbitry. Usinge the NRC-dictionary gives a slightly different classification.

OVERALLNRC

In this case, United Airlines and SAS are the airlines inspiring overall positive sentiments while Ethiopian Airlines and Vueling mostly negative sentiments. Air France, which was ranked 21 in the bing dictionary is now ranked 13. One needs therefore to check the content of these dictionaries and perhasp make contextual modification.

On the subject of sentiments and context, one might want to be careful using the NRC dictionary when classifying airlines. Indeed, “fear” is not a sentiment an airline wants to be associated with and one needs to really check what words in the dictionary are classified as sentiments of fear.

library(tidytext)
sentiments_all = as.data.frame(sentiments)
sentiment_fear = sentiments_all[which(sentiments_all$sentiment=="fear"),]

Words like crash, accident, catastrophe are among the 1476 word long list, but so are Shanghai, socialist and stormy. Why Shanghai (a lovely and charming city) and stormy would be classified as words inspiring fear is for me a mystery but it is worthwhile to notice that they should be, at least in this context, excluded from the list.

Another indication that the dictionary used needs further investigation is the fact tthat the ratiosbetween positive and negative words for the reviews differ depending on whether one uses Bing or NRC, thus altering the ranking.

RatioNRCRatioBING

Now, ranking any entities is strongly dependent on what is considered as important variables, e.g. comfort, economy, time tables, meals and so forth. So, scanning the internet for airline rankings will give you several rankings and not one single airline will be given as the best airline by all ranking systems. What we have given here is a ranking system based on reviews written in english. A better ranking by customers would have to deal with all reviews in all the avaible languages, something that is beyond the scope of this blog. The technique, however, are exactly the same but require some work.

Note that this could be used by any company wanting to investigate their products reputation compared to other companies manufacturing the same type of product. This can also be useful to evaluate the development of products and their ture value for consumers. Why produce a certain functionality in a product if the customer has no use for it or if only some markets find it useful? Can reviews be a predictor for future number of products to be manufactured, i.e., can it be useful to plan production lines and orders of necessary components used in manufacture?

 

 

 

 

 

 

One thought on “The customer is always right….

Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.

Up ↑

%d bloggers like this: