Sentiments in a Romantic Classic – A short introduction to Sentiment Analysis

Apart from my interest in mathematics, I have throughout the years been fascinated by literature and though I have read hundreds of authors, nothing moves my heart as much as the French romantics such as Stendhal, Diderot, Mme De Staël or Flaubert. The French are known to be romantic and very much guided by emotions and so are the works of French writers. We speak of “Sentiments” and the idea of performing a sentiment analysis on one of my favorite books, Madame Bovary by Gustave Flaubert.

While this masterpiece is on the scholar cursus of all high-school student in France, it probably is unknown to many and to satisfy your curiosity I give a brief summary of the book:

Charles Bovary is a mediocre county doctors who falls in love with Emma, the daughter of one of his patients. After marrying the young woman, they move to Tostes, where Charles has his practice. Marriage doesn’t live up to Emma’s romantic expectations as she dreamt of love and marriage as a solution to all her problems. She grows bored and depressed when she compares her fantasies to real rural life and she eventually falls ill. When Emma becomes pregnant, Charles decides to move to a different town in hopes of reviving her health.

In the new town of Yonville, the Bovarys meet Homais, the town pharmacist, a pompous windbag who loves to hear himself speak. Emma also meets Leon, a law clerk, who, like her, is bored with rural life and loves to escape through romantic novels. When Emma gives birth to her daughter Berthe, motherhood disappoints her, and she continues to be despondent. Romantic feelings blossom between Emma and Leon. However, when Emma realizes that Leon loves her, she feels guilty and throws herself into the role of a dutiful wife. Leon grows tired of waiting and departs to study law in Paris, which makes Emma miserable.

At a fair, a wealthy neighbor named Rodolphe declares his love to Emma, seduces her, and they begin having a passionate affair. Emma is indiscreet gossips spread about her. Charles, however, suspects nothing. His adoration for his wife and his stupidity combine to blind him to her indiscretions. His professional reputation, meanwhile, suffers a severe blow when he and Homais attempt an experimental surgical technique to treat a club-footed man named Hippolyte and end up having to call in another doctor to amputate the leg. Disgusted with her husband’s incompetence, Emma throws herself even more passionately into her affair with Rodolphe. She borrows money to buy him gifts and suggests that they run off together and take little Berthe with them. Soon enough, though, the jaded and worldly Rodolphe has grown bored of Emma’s demanding affections. Refusing to elope with her, he leaves her. Heartbroken, Emma grows desperately ill and nearly dies.

By the time Emma recovers, Charles is almost ruined by Emma’s debt. Re-enters Leon in an Opera house in Rouen. This meeting rekindles the old romantic flame between Emma and Leon, and this time the two embark on a love affair. As Emma continues sneaking off to Rouen to meet Leon, she also grows deeper and deeper in debt to the moneylender L’heureux, who lends her more and more money at exaggerated interest rates. She grows increasingly careless in conducting her affair with Leon. As a result, on several occasions, her acquaintances nearly discover her infidelity.

Over time, Emma grows bored with Leon. Not knowing how to abandon him, she instead becomes increasingly demanding. Meanwhile, her debts mount daily. Eventually, L’heureux orders the seizure of Emma’s property to compensate for the debt she has accumulated. Terrified of Charles finding out, she frantically tries to raise the money that she needs, appealing to Leon and to all the town’s businessmen. Eventually, she even attempts to prostitute herself by offering to get back together with Rodolphe if he will give her the money she needs. He refuses, and, driven to despair, she commits suicide by eating arsenic. She dies in horrible agony.

For a while, Charles idealizes the memory of his wife. Eventually, though, he finds her letters from Rodolphe and Leon, and he is forced to confront the truth. He dies alone in his garden, and Berthe is sent off to work in a cotton mill.

Tragic, no? This tragedy plays out in three parts with 9, 15 and 11 chapters, some displaying hope and other utter misery. Even though we may get a feeling for the levels of joy and misery in each chapter by reading them, one would like to give some measure of these emotions.

We have, the past few years, witnessed the development of novel methods in text analysis and, given the ever-growing flow of text based information, it has been necessary to find ways to determine the moods and feelings of groups of citizens. Tweets and Facebook posts during electoral periods are, for instance, among the most studied subject in this context to the point of being fairly good indicators of election outcomes. Many companies also analyze product reviews to project future sales or to identify new trends. The sky is the limit when it comes to natural language based data.

This blog is no way an extensive survey of techniques and does not have the intention of making the reader an expert on the subject. It simply gives a fairly easy example of what can be done. As you might have understood, if nothing else by the summary given above, the   tragedy in Madame Bovary is created by the interactions between a number of characters and therefore, a complete sentiment analytical study of the book would require giving the moods of each one of them in every chapter. We also want to point out that sentiment analysis is far from being the solution to all problems involving languages, and as with any automatic analysis of language, you will have errors in your results. As you might imagine, some languages are easier to analyze than others because of ambiguities related to them. English is complicated enough. Take for instance words like, “book”, “set” or “out” which have 2, 3 and 5 different meanings. Other languages, like Chinese contain even more ambiguities. Take for instance these threes sentences, 很难说 (Hěn nán shuō)(It’s hard to say. True meaning: I have no idea OR I know, but don’t want to say.), 马上到了(Mǎ shàng dào le) (I’ll be there immediately. True meaning: I’ll be there sometime in the near future…probably.) and 应该没问题 (Yīng gāi méi wèn tí) (Should be no problem. True meaning: Everything is under control OR you’re in deep trouble). Their meaning is not their direct translation.   However, it can be useful to quickly summarize overall feelings in texts in those cases where the amount of text is too important for a human reader to read and analyze.

Sentiments are divided into two groups, Positive and Negative, and for most purposes, and in particular when studying opinions, it is sufficient to have such a dichotomous separation. It is however evident that positive and negative feelings are associated with a wide range of emotions. Depending on the purposes with a sentiment analysis, one may want to rely on different types of dictionaries to preform the task. There are several lexicons that have been developed and in this exercise we have chosen to use two of these, namely “bing” and “nrc”.

The first of these lexicons is called the “Bing Opinion Lexicon” and was developed by Liu Bing in which words have been divided into two groups, one positive and one negative. This lexicon has been incorporated, together with two others, in the R-package, tidytext.

library(tidytext)
get_sentiments(“bing”)
# A tibble: 6,788 x 2
          word sentiment
              
 1     2-faced  negative
 2     2-faces  negative
 3          a+  positive
 4    abnormal  negative
 5     abolish  negative
 6  abominable  negative
 7  abominably  negative
 8   abominate  negative
 9 abomination  negative
10       abort  negative
# ... with 6,778 more rows

We shall not enter a discussion on how the classification was done and there is still an ongoing debate of whether some words should be classified in either of the two classes. The classification is fully binary and might not be very useful when analyzing literary works but does a fairly good job when it comes to tweets and political opinion “surveys”. Another useful lexicon is the NRC Emotion Lexicon, which is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).

get_sentiments("nrc")
# A tibble: 13,901 x 2
          word sentiment
               
1      abacus     trust 
2     abandon      fear 
3     abandon  negative 
4     abandon   sadness 
5   abandoned     anger 
6   abandoned      fear 
7   abandoned  negative 
8   abandoned   sadness 
9 abandonment     anger
10 abandonment      fear
# ... with 13,891 more rows

The NRC lexicon contains 14,182 unigrams (words) and approximately 25 000 senses, while the Bing lexicon contains approximately 6 800 words. This has, of course, implications when analyzing texts.

Analysis often demands data preparation and text mining and sentiment analysis is unfortunately not exempt from this time consuming but so necessary step. When it comes to literature, the task is slightly easier than for other written material because of the Gutenberg Project (https://www.gutenberg.org/) and the gutenbergr package in R, which allows a direct access to the project library. But, for the sake of this blog, I have even chosen to demonstrate how one would go about dividing a book into its parts and chapter and how the analysis I performed om each part. As the Project Gutenberg library makes available some 60k books, one can very well imagine performing an analysis on some unavailable one.

The gutenbergr way

The Gutenberg package allows one to search the Project Gutenberg library, either by displaying its content, by author or by title. For instance, Looking for Gustave Flaubert’s collected works is simply done by

>gutenberg_works(author == "Flaubert, Gustave")
gutenberg_id                                                             title                                                                        
1        1253                                                      A Simple Soul 
2        1290                                                      Salammbo 
3        1291                                                      Herodias 
4        2413                                                     Madame Bovary 

# ... with 12 more variables: author , gutenberg_author_id , language , gutenberg_bookshelf

The Gutenberg Project library metadata is made available through the gutenbergr package:

MadameBovary  = gutenberg_metadata %>%
                filter(title == "Madame Bovary")
MadameBovary
# A tibble: 2 x 8
gutenberg_id         title            author         gutenberg_author_id  language
                                                       
1         2413 Madame Bovary Flaubert, Gustave                 574           en
2        14155 Madame Bovary Flaubert, Gustave                 574           fr
# ... with 3 more variables: gutenberg_bookshelf , rights , has_text 

and works are downloadable using the gutenberg_download() function.

A manual preparation of files

In many cases one is not as lucky as to obtain text files or books arranged the way they are in the Project Gutenberg library. Instead, parts are (at best) marked in by words such as “Part” or “Chapter” or by consecutive line breaks. To examplify this, I downloaded a version of Madame Bovary and from the Gutenberg library and divided it in its three parts of 9, 15 and 11 chapter in separate *.txt files, ordered in a map on my PC. This is a painstaking process which easily turns into a nightmare for some prolific authors such as Peter F. Hamilton, Tolstoj or Stendhal.  I have even kept a copy of the entire book in order to perform a global word count to determine the frequency of certain words.

The first step in the analysis is to load the text and remove all trailing spaces in the textfile and to remove special characters that because of the R-language are reserved for special purposes:

library(tidyverse)
library(tidytext)
library(glue)
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library("gutenbergr")

files =list.files("C:\\Users\\sergos\\Desktop\\EGET\\Blogg\\TextMining\\Blogg\\Complete")
fileName  =  glue("C:\\Users\\sergos\\Desktop\\EGET\\Blogg\\TextMining Blogg\\Complete\\", files, sep = "")
fileName =   trimws(fileName)             # trimws erases trailing spaces in file names
fileText =   glue(read_file(fileName))
fileText =   gsub("\\$", "", fileText)     # removes dollar signs as they are reserved for variables definitions

Word statistics

The next step consists in tokenizing the book, that is to break it into each of its constituents, that is its words.

tokens              =   data_frame(text = fileText) %>% unnest_tokens(word, text)

and to do a frequency analysis for each word. Now, it goes without saying that words such as a, to, the, and and so on will be over represented. These words are considered to be stop words and can be disregarded from the frequency count using an anti-join (exclusion criterion). These stop words are pre-determined but the list can be customized if needed. One can imagine that words like “like” could be problematic in some contexts and could be excluded.

MadameBovary_en       = gutenberg_download(2413)  
words_MadameBovary_en = MadameBovary_en %>%                         
unnest_tokens(word, text) 
word_counts <- words_MadameBovary_en %>% 
anti_join(stop_words, by = "word") %>% 
count(word, sort = TRUE) 
words_MadameBovary_en %>% 
anti_join(stop_words, by = "word") %>% 
count(word, sort = TRUE) %>% 
filter(n >80) %>% 
mutate(word = reorder(word, n)) %>% 
ggplot(aes(word, n)) + geom_col() + xlab("Words in Madame Bovary") + 
ylab("Word frequency ") + coord_flip()

AllWordsbild1

Sentiments

As we mentioned above, there are two main lexicon that can be used to determine the sentiments in a text. We give here examples for both the bing and nrc lexicon and visualisations of these sentiments. We have chosen to define some sentiments as negative (sadness, fear and so on) and inverted the scales. Depending on ones nature, these sentiments can be viewed as positive ones but we chose hese to view them as the majority of people would.

bing_word_counts <- tokens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = T) %>%
  ungroup()

tmp <- bing_word_counts %>%
  filter(n >10) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n))

ggplot(data = tmp, mapping = aes(x = word, y = n, fill = sentiment)) +
  geom_bar(alpha = 0.8, stat = "identity") +
  labs(y = "Positive and negative Contributions to sentiments in Madame Bovary", x = NULL) +
  coord_flip()

bingsentimentglobalbild2

We can thus see the contribution of each word to the general sentiments of the book using the bing lexicon. The classification is completely binary (positive and negative) and it can be observed that is is the tokens (words) that are determinants of the sentiment. Further analysis in which combinations of unigrams are constructed can be necessary. Indeed, combinations of words that are negative by themselves can have a positive meaning and vice versa, just as words in different contexts can have different meanings in terms of sentiments.

Using the NRC lexicon can improve actually improve our undertanding of the wide range of emotions in our novel. Indeed, as we pointed out our introduction of the Bing and NRC lexicons, the latter not only considers the positive and negative sentiments but also eight emotions. This feature gives us a deeper understanding of the novels emotional lanscape.

As done with the bing lexicon, we begin by classifying the words using the NRC lexicon and mutate negatively loaded words (by sentiments and emotions) to visualize the full range of sentiments in Madame Bovary.

nrc_word_counts <- tokens %>%
inner_join(get_sentiments("nrc")) %>%
count(word, sentiment, sort = T) %>%
ungroup()

tmp <- nrc_word_counts %>%
filter(n >20) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(n = ifelse(sentiment== "fear",-n,n))%>%
mutate(n = ifelse(sentiment== "distgust",-n,n))%>%
mutate(n = ifelse(sentiment== "anger",-n,n))%>%
mutate(n = ifelse(sentiment== "sadness",-n,n))%>%
mutate(word = reorder(word, n))

ggplot(data = tmp, mapping = aes(x = word, y = n, fill = sentiment)) +
geom_bar(alpha = 0.9, stat = "identity") +
labs(y = "Positive and negative sentiment in Madame Bovary", x = NULL) +
coord_flip()

AllNRCemotionsbild3

NRC classification of emotions and sentiments. Only words appearing more than 20 times in the entire book are taken into account.

AllNRCemotions40bild4

NRC classification of emotions and sentiments. Only words appearing more than 40 times in the entire book are taken into account.

 

It seems quite evident that the NRC lexicon is more suited for literary works and that the Bing lexicon should not be used for these purposes. This statement can evidently be discussed and debated, but it is my opinion that Bing’s lexicon should only be used to classify opinions in micro blogs and political opinion statements as they involve a narrower range of emotions (even thought fear, disgust and trust may be involved).

A full inspection of Madame Bovary

To obtain the general sentiment of a novel is quite meaningless as the author of a well written book most often wishes to take the reader on a journey or a an emotional roller coaster. The plot takes us often from hope to dispair and back again throught landscapes of fear, love, anger and dismay. A better way to do sentiment analysis on a book is therefore to analyize the book chapter by chapter as is done in the code below. What is done here is that dataframes are created to save the consecutive sentiment analyses done with both the NRC and Bing lexicons.

rm(list = ls())
library(tidyverse)
library(tidytext)
library(glue)
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library("gutenbergr")

### The book was downloaded from the Gutenberg Project place as a text file. 
### and divided into its parts and chapters. All chapter of Part 1, 2 and 3
### were saved into separate folders
###
### 

files=list.files("C:\\Users\\sergos\\Desktop\\EGET\\Blogg\\TextMining Blogg\\Madame Bovary\\AllPartsAndChapters\\")
NRCanger            = data.frame()
NRCanticipation     = data.frame()
NRCdisgust          = data.frame()
NRCfear             = data.frame()
NRCjoy              = data.frame()
NRCnegative         = data.frame()
NRCpositive         = data.frame()
NRCsadness          = data.frame()
NRCsurprise         = data.frame()
NRCtrust            = data.frame()
NRCsentiment        = data.frame()
NRCWords            = data.frame()

BINGPositiveSentiments   =   data.frame()
BINGNegativeSentiments   =   data.frame()
BINGOverAllSentiments    =   data.frame()

Chapters                 =   as.data.frame(as.vector(files))

for(i in 1:length(files)){

fileName            =   glue("C:\\Users\\sergos\\Desktop\\EGET\\Blogg\\TextMining Blogg\\Madame Bovary\\AllPartsAndChapters\\", files[i], sep = "")
fileName            =   trimws(fileName) # trimws erases trailing spaces in file names
fileText            =   glue(read_file(fileName))
fileText            =   gsub("\\$", "", fileText) # removes dollar signs as they are reserved for variables definitions 
tokens              =   data_frame(text = fileText) %>% unnest_tokens(word, text)
Words            =   sapply(gregexpr("\\W+", fileText), length) + 1
sentimentNRC = tokens %>%
inner_join(get_sentiments("nrc")) %>% # Get only sentiment words using the nrc dictionnary 
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
mutate(sentiment = positive - negative)

sentimentBING = tokens %>%
inner_join(get_sentiments("bing")) %>% # Get only sentiment words using the bing dictionnary 
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
mutate(sentiment = positive - negative) # # of positive words - # of negative owrds

NRCanger            = rbind(NRCanger, sentimentNRC$anger)
NRCanticipation     = rbind(NRCanticipation, sentimentNRC$anticipation)
NRCdisgust          = rbind(NRCdisgust, sentimentNRC$disgust)
NRCfear             = rbind(NRCfear, sentimentNRC$fear)
NRCjoy              = rbind(NRCjoy, sentimentNRC$joy)
NRCnegative         = rbind(NRCnegative, sentimentNRC$negative)
NRCpositive         = rbind(NRCpositive, sentimentNRC$positive)
NRCsadness          = rbind(NRCsadness, sentimentNRC$sadness)
NRCsurprise         = rbind(NRCsurprise, sentimentNRC$surprise)
NRCtrust            = rbind(NRCtrust, sentimentNRC$trust)
NRCsentiment        = rbind(NRCsentiment, sentimentNRC$sentiment)
NRCWords            = rbind(NRCWords, Words)

BINGPositiveSentiments = rbind(BINGPositiveSentiments,sentimentBING$positive)
BINGNegativeSentiments = rbind(BINGNegativeSentiments,sentimentBING$negative)
BINGOverAllSentiments  = rbind(BINGOverAllSentiments,sentimentBING$sentiment)

}

To visualize the results I chose to use Power BI because of the flexibility in creating graphs. Of course, this can be done (with a little work) in R.

PositiveNegativeAndSentimentbild5

The Positive and Negative word counts in Madame Bovary’s parts and Chapters. The red graph represents the sentiment, defined as the difference between positive and negative words. 

 PositiveSentimentRatiobild6

The positive sentiment ratio (defined as the positve word / sentiment = Positve/(Postive -Negative).

Surprisingly enough, given the theme of the book, the overall sentiment throughout the book is positive. Given that the book is a closed set of words (i.e. the story is fully told by the author), it might seem contra intuitive. But, it is actually a magical trick of Flaubert! He manages to give some sense of hope in a story that should be everything but joyful and he manages to play gracefully with the proverbe “what goes around, comes around”, to show that truth is always victorious and leaves the reader with a positive feeling, even thought some part (Part 2 Chapter 13 and Part 3 Chapter 3) might leave the reader with a broken heart.

I have also chosen to visualize the eight emotions of the NRC lexicon in the following graph:

EightEmotionsbild7

The eight emotions of Madame Bovary by parts and Chapters. 

 

One word of caution is in order in the context of sentiment analysis. We studied here a closed set of words and analyzed the entire content of a novel. One should be aware that in opinion polling using sentiment analysis, as in any other polling, the results depend on the sampling of phrases. Tweets can be manipulated is order to give the impression that a majority of individuals think one or the other way, just as demographic sampling in particular socio economical groups can lead to the same effect.

I hope that this blog has inspired you to do some text mining of your own! Until next time, mine away!

Serge DE GOSSON DE VARENNES

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.

Up ↑

%d bloggers like this: