Estimation and Visualization of Canton-Level Moods using Hybrid Fine-Grained Emotion Analysis Methodologies

Rationale

This website provides an online platform for capturing sentiments and emotions expressed using twitter in swiss cantons. The platform uses Twitter as a data source and combines sentiment analysis, Hybrid Fine-Grained Emotion Analysis Methodologies, and visualization techniques in order to provide users with an interactive interface for data exploration.

Data Aggregation

In order to have meaningful results and allow the users to zoom in to a specific period of time and visualize the emotions and sentiments expressed in tweets for each canton, we have aggregate the data in five ways. First, we aggregate the data by season (Winter, Spring, Summer, and Fall), from the year of 2012 to 2016. Winter period is the months of December, January, and February. Spring period is the months of March, April, and May. Summer period is the months of June, July, and August. Fall period is the months of September, October, and November. Second, we aggregated the data by day. We choosed the year of 2012 and aggregated emotions and sentiments for each canton in a particular day. Third, we aggregated the data by month, from the year of 2012 to 2016. Forth, we aggregated the data by week, from the year of 2012 to 2016. Fifth, we aggregated the data by the day of the week(Mondays, Tuesdays, etc) from the year of 2012 to 2016. This gives the users a way to explore the emotions and sentiments expressed in the tweets for each canton in different time granularity.

How to Use?

This platform provides interactive maps in order to give users the possibility to have a comprehensive overview of the sentiment and emotion analysis results about Swiss cantons and with the opportunity to zoom in to a specific canton to visualize the distribution of emotions and sentiment during a specific time.

The platform allows users to tune their view on such huge amount of information and to interactively reduce the inherent complexity, possibly providing a hint for finding meaningful patterns, and correlation between moods/emotion and time. Each page with a map includes a dropdown menu to specify the aggregation method. As you change, it will rebuild the slider and the map. Play/pause buttons are for animation. Clicking on cantons also reveals more information about them. In addition to polarity and emotion scores information conveyed through the use of different colors in a diverging scale and emoticons, there is also density information (number of tweets) for each canton provided by circles: the bigger the circle, the more tweets users have tweeted in the canton.

Canton mapping

For canton mapping, we make use of the geogson file of Switzerland to extract the boundaries of swiss cantons and then we store them in range trees and using a computational geometry library (shapely) we were able to query the locations of tweets and map them to a cantons efficiently. At the end we show the number of tweets for each canton, Geneve has the highest number of tweets! Related repository.





Language detection

We use the langdetect 1.0.7 "Language detection library ported from Google's language-detection." to detect the language of each tweet we have in switzerland. Before applying the language detection algorithm we cleaned the tweet's text by removing urls, mentions, hashtags, and numbers! later we analyse tweets language distribution in switzerland and we observe that the top 9 languages used are: French, English, German, Italian, Spanish, portuguese, Turkish, Arabic, and Dutch!. We did the same analysis for some cantons in Switzerland such as Vaud and we found that the top 5 languages used in tweets are French, English, portuguese, Spanish, and German. Also we did some analysis on the distribution of languages, considering only 4 languages (English, French, German, and Italian) for some cantons! Related notebook.

Fine Grained Emotion Recognition

We used Plutchik Wheel of Emotions as our model for fine-grained emotion analysis. It consists of 8 emotions: anger, anticipation, disgust, fear, joy, sadness, surprise and trust. We use NRC Lexicon which enumerates for each emotion category in addition to polarity (positive and negative) a set of representative keywords. It contains on average 2000 words per each category and it has been translated to more than 40 languages.

Sentiment Analysis:

We apply a simple approach to analyse the sentiments of the tweets. We use a textblob library which has a support for 3 languages, English, French, and German.. the top three frequent languages in Switzerland. We do a simple cleaning before using TextBlob Sentiment Analyzer, such as removing http(s) links, hashtags and mentions. Related notebook.

Emotion Analysis:

We apply a simple approach to capture the emotions expressed in tweets. first, we clean the tweet by removing http(s) links,hashtags, and mentions. then we remove stop words. after that we use the NRC Emotion lexicon to look for emotion mapping for each word in a tweet and generate an emotion vector with 8 entries for each tweet, each entry in the vector represent a single emotion. Related notebook.

-0.2 0 0.45

Click on cantons for more information!

Click on cantons for more information!

Preliminary Data Cleaning and Pre-Processing:

After parsing and dealing with encoding issues, we tokenize the tweets into set of words using RegexpTokenizer which splits based on white spaces and punctuation. We perform more data cleaning such as stop word removal and lowercase conversion at a later stage since the accuracy of syntactic parsing and named entity recognition will highly depend on it. Related notebook.

Affective Feature Extraction:

Next, we perform rigorous syntactic analysis by applying part of speech tagging using nltk POS tagger in order to recognize N.A.V.A words (Nouns, Adjectives, Verbs and Adverbs) which are best candidates for carrying emotional content and Dependency Parsing using Stanford Dependency Parser to detect dependency relationships between words in the sentence. We focus on three types of dependencies: negation modifier (“this is not funny”), adjectival complement (“I feel depressed”) and adverbial modifier (“I struggled happily”). The objective is to adjust the representation of words based on the presence of stronger relationships for example, by detecting the negation dependency in the sentence “I am not happy”, its affective representation will not be the same as the sentence “I am happy” since the word happy is dependent on negation which will cancel its happiness emotion to be neutral. The emotion of a word is normalized using the mean of the dependent word “happy” and the word that it depends on “not”. After that, we perform more refinement of the word features of the sentence using Named Entity Recognition to remove the words that are proper nouns, places, monetary currencies and so since those don’t contain any affective content. Related notebook.

More Data Cleaning:

At this stage, we perform more term normalization by converting to lowercase and lemmatization using WordNetLemmatizer to reduce word to its root form. We choose not to use stemming since it cuts down words into meaningless words that are not contained in the lexicon. We also remove customized version stopwords to which we add some commonly used verbs that don’t express any emotion like “be”, “go”, “do”..

Computation of Emotional Vectors:

We compute word level emotional vectors where each word has a vector of 8 emotions + 2 polarities values based on semantic relatedness scores between the word and the set of representative for the emotion/polarity category. Then we average by taking either geometric or arithmetic mean to get the tweet level emotional vectors. After that, we assign the dominant emotion to be the index with the highest score among emotion indices and the dominant polarity the index with the highest score among polarity indices if the highest score exceeds a certain threshold. In case, the highest score doesn’t exceed the threshold or the tweet doesn’t contain any nava words, we assign neutral. Related notebook.

Semantic Similarity Methodology:

Instead of solely relying on lexicon spotting to compare words in sentence with set of representative words for each emotion category, we also extend it with an approach from predictive distributional semantics based on Word Embedding. We use word2vec in order to compute semantic similarity relatedness scores between each word and a set of representative words of an emotion category by taking the geometric mean using the formula where w_i is the word, K_j is the set of representative words for particular emotion category e_j. Related notebook.

-0.2 0 0.45

Click on cantons for more information!

Click on cantons for more information!

-0.2 0 0.45

Click on cantons for more information!

Click on cantons for more information!

Semi-Supervised Approach:

We also tried training machine learning models on already annotated data from another domain and used domain adaptation to increase the generalizability of the model on the unnanotated tweets. We follow the same pipeline used for pre-processing and affective representation of words, then we convert them into numerical vectors using Continuous Bag of Words word2vec model with context window of 10, minimal word occurence of 4 and dimensionality of 300. Then, we average over word vectors to get the numerical vector representation of tweets. We follow machine learning pipeline to split data into train and test datasets, we deal with emotion class imbalances and we try different algorithms including random forest, svm, and knn and we compare them. Related notebook.

-0.2 0 0.45

Click on cantons for more information!

Click on cantons for more information!