Get SGD58.85 off your premium account! Valid till 9 August 2021. Use the Code ‘SGLEARN2021’ upon checkout. Click Here

Digital Transformation

As we reach the end of the 2010s, the term Digital Transformation has been occurring with increasing frequency. We can be quite sure that it will continue to do so as we enter the new decade. What does it actually mean? I caught up with Tern Poh, one of our consultants at AI Singapore, to get his take on it.

Making Interoperable AI Models a Reality

In one of the last meetup events of the year, one of the graduates of the AI Apprenticeship Programme (AIAP™), Lee Yu Xuan, was invited to be one of the speakers at Google Singapore by TensorFlow and Deep Learning Singapore. This is one meetup group that consistently delivers cutting edge stuff, especially the latest in TensorFlow development, for the hardcore technical geek and I always try not to miss it. Organiser Martin Andrews had wanted to find a way to convert between PyTorch and TensorFlow deep learning models and came across Yu Xuan’s article which he had written earlier this year. Since Yu Xuan had already done prior work in this area with the ONNX open format, Andrew invited him to share his experience using it, which he gladly accepted.

What and Why ONNX

ONNX is an acronym for Open Neural Network Exchange Format. As the name suggests, it provides interoperability between different deep learning frameworks. Despite the demise of Theano in 2017, it has become clear that AI research and production will remain a multi-polar ecosystem in which TensorFlow and PyTorch are currently the most popular frameworks. While the good thing is that they are both open source, they are not interoperable. The ability to share models or to move training and inference between frameworks are some of the goals that ONNX seeks to fulfill.

Our goal is to make it possible for developers to use the right combinations of tools for their project. We want everyone to be able to take AI from research to reality as quickly as possible without artificial friction from toolchains.

– From the ONNX webpage

ONNX started in 2017 as a community project by Facebook and Microsoft. It has also received support from notable names, including Intel, AWS, Huawei and NVIDIA, among others. Last month, the LF AI Foundation welcomed it as a graduate level project. ONNX also supports a collection of pre-trained SOTA models contributed by the community.

While much has been accomplished, perhaps the biggest stride forward would be getting Google on board. To date, the Mountain View giant has been absent from the community and this has been felt. For one thing, TensorFlow 2.0, released at the end of September this year, is still not being supported by ONNX. Clearly, this is what the industry needs and the community is no doubt working hard to fill the gap. In time to come, we hope that friction from switching between frameworks will be a thing of the past.

Analysis of Tweets on the Hong Kong Protest Movement 2019 with Python

(By Griffin Leow, republished with permission)

(Disclaimer: This article is not intended to make any form of political or social commentary whatsoever on the current situation in Hong Kong. The analysis done is purely based on deductions made of the data set at hand.)

I was motivated to do a pet project on Sentiment Analysis after completing a Deep Learning Course on Coursera taught by Andrew Ng recently, where one of the specialisations is on Sequence Models. I wrote this article to consolidate and share my learning and codes.

With the Hong Kong protest movement already happening for close to 6 months, I had the sudden idea of scraping Twitter tweets about the protests and using them for this project. I did not want to use existing (and possibly already cleaned up) data sets that are easily available on Kaggle, for instance. I figured this was the chance for me to get my hands dirty and learn the process of scraping the tweets.

The aim for this project is to discover the:

  1. general sentiment of the tweets regarding the protest, in particular, what are these tweets’ stances/opinions towards the central government in China, Hong Kong administration and police force
  2. demographics of twitter users
  3. popularity of hashtags
  4. behaviour of top users and users in general
  5. daily top tweets

The structure of this article will be different from the usual tutorials, where the flow would be data cleaning and preprocessing, followed by exploratory data analysis and then model training and tuning. Here we want the reader to focus first on the data analysis and visualisation. Data cleaning and preprocessing steps are covered afterwards. You may access the source code from this repository.

Scraping Twitter Tweets using Tweepy

Since scraping Twitter tweets is also not the focus of this article, I have put up a separate article describing in detail the process of scraping. Click on this link, if you need a step by step guide.


Exploratory Data Analysis (EDA)

Let’s explore and visualise the already processed tweets with the usual data visualisation libraries — seaborn and matplotlib.

1. WordCloud — Quick Preview of Popular words found in tweets regarding the protest

First of all, we use a Word Cloud that can immediately show us what are the most heavily used words in tweets pertaining to the protest. The codes are required are as follows:

from wordcloud import WordCloud
import matplotlib.pyplot as plt
def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color = 'white',
        max_words = 200,
        max_font_size = 40, 
        scale = 3,
        random_state = 42
    ).generate(str(data))
fig = plt.figure(1, figsize = (15, 15))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize = 20)
        fig.subplots_adjust(top = 2.3)
plt.imshow(wordcloud)
plt.show()
    
# print wordcloud
show_wordcloud(data['cleaned_text'])

We generated a word cloud for the top 100 words, where the more popular a word is, the bigger the word will be in the word cloud (You can adjust this parameter by changing the value for ‘max_words’).

Some of the words that hit us quickly are: disgust, police, fireman, protestors, tear gas, citizen, failed, trust and etc. In general, without the context of the tweet we cannot determine if each word, taken as its own, represent a negative or positive sentiment towards the government or protesters. But for those of us who have been following social media and the news, there has been a big backlash against the police.

2. No. of Positive Sentiments vs No. of Negative Sentiments

Next, we look at what is the distribution of positive and negative tweets. Based on the SentimentIntensityAnalyzer from the NLTK Vader-Lexicon library, this analyser examines the sentiment of a sentence, on how positive, neutral or negative it is.

We can interpret the sentiment in the following manner. If a sentiment is positive, it could mean that it is pro-government and/or police. Whereas, a negative sentiment could mean that it is anti- government and/or police, and supportive towards the protesters.

The analyser returns 4 scores for each sentence namely, ‘positive’, ‘negative’, ‘neutral’ and ‘compound’. The score ‘compound’ returns the overall sentiment of a sentence with range of [-1, 1]. For our current purpose, we classify each tweet into 5 classes using the ‘compound’ score and assign a range of values for each of the classes:

  1. Very positive ‘5’ — [0.55, 1.00]
  2. Positive ‘4’ — [0.10, 0.55)
  3. Neutral ‘3’ — (-0.10, 0.10)
  4. Negative ‘2’ — (-0.55, -0.10]
  5. Very negative ‘1’ — [-1.00, -0.55]

Note: the range of values for a neutral sentiment is more stringent.

As it turns out, analysing the sentiment of the tweets using a rule-based approach is extremely inaccurate because of the nature of the protests. The sentiment of each tweet can be about either the government or the protesters. On the other hand, in other cases such as hotel reviews, a sentiment analysis of each review is about the hotel, but not the hotel guests who give the reviews. Hence, it is clear that a good sentiment score means that the review for the hotel is good, whereas a bad sentiment score means that the review for the hotel is bad. However, in our current case study, a good sentiment score for a tweet can either mean supportive towards one party and against/negative for its counterpart. This will be shown in the following to come.

Assign the classes to each data according to its ‘compound’ score:

# Focus on 'compound' scores
# Create a new column called 'sentiment_class'
sentimentclass_list = []

for i in range(0, len(data)):
    
    # current 'compound' score:
    curr_compound = data.iloc[i,:]['compound']
    
    if (curr_compound <= 1.0 and curr_compound >= 0.55):
        sentimentclass_list.append(5)
    elif (curr_compound < 0.55 and curr_compound >= 0.10):
        sentimentclass_list.append(4)
    elif (curr_compound < 0.10 and curr_compound > -0.10):
        sentimentclass_list.append(3)
    elif (curr_compound <= -0.10 and curr_compound > -0.55):
        sentimentclass_list.append(2)
    elif (curr_compound <= -0.55 and curr_compound >= -1.00):
        sentimentclass_list.append(1)

# Add the new column 'sentiment_class' to the dataframe
data['sentiment_class'] = sentimentclass_list

# Verify if the classification assignment is correct:
data.iloc[0:5, :][['compound', 'sentiment_class']]

We make a seaborn countplot to show us the distribution of sentiment classes in the dataset:

import seaborn as sns

# Distribution of sentiment_class
plt.figure(figsize = (10,5))
sns.set_palette('PuBuGn_d')
sns.countplot(data['sentiment_class'])
plt.title('Countplot of sentiment_class')
plt.xlabel('sentiment_class')
plt.ylabel('No. of classes')
plt.show()

Let’s take a look at some of the tweets in each sentiment class:

  • 10 random tweets that are classified with ‘negative sentiment’ — classes 1 and 2
# Display full text in Jupyter notebook:
pd.set_option('display.max_colwidth', -1)

# Look at some examples of negative, neutral and positive tweets

# Filter 10 negative original tweets:
print("10 random negative original tweets and their sentiment classes:")
data[(data['sentiment_class'] == 1) | (data['sentiment_class'] == 2)].sample(n=10)[['text', 'sentiment_class']]

It is clear that the tweets are about denouncing alleged police violence, rallying for international support — especially from the United States — and reporting about police activities that are against the protesters.

  • 10 random tweets that are classified with ‘neutral sentiment’ — class 3
# Filter 10 neutral original tweets:
print("10 random neutral original tweets and their sentiment classes:")
data[(data['sentiment_class'] == 3)].sample(n=10)[['text', 'sentiment_class']]

Most of these tweets, except for the last one with index 114113, are supposed to be neutral in stance. But given the context, it can be inferred that the tweets are about supporting the protesters and their cause.

  • 20 random tweets that are classified with ‘positive sentiment’ — classes 4 and 5
# Filter 20 positive original tweets:
print("20 random positive original tweets and their sentiment classes:")
data[(data['sentiment_class'] == 4) | (data['sentiment_class'] == 5)].sample(n=20)[['text', 'sentiment_class']]

20 random tweets were picked out but almost, if not all, are actually negative in sentiments, which means they are against the Hong Kong government and/or police. A quick observation reveals that the tweets covered the topics: passing the Hong Kong Democracy and Human Rights Act in the United States; removing fellowships from Hong Kong politician(s); and generally supporting the Hong Kong protesters.

This supports the argument made earlier that a rule-based sentimental analysis of tweets using the Vader Lexicon library in this case is inaccurate in identifying REAL positive sentiments, leaving us with many false positives. It fails to examine and take the context of the tweets into account. Most of the ‘positive sentiment’ tweets contain more ‘positive’ words than ‘negative’ ones but they actually show support to the protesters and their cause, NOT to the Hong Kong government and/or China.

3. Popularity of Hashtags

Recall that the tweets were scraped using a pre-defined search term that contains a list of specific hashtags, pertaining to the protests. In addition, the tweets can also contain other hashtags that are not defined in the search term, as long as the tweet contains hashtag(s) that are/is defined by the search term.

In this section, we want to find out what are the most and least popular hashtags used by Twitter users in their tweets.

# the column data['hashtags'] returns a list of string(s) for each tweet. Build a list of all hashtags in the dataset.

hashtag_list = []

for i in range(0, len(data)):
    # Obtain the current list of hashtags
    curr_hashtag = data.iloc[i, :]['hashtags']
    
    # Extract and append the hashtags to 'hashtag_list':
    for j in range(0, len(curr_hashtag)):
        hashtag_list.append(curr_hashtag[j])

The total number of hashtags used can be determined by:

# No. of hashtags
print('No. of hashtags used in {} tweets is {}'.format(len(data), len(hashtag_list)))

No. of hashtags used in 233651 tweets is 287331

We build a simple dataframe for visualisation purposes:

df_hashtag = pd.DataFrame(
    {'hashtags': hashtag_list}
)

print(df_hashtag.head())
print('Shape of df_hashtag is:', df_hashtag.shape)

Basic Visualisation: All-time Top 15 Hashtags used

Let’s take a look at the top 15 hashtags used by users

# Define N to be the top number of hashtags
N = 15
top_hashtags = df_hashtag.groupby(['hashtags']).size().reset_index(name = 'counts').sort_values(by = 'counts', ascending = False).head(N)
print(top_hashtags)

# seaborn countplot on the top N hashtags
plt.figure(figsize=(30,8))
sns.set_palette('PuBuGn_d')
sns.barplot(x = 'hashtags', y = 'counts', data = top_hashtags)
plt.title('Barplot of Top ' + str(N) + ' Hashtags used')
plt.xlabel('Hashtags')
plt.ylabel('Frequency')
plt.show()

As expected, there are 14 out of 15 hashtags that contain the keywords ‘hong kong’ and ‘hk’ because users use them to identify their tweets with Hong Kong and the protests. The only hashtag that is different from the rest is #china.

There are 6 hashtags that explicitly show support to the protesters and decry about the Hong Kong police’s actions and behaviours — these hashtags are: #fightforfreedom, #freehongkong, #fightforfreedom, #hkpoliceterrorism, #hkpolicestate, and #hkpolicebrutality.

Advanced Visualisation: Time-series of Top 10 Hashtags over the Last 7 Days (Not exactly a time series…)

We want to see the growth in usage of hashtags starting from 3rd November 2019 when the scraping of data started. Did one or few hashtags become more popular over time? Let’s find out:

from datetime import datetime
ind_to_drop = []
date = []

# First find out which 'tweetcreatedts' is not a string or in other weird formats
for i in range(0, len(data)):
    ith_date_str = data.iloc[i,:]['tweetcreatedts']
    ith_match = re.search(r'\d{4}-\d{2}-\d{2}', ith_date_str)
    if ith_match == None:
        ind_to_drop.append(i)
    else:
        continue

# Drop these rows using ind_to_drop
data.drop(ind_to_drop, inplace = True)

# Create a new list of datetime date objects from the tweets:
for i in range(0, len(data)):
    ith_date_str = data.iloc[i, :]['tweetcreatedts']
    ith_match = re.search(r'\d{4}-\d{2}-\d{2}', ith_date_str)
    ith_date = datetime.strptime(ith_match.group(), '%Y-%m-%d').date()
    
    date.append(ith_date)
    
# Size of list 'date'
print('Len of date list: ', len(date))

Len of date list: 233648

# Append 'date' to dataframe 'data' as 'dt_date' aka 'datetime_date'
data['dt_date'] = date

# Check to see that we have the correct list of dates from the dataset
data['dt_date'].value_counts()
# Create a new dataframe first
timeseries_hashtags = pd.DataFrame(columns = ['hashtags', 'count', 'date', 'dayofnov'])

# Obtain a set of unique dates in 'date' list:
unique_date = np.unique(date)

We define a function that allows you to choose the top N hashtags to show, and also data from the last T days, instead of everyday since 3rd Nov 2019 (although you can, it would clutter the plot)

def visualize_top_hashtags(main_df, timeseries_df, N, T, unique_dates):
    # main_df - main dataframe 'data'
    # timeseries_df - a new and empty dataframe to store the top hashtags 
    # N - number of top hashtags to consider
    # T - number of days to consider
    # unique_dates - list of unique dates available in the table
    
    # Returns:
    # timeseries_df
    
    # Start counter to keep track of number of days already considered
    counter = 1

# Starting from the latest date in the list
    for ith_date in reversed(unique_dates):
        # Check if counter exceeds the number of days required, T:
        if counter <= T:
            
            # Filter tweets created on this date:
            ith_date_df = main_df[main_df['dt_date'] == ith_date]

# From this particular df, build a list of all possible hashtags:
            ith_hashtag_list = []
for i in range(0, len(ith_date_df)):
                # Obtain the current list of hashtags:
                curr_hashtag = ith_date_df.iloc[i,:]['hashtags']

# Extract and append the hashtags to 'hashtag_list':
                for j in range(0, len(curr_hashtag)):
                    ith_hashtag_list.append(curr_hashtag[j])

# Convert the list into a simple DataFrame
            ith_df_hashtag = pd.DataFrame({
                    'hashtags': ith_hashtag_list
            })

# Obtain top N hashtags:
            ith_top_hashtags = ith_df_hashtag.groupby(['hashtags']).size().reset_index(name = 'count').sort_values(by = 'count', ascending = False).head(N)

# Add date as a column
            ith_top_hashtags['date'] = ith_date
            ith_top_hashtags['dayofnov'] = ith_date.day

# Finally, concat this dataframe to timeseries_hashtags
            timeseries_df = pd.concat([timeseries_df, ith_top_hashtags], axis = 0)

# Increase counter by 1
            counter += 1
        
        else: # break the for loop
            break
    
    print('The newly created timeseries_hashtag of size {} is: '.format(timeseries_df.shape))
    timeseries_df.reset_index(inplace = True, drop = True)
    
    # Visualization
    plt.figure(figsize=(28,12))
    ax = sns.barplot(x = 'hashtags', 
                   y = 'count',
                   data = timeseries_df,
                   hue = 'dayofnov')

# plt.xticks(np.arange(3, 6, step=1))
    # Moving legend box outside of the plot
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    # for legend text
    plt.setp(ax.get_legend().get_texts(), fontsize='22')
    # for legend title
    plt.setp(ax.get_legend().get_title(), fontsize='32') 
    plt.xlabel('Top Hashtags')
    plt.ylabel('Count of Hashtags')
    plt.title('Top ' + str(N) + ' Hashtags per day')
    sns.despine(left=True, bottom=True)
    plt.xticks(rotation = 45)
    plt.show()
    
    return timeseries_df

We can finally make the plot:

timeseries_hashtags = visualize_top_hashtags(main_df = data,
                       timeseries_df = timeseries_hashtags,
                       N = 10,
                       T = 7,
                       unique_dates = unique_date)

I figured that it would be easier to view the trend for each hashtag with a bar plot than with a scatter plot even though it would show the time-series, because it is harder to relate each point to the legend when there are so many colours and categories (hashtags).

A plot of top 10 hashtags per day for the past 7 days since 16 Nov 2019 show the commonly used hashtags for the movement — #hongkong, #hongkongprotests, #hongkongpolice, #standwithhongkong, and #hkpolice.

Other than the usual hashtags, the visualisation function can reveal unique and major events/incidents because users use these hashtags in their tweets when they talk about it.

Throughout the period, we see sudden appearance of hashtags such as #blizzcon2019 (not shown because of different parameters N and T), #周梓樂 (represented by squares in above graph), #japanese, #antielab, #pla, and #hkhumanrightsanddemocracyact.

These hashtags are tied to significant milestones:

  1. #blizzcon2019/blizzcon19 — https://www.scmp.com/tech/apps-social/article/3035987/will-blizzcon-become-latest-battleground-hong-kong-protests
  2. #周梓樂 — https://www.bbc.com/news/world-asia-china-50343584
  3. #hkust (related to #周梓樂) — https://www.hongkongfp.com/2019/11/08/hong-kong-students-death-prompts-fresh-anger-protests/
  4. #japanese (A japanese was mistaken to be a mainland chinese and hence, was attacked by the protesters) — https://www3.nhk.or.jp/nhkworld/en/news/20191112_36/#:~:targetText=Hong%20Kong%20media%20reported%20on,take%20pictures%20of%20protest%20activities.
  5. #antielab (protesters boycotting the elections because Joshua Wong was banned from it) — https://www.scmp.com/news/hong-kong/politics/article/3035285/democracy-activist-joshua-wong-banned-running-hong-kong
  6. #pla (China’s People’s Liberation Army (PLA) troops stationed in Hong Kong helped to clean and clear streets that were blocked and damaged by protesters) — https://www.channelnewsasia.com/news/asia/china-s-pla-soldiers-help-clean-up-hong-kong-streets-but-12099910

It is expected that major incidents will continue to trigger new hashtags in future, in addition to the usual ones.

4. Most Popular Tweets

In this section, we focus on what are the most popular tweets. There are 2 indicators that can help us achieve this — a tweet’s retweet count and favourite count. Unfortunately, we could only extract the retweet count because there was some trouble in retrieving the favourite count from the mess of dictionaries in the .json format (Please feel free to leave a comment if you know how to do it!).

We will do the following:

  • Top N tweets of all time
  • Top N tweets for particular day
  • Top N tweets for the past T days

Top 10 Tweets of All Time

# Convert the data type of the column to all int using pd.to_numeric()

print('Current data type of "retweetcount" is:',data['retweetcount'].dtypes)

data['retweetcount'] = pd.to_numeric(arg = data['retweetcount'])

print('Current data type of "retweetcount" is:',data['retweetcount'].dtypes)

Current data type of “retweetcount” is: object
Current data type of “retweetcount” is: int64

We define a function to pull out the top 10 tweets:

def alltime_top_tweets(df, N):
    # Arguments:
    # df - dataframe
    # N - top N tweets based on retweetcount
    
    # Sort according to 'retweetcount'
    top_tweets_df = df.sort_values(by = ['retweetcount'], ascending = False)
    # Drop also duplicates from the list, keep only the copy with higher retweetcount
    top_tweets_df.drop_duplicates(subset = 'text', keep = 'first', inplace = True)
    # Keep only N rows
    top_tweets_df = top_tweets_df.head(N)
    
    # Print out only important details 
    # username, tweetcreatedts, retweetcount, original text 'text'
    return top_tweets_df[['username', 'tweetcreatedts', 'retweetcount', 'text']]

print('All-time top 10 tweets:')
print('\n')
alltime_top_tweets(data, 10)

Top 10 Tweets for Any Particular Day

We can also create another function to pull out top N tweets for a specified day:

def specified_toptweets(df, spec_date, N):
    # Arguments
    # df - dataframe
    # N - top N tweets
    # date - enter particular date in str format i.e. '2019-11-02'
    
    # Specific date
    spec_date = datetime.strptime(spec_date, '%Y-%m-%d').date()
    
    # Filter df by date first
    date_df = df[df['dt_date'] == spec_date ]
    
    # Sort according to 'retweetcount'
    top_tweets_date_df = date_df.sort_values(by = ['retweetcount'], ascending = False)
    # Drop also duplicates from the list, keep only the copy with higher retweetcount
    top_tweets_date_df.drop_duplicates(subset = 'text', keep = 'first', inplace = True)
    # Keep only N rows
    top_tweets_date_df = top_tweets_date_df.head(N)
    
    print('Top ' + str(N) + ' tweets for date ' + str(spec_date) + ' are:')
    # Print out only important details 
    # username, tweetcreatedts, retweetcount, original text 'text'
    return top_tweets_date_df[['username', 'tweetcreatedts', 'retweetcount', 'text']]

Let’s try 5th Nov 2019:

specified_toptweets(data, '2019-11-05', 10)

Top 2 Tweets for The Past 5 Days

Okay finally, we can also extract the top N tweets for the last T days with the following function:

def past_toptweets(df, T, N, unique_date):
    # Arguments:
    # df - dataframe 'data'
    # T - last T days 
    # N - top N tweets
    # List of all unique dates in dataset
    
    # Create a df to store top tweets for all T dates, in case there is a need to manipulate this df
    past_toptweets_df = pd.DataFrame(columns = ['username', 'tweetcreatedts', 'retweetcount', 'text'])
    print(past_toptweets_df)
    
    # Filter data according to last T dates first:
    # Do a check that T must not be greater than the no. of elements in unique_date
    if T <= len(unique_date):
        unique_date = unique_date[-T:] # a list
    else:
        raise Exception('T must be smaller than or equal to the number of dates in the dataset!')
    
    # Print out top N for each unique_date one after another, starting from the latest:
    for ith_date in reversed(unique_date):
        # Filter tweets created on this date:
        ith_date_df = df[df['dt_date'] == ith_date]
        
        # Sort according to 'retweetcount'
        top_tweets_date_df = ith_date_df.sort_values(by = ['retweetcount'], ascending = False)
        # Drop also duplicates from the list, keep only the copy with higher retweetcount
        top_tweets_date_df.drop_duplicates(subset = 'text', keep = 'first', inplace = True)
        # Keep only N rows
        top_tweets_date_df = top_tweets_date_df.head(N)
        # Keep only essential columns
        top_tweets_date_df = top_tweets_date_df[['username', 'tweetcreatedts', 'retweetcount', 'text']]
        
        # Append top_tweets_date_df to past_toptweets_df
        past_toptweets_df = pd.concat([past_toptweets_df, top_tweets_date_df], axis = 0)
        
        # Print out the top tweets for this ith_date
        print('Top ' + str(N) + ' tweets for date ' + str(ith_date) + ' are:')
        # print only essential columns:
        print(top_tweets_date_df)
        print('\n')
    
    return past_toptweets_df
past_toptweets(data, T = 5, N = 2, unique_date = unique_date)

One flaw of the function ‘past_toptweets’ is that it can return tweets that are identical. For instance, a popular tweet on day 1 can be retweeted again on subsequent days by other users. This function can then pick up such tweets because no logic is implemented yet to consider only tweets that have not been chosen from earlier dates.

5. Behaviour of Twitter Users

No. of Tweets Daily

Let’s check out the trend in the number of tweets daily. We will build another dataframe that will be used to plot out the visualisation.

top_user_df = pd.DataFrame(columns = ['username', 'noTweets', 'noFollowers', 'dt_date'])

# Convert datatype of 'totaltweets' to numeric
pd.to_numeric(data['totaltweets'])

for ith_date in unique_date:
    print('Current loop: ', ith_date)
    
    temp = data[data['dt_date'] == ith_date]
    
    # pd.DataFrame - count number of tweets tweeted in that day - noTweets
    temp_noTweets = temp.groupby(['username']).size().reset_index(name = 'noTweets').sort_values(by = 'username', ascending = False)
    
    # pd.Series - count max followers - might fluctuate during the day
    temp_noFollowing = temp.groupby(['username'])['followers'].max().reset_index(name = 'noFollowers').sort_values(by = 'username', ascending = False)['noFollowers']
    
    # *** NOT WORKING
    # pd.Series - count max totaltweets - might fluctuate during the day. Note this is historical total number of tweets ever since the user is created.
    # temp_noTotaltweets = temp.groupby(['username'])['totaltweets'].max().reset_index(name = 'noTotaltweets').sort_values(by = 'username', ascending = False)['noTotaltweets']
    
    # Concat series to temp_noTweets, which will be the main df
    final = pd.concat([temp_noTweets, temp_noFollowing], axis = 1) # add as columns
    final['dt_date'] = ith_date
    
    print(final)
    
    # Append 'final' dataframe to top_user_df
    top_user_df = pd.concat([top_user_df, final])

Plotting the visualisation:

# hue = retweetcount and followers, totaltweets
f, axes = plt.subplots(3, 1, figsize = (22,22))
sns.set_palette('PuBuGn_d')
sns.stripplot(x = 'dt_date', y = 'noTweets', data = top_user_df, jitter = True, ax = axes[0], size = 6, alpha = 0.3)
sns.boxplot(y = 'dt_date', x = 'noTweets', data = top_user_df, orient = 'h', showfliers=False, ax = axes[1])
sns.boxplot(y = 'dt_date', x = 'noTweets', data = top_user_df, orient = 'h', showfliers=True, fliersize = 2.0, ax = axes[2])

# Axes and titles for each subplot
axes[0].set_xlabel('Date')
axes[0].set_ylabel('No. of Tweets')
axes[0].set_title('No. of Tweets Daily')

axes[1].set_xlabel('No. of Tweets')
axes[1].set_ylabel('Date')
axes[1].set_title('No. of Tweets Daily')

axes[2].set_xlabel('Date')
axes[2].set_ylabel('No. of Tweets')
axes[2].set_title('No. of Tweets Daily')
plt.show()

From the Seaborn box plots and strip plot, we see that most of the users in the dataset do not tweet a lot in a day. From the strip plot, we might not be able to discern the outliers in the dataset, and might think that most of the users tweeted in the range of 1 to 30 plus tweets daily.

However, the box plots tell us a different story. The first box plot in the middle of the visualisation reveals that most users tweeted roughly between 1 to 8 tweets. On the other hand, there are many outliers shown in the second box plot, at the bottom of the visualisation. These users tweeted a lot, ranging from 10 onwards. There were at least 7 users who have tweeted at least more than 100 tweets per day in the timeframe considered.

Top 5 Users with the Most Number of Tweets Daily

Let’s zoom in further by finding out who are top users exactly.

# To change the number of users, adjust the value in head()
# top_user_df.set_index(['dt_date', 'username']).sort_values(by = ['dt_date','noTweets'], ascending = False)
user_most_tweets_df = top_user_df.sort_values(by = ['dt_date', 'noTweets'], ascending = False, axis = 0).groupby('dt_date').head(5)

# Extract 'days' out of dt_date so we can plot a scatterplot
# Will return an int:
user_most_tweets_df['dayofNov'] = user_most_tweets_df['dt_date'].apply(lambda x: x.day)
user_most_tweets_df['noTweets'] = user_most_tweets_df['noTweets'].astype(int)

# Plot 2 subplots
# 1st subplot - show who are the users who tweeted the most
# 2nd subplot - trend in number of tweets
f, axes = plt.subplots(2, 1, figsize = (20,20))
f = sns.scatterplot(x = 'dayofNov', y = 'noTweets', hue = 'username', data = user_most_tweets_df, size = 'noFollowers', sizes = (250, 1250), alpha = 0.75, ax = axes[0])
sns.lineplot(x = 'dayofNov', y = 'noTweets', data = user_most_tweets_df, markers = True)

# Axes and titles for each subplot
# First subplot
axes[0].set_xlabel('Day in Nov')
axes[0].set_ylabel('No. of Tweets')
axes[0].set_title('Most no. of tweets daily')

# Legends for first subplot
box = f.get_position()
f.set_position([box.x0, box.y0, box.width * 1.0, box.height]) # resize position

# Put a legend to the right side
f.legend(loc='center right', bbox_to_anchor=(1.5, 0.5), ncol=4)

# Second subplot
axes[1].set_xlabel('Date')
axes[1].set_ylabel('No. of Tweets')
axes[1].set_title('Trend of no. of tweets by top users')
plt.show()

6. Demographics of Twitter Users

Location of Twitter Users

location = data['location']
print('No. of distinct locations listed by twitter users is:', len(location.value_counts()))
unique_locations = location.value_counts()

# Remove n.a.
unique_locations = pd.DataFrame({'locations': unique_locations.index,
                                'count': unique_locations.values})
unique_locations.drop(0, inplace = True)

# See top few locations
unique_locations.sort_values(by = 'count', ascending = False).head(10)

Expected to see that many of these users claim to be residing in Hong Kong since these users should be closer to the ground. Thus, they could spread news quickly from what they see in person.

We will discount ‘HongKong’ from the visualisation and focus on the distribution of the remaining locations:

# To remove 香港
hk_chinese_word = unique_locations.iloc[1,0]

# Obtain the row index of locations that contain hong kong:
ind_1 = unique_locations[unique_locations['locations'] == 'hong kong'].index.values[0]
ind_2 = unique_locations[unique_locations['locations'] == 'hk'].index.values[0]
ind_3 = unique_locations[unique_locations['locations'] == 'hong kong '].index.values[0]
ind_4 = unique_locations[unique_locations['locations'] == 'hongkong'].index.values[0]
ind_5 = unique_locations[unique_locations['locations'] == hk_chinese_word].index.values[0]
ind_6 = unique_locations[unique_locations['locations'] == 'kowloon city district'].index.values[0]

list_ind = [ind_1,ind_2,ind_3,ind_4,ind_5, ind_6]

# Drop these rows from unique_locations
unique_loc_temp = unique_locations.drop(list_ind)

# Focus on top 20 locations first
# Convert any possible str to int/numeric first
count = pd.to_numeric(unique_loc_temp['count'])
unique_loc_temp['count'] = count
unique_loc_temp = unique_loc_temp.head(20)

# Plot a bar plot
plt.figure(figsize=(16,13))
sns.set_palette('PuBuGn_d')
sns.barplot(x = 'count', y = 'locations', orient = 'h',data = unique_loc_temp)
plt.xlabel('Count')
plt.ylabel('Locations')
plt.title('Top 20 Locations')
plt.show()

A quick count of the top 20 locations, without hong kong, shows that majority of these locations come from western countries. We see the expected ones such as the United States, Canada, UK and Australia, where some people and politicians are also watching over the protest movement and speaking out against the ruling government and police.

Top 30 Users with Most Followers

# Reuse code from top_user_df
# Sort according to noFollowers
top_user_df = top_user_df.sort_values(by = 'noFol lowers', ascending = False)
user_most_followers = top_user_df.groupby('username')['noFollowers', 'dt_date'].max().sort_values(by = 'noFollowers', ascending = False)
user_most_followers['username'] = user_most_followers.index
user_most_followers.reset_index(inplace = True, drop = True)

# plot chart
plt.figure(figsize = (25, 8))
sns.set_palette('PuBuGn_d')
sns.barplot(x = 'noFollowers', y = 'username', orient = 'h', data = user_most_followers.head(30))
plt.xlabel('No. of Followers')
plt.ylabel('Usernames')
plt.title('Top Twitter Accounts')
plt.show()

In the list of top 30 accounts, majority of them are accounts that belong to news agencies or media outlets such as AFP, CGTNOfficial, EconomicTimes, and ChannelNewsAsia. The rest belongs to individuals who are journalists and writers etc. Joshua Wong’s account is the only one in the list that can be identified as part of the protests.

Activities of Top Accounts

user_most_followers_daily = top_user_df.sort_values(by = ['dt_date', 'noFollowers'], ascending = False, axis = 0).groupby('dt_date').head(5)
print(user_most_followers_daily)

# Extract 'days' out of dt_date so we can plot a scatterplot
# Will return an int:
user_most_followers_daily['dayofNov'] = user_most_followers_daily['dt_date'].apply(lambda x: x.day)
user_most_followers_daily['noFollowers'] = 
user_most_followers_daily['noFollowers'].astype(int)
f, axes = plt.subplots(1, 1, figsize = (15,10))
f = sns.scatterplot(x = 'dayofNov', y = 'noTweets', hue = 'username',data = user_most_followers_daily, size = 'noFollowers', sizes=(50, 1000))

# Axes and titles for each subplot
# First subplot
axes.set_xlabel('Day in Nov')
axes.set_ylabel('No. of Tweets')
axes.set_title('Daily activity of users with most number of followers')

# Legends for first subplot
box = f.get_position()
f.set_position([box.x0, box.y0, box.width * 1, box.height]) # resize position

# Put a legend to the right side
f.legend(loc='center right', bbox_to_anchor=(1.5, 0.5), ncol=3)

Although these top accounts have a lot of followers, the number of tweets they post per day, on average, is fewer than 10. This kind of activity pales in comparison as compared to that of the top 5 users with most number of tweets daily under the section ‘Top 5 Users with the Most Number of Tweets Daily’.

7. Most Mentioned Usernames

Can we uncover more popular figures in the protest movements from these tweets? Twitter users might be tagging these people to inform them of events that are happening on the ground. Their backgrounds can range from lawyers, lawmakers, politicians, reporters to even protest leaders.

def find_users(df):
    # df: dataframe to look at
    # returns a list of usernames
    
    # Create empty list
    list_users = []
    
    for i in range(0, len(df)):
        users_ith_text = re.findall('@[^\s]+', df.iloc[i,:]['text'])
        # returns a list
        # append to list_users by going through a for-loop:
        for j in range(0, len(users_ith_text)):
            list_users.append(users_ith_text[j])
    
    return list_users

# Apply on dataframe data['text']
list_users = find_users(data)

mentioned_users_df = pd.DataFrame({
    'mentioned_users': list_users
})

mentionedusers = mentioned_users_df.groupby('mentioned_users').size().reset_index(name = 'totalcount').sort_values(by = 'totalcount', ascending = False)
mentionedusers.head()
plt.figure(figsize=(30,8))
sns.set_palette('PuBuGn_d')
sns.barplot(x = 'mentioned_users', y = 'totalcount', data = mentionedusers.head(15))
plt.xlabel('Mentioned users in tweets')
plt.ylabel('Number of times')
plt.title('Top users and how many times they were mentioned in tweets')
plt.show()

Most of the 15 most mentioned users, if not all, are directly related to Hong Kong and the protest movements. A quick google search on each of these users returns queries that show that they are either supportive of the protesters and protests and/or against the Hong Kong Administration and the police force. In summary:

  1. @SolomonYue — a Chinese American politician related to the Hong Kong Human Rights and Democracy Act passed in the U.S
  2. @joshuawongcf — a local protest leader who was planning to take part in the elections but was banned.
  3. @GOVUK — threatens to sanction hong kong officials over their handling of the protest
  4. @HawleyMO — a U.S politician
  5. @HeatherWheeler — Minister for Asia and the Pacific who sent a letter to Hong Kong government officials on the proposed sanctions.

Conclusion on EDA

All in all, more than 200k tweets regarding the Hong Kong Protest Movement 2019 over a period of 14 days starting from 3rd Nov 2019 till 16th Nov 2019 were scraped. The main steps required are setting up of Twitter API calls; cleaning and processing the tweets; and creating Seaborn visualisations. The data analysis/visualisation of this project focused on several themes:

  1. Most Popular Words with a Word Cloud
  2. Sentiment Analysis with Vader-Lexicon from NLTK
  3. Popularity of Hashtags
  4. Most Popular Tweets
  5. Activity of Twitter Users
  6. Demographic of Twitter Users
  7. Most Mentioned Usernames

Usefulness of Hashtags

NOTABLY, it is possible to identify significant milestones in the protests by monitoring the daily popularity and trend of the hashtags used by users. In theory, this means that one could simply monitor the hashtags and go without reading the news or scrolling through social media to keep updated with the protests movement.

Usefulness of Popular Tweets

THE most popular tweets can also help to reveal ongoing sentiments of the general twitter users pertaining to the movement. We can understand it as follows. When a tweet about a certain event or content gets retweeted by many people, it could mean that these people resonate with the message and want to share it to as many people as possible. As an example, the Hong Kong Human Rights and Democracy Act was such a hot topic. The most popular tweets can also provide further details and granularity as to what are the main topics/events for a day. In the evidence shown above in the section entitled ‘Most Popular Tweets’, we saw that these tweets often involve cases about police brutality and alleged inappropriate use of force.

Personal Observation

AFTER completing this project, it has strengthened my belief that the overall sentiment towards a topic/idea might be dependent on the social media platform i.e. Twitter, Facebook, Weibo etc.

So far we have seen that the overall sentiment of the Hong Kong protest movement from these Twitter tweets are overwhelming negative towards the Hong Kong Government and the police force, but positive and supportive towards the protesters. Likewise, we might get an opposite reaction on platforms like Weibo and other Chinese media sites that show support and praise for the Hong Kong government and police.

Nevertheless, it is also possible that I am wrong and there was a flaw in the collection of tweets because of the hashtags used in the search term. Hashtags used in our search were explicitly about the negative aspects of the police such as #hkpolicebrutality. People who used it used it obviously to denounce these alleged brutality. In retrospection, it would be fairer to consider hashtags such as #supporthongkongpolice #supporthkgovt #supporthkpolice etc. I will leave this to the reader to explore this element.

Shortcomings of Ruled-Based Sentiment Analysis

THE rudimentary sentiment analysis that we did above using the Vader Library from NLTLK revealed plenty of false positives — upon closer inspection of these random tweets that were rated to be positive towards the government and police, they actually turned out to be either negative towards them or supportive towards the protesters’ cause. Hence, we need to turn to deep learning techniques that would give us better and reliable results.

It is not within the scope and aim of this project to cover the deep learning work. More work needs to be done in classifying a small dataset of tweets, so that it could be used for transfer learning with pre-trained models.

However, based on the tweets we have on hand, there has been an overwhelming support for the protesters and their cause, but public outcry over police brutality and misbehaviours. Any attempts at classifying the tweets into positive or negative sentiments might end up with a highly skewed distribution of negative sentiments towards the Hong Kong Government and Police Force. Hence, it might not be worthwhile to proceed with predicting the sentiments with deep learning. In my opinion, the sentiment of related Twitter tweets is largely negative.


Cleaning up and Preprocessing the Tweets

From here onwards, readers who are keen in the flow of data cleaning for this project may continue to walk through the remainder sections of this article.

Import Libraries and Dataset

In a separate Jupyter Notebook:

# Generic ones
import numpy as np
import pandas as pd
import os

# Word processing libraries
import re
from nltk.corpus import wordnet
import string
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer

# Widen the size of each cell
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

Each round of tweet scraping results in a creation of a .csv file. Read each .csv file into a dataframe first:

# read .csv files into Pandas dataframes first
tweets_1st = pd.read_csv(os.getcwd() + '/data/raw' + '/20191103_131218_sahkprotests_tweets.csv', engine='python')
..
..
tweets_15th = pd.read_csv(os.getcwd() + '/data/raw' + '/20191116_121136_sahkprotests_tweets.csv', engine='python')

# Check the shape of each dataframe:
print('Size of 1st set is:', tweets_1st.shape)

# You can also check out the summary statistics:
print(tweets_1st.info())

Concatenate all dataframes into a single dataframe:

# Concat the two dataset together:
data = pd.concat([tweets_1st, tweets_2nd, tweets_3rd, tweets_4th, tweets_5th, tweets_6th, tweets_7th, tweets_8th, tweets_9th, tweets_10th, tweets_11th, tweets_12th, tweets_13th, tweets_14th, tweets_15th], axis = 0)

print('Size of concatenated dataset is:', data.shape)

# Reset_index
data.reset_index(inplace = True, drop = True)
data.head()
print(data.info())

A snippet of what you will see in the dataframe:

Checking for Duplicated Entries and Removing Them

Since we are performing the scraping close to each other, it is possible to scrape the same tweets as long as they fall within the search window of 7 days from the search_date. We remove these duplicated rows from our dataset.

# Let's drop duplicated rows:
print('Initial size of dataset before dropping duplicated rows:', data.shape)
data.drop_duplicates(keep = False, inplace = True)

print('Current size of dataset after dropping duplicated rows, if any, is:', data.shape)

Initial size of dataset before dropping duplicated rows: (225003, 11)
Current size of dataset after dropping duplicated rows, if any, is: (218652, 11)

Remove Non-English Words/Tokens

Since it might be possible to remove non-English words that are used in daily English conversations such as names etc, it might be better to filter by the Chinese language.

# Remove empty tweets
data.dropna(subset = ['text'], inplace = True)

# The unicode accounts for Chinese characters and punctuations.
def strip_chinese_words(string):
    # list of english words
    en_list = re.findall(u'[^\u4E00-\u9FA5\u3000-\u303F]', str(string))
    
    # Remove word from the list, if not english
    for c in string:
        if c not in en_list:
            string = string.replace(c, '')
    return string

# Apply strip_chinese_words(...) on the column 'text'
data['text'] = data['text'].apply(lambda x: strip_chinese_words(x))
data.head()

Extract Twitter Usernames Mentioned in Each Tweet

We want this useful information from each tweet because we can analyse who are the popular figures in the protest movement.

# Define function to sieve out @users in a tweet:
def mentioned_users(string):
    usernames = re.findall('@[^\s]+', string)
    return usernames

# Create a new column and apply the function on the column 'text'
data['mentioned_users'] = data['text'].apply(lambda x: mentioned_users(x))
data.head()

Main Text Cleaning and Preprocessing

With Chinese words and usernames removed and extracted from each text, we can now do the heavy lifting:

# Define Emoji_patterns
emoji_pattern = re.compile("["
         u"\U0001F600-\U0001F64F"  # emoticons
         u"\U0001F300-\U0001F5FF"  # symbols & pictographs
         u"\U0001F680-\U0001F6FF"  # transport & map symbols
         u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
         u"\U00002702-\U000027B0"
         u"\U000024C2-\U0001F251"
         "]+", flags=re.UNICODE)

# Define the function to implement POS tagging:
def get_wordnet_pos(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# Define the main function to clean text in various ways:
def clean_text(text):
    
    # Apply regex expressions first before converting string to list of tokens/words:
    # 1. remove @usernames
    text = re.sub('@[^\s]+', '', text)
    
    # 2. remove URLs
    text = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', '', text)
    
    # 3. remove hashtags entirely i.e. #hashtags
    text = re.sub(r'#([^\s]+)', '', text)
    
    # 4. remove emojis
    text = emoji_pattern.sub(r'', text)
    
    # 5. Convert text to lowercase
    text = text.lower()
    
    # 6. tokenise text and remove punctuation
    text = [word.strip(string.punctuation) for word in text.split(" ")]
    
    # 7. remove numbers
    text = [word for word in text if not any(c.isdigit() for c in word)]
    
    # 8. remove stop words
    stop = stopwords.words('english')
    text = [x for x in text if x not in stop]
    
    # 9. remove empty tokens
    text = [t for t in text if len(t) > 0]
    
    # 10. pos tag text and lemmatize text
    pos_tags = pos_tag(text)
    text = [WordNetLemmatizer().lemmatize(t[0], get_wordnet_pos(t[1])) for t in pos_tags]
    
    # 11. remove words with only one letter
    text = [t for t in text if len(t) > 1]
    
    # join all
    text = " ".join(text)
    
    return(text)

# Apply function on the column 'text':
data['cleaned_text'] = data['text'].apply(lambda x: clean_text(x))
data.head()
# Check out the shape again and reset_index
print(data.shape)
data.reset_index(inplace = True, drop = True)

# Check out data.tail() to validate index has been reset
data.tail()

Process the Column ‘hashtags’

The data type of the column ‘hashtags’ is initially in string, so we need to convert it to a Python list.

# Import ast to convert a string representation of list to list
# The column 'hashtags' is affected
import ast

# Define a function to convert a string rep. of list to list
## Function should also handle NaN values after conversion
def strlist_to_list(text):
    
    # Remove NaN
    if pd.isnull(text) == True: # if true
        text = ''
    else:
        text = ast.literal_eval(text)
    
    return text

# Apply strlist_to_list(...) to the column 'hashtags'
# Note that doing so will return a list of dictionaries, where there will be one dictionary for each hashtag in a single tweet.
data['hashtags'] = data['hashtags'].apply(lambda x: strlist_to_list(x))

data.head()

Since each ‘hashtag’ entry contains a list of dictionaries, we need to loop through the list to extract each hashtag:

# Define a function to perform this extraction:
def extract_hashtags(hashtag_list):
    # argument:
    # hashtag_list - a list of dictionary(ies), each containing a hashtag
    
    # Create a list to store the hashtags
    hashtags = []
    
    # Loop through the list:
    for i in range(0, len(hashtag_list)):
        # extract the hashtag value using the key - 'text'
        # For our purposes, we can ignore the indices, which tell us the position of the hashtags in the string of tweet
        # lowercase the text as well
        hashtags.append(hashtag_list[i]['text'].lower())
        
    return hashtags

# Apply function on the column - data['hashtags']
data['hashtags'] = data['hashtags'].apply(lambda x: extract_hashtags(x))

# Check out the updated column 'hashtags'
print(data.head()['hashtags'])

Process the Column ‘location’

# Replace NaN (empty) values with n.a to indicate that the user did not state his location
# Define a function to handle this:
def remove_nan(text):
    if pd.isnull(text) == True: # entry is NaN
        text = 'n.a'
    else:
        # lowercase text for possible easy handling
        text = text.lower()
        
    return text

# Apply function on column - data['location']
data['location'] = data['location'].apply(lambda x: remove_nan(x))

# Check out the updated columns
print(data.head()['location'])

# Let's take a quick look at the value_counts()
data['location'].value_counts()

Unsurprisingly, most of the tweets are tweeted by users who are from/in Hong Kong. Since these are the locations of users of each tweet, it is still early to determine the actual demographics. We will deal with this later.

Process the Column ‘acctdesc’

We clean up this column — the account descriptions of twitter users — by removing NaN values and replacing them with string ‘n.a’.

# Apply the function already defined above: remove_nan(...)
# Apply function on column - data['acctdesc']
data['acctdesc'] = data['acctdesc'].apply(lambda x: remove_nan(x))

# Check out the updated columns
print(data.head()['acctdesc'])

Feature Engineering — Rule-based Word Processing

So far, we have removed duplicated rows, extracted important information such as hashtags, mentioned users and users’ locations, and also cleaned up the tweets. In this section, we focus on rule-based word processing for our sentiment analysis. Exploratory data visualisation will be done later once we have all the ingredients.

Generate Sentiments from Tweets with NLTK Vader_Lexicon Library

We use the Vader_lexicon library from NLTK to generate sentiment for each tweet. Vader uses a lexicon of words to determine which words in the tweet are positive or negative. It will then return a set of 4 scores on the positivity, negativity, neutrality of the text, and also an overall score whether the text is positive or negative. We will define the following:

  1. Positivity — ‘pos’
  2. Negativity — ‘neg’
  3. Neutrality — ‘neu’
  4. Overall Score — ‘compound’
# Importing VADER from NLTK
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Create a sid object called SentimentIntensityAnalyzer()
sid = SentimentIntensityAnalyzer()

# Apply polarity_score method of SentimentIntensityAnalyzer()
data['sentiment'] = data['cleaned_text'].apply(lambda x: sid.polarity_scores(x))

# Keep only the compound scores under the column 'Sentiment'
data = pd.concat([data.drop(['sentiment'], axis = 1), data['sentiment'].apply(pd.Series)], axis = 1)

Extract additional Features — no. of characters and no. of words in each tweet

# New column: number of characters in 'review'
data['numchars'] = data['cleaned_text'].apply(lambda x: len(x))

# New column: number of words in 'review'
data['numwords'] = data['cleaned_text'].apply(lambda x: len(x.split(" ")))

# Check the new columns:
data.tail(2)

Word Embeddings — Training Doc2Vec using Gensim

Word embeddings involve the mapping of words in the text corpus to numerical vectors, where similar words sharing similar contexts will have similar vectors as well. It involves a shallow two-layer neural network that trains a matrix/tensor called the embedding matrix. By taking the matrix product of the embedding matrix and one-hot vector representation of each word in the corpus, we obtain the embedding vector.

We will use Gensim — an open-source Python library — to generate doc2vec.

Note: doc2vec should be used over word2vec to obtain the vector representation of a ‘document’, in this case, an entire tweet. Word2vec will only give us the vector representation of a word in a tweet.

# Import the Gensim package
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(data["cleaned_text"].apply(lambda x: x.split(" ")))]

# Train a Doc2Vec model with our text data
model = Doc2Vec(documents, vector_size = 10, window = 2, min_count = 1, workers = 4)

# Transform each document into a vector data
doc2vec_df = data["cleaned_text"].apply(lambda x: model.infer_vector(x.split(" "))).apply(pd.Series)
doc2vec_df.columns = ["doc2vec_vector_" + str(x) for x in doc2vec_df.columns]
data = pd.concat([data, doc2vec_df], axis = 1)

# Check out the newly added columns:
data.tail(2)

Compute TF-IDF Columns

Next, we will compute the TF-IDF of the reviews using the sklearn library. TF-IDF stands for Term Frequency-Inverse Document Frequency, which is used to reflect how important a word is to a document in a collection or corpus. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

  1. Term Frequency — the number of times a term occurs in a document.
  2. Inverse Document Frequency — an inverse document frequency factor that diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

Since NLTK does not support TF-IDF, we will use the tfidfvectorizer function from the Python sklearn library.

from sklearn.feature_extraction.text import TfidfVectorizer

# Call the function tfidfvectorizer
# min_df is the document frequency threshold for ignoring terms with a lower threshold.
# stop_words is the words to be removed from the corpus. We will check for stopwords again even though we had already performed it once previously.
tfidf = TfidfVectorizer(
    max_features = 100,
    min_df = 10,
    stop_words = 'english'
)

# Fit_transform our 'review' (the corpus) using the tfidf object from above
tfidf_result = tfidf.fit_transform(data['cleaned_text']).toarray()

# Extract the frequencies and store them in a temporary dataframe
tfidf_df = pd.DataFrame(tfidf_result, columns = tfidf.get_feature_names())

# Rename the column names and index
tfidf_df.columns = ["word_" + str(x) for x in tfidf_df.columns]
tfidf_df.index = data.index

# Concatenate the two dataframes - 'dataset' and 'tfidf_df'
# Note: Axis = 1 -> add the 'tfidf_df' dataframe along the columns  or add these columns as columns in 'dataset'.
data = pd.concat([data, tfidf_df], axis = 1)

# Check out the new 'dataset' dataframe
data.tail(2)

Closing

I hope you have gained as much insights as I have. Feel free to leave a comment to share your thoughts or correct me on any technical aspects or my analysis of the data.

Thank you for your time in reading this lengthy article.

This article is a reproduction of the original by Griffin Leow. The support of local content creators is part of AI Singapore’s broader mission of building up the nation’s AI ecosystem. If you know of content worth sharing, email us at shareai-editor@aisingapore.org.

Griffin is a Junior Data Science Engineer who battles with large data sets on a daily basis.  He has been picking up new programming languages and technologies as he believes strongly in continuous learning to stay relevant. Previously, Griffin had also worked as a Business Intelligence Product Consultant. In his free time, he enjoys backpacking around the world and telling stories of his travel through photography.

Related Articles

Artificial Intelligence: 2020 and Beyond

In the past year, we have seen exciting developments in what Artificial Intelligence has to offer. In 2013, Artificial Intelligence started to dominate simple Atari games. Fast forward six years to today, we see it dominate real-time strategy games like StarCraft. The AlphaStar created by Google’s Deepmind company was able to compete at the professional level in January this year and went on to achieve Grandmaster level by October. The improvement achieved in less than a year is amazing!

OpenAI, an organization devoted to developing Artificial General Intelligence (AGI) has managed to train computer agents to play hide-and-seek and, if you check out the YouTube video, you will be amazed at the strategies they could come up with, for instance using blocks to barricade themselves. 

There are other areas that have come to the attention of the Artificial Intelligence community. I will be discussing them below to help the readers understand what they are.

Artificial General Intelligence soon?

One question I get asked a lot during any discussion with the community or general public is, “Are we going to have human-like intelligence soon?”

I have been doing research into AGI in recent years and been following media reports on Artificial Intelligence for quite a while. The vision painted seems to be that we will be able to develop human-like intelligence soon and we humans will be out of job in the very near future. My short answer is, I do not think we will be developing AGI that soon. My guesstimate is that I will not (sadly and unfortunately) be able to see it during my lifetime, i.e. in the next 40 years. There are still many areas that need to be developed first before we get to AGI. For instance, knowledge representation and abstraction are still being developed and have not been integrated into the current “intelligence” that we have built.

The current level of Artificial Intelligence that scientists have been able to achieve is called “Artificial Narrow Intelligence” or ANI. ANI is very good at performing a single well defined task which means that more and more such tasks can and will be automated.

Let me use an example to illustrate the difference between ANI and AGI. An agent with ANI can only make coffee in Kitchen A and in no other kitchen. However, an agent with AGI can not only make coffee very well in Kitchen A, it is also able to make coffee in any other kitchens from the learning it has acquired from making coffee in Kitchen A.

The interested reader can read my previous blog post on the different definitions here.

Jobs-wise, readers do not have to worry as our jobs usually consist of many tasks and the boring tasks usually have a chance of being automated. This means we can look forward to having higher value-added (and usually more interesting) tasks/projects to work on. Better productivity hopefully translates to better pay!

AI Ethics & Explainability

With the proliferation of AI into our business processes and moving more decision-making power over to AI, we start to see AI having more influence and greater impact on consumers and clients. With that in mind, businesses must start thinking about how their use of AI can impact their customers, both positively and adversely. Customers, especially the more disgruntled ones, are going to start asking more questions on how the decisions affecting them were made. As such, business organizations need to understand how the decisions are made inside the machine learning models they deploy in order to be able to provide the explanation. This is very challenging given that biases can creep into many aspects of the model training process – during data collection, model training and validation stages etc.

Moreover, most of the machine learning models in use, such as deep learning and support vector machines, are not built to show their inner workings with full transparency. For instance, most instructors/lecturers tend to call neural networks black boxes to avoid the tedium of explaining the mathematics behind them. Even with an understanding of the mathematics, neural networks are not transparent enough to reveal how they make their decisions.

As the general public gets more educated about AI, questions on how decisions are made will come in fast and furious. To avoid a PR disaster or damage to brand reputation, businesses should start looking at the use of data and machine learning in their processes, understand its impact and avoid any foreseeable biases in their data and machine learning process.

Data Privacy & Security

Many businesses have jumped onto the data bandwagon. With tons of data being collected to develop AI models, maintaining privacy and security will get more complicated and at the same time be a challenge that cannot be avoided. The good news is that there are technologies being developed to protect individual privacy and security by restricting data exchange between different parties while still being able to train up reliable machine learning models, including differential privacy, federated learning, etc. Companies can start looking at these technologies to overcome their data privacy and security challenges. Those technologies usually draw on research in both the machine learning and cryptography fields. Talents with a good blend of multidisciplinary skills would help to better leverage those technologies’ power.

Talent Development

When it comes to the research front on Artificial Intelligence, I believe research are starting to plateau and breakthroughs might get fewer going forward. Even the Head of AI at Facebook is saying that too. Having said that, I believe that many businesses can still take advantage of Artificial Intelligence given that there is a symbiotic relationship between technology improvement and data collected. To take advantage of them, both quantity and quality of talents are going to be important. Given the current circumstances where many countries believe there is a strong need to dominate (or at least be very good at) Artificial Intelligence to attract investments and bring the economy forward, there will be very high demand for AI talents, and this is especially true for the very experienced data/AI scientists.

Businesses, both small and large, if they are serious about building Artificial Intelligence capabilities should start planning out their talent roadmap, how to identify, attract and retain the best talents. It will not be easy given that businesses may have to compete with technology firms followed by financial institutions for the talents, but I believed it can still attract its fair share of talents with a good roadmap, outreach, project types and management. All in all, moving into 2020 and beyond, the war for AI talents will get more intense.

Conclusion

For the next few years, with the increase adoption of Artificial Intelligence (i.e. ANI), there is going to be a stronger emphasis on understanding how decisions are made by Artificial Intelligence (i.e. ethics, transparency and explainability), stronger measures in privacy & security needed and a VERY intense war for the best talents.

These are what I have observed so far and I will be more than happy to discuss these different areas further.

Promoting Ethical AI for Human Well-being

On the Saturday morning of December 7, AI Singapore joined a few other AI communities across the Pacific Rim to discuss a number of topics within the complex subject of AI ethics with thought leader Andrew Ng. Connected via video conferencing, we got to hear about Andrew’s views directly from him on this increasingly important area within the unfolding narrative of AI’s permeation into our everyday lives.

Organised by deeplearning.ai, the AI teaching project founded by Andrew, communities in Hong Kong, Tokyo, Manila and Singapore discussed within their own locations specific topics relevant to AI ethics before coming together to share their insights. For Singapore, we dived into how we can ensure that the AI systems that we develop promote the well-being of humans.

Since AI Singapore was established 2 years ago, we have grown to an engineering strength of about 70 and have worked on close to 40 projects (known as 100Es) from start-ups to MNCs, both private and public, across various verticals like healthcare, finance, engineering, IT and media. In each and every project we handled, we have always been mindful of not just the business benefits to organisations, but also the wider impact on individuals and society which AI potentially has.

With the experience accumulated from building AI systems across many domains, we have produced an internal document titled “AI Ethics & Governance Protocols”. Among the 12 protocols defined in the document, our AI engineers have chosen the top 3 to share with Andrew and the global community. Read more about them below.

Personal Privacy

Protection of data containing personal attributes is of utmost importance in any project. Many models we create draw upon such data, from models in healthcare to finance. As we work with organisations, we often have to play the role of gatekeepers to ensure that data provided by them is properly anonymised and does not contain personally identifiable information.

Creativity and Intellect

AI systems are built from many simpler parts, very often from open source code. While much attention is focused on the consumers of AI systems, we believe that producers of such systems, directly or indirectly, should not be neglected. It is important to acknowledge their contributions and respect their terms and conditions. This will encourage greater creativity in the AI ecosystem and by extension greater good in what AI can bring.

Data Integrity and Inclusiveness

Training datasets should be sufficient in amount, unbiased, labelled, machine readable and accessible to a reasonable degree. The benefits of AI should also not favour or disadvantage any group of individuals. This is especially so in Singapore where we have a very diverse population. Datasets should be rigorously questioned for representativeness and biases. For example, we have developed wound diagnosis models and being able to classify wounds across different racial types with equal accuracy has always been a target of the highest priority.


AI Singapore is pleased to have been able to do our part to contribute to the ongoing conversation on AI ethics at a global level. A common lament about the event is that it was too short and we had to end the discussions on time. However, hearing from Andrew was an unforgettable experience and we are also gratified to know that other AI communities in the world are very much involved in this subject. This certainly bodes well for our AI-driven future. You can read more about the event on deeplearning.ai’s blog here.

Modeling and Output Layers in BiDAF — an Illustrated Guide with Minions

BiDAF is a popular machine learning model for Question and Answering tasks. This article explains the modeling layer of BiDAF with the help of some cute Minions.

(By Meraldo Antonio)

This article is the last in a series of four articles that aim to illustrate the working of Bi-Directional Attention Flow (BiDAF), a popular machine learning model for question and answering (Q&A).

To recap, BiDAF is an closed-domain, extractive Q&A model. This means that to be able to answer a Query, BiDAF needs to consult an accompanying text that contains the information needed to answer the Query. This accompanying text is called the Context. BiDAF works by extracting a substring of the Context that best answers the query — this is what what we refer to as the Answer to the Query. I intentionally capitalize the words Query, Context and Answer to signal that I am using them in their specialized technical capacities.

An example of Context, Query and Answer.

A quick summary of the three preceding articles are as follows:

  • Part 1 of the series provides a high-level overview of BiDAF.
  • Part 2 explains how BiDAF uses 3 embedding algorithms to get the vector representations of words in the Context and the Query.
  • Part 3 explores BiDAF’s attention mechanism that combines the information from the Context and the Query.

The output of the aforementioned attention step is a giant matrix called GG is a 8d-by-T matrix that encodes the Query-aware representations of Context words. is the input to the modeling layer, which will be the focus of this article.

What Have We Been Doing? What Does G Actually Represent? Minions to the Rescue!

Ok, so I know we’ve been through a lot of steps in the past three articles. It is extremely easy to get lost in the myriad of symbols and equations, especially considering that the choices of symbols in the BiDAF paper aren’t that “user friendly.” I mean, do you even remember what each of H, U, Ĥ and Ũ represents?

Hence, let’s now step back and try to get the intuition behind all these matrix operations we have been doing so far.

Practically, all the previous steps can be broken down to two collection of steps: the embedding steps and the attention steps. As I mentioned above, the result of all these steps is an 8d-by-T matrix called G.

An example of can be seen below. Each column of is an 8d-by-1 vector representation of a word in the Context.

An example of G. The length of the matrix, T, equals the number of words in the Context (9 in this example). Its height is 8dis a number that we preset in the word embedding and character embedding steps.

Let’s now play a little game that (hopefully) can help you understand all the mathematical mumbo-jumbo in the previously articles. Specifically, let’s think of the words in the Context as an ordered bunch of Minions.

Think of our Context as a bunch of Minions, with each Context word corresponding to one Minion.

Each of our Minions has a brain in which he can store some information. Right now, our Minions’ brains are already pretty cramped. The current brain content of each Minion is equivalent to the 8d-by-1 column vector of the Context word that the Minion represents. Here I present the brain scan of the “Singapore” Minion:

The Minions’ brains haven’t always been this full! In fact, when they came into being, their brains were pretty much empty. Let’s now go back in time and think about what were the “lessons” that the Minions went through to acquire their current state of knowledge.

The first two lessons the Minions had were “Word Embedding” and “Character Embedding.” During these lessons, the Minions learned about their own identities. The teacher in the “Word Embedding” class, Prof. GloVe, teaches the Minions basic information about their identities. On the other hand, the “Character Embedding” class is an anatomy class in which the Minions gained an understanding of their body structure.

Here is the brain scan of the “Singapore” Minion after these two lessons.

The “Singapore” Minion understands his identity after attending the “Word Embedding” and “Character Embedding” lessons.

Right after, the Minions moved on and attended the “Contextual Embedding” lesson. This lesson is a conversational lesson during which the Minions had to talk to one another through a messenger app called bi-LS™. The bi-LS™-facilitated convo allows the Minions to learn each other’s identities—which they learned in the previous two lessons. Pretty neat, huh?

Two Minions having a fun conversation through bi-LS™, sharing information about themselves.

I took another MRI scan of the “Singapore” Minion right after the “Contextual Embedding” class. As you can see, now our little guy knows a bit more stuff!

Now “Singapore” knows both his and his neighbors’ identities!

Our Minions were happily studying when suddenly a man barged into their school😱 It turns out that his name is Mr. Query and he is a journalist. Here he is:

The inquisitive Mr. Query. He has an urgent question —”Where is Singapore situated” — and he knows some of our Minions hold relevant information for this question.

Mr. Query urgently needs to collect some information for an article he is writing. Specifically, he wants to know “where is Singapore situated.” Mr. Query knows that some of our Minions hold this information in their brains.

Our Minions, helpful as they are, want to help Mr. Query out. To do so, they will need to select several members of their team to meet with Mr. Query and deliver the information he’s seeking. This bunch of Minions that have relevant information for Mr. Query and will be dispatched to him is known as the Answer Gang.

The Answer Gang, which collectively holds the answer to Mr. Query’s question. Only relevant Minions can join the Answer Gang!

Now, our Minions have a task to do — they need to collectively decide who should and shouldn’t join the Answer Gang. They need to be careful when doing so! If they leave out from the Answer Gang too many Minions that should’ve been included, Mr. Query won’t get all the information he needs. This situation is called Low Recall and Mr. Query hates that.

On the other hand, if too many unnecessary Minions join the Answer Gang, Mr. Query will be inundated with superfluous information. He calls such situation Low Precision and he doesn’t like that either! Mr. Query is known to have some anger management issues 👺 so it’s in our Minions’ best interest to supply Mr. Query with just the right amount of information.

So how do the Minions know which of them should join the Answer Gang?

The answer to this is by organizing several meet-up sessions that are collectively called “Attention.” During these sessions, each Minion gets to talk to Mr. Query separately and understand his needs. In other words, the Attention sessions allow the Minions to measure their importance to Mr. Query’s question.

This is the MRI scan of the “Singapore” Minion’s brain as he saunters away from the Attention sessions. This is equivalent to the first brain scan image I showed.

Singapore’s current brain content. He knows quite a bit— but he is still missing one thing!

As we can see, our Minions’ brain are now pretty full. With the Minions’ current state of knowledge, are they now in position to start choosing the members of the Answer Gang? Nope, not quite! They are still missing one key piece of information. Each of our Minions knows his importance to Mr. Query. However, before making this important decision, they will also need to be aware of each other’s importance to Mr. Query.

As you might’ve guessed, this implies that the Minions have to talk to each other for the second time! And now you know that this conversation is done through the bi-LS™ app.

The Minions during the modeling step meeting. Here, they talk to each other through bi-LS™ and share their relative importance to Mr. Query.

This bi-LS™-facilitated conversation is also known as the ‘modeling step’ and is the focus of our current article. Let’s now learn this step in detail!

Step 10. Modeling Layer

Okay, let’s leave our Minions for a while and get back to symbols and equations, shall we? It’s not that complicated, I promise!

The modeling layer is relatively simple. It consists of two layers of bi-LSTM. As mentioned above, the input to the modeling layer is G. The first bi-LSTM layer converts into 2d-by-T matrix called M1.

M1 then acts as an input to the second bi-LSTM layer, which converts it to another 2d-by-matrix called M2.

The formation of M1 and M2 from is illustrated below.

Step 10. In the modeling layer, is passed through two bi-LSTM layers to form M1 and M2.

M1 and M2 are yet another matrix representation of Context words. The difference between M1 and M2 and the previous representations of Context words are that M1 and M2 have embedded in them information about the entire Context paragraph as well as the Query.

In Minion-speak, this means that our Minions now have all the information they need to make the decision about who should be in the Answer Gang.

The “Singapore” guy now has all he needs to decide if he should join the Answer Gang.

Step 11. Output Layer

Okay, now we’ve reached the finale! Just one more step and then we’re done!

For each word in the Context, we have in our disposal two numeric vectors that encode the word’s relevance to the Query. That’s awesome! The very last thing we need is to convert these numeric vectors to two probability values so that we can compare the Query-relevance of all Context words. And this is exactly what the output layer does.

In the output layer, M1 and M2 are first vertically concatenated with to form [GM1] and [GM2]. Both [GM1] and [GM2] have a dimension of 10d-by-T.

We then obtain p1, the probability distribution of the start index over the entire Context, by the following steps:

Similarly, we obtain p2, the probability distribution of the end index, by the following steps:

The steps to get p1 and p2 are depicted in the diagram below:

Step 11. The output layer, which converts M1 and M2 to two vector of probabilities, p1 and p2.

p1 and p2 are then used to find the best Answer span. The best Answer span is simply a substring of the Context with the highest span score. The span score, in turn, is simply the product of the p1 score of the first word in that span and the p2 score of the last word in the span. We then return the span with the highest span score as our Answer.

An example will make this clear. As you know, we are currently dealing with the following Query/Context pair:

  • Context“Singapore is a small country located in Southeast Asia.” (= 9)
  • Query“Where is Singapore situated?” (J = 4)

After running this Query/Context pair through BiDAF, we obtain two probability vectors — p1 and p2 .

Each word in the Context is associated with one p1 value and one p2 value. p1 values indicate the probability of the words being the start word of the answer span. Below are the the p1 values for our example:

We see that the model thinks that the most probable start word for our Answer span is “Southeast.”

p2 values indicate the probability of the words being the last word of the Answer span. Below are the the p2 values for our example:

We see that our model is very sure, with almost 100% certainty, that the most probable end word for our Answer span is “Asia.”

If in the original Context the word with the highest p1 comes before the word with the highest p2, then we have our best Answer span already — it will simply one that begins with the former and ends with the later. This is the case in our example. As such, the Answer returned by the model will simply be “Southeast Asia.”

That’s it, ladies and gentlemen — finally after 11 long steps we obtain the Answer to our Query!

Here is Mr. Query with “Southeast” and “Asia”, both of whom have been selected to join the Answer Gang. It turns out that the information provided by “Southeast” and “Asia” is just what Mr. Query needs! Mr. Query is happy🎊

Okay, one caveat before we end this series. In the hypothetical case that the Context word with the highest p1 comes after the Context word with the highest p2, we still have a bit of work to do. In this case, we’d need to generate all possible answer spans and calculate the span score for each of them. Here are some examples of possible answer spans for our Query/Context pair:

  • Possible answer span: “Singapore” ; span score: 0.0000031
  • Possible answer span: “Singapore is” ; span score: 0.00000006
  • Possible answer span: “Singapore is a” ; span score: 0.0000000026
  • Possible answer span: “Singapore is a small” ; span score: 0.0000316

We then take the span with the highest span score to be our Answer.


So that was it — a detailed illustration of each step in BiDAF, from start to finish (sprinkled with a healthy dose of Minion-joy). I hope that this series has helped you in understanding this fascinating NLP model!

If you have any questions/comments about the article or would like to reach out to me, feel free to do so either through LinkedIn or via email at meraldo.antonio AT gmail DOT com.


Glossary

  • Context : the accompanying text to a query that contains an answer to that query
  • Query : the question to which the model is supposed to give an answer
  • Answer : a substring of the Context that contains information that can answer the query. This substring is to be extracted out by the model.
  • : the number of words in the Context
  • : a big, 8d-by-T matrix that contains Query-aware Context representations. is the input to the modeling layer.
  • M1: 2d-by-matrix obtained by passing through a bi-LSTM. M1 contain vector representation of Context words that have information about the entire Context paragraph as well as the Query.
  • M2: 2d-by-matrix obtained by passing M1 through a bi-LSTM. M2, just like M1, contains vector representation of Context words that have information about the entire Context paragraph as well as the Query.
  • p1: a probability vector with length T. Each Context word has its own p1 value. This p1 value indicates the probability of the word being the first word in the answer span.
  • p2: a probability vector with length T. Each Context word has its own p1 value. This p1 value indicates the probability of the word being the last word in the answer span.

References

[1] Bi-Directional Attention Flow for Machine Comprehension (Minjoon Seo et. al, 2017)

This article is a reproduction of the original by Meraldo Antonio. At the time of publication, Meraldo was doing his AI apprenticeship in the AI Apprenticeship Programme (AIAP)™. He has since graduated from the programme.

DevFest Singapore 2019 Takeaways

Last Saturday, I attended DevFest Singapore 2019 at Google Developer Space. Among the twenty-odd technical sessions in the lineup, I would like to share two which I found particularly interesting from an AI perpective.

Firstly, Than Hien May from Google tackled the difficult (to me!) subject of fairness in machine learning model development and discussed how biases might be detected. A large part of her sharing revolved around the appropriately named What-If Tool (WIT). First released in September 2018, WIT offers a way to probe the behaviour of machine learning models in a visual way with a minimal need to code. Model transparency is a starting point towards understanding fairness and bias and I am curious to learn more about it.

Thanh Hien May from Google.

On a more fun note, Preston Lim and Tan Kai Wei from GovTech introduced the deep learning model they developed to apply vivid colours to vintage black and white photographs. Trained on images from our local Singapore context, the public can readily give their model a try through the ColouriseSG website. For more details on how the model was developed and deployed, there is an excellent blog post by the team which I encourage the reader to read it.

Tan Kai Wei from GovTech.

I decided to give the colourisation model a spin by taking some colour photographs, converting them to B&W and then running them through the colourisation process to see if I could get back the original. I stacked them side-by-side (original on left) to make the comparison.

Source : https://roots.sg/Content/Places/national-monuments/telok-ayer-chinese-methodist-church

Interestingly, the sky and trees in the picture above came out looking natural. Clearly, the model has learned to recognise a cloudy sky and natural foliage. However, the roof tiles of the building did not come out orange, which is what I would expect to be the most common colour. Next, I decided to try it on Singapore’s favourite fruit.

The greenness of the king of fruits is fairly discernible and it makes me wonder how many durian images the model has been trained on. The skin colour of the people also appear natural for the most part. A big challenge in this particular picture is the colour of the 福 (“blessing”) decor on the wall. This should always be in bright red, with no exceptions. I think perhaps more data from Chinatown would be required. 🙂

Our deep learning model performs best on higher resolution images that prominently feature human subjects and natural scenery.

As the team blog post makes clear, the model works very well in certain cases and falls short in others. Hopefully, the team can continue to improve upon it as it has proven popular even among overseas users. Do check it out yourself!

Quick Win from Data – Visualization

During my work, many have asked the following question, in one form or another.

“How can I score quick wins from my existing data?”

It is a reasonable question, given that many organizations have spent tons of money on infrastructure and collected much data. My answer is :

“Data Visualization.”

Many of us, unfortunately, have come across bad visualizations that make us cringed. Examples are bright colours, over-detailed graphics with no focus, visualizations that are not self-explanatory and waste the audience’s time trying to understand them. If you think about it, we have come across lots of bad visualizations and good visualizations are few and far between.

Why Visualization? Why is it important?

According to Wikipedia, the definition of data visualization is “graphic representation of data“. “Well, we have been doing pie charts, line charts and bar charts for a long while already … why is it important NOW?”

Data visualization is more important now because of the following trends. Firstly, we are collecting more data. Data which used to be able to fit easily into Excel has now expanded to millions of rows. Secondly, data is getting more granular. Ever been to a convenience store these days and bought milk? In the past, you probably had to make a simple decision between whether it is full-cream or low-fat milk. Now, besides full-cream or low-fat, you have to think about whether you want UHT, high-calcium etc.

As business reliance on data to make better decisions grows stronger, sieving out the signals from the data is going to get more important and good visualization will be an important tool to sieve out the signals.

How to Build a Good Visualization?

All of us can easily evaluate whether a visualization is good or bad. But how do we build a good visualization? I have benchmarks that I set to help me with that. Let me share them briefly here.

A good visualization is one which if you came back to a few days since building it, you still understand VERY EASILY what information you were trying to present. How many of us have seen presenters who looked at their visualization and said, “Hmmm…what is it that I want to share here?” If the presenter, who had seen the data and created the visualization, could not understand what they wanted to present, the audience definitely could not be expected to understand the points the visualization intended to accentuate.

Secondly, if your audience is asking more questions on how to understand the visualization rather than asking for more information on the points presented, then you have a bad visualization on hand. For instance,

“What does the blue color represent?”

versus

“What are the assumptions made when you calculate the average.”

The former is an attempt to read the visualization while the latter is an attempt to explore the insights presented further.

From conducting many trainings and assessing visualization projects, I have distilled the following points for building better visualizations.

  • Design Thinking – “How can I present more dimensions without crowding the visualization with more details?”
  • Empathy – “How will my audience see my visualization?”, “What stands out from the visualization (i.e. what features of the visualization capture my attention immediately)?
  • Planning & Experimentation – “Who are my target audience?”, “Will adding a reference line or colour show which category is above average?”
  • Storytelling – “After I have shown this visualization, what will my audience like to know next? Which rabbit hole should I bring them towards, what can add value to my audience?”

A good visualization tells a main story point and contextualizes the main story point. It requires a lot of design thinking and good design thinking comes from understanding your audience.

Conclusion

Creating visualizations that can continually give the “Aha” moment is never easy and it comes with practice and the willingness to take in feedback. Having said that, this is still the route to scoring quick wins from your existing data. I strongly encourage readers to start their learning journey in data visualization and get better at it. To conclude, I would like to share a quote here.

Good visualization requires ten minutes of planning and creating but ten seconds to understand.

Bad visualization requires ten seconds of planning and creating but ten minutes to understand.

mailing list sign up

Mailing List Sign Up C360