Get SGD58.85 off your premium account! Valid till 9 August 2021. Use the Code ‘SGLEARN2021’ upon checkout. Click Here

Want an AI Team? Don’t Recruit. Grow your own!

If you want to build your AI team, unless you have the deep pockets of Google or Facebook, you need to find more creative ways to build up that AI/ML engineering talent.

AI Singapore is a Singapore government funded programme hosted by the National University of Singapore which also means the salaries we can pay our AI engineering talent are constrained. But this has not stopped us from building one of the largest AI Engineering team in Singapore where more than 90% of our staff are Singaporeans. The other 10% are talented foreigners whom we have brought in to fill some of the talent gaps.

Lifelong Learners

We noticed a trend of professionals re-skilling and learning AI, ML, Python, R, Big Data, Cloud etc on their own. They are passionate, self-directed learners. And most of them do not have a computing science background! They come from Engineering, Business, Finance, Physics, Chemistry, Biology, Mathematics, Statistics and Economics. One thing they have in common is the love of learning, curiosity, enjoys being challenged by difficult business or engineering problems and love of data.

Human Resource departments total missed out on them. They continued to recruit for academic degrees and computer science disciplines, specifying the exact same criteria that the big AI companies are using – “2 – 5 or 10 years of experience in Computing or AI or Machine Learning”. And companies complained they cannot find the talents!

Skills, skills and skills.

In AI Singapore – we recruit based on skills. We focus on the skills we expect you to have, not what you studied 1, 5 or 10 years ago. It does not matter.

In the last 2 years, our full-time AI engineering staff strength grew from 0 to 21 with 19 Singaporeans! This team is complemented by an additional 36-50 Singaporean AI apprentices. That is a total of 60-70 Singaporean AI engineers developing AI/ML applications for companies and agencies in Singapore. We are currently working on more than 15 AI projects from healthcare, finance, engineering, IT and media.


These AI Apprentices are from our AI Apprenticeship Programme (AIAP)™ which is a 9-months full-time apprenticeship programme where we bring on-board carefully selected candidates (based on skills alone) who are mostly self-taught AI/ML/Python enthusiasts. We provide them an environment to continue to deep skill and work on a real-world AI/ML problem statement. Each of these AI problem statements from our 100Experiments programme is worth $360,000 – $500,000 with AI Singapore co-funding 50%, and the company funding the other 50% of the costs. The outcome in 9-months is an AI minimum viable product which is deployed at the company site. The company is also encouraged to hire the apprentices at the end of 9-months.

You may say 9-months is very long to wait for an apprentice to be trained. Yes, but we do 3 batches of 25 apprentices per year with a gap of 4-months between batches. So, after the initial 9-months, we now have potential new hires every 4-months. Candidates whom we have worked with for 9-months and whom we know well, and if hired by AI Singapore, can start work immediately! No ramp-up time required as they would have already gotten into the rhythm of our AI engineering practices and work culture.

For the rest of the apprentices whom AI Singapore is not able to hire (even if we make them an offer), they are released to the industry. Nearly all (AIAP™ batch #1 and #2) are working in a AI/ML role in Singapore today.

We are now into interviewing for our fifth batch of 25 AI Apprentices who are expected to start in Feb 2020. We continue to refine our AIAP™ process, both internally and externally. To ensure a steady stream of quality AI Apprentice candidates, we also introduced an AI Certification initiative with a structured AI/ML 12-months learning journey in October 2019 to help guide and incentive self-directed learners. We hope some of them will join the AIAP™ once they complete the 12-months learning journey.

Grow your own timber

In the last two and half years in AI Singapore, I must have met more than 300 companies keen to do AI. Some still have the same complain after a year – they cannot find Singaporean AI talents. I say to them, “grow your own timber”.

AI and Curve Fitting

This article from last year popped up in my newsfeed recently. It contains a discussion on whether AI systems today display true intelligence. I would like to focus on the following quote from computer scientist Prof Judea Pearl mentioned in the article.

As much as I look into what’s being done with deep learning, I see they’re all stuck there on the level of associations. Curve fitting. That sounds like sacrilege, to say that all the impressive achievements of deep learning amount to just fitting a curve to data. From the point of view of the mathematical hierarchy, no matter how skillfully you manipulate the data and what you read into the data when you manipulate it, it’s still a curve-fitting exercise, albeit complex and nontrivial.

To provide more context, Prof Pearl said the above during an interview in response to a comment on the public excitement over the possibilities in AI. He was actually very impressed that curve fitting could solve so many practical problems. So, what exactly is the “curve fitting” he was talking about?

It turns out that many problems in the real world can be represented as spaces. Spaces which are populated by data points which represent objects of interest in the real world. Take the rather simple examples illustrated below.

The first diagram is a classification model. It predicts healthy (blue) and diseased (red) specimens based on two quantitative characteristics – gene 1 and gene 2.

The second diagram is a regression model. That is just a jargon-y way of saying that instead of predicting a category, the model predicts a numerical value – in this case the salary of an individual – based on work experience in years.

In each case, the model is represented by a line (the eponymous “curve”). How good the model is will be determined by how well the curve is fitted. For the classification model, any specimen falling above the line is deemed diseased. For the regression model, the predicted salary is given by drawing a vertical line from the number of years in experience to the model line and reading the value off the Salary axis.

That, in very simple terms, is what Prof Pearl was talking about when he said “curve fitting”. It lies at the heart of many problems that have been solved by AI, like spam detection and object classification.

Represent, Transform, Optimise

Of course, there are a lot more things going on in developing an AI model. For starters, it is not always immediately clear what characteristics of the real world (which can number in the hundreds and thousands) should be considered. That is what feature selection is concerned with. The characteristics must also be represented in numerical form (quantified). Often, there is even a need to perform mathematical transformations on the original data before a good curve can be fitted. This is called feature engineering.

All machine learning algorithms consist of automatically finding such transformations that turn data into more useful representations for a given task.

– Deep Learning with Python by François Chollet

Finally, there must be a way to measure the “goodness-of-fit” of the curve because AI model development is basically an iterative optimisation problem where the curve becomes increasingly better at performing the particular task, be it classification or regression.

In conclusion, for all the apparent magic that present-day AI does, like facial recognition etc, it is basically a number-crunching exercise. It probably is not the same thing as the intelligence displayed by human beings, but it is here to stay and finding increasing number of use cases in our lives.

(To get a formal introduction to AI, sign up for the AI for Everyone and AI for Industry programmes!)

Supporting Our Healthcare Workers and Kidney Disease Patients

AI Singapore and RenalTeam were pleased to be able to share with the public the collaborative work they have done in the area of renal health on November 21, 2019. Founded by Mr Chan Wai Chuen in 2012, RenalTeam is an outpatient kidney dialysis service provider with operations in Singapore, Malaysia and Indonesia.

Since 2014, the healthcare provider has been amassing data captured from the treatment of its patients. Kidney patients are often at risk of hospitalisation for various reasons. Being able to dynamically and accurately identify high-risk patients will enable limited resources to be channeled to ensure quality care. This is a significant challenge for any medical team. Would an AI model be able to provide additional support in this area? With inspiration from the 2018 HIMSS conference, Wai Chuen brought the company into the 100E programme offered by AI Singapore to get started on the AI journey.

Starting in January this year, Mr Lim Tern Poh, then an apprentice in the AI Apprenticeship Programme (AIAP)™ and now a senior AI engineer at AI Singapore, took on the challenge of developing an imbalanced class model with a predictive window of one week. Starting with no medical knowledge, he and his team mates quickly got up to speed with scoping out the problem, exploring the data, selecting the features and finally, training the model. In the span of seven months, they succeeded in delivering a model comparable with the performance of the medical team in RenalTeam.

The audience consisted of a mix of both healthcare and technology workers. Interesting questions were asked, like the course of action to take when human and machine arrive at different conclusions, how the model integrates seamlessly into the every day routines of the medical team. With kidney disease on the rise in the nation, AI Singapore is proud to have done our part in supporting our healthcare workers and improving the quality of life of citizens.

Related Story

Predicting Hospitalisation of Kidney Dialysis Patients with AI

Predicting Hospitalisation of Kidney Dialysis Patients with AI

Under AI Singapore (AISG)’s 100 Experiments Programme, my fellow apprentice and I were assigned to work with a regional kidney dialysis company to develop an AI model that predicts the hospitalisation of patients. This is a key component of our AI Apprenticeship Programme (AIAP)™ where we get to work on a real-world AI industry problem. Our model served as a decision support tool and has helped the kidney dialysis company’s medical team achieve 36% better precision (i.e. less false positive). It is currently deployed in their dialysis centres.

In this article, I will share the key challenges, processes and insight from developing my first medical AI model.

Ok… why are you interested in predicting the hospitalisation of dialysis patients?

Patients undergoing dialysis have higher morbidities and high risks of hospitalisation. By the time they are hospitalised, their medical conditions usually have become full-blown and their mortality risks would have increased. The ability to predict hospitalisation risk will allow early medical intervention.

Even though there is research done on key predictors on hospitalisation, the current process is fuzzy and is dependent on the experience of medical staff.

The current process of hospitalisation prediction

With the vast amount of data collected from each patient before, during, and after their dialysis, there is potential in using this data to train an AI model that predicts the hospitalisation of patients. The prediction from the model could be used as decision support for the medical team.

Envisioned future of hospitalisation prediction, with AI-model as decision-making support

All you need is to feed those medical data into the model?

No. We need to pre-process the raw medical data to something useful for the model.

We put ourselves in the shoes of a medical professional and asked: how would a doctor assess the hospitalisation risk of patients? From this thought process, we learnt we could teach medical knowledge and supply patients’ medical histories to the model.

We could teach medical knowledge to our model by incorporating medical research done into our data.

For patients’ medical histories, we must find ways to aggregate patients’ medical parameters without losing excessive information. If a patient’s medical parameter is deteriorating, it is usually foretelling a grim outlook. This is what we want our model to know as well.

Cool… how do you teach medical knowledge to the model?

In the raw dataset given to us, most medical parameters readings are just numbers without any numerical significance.

There are guidelines on healthy ranges for most medical parameters. For example, an individual with hypertension has blood pressure above 140/90 mmHg. To give meaning to the raw blood pressure data, we converted it into categories of 0, 1, 2, or 3 for low, healthy, pre-hypertensive, and hypertensive blood pressure, respectively.

Features Engineering: Categorising medical information to embed medical domain knowledge

We did the same by converting other medical parameters into established categories.

And for incorporating the history of patients’ medical information?

We took different approaches based on the type of data.

For continuous variables, such as patients’ blood pressure, we used an exponential moving average (EMA) across 12 periods. This is a sliding window that takes 12 most recent readings and calculates the average value, with more recent data given higher weight.

Feature Engineering: Using exponential moving average to aggregate time-series data

Why 12? In dialysis, each patient will undergo 3 sessions weekly. A period of 12 means taking the average of the past 1-month dialysis data.

For discrete variables, such as past hospitalisation count, we created a ‘cumulative count’ column that records the number of times a patient’s frequency of hospitalisation. We increased this cumulative count by 1 each time the patient is hospitalised.

Features Engineering: Calculating the cumulative sum of events to embed patient's history

This is based on medical literature which discovered that past hospitalisation count is a strong predictor for future hospitalisation (an often-hospitalised patient means he is sicker and therefore has a higher chance of hospitalisation in future).

Any interesting finding from your project?

We tried using NLP (natural language processing) to extract information from patients’ discharge notes written by doctors. We thought providing this additional information to the model would improve its performance, but we are wrong — the model performance did not increase.

Our hypothesis is whatever information contained in patients’ discharge notes is already present in the patients’ medical parameters. For instance, for the doctor to indicate ‘high blood pressure’ in the discharge note, the doctor must have referred to the blood pressure reading of the patient.

Furthermore, adding NLP into the model slows down the model performance significantly due to the additional data processing required. We eventually decided to exclude patients’ discharge notes in our final AI model.

How do you know if your model is good enough?

We don’t. Everyone wants a perfect model, especially for medical professionals. This is rightly so. They are concerned about false negative and false positive results. These false results will negatively affect the patients’ medical outcome.

How did you then convince the medical team to implement your model?

We got creative. Instead of setting an arbitrary benchmark, we proposed a model validation exercise with the medical team. If our model can enable the medical team to make better predictions, then deploying our model will help the patients.

For one month, the medical team assessed patients and predicted which patient will be hospitalised. We did the same with our model. We then compared our predictions with the actual hospitalisation of patients.

After tallying the result, our AI model performed 36% better in precision. This means using our model as a decision support tool will help the kidney dialysis company’s medical team to make less false positive predictions.

Oh no! Did something bad happen?

No. On the contrary, something positive happened to the patients during the model validation period.

The hospitalisation rate of patients dropped significantly compared to the average hospitalisation rate before the model validation period. This is even though the medical team doesn’t know the prediction of our AI model.

After the model validation period, the average hospitalisation rate creeps back up to the original average hospitalisation rate. Therefore, it is unlikely that this event is due to randomness or other confounding factors.

How do you explain this decrease in hospitalisation rate?

This phenomenon is known as Social Facilitation, where individual performance improved when working with others.

Interestingly, it seems that just by knowing an AI model is competing with them on hospitalisation prediction, the overall performance of the medical team improves.

We suspected it could be due to the medical team taking a closer look at patients during this period. A little healthy competition never hurts.

Interesting… should I start telling my colleagues an AI model is running in the background (even if there isn’t) to improve their performance?

I will leave it to you to decide. A better alternative is to contact AISG and let us develop an AI model for your company. 

Attention Mechanism in Seq2Seq and BiDAF — an Illustrated Guide

Sequence-to-sequence (seq2seq) and Bi-Directional Attention Flow (BiDAF) are influential NLP models. These models make use of a technique called “attention” that involves the comparison of two sequences. In this article, I explain how the attention mechanism works in these two models.

(By Meraldo Antonio)

This article is the third in a series of four articles that aim to illustrate the working of Bi-Directional Attention Flow (BiDAF), a popular machine learning model for question and answering (Q&A).

To recap, BiDAF is an closed-domain, extractive Q&A model. This means that to be able to answer a Query, BiDAF needs to consult an accompanying text that contains the information needed to answer the Query. This accompanying text is called the Context. BiDAF works by extracting a substring of the Context that best answers the query — this is what what we refer to as the Answer to the Query. I intentionally capitalize the words Query, Context and Answer to signal that I am using them in their specialized technical capacities.

An example of Context, Query, and Answer. Notice how the Answer can be found Verbatim in the Context.

In the first article in the series, I presented a high level overview of BiDAF. In the second article, I talked about how BiDAF uses 3 embedding layers to get vector representations of the Context and the Query. The final outputs of these embedding steps are two matrices — H (which represent words in the Context) and U (which represent words in the Query). H and U are the inputs of the attention layers, whose function is to combine their informational content.

These attention layers are the core component in BiDAF that differentiates it from earlier models and enables it to score highly in the SQuAD leaderboard. The workings of the attention layers will be the focus of this article. Let’s dive in!

Conceptual Introduction to Attention Mechanism

Before delving into the details of the attention mechanism used in BiDAF, it behooves us to first have an understanding of what attention is. Attention was first introduced in 2016 as a part of a sequence-to-sequence (seq2seq) model. As its name suggests, seq2seq is a neural network model whose aim is to convert one sequence to another sequence. An example application of seq2seq is the translation of a 🇫🇷 French sentence to an 🇬🇧 English sentence, such as the one below:

A sample translation task for seq2seq model.

A seq2seq model consists of two Recurrent Neural Networks (RNNs):

  • The first RNN, called “encoder”, is responsible for understanding the input sequence (in our example, a French sentence) and converting its informational content into a fixed-size intermediary vector.
  • The second RNN, called “decoder”, then uses this intermediary vector to generate an output sequence (in our example, the English translation of the French sentence).

Prior to the incorporation of attention mechanism, seq2seq models can only deal with short sequences. This restriction arises because the “vanilla” seq2seq models can only fit a limited amount of information into the intermediary vector and some informational content is loss in the process.

This situation is akin to an attempt to translate a French book by first reading the whole book, memorizing its content and then translating it into English just from memory. As you can imagine, such an attempt is bound to fail— our poor translator will forget most of the book by the time he starts writing the English translation!

The attention mechanism was developed to solve this information bottleneck. The core idea of attention is that on each step of the decoding process, we are to make a direct connection to specific parts of the encoder.

In the context of our French-English translation task, this means that at every time our model is to generate the next English word, it will only focus on the most relevant portions of the input French sentence.

Conceptually, a seq2seq translation model with attention works just like how a normal human translator would translate a French text. He would read the first paragraph of the French text and translate it into English, move on to the second paragraph and translate this one, and so on. By doing so, he doesn’t have to commit the whole content of the book into his memory and run the risk of forgetting most of its content.

Implementation of Attention in Sequence-to-Sequence Model

Practically, we can include an attention mechanism in a seq2seq model by performing the following three steps that are depicted in the diagram below:

Attention mechanism in seq2seq.

1. Comparison of Sequences and Calculation of Attention Scores

At each time step during the decoding process, we will compare the decoder hidden state with all of the encoder hidden states.This comparison can be done using any function that takes two vectors and outputs a scalar that reflects their similarity. The simplest of such function is a simple dot product. The scalar output of the similarity function is called an “attention score”; these attention scores are depicted by blue circles 🔘 in the diagram above.

2. Conversion of Attention Scores to Attention Distribution

We then take the softmax of all these attention scores.The softmax function normalizes these attention scores into a probability distribution (a set of numbers that sum up to one). This probability distribution is called the attention distribution” ; it signals the parts of the input sequence that are most relevant to the decoding process at hand.

The blue bars in the diagram above show the attention distribution. We see that the bar corresponding to the second 🇫🇷 French word, “me”, is the tallest ; this is because this word translates to “I” in 🇬🇧 English, which is the first word in our output sequence.

3. Multiplying Attention Distribution with Encoder Hidden States to Get Attention Output

We then multiply each element of the attention distribution with its corresponding encoder hidden states and sum up all of these products to produce a single vector called the “attention output”.You can think of the attention output as a selective summary of the input sequence. The attention output will then become the input to the next decoding step.

Although the three attention steps above were first applied to seq2seq, they are generalizable to other applications. As we will see later, BiDAF uses the same three steps in its implementation of attention, albeit with some minor modifications.

With this quick overview of the attention mechanism and its implementation in seq2seq, we are now ready to see how this concept is applied in BiDAF. C’ est parti!

Step 6. Formation of Similarity Matrix

Just to remind you where we left off in the last article — our last step (step 5) was a contextual embedding step that produced two output matrices: H for the Context and U for the Query. Our overarching goal for the attention steps (steps 6 to 9) is to fuse together information from and U to create several matrix representations of the Context that also contain information from the Query.

Our sixth step — the first attention-related step — is the formation of the so-called similarity matrix S. S is a tall skinny matrix with a dimension of T-by-(number of words in Context by number of words in the Query).

The generation of the similarity matrix S corresponds to the first step in the seq2seq attention mechanism discussed above. It entails applying a comparison function to each column in H and each column in U.The value in row tand column jof the matrix represents the similarity of t-th Context word and j-th Query word.

Let’s take a look at an example of similarity matrix S. Suppose we have this Query/Context pair:

  • Context“Singapore is a small country located in Southeast Asia.” (= 9)
  • Query“Where is Singapore situated?” (J = 4)

The similarity matrix produced from the above Query/Context pair is shown below:

An example of similarity matrix S.

We can make a couple of observations from the matrix above:

  • As we expect, the matrix has a dimension of 9-by-4, 9 being the length of the Context(T)and 4 being the length of the Query (J).
  • The cell in row 1, column 3 contains a relatively high value as indicated by its bright yellow color. This implies that the Query word and Context word associated with this coordinate are highly similar to each other. These words turn out to be the exact same word — “Singapore” — hence it makes sense that their vector representations are very similar.
  • Just because a Context word and a Query word are identical doesn’t necessarily imply that their vector representations are highly similar! Look at the cell in row 2, column 2 — this cell encodes the similarity of the Context word “is” and the identical Query word “is”. However, its value is not as high as the “Singapore” par above. This is because these vector representations also incorporates information from the surrounding phrases. This contextual contribution is especially important for small copulas such as “is”.
  • On the other hand, we can see that the similarity value of two distinct words with close semantic and grammatical meaning, such as “situated” and “located” is relatively high. This is thanks to our word and character embedding layers, which can generate vector representations that pretty accurately reflect a word’s meaning.

Now let me tell you about how we calculate values in SThe comparison function used to perform this calculation is called α. αis more complex than the dot product used in the seq2seq; here is the equation for α:

As the function α comprises a multiplication of a row vector and a equally-sized column vector, it always returns a scalar.

The diagram below shows all the matrix operations performed in this step.

Step 6. Using the similarity function α, we combine context matrix and query matrix to form similarity matrix S.

Step 7. Context-to-Query Attention (C2Q)

The similarity matrix serves as an input to the next two steps: Context-to-Query attention (C2Q) and Query-to-Context attention (Q2C).

In this section, we will focus on C2Q. The goal of this step is to find which query words are most relevant to each context words.

Performing C2Q is similar to performing the second and the third steps of the seq2seq attention. First, we use the scalar values in to calculate the attention distribution. This is done by taking the row-wise softmax of S. The result is another matrix. This matrix is not explicitly named in the BiDAF paper, but let’s call it matrix A.

The matrix A, whose dimension is the same as S, indicates which Query words are the most relevant to each Context word. Let’s look at an example of A:

An example of matrix A, the row-wise softmaxed version of S.

By observing the heatmap above, we can conclude that:

  • Semantic similarity strongly contributes to relevance. We see that for the Context words [“Singapore”, “is”, “located”], the most relevant Query words are [“Singapore”, “is”, “situated”] . These are also words with which they share strong semantic similarity.
  • Context words “understand” the information requested by query words. We see that the Query word “Where” is the most relevant query word for the Context words [“a”, “small”, “country”, “in” ,“Southeast”, “Asia”]— words related to geographical location.

We then take every row of to get a the attention distribution At: which has a dimension of 1-by-JAt:reflects the relative importance of each Query word for the t-th Context word.

We then calculate the weighted sum of the query matrix with respect to each element in the attention distribution At: . The result of this step is the attention output matrix called Ũ, which is a 2d-by-matrix.

Ũ, just like H, is a matrix representation of the Context. However, Ũ contains different information from H! Whereas H encapsulates the semantic, syntactic and contextual meaning of each Context word, Ũ encapsulates the information about the relevance of each query word to each Context word.

The whole process to generate Ũ from similarity matrix and query matrix is depicted below:

Step 7. Context-to-Query attention.

Step 8. Query-to-Context (Q2C) Attention

The next step is Q2C, which like C2Q also starts with the similarity matrix SIn this step, our goal is to find which Context word is most similar to either one of the Query words hence are critical for answering the Query.

We first take the maximum across row of the similarity matrix S to get a column vector. This column vector is not explicitly named in the paper, but let’s call it z.

Let’s now step back and think about what symbolizes. Remember, our similarity matrix S records the similarity between each Context word and each Query word. Let’s take a second look at the example S we created above.

Similarity matrix S.

Now let’s focus our attention to the the fourth row of this matrix, which corresponds to the Context word “small”. One can see that there isn’t any bright cell across this row! This indicates that there is no word in the query that is similar in meaning to the Context word “small”. When we take the maximum across this row, the maximum value obtained will be close to zero.

Contrast this with the word “Singapore” and “situated”, in whose rows we do find at least one bright cell, indicating the existence of query word(s) similar to these words. When we take the maximum of these two rows, the corresponding value in the vector for these rows will also be relatively high.

Here is the obtained for our example:

In Q2C, the values in serve our attention values. We apply softmax on to get an attention distribution called b. We then use to take a weighted sum of the Context matrix H. The resulting attention output is a 2d-by-1 column vector called ĥ.

The last step of Q2C is copying-and-pasting ĥ T times and combine these copies into a 2d-by-T matrix called Ĥ. Ĥ is yet another representation of the Context that encapsulates the information about the most important words in the Context with respect to the Query.

The whole process to generate Ĥ from similarity matrix and Context matrix is depicted below:

Step 8. Query-to-Context attention.

Step 9. “Megamerge”

The matrices produced in steps 5, 7 and 8 are then combined to form a giant matrix G. To refresh your memory, these three matrices are as follows:

  • H : the original Context matrix that encapsulates the semantic, syntactic and contextual meaning of Context words.
  • Ũ : Context matrix that encapsulates the relevance of each Query word to each Context word.
  • Ĥ : Context matrix that encapsulates the information about the most important words in the context with respect to the Query.

These three matrices have the same dimension — 2d-by-T.

This “megamerging”, unfortunately, is not as simple as stacking these three matrices vertically! Here, we make use of another custom function called β. Here is the equation for β:

We define our megamerged matrix G by G:t = β(H:t, Ũ:t, Ĥ:t) where G:t is the t-th column vector of that corresponds to t-th Context word. has a dimension of 8d-by-T.

The whole process to generate from HŨ and Ĥ is depicted below:

Step 9. Merging of the three Context matrices HŨ and Ĥ to form G.

The giant matrix contains all information in HŨ and Ĥ. That is, you can think of each column vector in as a vector representation of a Context word that is “aware” of the existence of the Query and has incorporated relevant information from it.

So that’s all there is to it about the attention layers in BiDAF! I know this might be a lot to take in, especially with the myriad of symbols and matrix operations involved. If you really want to study BiDAF in detail, I recommend you to print out the glossary below as well as all the diagrams above and study them side-by-side.

In the next article, which will be the last article of the series, I will discuss about how serves as an input to the modeling and output layer. The output layer will be the one that gives out probabilities for each Context word being included in the Answer span. I hope to see you in that last article!


  • Context : the accompanying text to a query that contains an answer to that query
  • Query : the question to which the model is supposed to give an answer
  • Answer : a substring of the Context that contains information that can answer the query. This substring is to be extracted out by the model.
  • : the number of words in the Context
  • : the number of words in the Query
  • : the original Context matrix that encapsulatesthe semantic, syntactic and contextual meaning of query words. has a dimension of 2d-by-J
  • H : the original Context matrix that encapsulatesthe semantic, syntactic and Contextual meaning of Context words. has a dimension of 2d-by-T
  • S : the similarity matrix that records the similarity between each Context word and each query word. has a dimension of T-by-(number of words in the Context by number of words in the Query).
  • α : the comparison function used to get similarity values in S.
  • A : the matrix that results from the row-wise softmax of and indicates which query words are the most relevant to each Context word. has a dimension of T-by-J.
  • z : the column vector obtained by taking the maximum across row of the similarity matrix S. z has a dimension of T-by-1.
  • b : an attention distribution vector that comes as the result of applying softmax to z. b has a dimension of T-by-1.
  • ĥ : an attention output obtained by multiplying with Context matrix H. ĥ has a dimension of 2d-by-1.
  • Ũ : Context matrix that encapsulates the relevance of each query word to each Context word. Ũ has a dimension of 2d-by-T.
  • Ĥ : Context matrix that encapsulates the information about the most important words in the Context with respect to the query. Ĥ has a dimension of 2d-by-T.
  • : a big, 8d-by-T matrix that contains all information in HŨ and Ĥ. G is the input to the modeling layer.
  • β : a custom concatenation function used to build G


[1] Bi-Directional Attention Flow for Machine Comprehension (Minjoon Seo et. al, 2017)

[2] Character-Aware Neural Language Models (Yoon Kim et. al, 2015)

This article is a reproduction of the original by Meraldo Antonio. At the time of publication, Meraldo was doing his AI apprenticeship in the AI Apprenticeship Programme (AIAP)™. He has since graduated from the programme.

Technical Debt in Building Machine Learning Systems

Developing a machine learning (ML) system is not the same as software engineering, as I recently wrote. However, there is much to learn from the long history of software engineering. This is hardly surprising as the ML engineer writes code and must own and maintain it. One of the concepts which can be leveraged from software engineering is that of technical debt. Just like in the case of building software systems, costs are incurred when ML systems are developed and deployed (often under tight schedules) without well-thought-out processes and tools to maintain and update them. This eventually leads to disappointment in ML projects a year from deployment.

The term technical debt comes from the world of finance (obviously). Just like financial debts, a technical debt must eventually be repaid. And just like in the world of finance, a technical debt often results from an engineering decision made to incur a future cost in order to achieve a more profitable goal (e.g. time-to-market). This is all well and good. However, there are often times when technical debts are incurred not deliberately (the finance analogy breaks down here), but inadvertently. This is where things get messy. Studies into technical debt aim to bring to attention the cases where certain development pathways might lead to undesirable outcomes.

As the world entered an AI spring in the early 2010s, the concept of technical debt in software engineering began to be applied to machine learning. Google researchers in particular published a series of papers at NIPS (now renamed NeurIPS) in 2014, 2015 and 2016 addressing this topic. The papers are worth reading in full. If pressed for time, consider the 6-min read here and 18-min read here. An important takeaway point is that deferring the payment of such debts results in compounding costs. And such compounding happens silently.

A non-comprehensive list of areas where technical debt can develop in machine learning systems.

The big challenge of technical debt is that it can come in many forms and there is no single silver bullet to address them. There is often not even an objective way to measure it. The qualitative questions proposed by the Google researchers in their papers, which a development team should continually ask themselves, help to keep the topic not far from mind as the race to develop and deploy complex ML systems tends to crowd out concerns of lesser immediacy.

NLP in a Great Hurry

(Abstracted with permission from NLP in a Hurry by Pier Lim.)

Here is a collection of different Python libraries for natural language processing (NLP) which can be invaluable for rapid prototyping.

Semantic Similarity

Sentence Transformers

BERT/XLNET produces rather bad sentence embeddings out-of-the-box . This library helps you produce your own sentence embeddings tuned for your specific task. This would be useful for anything to do with semantic textual similarity, clustering and semantic search.

Rule-based Text Sentiment for Social Media


While deep learning models are cool, rule-based models still have their place under the sun. Especially when you don’t have a lot of data and time to tune your model. The library describes itself as specifically attuned to sentiments expressed in social media. This means that emoticons and sentiment intensity markers (e.g. “!!!”) are taken into account.

Named Entity Recognition


SpaCy is a popular open-source library which can be used for production. Apart from the default entities, spaCy also gives us the liberty to add arbitrary classes to the NER model, by training the model to update it with newer trained examples. This blog post shows how.

Production-Ready BERT Models


BERT-as-a-Service wraps the BERT code and serves it using ZeroMQ, allowing one to serve BERT embeddings with just a few lines of code which is fast (optimised), scalable and reliable.

Related Articles

Analysis of Tweets on the Hong Kong Protest Movement 2019 with Python demonstrates the use of the Vader tool.

Data for the People by Andreas Weigend : a Book Review

I was introduced to this book by a close friend of mine who is currently working as a data analyst in a prominent start-up. The audio-book is available at the National Library which you can find here. It is written by Andreas Weigend, former Chief Scientist of Amazon.

The book goes into how to make data work for people, making a positive impact to people while taking into account individual privacy. Granted that while there are many governments providing data protection frameworks such as the PDPA (Personal Data Protection Act) in Singapore, after being in the field for a while, I believe businesses can do more to protect individual privacy while making a positive impact through the customer data collected.

The first part of the book shares a lot of examples of companies such as Amazon, Facebook, Google and the likes on how data is used for the benefit of people. It gives everyone, both in and out of the industry, a very good idea of how their data is used to provide detailed insights on customers. I am very sure the examples provided will surprise readers with a few exclamation marks.

After the examples, the author moves on to how data can work for the people. His proposal is Transparency and Agency.

Transparency is about getting the technology firms or any firm that uses data (the author calls them “data refineries”) to be transparent about how they use the data, how the data is being processed and what kinds of insights are provided by the data. My take on transparency is, data refineries can, to a certain extent, provide transparency, but whether the transparency level is enough will definitely be debatable. This is because the ways these data refineries use the data provide them with a competitive advantage. No sane business owner will be willing to show everything to their competitors for fear of losing their competitive edge.

Agency is about giving control back to consumers, to let consumers decide how much and which data they are willing to let these data refineries have so that data refineries can continue to provide data-driven services to customers. It is an interesting concept but the premise is that more people will start to understand the value of their data, how they can benefit from their own data and what the potential red flags are when data is shared with the data refineries.


I enjoyed the audio-book tremendously, getting new ideas on what insights can be derived from data. I like the two ideas, transparency and agency, that the author has proposed, giving information and power back to consumers. The author also proposed ideas to implement the two ideas. I like most of them but I feel that some of them are not feasible. Overall, my rating is 4 out of 5 stars for it.

Software Engineering for Machine Learning : A Case Study

Some take-away points from Microsoft’s paper

Microsoft presented this paper at this year’s International Conference on Software Engineering (ICSE 2019). It is a distillation of the experiences gained by the numerous software teams within the company as they implement machine learning (ML) features as diverse as search, machine translation and image recognition in their products. With decades of experience in software engineering and no stranger to ML, Microsoft is well-placed to teach a thing or two about developing machine learning systems vis-à-vis software engineering.

A commonly used machine learning workflow at Microsoft is depicted in Figure 1. It should look familiar to those already conversant with machine learning. If you find this workflow unfamiliar or need a refresher, head over to Section II-B of the paper where they give a pretty good summary of the various steps.

Figure 1 : The nine stages of the machine learning workflow. Some stages are data-oriented (e.g., collection, cleaning, and labeling) and others are model-oriented (e.g., model requirements, feature engineering, training, evaluation, deployment, and monitoring). There are many feedback loops in the workflow. The larger feedback arrows denote that model evaluation and monitoring may loop back to any of the previous stages. The smaller feedback arrow illustrates that model training may loop back to feature engineering (e.g., in representation learning).

Among other things, the paper highlights three fundamental differences between the software and ML domains (Section VII), which I find most relevant for our purpose :

  • Data discovery and management.
  • Customisation and reuse.
  • ML modularity.

Data Discovery and Management

The collection, curation, cleaning and processing of data are central to machine learning. While software development can be supported by neatly defined APIs which do not often change during the development cycle (relatively speaking), datasets rarely have explicit and stable schema definitions across the many rounds of iterations involved in ML. All data must be stored, tracked and versioned. There are well-established technologies to version code, but the same cannot be said for data. A given dataset may contain data from several different schema regimes. When a single engineer gathers and processes this data, they can keep track of these unwritten details, but when project sizes scale, maintaining this common knowledge becomes non-trivial.

Customisation and Reuse

Model customisation and reuse require very different skills than are typically found in software teams. In software engineering, this involves forking from a library and making the required changes to the code. In ML model reuse, there are more considerations to be made. For example, the original domain the model was trained on and the input format of the data. The developer cannot do without having ML knowledge and those coming from a purely software background should be cognizant of this point.

ML Modularity

Modularity is often a key principle in software engineering, often strengthened by Conway’s Law. The final software is divided into modules with interactions between them controlled by APIs. Maintaining strict module boundaries could be challenging in ML systems. As an example :

… one cannot (yet) take an NLP model of English and add a separate NLP model for ordering pizza and expect them to work properly together. Similarly, one cannot take that same model for pizza and pair it with an equivalent NLP model for French and have it work. The models would have to be developed and trained together.

Another point mentioned is the non-obvious ways in which models interact.

In large-scale systems with more than a single model, each model’s results will affect one another’s training and tuning processes. In fact, one model’s effectiveness will change as a result of the other model, even if their code is kept separated. Thus, even if separate teams built each model, they would have to collaborate closely in order to properly train or maintain the full system. This phenomenon (also referred to as component entanglement) can lead to non-monotonic error propagation, meaning that improvements in one part of the system might decrease the overall system quality because the rest of the system is not tuned to the latest improvements. This issue is even more evident in cases when machine learning models are not updated in a compatible way and introduce new, previously unseen mistakes that break the interaction with other parts of the system which rely on it.

Some Thoughts

There is much to agree with in Microsoft’s paper. In particular, the central role that data plays cannot be overemphasized. Initiatives like DVC, the open-source version control system running atop Git, are evidence of the importance of and the progress made towards the tracking and versioning of data.

Software engineering and machine learning are distinct disciplines with the deliverables being expressed in code as the common denominator. It should therefore be recognised as a matter of course that skills in software engineering do not naturally transfer over to projects which include ML features.

On the modularity of ML, I feel there is potential to go a little deeper. Modularisation is not a no-go per se when it comes to machine learning. After all, Andrew Ng (founder of devoted a few chapters of his book Machine Learning Yearning to talk about how a machine learning task can be handled by a pipeline of components. The more pertinent question is, when can modularisation go wrong? Perhaps this is a whole topic best handled by an entire paper by itself.

AI Makerspace is officially launched today!

AI Makerspace is a platform offered by AI Singapore (AISG) to help SMEs and Start-ups accelerate the adoption of AI in Singapore.

They will benefit from the quick turnaround and deployable AI solutions to address their business needs. 

AI Makerspace provides a suite of AI tools, APIs and pre-built solutions (Makerspace Bricks) for specific use cases which SMEs and Start-ups can download and use. The pre-built solutions include Open Source Software[1] and IPs developed by our local universities.

AI Makerspace also hosts open datasets, training courses as well as limited consulting and engineering services for them to jumpstart their AI journey.

Benefits to SMEs and Start-ups:

AI Makerspace will also benefit System Integrators: 

Makerspace Bricks

Makerspace Bricks are pre-built AI solutions. They are designed to address specific use cases which are inspired by real-world 100E projects[2] and common AI requests from the industry.

The Makerspace Bricks are developed by AISG’s engineers and will be put up as FREE downloadable tools, libraries and assets for open source softwares[3] or APIs (for IPs developed by Singapore local universities[4]) for SMEs and Start-ups to try.

At launch, four core AI technologies will be available on AI Makerspace:

  1. Robotic Process Automation
  2. Natural Language Processing (NLP)
  3. Speech To Text (S2T)
  4. Computer Vision (CV)

AI Quick Start Programme

AI Quick Start is a programme to help SMEs which require further assistance to use the Makerspace Bricks to solve their business problems.  AI Singapore will provide a limited 7 person-day AI consultancy to SMEs to jumpstart their AI projects.

An AISG engineer will guide the SME/Start-up through the stages of AI development and steps:

  1. Scoping and Problem Definition
  2. Data Acquisition and AI Architecture
  3. Clean and Explore Data
  4. Architect AI Models
  5. Deploy for Usage

The typical stages are shown in the figure below:

The deliverables of the AI Quick Start Programme include a validated business model, AI Minimum Viable Model and an AI architecture.

The fee for the AI Quick Start Programme is SGD 20,000 (excluding GST)

AI Makerspace Website:

Click here to view the video and interviews with “Bricks” users.


[4] Source-available Bricks can be used for testing via API calls for FREE with conditions, subsequent licensing will be required for commercialisation.

mailing list sign up

Mailing List Sign Up C360