Building on the recognized success of our AI Apprenticeship Programme (AIAP) which recently won the ‘Talent Accelerator in Singapore’ award in IDC’s Digital Transformation (DX) Awards, AISG looks to expand the pool of Singaporeans to become future AI Apprentices; and enable those who are technically proficient in AI to be recognised as AI Certified Engineers (AICE) in Singapore.
We will do this through structured learning pathways that allow Singaporeans to reskill, upskill and deep skill oneself in AI – one of the most crucial in-demand technical skills globally.
We are pleased to announce this newly launched initiative – AISG AI Certification.
The learning pathways and assessment framework are shown in Annex A and Annex B respectively. The framework has been developed based on the experience we had drawn from the 100 Experiments (100E) programme. Some of our 100E partners such as Daimler, IBM, KroniKare, NDR Medical, NUS, Q&M Dental and Surbana Jurong are supporting this initiative as this will help them identify and find more AI Engineers to hire.
The same framework can also be used by companies’ Learning & Development teams as a template/reference to reskill selected staff and start building their own AI and Data Science teams. IBM is participating in this AISG initiative.
With the learning pathways, Singaporeans can choose to enroll in our existing talent development programmes and upgrade their knowledge and skills progressively. Most of these programmes are free and will only require a personal investment of their time to learn. There are assessments at each stage to track the learning progress and give recognition to the mastery of AI knowledge and skills by the attendees. Self-directed learners can choose to skip the programmes and take the assessment directly to validate their proficiencies.
AISG will scale our existing AI4E and AI4I programmes to help more Singaporeans build their foundational knowledge and skills in AI. We aim to enable 100,000 Singaporeans to complete the free “Literacy in AI” course, of which 25,000 Singaporeans will be encouraged to attain the “Practical Foundations in AI” certification.
AISG will work with companies and organizations to promote and recognize the AI Certification and identify technical-proficient Singaporean working professionals to be certified as AI Associate Engineers and AI Certified Engineers. We target to start accepting applicants for our Professional Certification by Q1, 2020.
In addition, AISG would also invite local IHLs and local training providers to align their courses to our assessment framework. Interested organizations can write to firstname.lastname@example.org.
Last month, in a Facebook post, Prime Minister Lee Hsien Loong brought to attention what is purportedly the first publicly reported AI-heist. The known facts of the case are simple. On a Friday afternoon in March this year, the managing director of a British energy company was duped into believing that he was in a voice call with his boss, with the voice at the other end demanding that he immediately wire over hundreds of thousands of dollars to a supplier in Hungary to help save the company in late payment fines. Apart from the detailed homework done by the thief or thieves to make the request appear convincing, what made the subordinate comply was the quality of the fake voice. As the unfortunate managing director later recalled, the voice was so realistic, down to the tonality, punctuation and German accent.
While the case might seem Hollywood-like, those in the AI community are already cognizant of the possibilities of such technologies. Back in 2017, the startup Lyrebird published a video of what appeared to be Barack Obama making a pitch for the company. Judge for yourself how realistic the voice sounds.
According to the company, now the AI research division of Descript, only one minute of audio data was required to generate the Obama pitch. Running on publicly available cloud compute resources, machine learning models were trained to determine the features that make every voice unique. These include accent, pitch, cadence and speed. Features that are not pre-defined by humans, but automatically learned by the models over training cycles.
Fast forward to 2019 and we can readily appreciate the progress since then. At the Neural Information Processing Systems Conference (formerly NIPS, now renamed NeurIPS) at the end of last year, Google – building upon the previous work of others – published a paper detailing how speech audio in the voice of many different speakers, including those unseen during training, can be generated with the help of transfer learning. To provide a simplified technical view, the system can be broken down into 3 components which are trained independently.
A more in-depth technical explanation would be beyond the scope of this article and the interested reader is encouraged to follow the references. Other readers who are more inclined toward engineering work would be interested to note that an open source implementation of the system was recently released by a masters graduate from the University of Liège. Do check it out and see how well it works.
In the 1991 action film, Terminator 2, the protagonist John Connor’s mother was killed by the T-1000 assassin android, but not before it learned to mimic the way she talked. While other properties of the shape-shifting machine still belong firmly in the realm of fiction, the ability to clone a human being’s voice no longer appears very far fetched.
Machine learning algorithms can take significant amounts of time to run. Intel’s distribution of Python aims to accelerate this machine learning process without users having to switch to C or C++. The prospect of code acceleration within Python itself is exciting. Therefore, a very simple experiment was set up to analyse Intel Python’s performance. This experiment utilises the 4 commonly used machine learning libraries : random forest, linear regression, hierarchical clustering and K-means clustering.
Intel Python performs the speed up through Intel® MKL, Intel® DAAL and the Intel® AVX. These are libraries which aim to optimise matrix and vector operations (common in machine learning).
Random Forest – 10,000 training rows, 10 training columns, 1,000 test rows, 100 estimators.
Linear Regression – 10,000,000 training rows, 10 training columns, 1000000 test rows.
Stock Python is significantly slower for LR and K-means while it is faster in hierarchical clustering. It is interesting to note that IdpPip performs ~8x faster than conda and ~16x faster than stock Python.
With the exception of K-means clustering, the performance of Intel’s Python distribution appears comparable to Anaconda’s Python distribution. IdpPip has a significant speedup over the other environments. However, stock Python still performs better at hierarchical clustering. Therefore, it is not conclusive if Anaconda or Intel Python is definitively better.
Fortunately, Intel is constantly working on performance enhancements for Python code. We can look forward to a near future when Intel Python becomes definitively better than other distributions of Python.
As an economist by training, I did a lot of mathematics and statistics modules during my university days. I only had a one-module experience in programming and that was with Java, which in my opinion was one of the most difficult languages to pick up (hats off to Java Programmers), especially for someone with limited computer science knowledge at that time (or even now).
Being a data professional these days, it is essential to know programming, which is the reason I picked up two of the most popular programming languages for data science – R and Python.
I thought of sharing my learning journey in programming to help people, especially the non-computer science folks, understand more about programming and, hopefully, take the first steps in coding.
As mentioned, my first programming language was Java and I was really bad at it because I could not grasp the mechanism behind the ‘public’ and ‘private’ thing even after many explanations from my friends. What was most frustrating was, back in those days, coding was done on a notepad so I was not able to detect any syntax errors that would have shown up early on with the colour coding.
Moreover, the error messages were so cryptic, I had no idea how to go about correcting the error. There were no Google and Stack Overflow around to help poor souls like me. I had to constantly reference very thick physical textbooks and narrow my solution space so that I could pinpoint the actual error. It was very painful and got to be even more so after I received my grade for the Java programming module. It was a terrible experience and I have the grades to show for it.
I graduated during a very bad economic downturn. The reality then was very brutal to fresh graduates as there were many experienced professionals around and companies were more willing to give them opportunities than fresh grads like me. It was only after a very long period of time that I was finally hired into a research centre. At the research centre, they used a software called SAS, a statistical analysis tool that started in the US during the late 1970s. To use the software, I had to re-visit programming again. Fortunately, when I looked at the syntax and how it worked behind the scene, I was able to grasp it quickly. With my data concepts and background backing me, I knew what I wanted to do with SAS. Google was available then to help with code and syntax details, although the search results were not as good as today. Long story short, I experienced new confidence in programming and proficiency in SAS programming made me more effective in my analysis and modeling work.
Knowledge of SAS programming helped me get into the banks because proficiency in it was very rare at that time. While I was in the banks, I came across R. Just in case you are wondering, the banks were not using R. Rather, I got to know R through my professional network.
R is an open-source software mostly for statistics and data science. A friend of mine told me I should learn R programming to broaden my skills set and have more to offer to prospective employers. With a renewed confidence in programming, I decided to pick up R. There were a lot of online tutorials on R then and I proceeded to research and learn it. Again, the strong concepts in data analytics helped me pick up R quickly. I noticed that the syntax for R programming actually looked very much like Excel functions, with the exception of the Tidyverse packages. With this realisation, I was able to learn R even faster. After R and a successful round of learning another programming language, I went on to pick up Python, a popular general purpose language, paying special attention to the data science portion. That meant focusing on packages like pandas, NumPy, scikit-learn etc.
Nowadays, there is an abundance of online material available for anyone to start learning a programming language. The first challenge is no longer about picking up the language basics but to have an actual environment installed on your computer for practise.
I encourage all my training participants and you, the reader, to pick up programming, to have some programming experience at least. Because with Industry 4.0, most jobs will be associated with data and in order to have better and smarter automation, we cannot run away from interacting with computers. Programming gives us that ability to interact. You can learn programming from online sources or perhaps even use social media to gather a group of like-minded folks to come and learn together. With Google and Stack Overflow, even if the error messages are pretty cryptic (by the way, they still are!), debugging the error is much easier compared to the days when I was learning Java.
What about coding bootcamps? My advice is to attend a coding bootcamp if you want to maintain a structure in your learning, but do check out the trainer’s background beforehand. Make sure that the trainer has enough experience to share industry best-practices and the common pitfalls in that particular programming language.
Here are a few recommended resources to kick things up.
All in all, I strongly encourage everyone to at least try their hands on programming, especially for folks who want to become a data professional, because there are very, very limited ways you can avoid programming in such a job role. Have fun learning programming!
The annual PyCon Singapore conference took place from the 10th to 12th of October this year. AI Singapore had the honour of contributing to the community through the delivery of a keynote address, as well as through participation in the technical sharing in the subsequent breakout sessions.
things off on the right note, the director of AI Industry Innovation in AI
Singapore, Laurence Liew, gave the keynote address on the first day of the
conference. Looking back at the historical trends in the development of AI over
the preceding decades, he reflected on how AI has so permeated our lives that
hardly a day goes by today without this technology being used in one form or
another. Whether it is the spam filter working 24/7 in our email systems or the
recommendations made by Netflix, it is almost impossible to escape AI in the
As powerful as AI technology has become, Laurence also reminded us that it is fundamentally a series of computations strung together and engineered to serve an intended purpose. He took the audience on a quick tour of how a neural network actually works, dispelling the notion that the oft-hyped deep learning is a step toward Terminator-like thinking machines. On a more serious note, while AI may not take over the world Hollywood-style, its impact is certainly being felt in tasks that used to be performed by humans. Is AI a job-killer? The reality is a bit more complex. As Laurence put it,
AI replaces tasks, not jobs. Programmers will not be replaced by AI, but by programmers who use AI.
“programmer” with another profession. This is a theme that will be
echoed as we will see later.
Recognising that mastery of AI offers a competitive edge at the national level, AI Singapore has rolled out the 100 Experiments (100E) programme to increase AI adoption in the nation. In tandem, the AI Apprenticeship Programme (AIAP)™ serves to strengthen our local AI engineering talent base. So successful has AIAP™ been so far that it has been recognised as one of the country winners in the “Talent Accelerator” category in this year’s IDC Digital Transformation Awards (DX Awards). Laurence is of the view that an important focus of the national effort should be on engineers and developers :
We need many more AI/ML engineers to deploy AI systems than scientists to build new AI algorithms.
For aspiring AI engineers, the good news is that application for the next batch of apprentices in AIAP™ is already open and will close on 20th Oct. Click here for the application form.
When it comes to AI talent development, it is never too young to start. Laurence was pleased to announce an update to the ongoing AI for Students (AI4S) programme. Henceforth, all students in MOE-registered schools can directly register for the programme without going through their teachers. Certainly a good initiative to allow the young to take charge of their own learning and be prepared for the future workplace!
In recent years, there has been a surge in interest in robotic process automation (RPA). A quick search on Google Trends reveals the exploding number of queries. So, what is RPA? Thu Ya Kyaw, a graduate of the inaugural batch of AIAP™ was well placed to answer the question in one of the breakout sessions. In his words,
RPA is a process of creating software programs to automate non-decision making repetitive activities or processes.
In a nutshell, RPA mimics human actions. Actions that do not require strategic thinking and higher level reasoning. Actions that rank low on the scale of human cognitive satisfaction. Coupled with AI technologies like computer vision (CV) and natural language processing (NLP), RPA can be used to chain together work packages traditionally done by humans to deliver complete work solutions. This naturally raises the question (again), “Will RPA replace jobs?”. To that, Thu Ya has a word of advice, to re-quote Steve Jobs,
With the burgeoning interest in RPA, it comes as no surprise that many players have entered a fragmented market, offering enterprise solutions. While enterprise software are generally user friendlier and have better support and functionalities, the absence of a common standard gives rise to a lack of interoperability and results in high switching costs. An alternative would be open source solutions. Ken Soh, the creator of TagUI for Python, shared about the tool in one of the last breakout sessions at PyCon. Offering features such as website automation, visual automation, OCR automation, keyboard and mouse automation in one seamless API, Ken demonstrated how it can be employed in use cases like making repeat orders on RedMart, capturing forex rates and making mass edits of MS Word documents.
TagUI for Python is a Python package built on TagUI, an open source RPA software maintained by AI Singapore and released under the Apache 2.0 licence.
New technologies are disruptive. They have the potential to improve lives, as well as throw them off balance. In the few jam-packed days of PyCon Singapore 2019, AI Singapore sought to bring insights to how to stay ahead of the curve and help citizens live better lives.
The buzzword “RPA” has been gaining traction in recent years across different industries, but what is it exactly and why is it so popular? I too, had a similar chain of thoughts when I was fresh out of school and starting my first job as a RPA Analyst.
RPA stands for “Robotic Process Automation”. It is nothing more than a process of creating software programs to automate non-decision making repetitive activities or processes. An important point to note is that making RPA programs is not the same as making physical robots. RPA programs are often regarded as robots, simply because they mimic human actions in carrying out tasks. A software robot, if you will.
RPA is also known as RDA (Robotic Desktop Automation)
A brief history of RPA
Although RPA is very popular nowadays, a similar form of automation has been in the market for a long time. During those early periods, RPA existed in the form of screen scraping and automated testing tools. Soon after, it was used as one of the automation tools in business process management. The main factor for businesses to adopt RPA in business process management was that more and more systems were being used in end-to-end business processes and it got harder and harder to automate those processes just by writing macros in Excel sheets. Most importantly, scheduling is necessary for efficient time management (e.g. generating reports throughout the night or exactly at 1.00 am).
With the help of AI (Artificial Intelligence), present day RPA can do so much more than just automate business processes. It can extract text from images using OCR (Optical Character Recognition) technology. By using CV (Computer Vision), it can identify the images and process them with a given set of rules accordingly. It can also understand the context of a sentence or paragraph, thanks to NLP (Natural Language Processing), and so on.
How does RPA automate a process?
RPA programs are known to automate processes by mimicking human actions in those processes. If you are someone with a software engineering background or are familiar with APIs (application programming interfaces) and Cron jobs, you must be wondering how it defers from scripted automation; a variant where API calls are being made from scripts and scheduled in a server using a scheduler program. The Lab Consulting has a nice comparison summary chart for the standard manual vs scripted automation vs RPA version of a process for creating a user account in a system.
Given a process, creating a user account in this case, the scripted (APIs + Cron jobs) approach may require additional changes to be made to the existing processes. On top of building an API server for handling requests, the system has to expose the SQL(Structured Query Language) connection for that API server as well. Sometimes, this approach is not feasible if connecting to the SQL database using the connector from external scripts violates the software usage agreement contract of the database service provider. Moreover, producing an API server and the scripts requires additional tech talent.
This is the area that RPA does better. It doesn’t require a system change in any way. It can just follow the current process of doing things in the system and automate the steps. It mimics human actions, remember? By looking at the standard manual process and robotic automated process from Figure 1, it is clear that only the manual process steps are being replaced by the automated steps, nothing more. RPA frameworks nowadays offer a code-free environment to design and automate simple to mildly complex processes as well. Hence, only investment to train existing talent to use RPA tools is required rather than hiring entirely new tech talent.
What are the benefits of using RPA?
Have you ever encounter a colleague who doesn’t take her annual and sick day leaves, and neither asks for a pay increment nor a promotion? Furthermore, that person works very long hours without asking for any compensation? Ridiculous, right? But that is what a RPA program truly is. If it is done right, of course.
“A picture is worth a thousand words”
In the nutshell, RPA programs are super employees that do all the ridiculously mundane tasks accurately without complaining or competing with you for a promotion. That brings me to my next point, job security. Since RPA can do those tasks just like me, what does it mean for my future in the organization?
Will RPA replace jobs?
With all the benefits of using RPA programs and their ability to mimic human actions, will RPA replace jobs? The answer to that question is “it depends”. It depends on many factors. To the best of my knowledge, all I can say is that RPA will replace tasks. Repetitive, boring and mundane tasks that nobody wants to spend their time on. However, if your job mainly consists of those tasks, you are at risk of being replaced by RPA programs.
You can also calculate the chance of your job being replaced by robots from sites like https://willrobotstakemyjob.com/. Although it may not be able to give you the definitive answer, it is almost always good to know your risks. Then you can be prepared and plan to upgrade your skills.
You may hear about people getting laid off because automation has taken over their jobs. I strongly believe that those unfortunate events only occur due to either the employees not upgrading their skills for a long time or the employers not providing the opportunities for them to learn new things in the organization. Hence, whether or not you let RPA take over your job is entirely up to you. Stay hungry, stay foolish.
This article is the second in a series of four articles that aim to illustrate the working of Bi-Directional Attention Flow (BiDAF), a popular machine learning model for question and answering (Q&A).
To recap, BiDAF is an closed-domain, extractive Q&A model. This means that to be able to answer a Query, BiDAF needs to consult an accompanying text that contains the information needed to answer the Query. This accompanying text is called the Context. BiDAF works by extracting a substring of the Context that best answers the query — this is what what we refer to as the Answer to the Query. I intentionally capitalize the words Query, Context and Answer to signal that I am using them in their specialized technical capacities.
The first article in the series provided a high-level overview of BiDAF. In this article, we will focus on the first portion of the BiDAF architecture — the first thing that takes place when the model receives an incoming Query and its accompanying Context. We will go through the following sequence of steps :
Word Level Embedding
Character Level Embedding
To facilitate your learning, a glossary containing the mathematical notations involved in these steps is provided at the end. Let’s dive in!
Step 1. Tokenization
In BiDAF, the incoming Query and its Context are first tokenized, i.e. these two long strings are broken down into their constituent words. In the BiDAF paper, the symbols T and J are used to denote the number of words in Context and Query, respectively. Here is a depiction of the tokenization:
Step 2. Word Level Embedding
The resulting words are then subjected to the embedding process, where they are converted into vectors of numbers. These vectors capture the grammatical function (syntax) and the meaning (semantics) of the words, enabling us to perform various mathematical operations on them. In BiDAF, embedding is done on three levels of granularity: on the character, word and contextual levels. Let’s now focus on the first embedding layer — the word embedding.
The word embedding algorithm used in the original BiDAF is GloVe. In this article, I will only give a brief overview of GloVe because there already exist several excellent resources that explain how the algorithm works. But if you are short on time, here is a very simplified summary of GloVe:
GloVe is an unsupervised learning algorithm that uses co-occurrence frequencies of words in a corpus to generate the words’ vector representations. These vector representations numerically represent various aspects of the words’ meaning.
As the numbers that make up GloVe vectors encapsulate semantic and syntactic information about the words they represent, we can perform some cool stuff using these vectors! For instance, we can use subtraction of GloVe vectors to find word analogies, as illustrated below.
BiDAF uses pre-trained GloVe embeddings to get the vector representation of words in the Query and the Context. “Pre-trained” means that the GloVe representations used here have already been trained; their values are frozen and won’t be updated during training. Thus, you can think of BiDAF’s word embedding step as a simple dictionary lookup step where we substitute words (the “keys” of the GloVe “dictionary”) with vectors (the “values” of the “dictionary”).
The output of the word embedding step is two matrices — one for the Context and one for the Query. The lengths of these matrices equal the number of words in the Context and the Query (T for the Context matrix and J for the Query matrix). Meanwhile, their height , d1, is a preset value that is equal to the vector dimension from GloVe; this can either be 50, 100, 200 or 300. The figure below depicts the word embedding step for the Context:
Step 3. Character Level Embedding
Okay, so with GloVe, we obtain the vector representations of most words. However, the GloVe representations are not enough for our purpose!
The pretrained GloVe “dictionary” is huge and contains millions of words; however, there will come a time where we encounter a word in our training set that is not present in GloVe’s vocabulary. Such a word is called an out-of-vocabulary (OOV) word. GloVe deals with these OOV words by simply assigning them some random vector values. If not remedied, this random assignment would end up confusing our model.
Therefore, we need another embedding mechanism that can handle OOV words. This is where the character level embedding comes in. Character level embedding uses one-dimensional convolutional neural network (1D-CNN) to find numeric representation of words by looking at their character-level compositions.
You can think of 1D-CNN as a process where we have several scanners sliding through a word, character by character. These scanners can focus on several characters at a time. As these scanners sweep along, they extract the information from the characters they are focusing on. At the end of these scanning process, information from different scanners are collected to form the representation of a word.
The output of the character embedding step is similar to the output of the word embedding step. We obtain two matrices, one for the Context and the other for the Query. The lengths of these matrices equal the number of words in the Context and in the Query — T and J — while their height depends on the number of convolutional filters used in 1D-CNN (to know what a “convolutional filter” is, do read the next section). The height is denoted as d2 in the diagram. These two matrices will be concatenated with the matrices that we obtained from the word embedding step.
Additional Details on 1D-CNN
The section above only presents a very conceptual overview of the workings of 1D-CNN. In this section, I will explain how 1D-CNN works in detail. Strictly speaking, these details are not necessary to understand how BiDAF works; as such, feel free to jump ahead if you are short on time. However, if you are the type of person who can’t sleep well without understanding every moving part of an algorithm you are learning about, this section is for you!
The idea that motivates the use of 1D-CNN is that not only words as a whole have meanings — word parts can carry meaning, too!
For example, if you know the meaning of the word “underestimate”, you will understand the meaning of “misunderestimate”, although the latter isn’t actually a real word. Why? Because you know from your knowledge of the English language that the prefix “mis-” usually indicates the concept of “mistaken”; this allows you to deduce that “misunderestimate” refers to “mistakenly underestimate” something.
1D-CNN is an algorithm that mimics this human capability to understand word parts. More broadly speaking, 1D-CNN is an algorithm capable of extracting information from shorter segments of a long input sequence. This input sequence can be music, DNA, voice recording, weblogs, etc. In BiDAF, this “long input sequence” is words and the “shorter segments” are the letter combinations and morphemes that make up the words.
To understand how 1D-CNN works, let’s look at the series of illustrations below, which are taken from slides by Yoon Kim et. al., a group from Harvard University.
Let’s say we want to apply 1D-CNN on the word “absurdity”. The first thing we do is represent each character in that word as a vector of dimension d. These vectors are randomly initialized. Collectively, these vectors form a matrix C. d is the height of this matrix, while its length, l, is simply the number of characters in the word. In our example, d and l are 4 and 9, respectively.
Next, we create a convolutional filterH.This convolutional filter (also known as “kernel”) is a matrix with which we will “scan” the word. Its height, d, is the same as the height of C but its width w is a number that is smaller than l. The values within H are randomly initialized and will be adjusted during model training.
We overlay H on the leftmost corner of C and take an element-wise product of H and its projection on C (a fancy word to describe this process is taking a Hadamard product of H and its projection on C). This process outputs a matrix that has the same dimension as H — a dx lmatrix. We then sum up all the numbers in this output matrix to get a scalar. In our example, the scalar is 0.1. This scalar is set as the first element of a new vector called f.
We then slide H one character to the right and perform the same operations (get the Hadamard product and sum up the numbers in the resulting matrix) to get another scalar, 0.7. This scalar is set as the second element of f.
We repeat these operations character by character until we reach the end of the word. In each step, we add one more element to f and lengthen the vector until it reaches its maximum length which is l – w+1. The vector f is a numeric representation of the word “absurdity” obtained when we look at this word three characters at a time. One thing to note is that the values within the convolution filter H don’t change as H slides through the word. In fancier terms, we call H “position invariant”. The position invariance of the convolutional filters enables us to capture the meaning of a certain letter combination no matter where in the word such combination appears.
We record the maximum value in f. This maximum can be thought of as the “summary” of f. In our example, this number is 0.7.We shall refer to this number as the “summary scalar” of f. This process of taking a maximum value of the vector f is also referred to as “max-pooling”.
We then repeat all of the above steps with yet another convolutional filter (yet another H!). This convolutional filter might have a different width. In our example below, our second H, denoted H’, has a width of 2. As with the first filter, we slide along H’ across the word to get the vector f and then perform max-pooling on f (i.e. get its summary scalar).
We repeat this scanning process several times with different convolutional filters, with each scanning process resulting in one summary scalar. Finally, the summary scalars from these different scanning processes are collected to form the character embedding of the word.
So that’s it — now we’ve obtained a character-based representation of the word that can complement is word-based representation. That’s the end of this little digression on 1D-CNN; now let’s get back to talking about BiDAF.
Step 4. Highway Network
At this point, we have obtained two sets of vector representations for our words — one from the GloVe (word) embedding and the other from 1D-CNN (character) embedding. The next step is to vertically concatenate these representations.
This concatenation produces two matrices, one for the Context and the other for the Query. Their height is d, which is the sum of d1 and d2. Meanwhile, their lengths are still the same as their predecessor matrices (T for the Context matrix and J for the Query matrix).
These matrices are then passed through a so-called highway network. A highway network is very similar to a feed forward neural network. You guys are probably familiar with feed forward neural network already. To remind you, if we insert an input vector y into a single layer of feed forward neural network, three things will happen before the output z is produced:
y will be multiplied with W, the weight matrix of the layer
A bias, b, will be added to W*y
A nonlinear function g, such as ReLU or Tanh will be applied to W*y + b
In a highway network, only a fraction of the input will be subjected to the three aforementioned steps; the remaining fraction is permitted to pass through the network untransformed. The ratio of these fractions is managed by t, the transform gate and by (1-t),the carry gate. The value of t is calculated using a sigmoid function and is always between 0 and 1. Now, our equation becomes as follows:
Upon exiting the network, the transformed fraction of the input is summed with its untransformed fraction.
The highway network’s role is to adjust the relative contribution from the word embedding and the character embedding steps. The logic is that if we are dealing with an OOV word such as “misunderestimate”, we would want to increase the relative importance of the word’s 1D-CNN representation because we know that its GloVe representation is likely to be some random gibberish. On the other hand, when we are dealing with a common and unambiguous English word such as “table”, we might want to have more equal contribution from GloVe and 1D-CNN.
The outputs of the highway network are again two matrices, one for the Context (a d-by-T matrix)and one for the Query (a d-by-J matrix). They represent the adjusted vector representations of words in the Query and the Context from word and character embedding steps.
Step 5. Contextual Embedding
It turns out that these representations are still insufficient for our purpose! The problem is that these word representations don’t take into account the words’ contextual meaning — the meaning derived from the words’ surroundings. When we rely on word and character embedding alone, a pair of homonyms such as the words “tear” (watery excretion from the eyes) and “tear” (rip apart) will be assigned the exact same vector representation although these are actually different words. This might confuse our model and reduce its accuracy.
Thus, we need an embedding mechanism that can understand a word in its context. This is where the contextual embedding layer comes in. The contextual embedding layer consists of Long-Short-Term-Memory (LSTM) sequences. Here is a quick introduction to LSTM:
An LSTM is a neural network architecture that can memorize long-term dependencies. When we enter an input sequence (such as a string of text) into a normal forward LSTM layer, the output sequence for each timestep will encode information from that timestep as well as past timesteps. In other words, the output embedding for each word will contain contextual information from words that came before it.
BiDAF employs a bidirectional-LSTM (bi-LSTM), which is composed of both forward- as well as backward-LSTM sequences. The combined output embeddings from the forward- and backward-LSTM simultaneously encode information from both past (backwards) and future (forward) states. Put differently, each word representation coming out of this layer now includes include contextual information about the surrounding phrases of the word.
The output of the contextual embedding step is two matrices — one from the Context and the other from the Query. The BiDAF paper refers to these matrices as H and U, respectively (terminologies alert — this matrix H is distinct from the convolutional matrix H mentioned earlier; it is an unfortunate coincidence that the two sources use the same symbol for two different concepts!). The Context matrix H is a d-by-T matrix while the Query matrix U is a d-by-J matrix.
So that’s all there is to it about the embedding layers in BiDAF! Thanks to the contribution from the three embedding layers, the embedding outputs H and U carry within them the syntactic, semantic as well as contextual information from all words in the Query and the Context. We will use H and U in the next step — the attention step — in which we will fuse together the information from them. The attention step, which is the core technical innovation of BiDAF, will be the focus of the next article in the series — do check it out!
Context : the accompanying text to the Query that contains an answer to that Query
Query : the question to which the model is supposed to give an answer
Answer : a substring of the Context that contains information that can answer Query. This substring is to be extracted out by the model
T : the number of words/tokens in the Context
J : the number of words/tokens in the Query
d1 : the dimension from the word embedding step (GloVe)
d2 : the dimension from the character embedding step
d : the dimension of the matrices obtained by vertically concatenating word and character embeddings. d is equal to d1 + d2.
H : the Context matrix outputted by the contextual embedding step. H has a dimension of 2d-by-T
U : the Query matrix outputted by the contextual embedding step. U has a dimension of 2d-by-J
Singapore, 8 October 2019 – AI Singapore’s AI Apprenticeship Programme (AIAP) is recognized as one of the country winners in the “Talent Accelerator” category in this year’s IDC Digital Transformation Awards (DX Awards).
IDC’s DX Awards recognizes outstanding organizations that have made critical breakthroughs in digital transformation across the Asia Pacific region, spread across seven different categories.
The AIAP digital transformation project was selected among over 1000 high-quality entries received from end-user organizations across Asia/Pacific. This initial win at a country-level allowed AI Singapore to qualify as one of IDC’s finalists for the regional awards. AI Singapore will be benchmarked against other winners in the same category to ultimately determine the region’s best of the best.
The AIAP is a full time 9-month programme where we train and groom local Singaporean Artificial Intelligence (AI) talent. This is in line with the nation’s focus in deepening national capabilities in AI, not only in terms of technology, but also skillsets and talents. The AIAP is a deep skilling programme and successful applicants to the AIAP are expected to have prior knowledge in AI and machine learning. During the duration of the programme, the apprentice is paid a stipend of between SGD$3500 – $5500 per month. There is no programme fee as this is a fully subsidized programme by the Singapore government.
In a 2+7 months approach, the apprentices get to deepen their skills not only in AI, machine learning and deep learning but also software engineering skills for model deployment into production in the first two months of the programme. From the third month onwards, they will complete 10 x 3 weeks or 15 x 2 weeks sprints on a real-world AI project from AI Singapore’s 100 Experiments (100E) programme. Each 100E is co-funded with the industry up to SGD$500,000, and is expected to solve a real-world business problem leading to significant productivity gains, costs savings and/or new products and solutions for the industry.
IDC’s DX Awards follows a two-phased approach to determine the country and regional winners. Each nomination is evaluated by a local and regional IDC analyst against a standard assessment framework based on IDC’s DX taxonomy. All country winners will qualify for the regional competition, which will be decided by a regional panel of judges comprised of IDC Worldwide analysts, industry thought leaders and academia. Singapore winners will be awarded at the end of the IDC DX Summit Singapore on 24 October, 2019 in Singapore, along with the regional DX Award winners.
“We are indeed honoured and excited to be given this award for our first participation in the DX Awards. The AIAP was conceived to solve a problem AI Singapore faced ourselves i.e. the lack of Singaporean AI engineers to work on our 100 Experiments programme. In the last 18-months, we have undertaken more than 38 AI projects and deployed more than 10 projects into production. The AI apprentices work on these real-world AI projects together with our AI Singapore engineers, mentors and Institutes of Higher Learning/Research Institutes collaborators,” said Laurence Liew, Director, AI Industry Innovation, AI Singapore.
“Digital disruption is impacting all industries and countries around the world. By 2022, the digital economy will go mainstream with 46% of global GDP expected to come from digital products and services. Organizations, both large and small, are seeking digital talent to help them to compete better. Emerging technologies such as AI and machine learning are critical capabilities for the next phase of digital transformation. It is therefore important for governments and educational institutions to provide the necessary programs to prepare the next generation, train young professionals and reskill experienced/senior workforce to contribute effectively in the digital economy”, says Sandra Ng, Group Vice President, Practice Group, IDC Asia/Pacific.
“To date, 2 batches of Singaporean AI apprentices have graduated; 11 out of 57 have been hired into AI Singapore with the rest joining organizations such as DSTA, DBS, Grab and CapGemini, amongst others. Our plan is to train up to 500 Singaporean AI apprentices over the next 5 years for the industry.” said Laurence.
The AIAP is the first TechSkills Accelerator Company-Led Training (TeSA-CLT) initiative in AI. This is a collaboration between AI Singapore and IMDA to develop a pipeline of AI professionals for the industry.
As many of you know, I conduct training in a few institutions in Singapore. For instance, NUS SCALE, SAS Institute and SMU Academy. A frequent comment that participants make during our discussions is,
There is a myriad of terms out there – Big Data, Artificial Intelligence, Machine Learning, Data Science, Data Analytics etc. It is very challenging to make sense of them and also how they are different from each other.
Especially, during most discussions, the group thought they were talking about the same thing, but it was not actually the case!
In this column, I will attempt to demystify these popular terms so as to give readers a better understanding of what they constitute and, hopefully, as more people are onboard with what these terms are, there will be more meaningful discussions on them. A very important note is that the following are my own definitions and may differ from others, but the definitions I have so far were formed through my work experiences in the field.
Let us start with Data Science. Ok, first thing first, let me state that I do not see any difference between Data Analytics and Data Science. They are referring to the same thing. What is Data Science? Data Science is actually the transformation of data through maths and statistics into products that we can use to make better decisions. For instance, insights, visualization, probability scores or estimates. The main goal is to use data to make informed business decisions. The diagram below simplifies the understanding.
Let us move on to the next term from here, Machine Learning. Inside the “Maths & Statistics” part we have the machine learning models. So what does machine learning do? There are three branches of Machine Learning:
Let us start with supervised learning. Now, for supervised learning, you have to work with what is known as labelled data. Labelled data consists of two parts – a single column known as the target and multiple columns known as features. The dataset needs to be structured in that way before it is ‘fed’ into the relevant machine learning model.
The target is something you wish to know as early as possible so that you can act when it is not favorable. Here are a few examples:
A bank would like to know if a person is going to default on a credit card loan.
A company would like to know if its star talent is leaving.
An e-commerce business would like to know how much spending power a potential customer has before making the right marketing offers.
Now, in order to have a good prediction of the above examples, we need to have some ‘early’ signals. This is where the features come in. Features are what we believe can help us peek into the future. For instance, in the credit card example, the income of loan holder can be a useful feature since we believe that the higher the income, the less likely the loan holder may default on the loan.
So, the target is what we would like to know as early as possible and the features are what will help us know the probability of an event happening (default on loan repayment) or an estimate of a value we are interested in, as seen in the e-commerce example (estimating the customer’s spending power).
Supervised learning helps us understand the relationship between the features and the target.
Unsupervised learning deals with unlabelled data. Great! Now you have come across both labelled data and unlabelled data, this gives me the opportunity to explain the difference. Labelled data, as mentioned, has both features and a target whereas unlabelled data only has the features. Yes, the difference between the two is labelled data has a target and unlabelled data does not have one.
So what do we do with the features here? Unsupervised learning is a way for us to find patterns in our data. For instance, for a supermarket we can use unsupervised learning to find out which products are frequently bought together, or the segments that make up the supermarket’s clientele. Namely, based on the features provided, the data analyst/scientist can find out what are the types of customers that they are serving. For instance, health-conscious, families, singles etc.
Reinforcement Learning is not a newly-coined term but receives a lot of mention these days because of Artificial Intelligence.
Let us start with an agent. The human will now provide the agent an objective function, for instance, maximize the score of a computer game the agent is playing or finding an object in a maze in the shortest amount of time. With the objective function established, the agent will now start operating and while it is taking different actions or interactions with the environment, it will get feedback on whether it is closer or further away from the objective. For instance, if we put the agent in a maze, it will take different available directions in the maze and it will receive feedback saying whether it is closer or further away from the ending point in the maze. You can see from here why it is called reinforcement learning, because it will continue taking positive actions to reach the objective and avoid negative actions wherever possible.
Different from Supervised and Unsupervised Learning, Reinforcement Learning requires an objective function, usually to maximise or minimise certain results. Data is collected along the way and analyzed to determine the optimal solution towards the objective function. These are the main differences between them.
Most of you have heard about this term called Deep Learning in recent years. Deep learning is not totally new. Deep learning is a form of neural network. Neural networks were created back in the 1940s with the brain as a blueprint. The basic neural network only has three layers. The first layer takes in the input data and passes it to the second layer known as the computation layer, then followed by the last layer which is the output layer (see diagram below).
In recent years, because of the increase in computation power and the amount of data collected, we can now tap on the potential of neural networks, especially since we can now add in more computation layers. Deep Learning is actually a neural network with multiple computation layers. Computation layers can have many ways of computing the data passing through them, which is why there are different types of neural networks these days. For instance, the Recurrent Neural Network, which is very good with sequential data such as languages, and the Convolutional Neural Network which is very good with images, creating tools such as facial recognition and object recognition.
Referring to the diagram above, on the left is the traditional Neural Network. On the right, once we add multiple computation (also known as hidden) layers, the Neural Network starts to get deep. Hence, the name Deep Learning.
Here comes the star of the show, Artificial Intelligence. What is Artificial Intelligence? Artificial Intelligence is about teaching a machine to perform like human intelligence. It involves machine learning, reasoning methods, knowledge graphs, philosophy, algorithms and a lot more. There are actually three levels of Artificial Intelligence.
Level 1: Artificial Narrow Intelligence (ANI)
Level 2: Artificial General Intelligence (AGI)
Level 3: Super-intelligence
Artificial Narrow Intelligence
Artificial Narrow Intelligence, ANI, are what most people are talking about these days. They are the ones that are generating a lot of news and magazine articles. ANI, are artificial intelligence that humans have built, to replace humans in a SINGLE task. For instance, playing a specific chess such as GO, flipping burgers only etc. ANI are very good with a specific task and nothing else. ANI are what I would call, smarter automation. Some of the tasks can now be automated and be taken up by ANI, which is good news, because most of these tasks are repetitive and boring. They free up working time for humans, giving them opportunities to move on to tasks that require more human touch, tasks that require more critical thinking, design thinking, creativity, empathy and more. It will definitely increase the productivity of humans.
Artificial General Intelligence
Artificial General Intelligence, AGI for short, is the holy grail for most Artificial Intelligence researchers. AGI displays human-like intelligence, for instance an AGI can go to anyone’s kitchen and still figure out how to make a cup of coffee, whereas for ANI, it can only make a cup of coffee in a very specific kitchen and nowhere else.
We are still a very long way from building AGI given the current technology that we have currently, in my opinion. The reason is, currently we have a huge breakthrough in the area of machine learning, given the multi-layer Neural Network (Deep Learning, see above) but AGI is much more than Deep Learning. Researchers have to know how to teach the machine to build abstractions and relations from the data that is collected. Tools such as reasoning methods, knowledge graphs, control theory can help with that but we are still a distance away from including them into the apparatus of AGI.
Having said that, I do believe it is time that we start discussing and paying more attention to how we are going to build AGI, to ensure it is for the benefit of humankind. In fact, there is now a movement in advocating for ethics, explain-ability, transparency in the usage of artificial intelligence. Most governments have now started to work on such research and usage frameworks as well, which is very encouraging.
Super-intelligence comes about when there is no control over how we grow, equip and manage our AGI and thus they gained the momentum to become more intelligent (i.e. becomes smarter than any humans). There is a lot of discussion on this right now and if you want to get up to speed, I recommend the book “SuperIntelligence” by Nick Bostrom, a Swedish philosopher at the University of Oxford. Inside, he painted many scenarios where Super-intelligence might go wrong. (Side note: after reading the book, you will start seeing paperclips in a different perspective.) My opinion after reading the book is that there is no need to panic, we just need to pay more attention to how we build our AGI and this is where AI ethics come into the picture, imbuing our AI scientists and engineers with morals and ethics, paying more attention to the impact of their work on stakeholders.
I, for one, am in favor of having AGI and humans working together, side-by-side creating more synergy and, in turn, creating bigger and positive impact on society.
I hope the above helped to clarify some doubts that you might have. had As you can see, Machine Learning is just a branch of Artificial Intelligence. Currently we are having a lot more breakthroughs in Machine Learning which creates the illusion that we are nearing the human-like intelligence called AGI but in actual fact we are still a distance away, in my opinion.
And also please, do not take my word as final, go forth and read up more to understand the field. I can only say that the understanding I have so far of these terms are accumulated across many years of working in the field and things are changing rapidly so your understanding and definitions should get updated along the way.
There are a few more terms I have not clarified in this post, such as Data Science and Big Data. I will attempt to clarify them later. Thank you for taking the time to read through the post. I wish you, the reader, a pleasant and rewarding journey in understanding the field of Artificial Intelligence.
The year 2016 saw the publication of BiDAF by a team at the University of Washington. BiDAF handily beat the best Q&A models at that time and for several weeks topped the leaderboard of the Stanford Question and Answering Dataset (SQuAD), arguably the most well-known Q&A dataset. Although BiDAF’s performance has since been surpassed, the model remains influential in the Q&A domain. The technical innovation of BiDAF inspired the subsequent development of competing models such as ELMo and BERT, by which BiDAF was eventually dethroned.
When I first read the original BiDAF paper, I was rather overwhelmed by how seemingly complex it was.
BiDAF exhibits a modular architecture — think of it as a composite structure made out of lego blocks with the blocks being “standard” NLP elements such as GloVe, CNN, LSTM and attention. The problem with understanding BiDAF is that there are just so many of these blocks to learn about and the ways they are combined can seem rather “hacky” at times. This complexity, coupled with the rather convoluted notations used in the original paper, serves as a barrier to understanding the model.
In this article series, I will deconstruct how BiDAF is assembled and describe each component of BiDAF in (hopefully) an easy-to-digest manner. Copious amount of pictures and diagrams will be provided to illustrate how these components fit together.
Here is the plan :
Part 1 (this article) provides an overview of BiDAF.
Part 4 talks about the modeling and output layers. It also includes a recap of the whole BiDAF architecture presented in a very easy language. If you aren’t technically inclined, I recommend you to simply jump to part 4.
BiDAF vis-à-vis Other Q&A Models
Before delving deeper into BiDAF, let’s first position it within the broader landscape of Q&A models. There are several ways with which a Q&A model can be logically classified. Here are some of them:
Open-domain vs closed-domain. An open-domain model has access to a knowledge repository which it will tap on when answering an incoming Query. The famous IBM-Watson is one example. On the other hand, a closed-form model doesn’t rely on pre-existing knowledge; rather, such a model requires a Context to answer a Query. A quick note on terminology here — a “Context” is an accompanying text that contains the information needed to answer the Query, while “Query” is just the formal technical word for question.
Abstractive vs extractive. An extractive model answers a Query by returning the substring of the Context that is most relevant to the Query. In other words, the answer returned by the model can always be found verbatim within the Context. An abstractive model, on the other hand, goes a step further: it paraphrases this substring to a more human-readable form before returning it as the answer to the Query.
Ability to answer non-factoid queries. Factoid Queries are questions whose answers are short factual statements. Most Queries that begin with “who”, “where” and “when” are factoid because they expect concise facts as answers. Non-factoid Queries, simply put, are all questions that are not factoids. The non-factoid camp is very broad and includes questions that require logics and reasoning (e.g. most “why” and “how” questions) and those that involve mathematical calculations, ranking, sorting, etc.
So where does BiDAF fit in within these classification schemes? BiDAF is a closed-domain, extractive Q&A model that can only answer factoid questions. These characteristics imply that BiDAF requires a Context to answer a Query. The Answer that BiDAF returns is always a substring of the provided Context.
With this knowledge at hand, we’re now ready to explore how BiDAF is structured. Let’s dive in!
Another quick note: as you may have noticed, I have been capitalizing the words “Context”, “Query” and “Answer”. This is intentional. These terms have both technical and non-technical meaning and the capitalization is my way of indicating that I am using these words in their specialized technical capacities.
Overview of BiDAF Structure
BiDAF’s ability to pinpoint the location of the Answer within a Context stems from its layered design. Each of these layers can be thought of as a transformation engine that transforms the vector representation of words; each transformation is accompanied by the inclusion of additional information.
The BiDAF paper describes the model as having 6 layers, but I’d like to think of BiDAF as having 3 parts instead. These 3 parts along with their functions are briefly described below.
BiDAF has 3 embedding layers whose function is to change the representation of words in the Query and the Context from strings into vectors of numbers.
Attention and Modeling Layers
These Query and Context representations then enter the attention and modeling layers. These layers use several matrix operations to fuse the information contained in the Query and the Context. The output of these steps is another representation of the Context that contains information from the Query. This output is referred to in the paper as the “Query-aware Context representation.”
The Query-aware Context representation is then passed into the output layer, which will transform it to a bunch of probability values. These probability values will be used to determine where the Answer starts and ends.
A simplified diagram that depicts the BiDAF architecture is provided below:
If all these don’t make sense yet, don’t worry; in the next articles, I will delve into each BiDAF component in detail. See you in Part 2!