Get SGD58.85 off your premium account! Valid till 9 August 2021. Use the Code ‘SGLEARN2021’ upon checkout. Click Here

The Rise of the Data Catalogs

As the need for data to fuel AI applications grows, tools to organise data are proliferating

As discussed in a previous post, the data engineering team (DE) at AI Singapore was formed to architect and deploy a data platform for our organisation. A more robust and consistent approach to data management has been expressed as a critical need by our technical leaders. They want to know where data is located, who has been accessing that data, what data products and other artefacts have been generated from the original dataset, whether there is an expiry date on the data and many other details. This data platform will support our engineers with both improved processes and tools that provide a comprehensive view of our data resources. This, in turn, will enable data governance policies and processes to be applied to better protect our data assets.

Enter the Data Catalog

Over the last few months we have been looking at the technology landscape for data catalogs for the AI Singapore data platform and were surprised at the activities in this domain. There are a number of open source and commercial projects launched by big tech to better organise their data layer and facilitate data scientists and engineers when locating relevant information. Yet another strong indicator that organisations view data as a strategic asset and want to manage it more effectively.

Our team is reviewing the feature sets across these frameworks and will then collaborate with the platform engineering and AI engineering team leaders to align on what the important capabilities are. As an organisation with the mission to develop the AI ecosystem in Singapore, we have data needs that are somewhat unique; we work with external collaborators who provide data that should be restricted only to the project team members. This data and subsequently generated data products should be erased when the project is completed, while metadata about the artefacts created should be maintained so that metrics can be collected to help us refine our practices. Without providing a formal comparison or detailed laundry list of features, this post highlights some of the reasons we found these frameworks interesting and how the technology might fit into the AI Singapore environment.

Mature Solutions

CKAN is an acronym for Comprehensive Knowledge Archive Network, although hardly anyone knows that. When we initially began this effort, CKAN was probably the only data catalog that most data engineers could name off the top of their heads other than TDS. CKAN has been the data portal standard for a number of years with a large community and many plugins and extensions. So many extensions, in fact, that at this point it is a bit daunting to know which to choose and what combinations will work well. Users can also write their own extensions. Initially oriented toward making datasets available to the public or large institutions for search and download, it would be interesting to see if it has evolved to support the more heterogeneous world of AI data.

Apache Atlas is a project oriented to data governance. It provides robust authentication and authorisation features using Apache Ranger. Metadata objects are stored internally as graphs in JanusGraph and a searchable index is generated. Atlas can ingest objects from HBase, Hive, Storm and Kakfa, of which we only use Kafka frequently. It also supports data expiry.

Recent Offerings

What is clear to us is that managing metadata is a common challenge if well known tech companies are creating fully featured solutions for their internal development teams. Multiple projects reference the Ground project from Berkeley as shaping their thinking about design. Ground used a layered graph structure to track versions, models and lineage. Just as clear is that there is not a silver bullet available if they have all decided to build their own solutions as this requires a large commitment of resources to create and maintain.

Google Data Cloud is a relatively new product that integrates with their data storage tools (BigQuery, Pub/Sub, and GCS) to extract and make available both technical and business metadata. The flexible tag schema and ability to attach metadata as a tag to any data asset (down to a specific column in a table) is well designed. This facilitates faceted searches being performed over those tags. As we use a different cloud provider, this will probably not be our choice. However, Google has a knack for delivering simple, flexible solutions and we want to understand how they approach this challenge.

DataHub is an open source project at LinkedIn which is still in development even though it has already been deployed internally for their data engineering teams. The UI is intuitive and provides strong search capabilities and a broad feature set. One initial concern is that many of the supporting technologies are other LinkedIn open source projects, so an organisation will be buying into a lot of LinkedIn technologies. Additionally, it is important to know that this is a complex system composed of about a dozen docker images and many technologies. Currently, there is no standard Kubernetes deployment script, but that is apparently in the works, so we deployed with Docker Compose to do some further prototyping.

Also still in development, Amundsen from Lyft seems a less complex system than DataHub, but it also appears to have a narrower scope. Designed as a set of micro services and based on well known open source projects such as Neo4j, Elastic Search and Airflow, Amundsen uses discovery as well as user annotation to gather information and context, currently focusing more on the technical metadata than the business metadata. The underlying engine is Neo4j which seems a good match for tracking and surfacing the relationships between users, datasets, reports, etc. Unfortunately, in the current version, there are no authorisation capabilities to restrict metadata to project members only (this is on the roadmap). In addition, there are only three types of resources : Users, Tables and Dashboard, but the data model is extensible. The community on Slack is also very active.

Making the Choice

Ideally, the solution we choose will provide integrations that collect much of the technical metadata (schemas, size, create/modified timestamps, etc) automatically from the repositories which would include RDBMS, logs, and object stores. However, only so much collection can be automated, other information needs to be added by the team that is responsible for the associated data. Both Amundsen and DataHub documentations note that a metadata system is only as good as the community that maintains it. Clear processes and team buy-in will be as important as the features of the framework.

Once a solution is chosen there will still be a considerable amount of work to configure the framework to support our projects, integrate it with existing frameworks, and augment the environment to make it as intuitive as possible for the AI engineers.

We could, of course, roll out our own and build only what we need, but like many modern organisations, we run a lean development team and to commit to a significant internal development project at this time is not appealing. It is not only the initial design, development, test and deployment effort, but also the subsequent maintenance effort. A viable open source solution is preferred.

We are still in evaluation mode as part of our effort to enable our engineering teams with a more organised approach to data management. Our team will provide an update down the road on our decisions and progress overall. If you have ideas, success stories with a certain solution or similar challenges, feel free to leave a comment.

The Data Engineering Series

Artificial Intelligence for Financial Analysis

(By Nick Lik Khian Yeo)

In the world of finance, you will never run out of reading material. Dozens of documents, reports, and studies show up in your inbox daily demanding your attention. However, not all of these are relevant to your interests, especially if you are an analyst for a specific product in a specific region. In this project which was completed in Oct’ 19, I was part of a team that trained and deployed a machine learning system to classify tax update documents by topic and location.

The Problem: More documents than you could shake a stick at

Our client, GIC, is a sovereign wealth fund established by the Singapore government, with a portfolio spanning dozens of countries and territories. They were one of the project sponsors in AISG’s 100 Experiments (100E) programme. We were engaged by their tax advisory team whose job (among others) was to keep track of changes in the tax code and to study its implications to their portfolio of investments. This was a time consuming task as they had to sift through mountains of documents in their inbox to identify information specific to changes in their specialised tax categories before they could even get started on the analysis.

Our solution was to build a document labeling algorithm that would parse a document and identify the specific tax topics it related to, as well as the geographical region it affected. This is known in machine learning as a multi-label classification problem as each document could cover multiple topics.

Data drives the solution to every AI problem

Before we can train our machine learning model, we first need data. Due to content sensitivity, our client could not simply give us a dump of their emails. Instead we worked with them to construct an initial labeled dataset of 200 publicly-available documents. This dataset is too small to perform any significant training, but simply serves as a ‘gold standard’ to help validate our model accuracy, and for us to do some exploratory data analysis.

Our initial exploration of the data identified 10 main categories, and over 100 sub-categories that fell under these 10 categories. In the course of our discovery process, we found that the 10 main categories were easily distinguishable and in fact, the updates the analysts received were already sorted according to these main categories. The real value thus lay in identifying which of the sub-categories each document belong to, and this requires a deeper understanding of each document.

To deal with the lack of training data, we went online to find documents for each sub-category. We downloaded all the documents that had words matching that sub-category in the title. This gave us a ‘weakly’ labeled training dataset of several thousand documents.

One problem remained: our training data had one label for each document, but we were supposed to build a model that could predict multiple labels per document.

Training the machine learning model

I fear not the man who has practiced 10,000 kicks once, but I fear the man who has practiced one kick 10,000 times.
– Bruce Lee

The problem of training on a multi-class dataset and creating a multi-label output was treated in the model design: Instead of training one model that could predict 100 labels, I trained 100 models that each predicted one label. Each model will train only on one topic and become the expert at identifying whether that particular topic exists. When a new document is encountered, every model makes a prediction, and the results are collated to retrieve multiple labels for the document. This system had the added benefit of future-proofing the model. If I wanted to add additional categories after the model has been trained, I do not need to retrain the entire model. Instead, I only have to train a model on that additional category, and then add it to the original model.

The model itself was actually an ensemble of several models. Some models focused on the number of times each word occurs (known as term frequency-inverse document frequency, or TF-IDF), while others tried to gain a semantic understanding of the document with a language model pre-trained on the English language.

Additional features were generated in the following manner:

  1. The spaCy matcher was used to highlight certain important keywords identified by subject matter experts
  2. An algorithm called k-means clustering was used to automatically group documents into unsupervised categories

Will this model be useful?

We decided to evaluate the performance of the model by a classic human vs. computer comparison, with the target that a satisfactory model should perform no worse than a human analyst. We collected a batch of unseen documents and had the model predict its labels. At the same time, 3 analysts also worked to label these documents.

With these data points, we could look at both how the model performed as compared to humans, and how each analyst compared to each other. The intra-analyst comparison was necessary because at this high level of topic-granularity, many topics overlap and there is some degree of subjectivity to the labels.

Our model achieved an F1 score of 0.65, which was essentially the same as the intra-analyst F1 score of 0.64. We have successfully built a model that performed no worse than an analyst at identifying document topics!

Deploying the model

All incoming documents are automatically tagged, and feedback given is used to retrain the model.

The model is deployed in a Docker container to make it work across different environments, and consists of 3 key services

  1. An automated training script that can be used to add additional categories or incorporate user feedback
  2. A prediction API that is triggered when a new document is added
  3. A feedback module, which collects feedback from analysts, accounts for conflicting feedback, and updates the database


Deploying a model in this manner is what makes the system ‘intelligent’ as it is able to get better over time by learning from user feedback and improving itself. This ensures that the model will remain relevant as new tax topics are introduced, or the discussion surrounding a particular topic changes over time (concept drift).

The model has proven effective during user acceptance tests and has since been deployed into production for their local and overseas offices. This is just one of the ways that artificial intelligence can be used to augment workflows and improve efficiency. Artificial intelligence is now at a point where many of the techniques originally found in research papers are ready to be adopted by industry.

Related Stories

This article is a reproduction of the original by Nick Lik Khian Yeo. He is a graduate of the AI Apprenticeship Programme (AIAP®) and is plying his craft as a data scientist at GIC at the time of publication.

AI Summer School 2020 Thrives in Virtual Setting with Innovative Tweaks

AI Summer School has grown, COVID-19 notwithstanding. More than 250 attendees from 30 countries took part in this year’s event, with leading experts in the field of artificial intelligence (AI) coming together to share their work and foster collaboration amongst the next generation of researchers. The latest edition, which took place from 3 to 7 August, follows the success of the inaugural event last year which attracted 142 attendees from 15 countries.

Held in a virtual setting because of the pandemic, AI Summer School 2020 featured innovative tweaks to the traditional format to enable students, academic researchers and industrial practitioners to explore exciting possibilities surrounding the use of AI in real-world application domains and raise awareness of data innovation challenges and issues.

Unconference Sessions

“The main difficulty this year was in providing opportunities for participants to interact,” said Dr Stefan Winkler, Chair of Organising Committee and Deputy Director of AI Technology, AI Singapore. To address this, “Unconference Sessions” were held to give participants an opportunity to break into small discussion groups with the flexibility of exploring different groups based on their interests.

“The main aim was to facilitate the exchange and cross-pollination of ideas from a ground up rather than a top down approach,” he explained.

A general “Hangout” table was created to help participants navigate to topics they may be interested in, while Individual topic tables gave participants the option to start or join a table with its own Zoom session and Google Doc sharing.

Poster videos

Poster video sessions were also held, where participants shared a video of their AI-related work on Youtube and facilitated discussions on their project by replying to comments posted on the social media platform.

Awards were presented for the three top poster videos. Christian Alvin H. Buhat from the University of the Philippines Los Banos received the nod for his animated agent-based model of COVID-19 infection inside a train wagon. The second award went to Jiafei Duan from the Artificial Intelligence Initiative at the Institute for Infocomm Research, A*STAR, for his video on ActioNet. This is a platform for task-based data collection and augmentation in 3D environment, which has the potential to catalyse research in the field of embodied AI.

The third video poster that caught the judge’s eye was M Ganesh Kumar’s presentation on schemas for few-shot learning, which involves feeding a learning model with a very small amount of training data. Ganesh is from the Graduate School of Integrative Sciences and Engineering at the National University of Singapore (NUS).

DinerDash challenge

Another unique component introduced in AI Summer School 2020 was the DinerDash challenge, which was organised as part of the Reinforcement Learning workshop on Day 2. This is a game where a single waiter makes complex decisions on customer seating arrangements, taking orders, serving food and many others. Participants worked in small groups to test reinforcement learning baselines and competed with one another for the highest score in the DinerDash simulator.

For Ong Chi Wei, a Post-doctoral Research Fellow from the Department of Biomedical Engineering, NUS, this was the best experience he had at the summer camp. “The key takeaway for me was the Reinforcement Learning (RL) Diner Dash challenge. It was well organised and interesting. We were required to submit our proposal on the same day after the question was posted using reinforcement learning. I learnt from my teammates and we managed to solve the problem with different algorithm testing. Overall the challenge made us think creatively and how to work as a team to solve AI problems.”

Distinguished alumni

Adding gravitas to Summer School were presentations by experts in the field of AI.

In a keynote on “AI @ Scale – Trends and Lessons Learnt from Large-scale Machine Learning Projects”, Dr Tok Wee Hyong, Principal Data Science Manager at Microsoft Corporation, shared his insights into key trends in machine learning and deep learning, grounded in practical experience evolving AI ideas from proof of concept to production at some of the world’s largest Fortune 500 companies.

Dr Tok, who is with the AzureCAT team in Redmond, was one of three overseas-based speakers who are alumni of NUS and Nanyang Technological University (NTU) graduate schools, who have gone on to carve out a distinguished career in the field of AI. The others are fellow Singaporean Dr Yi Tay, a research scientist at Google AI, Mountain View, and Dr Trọng-Nghĩa Hoang, a research staff member at the MIT-IBM Watson AI Lab, IBM Research Cambridge.

Personalised learning at scale

The second keynote at the event was delivered by Prof Zhai Chengxiang, Donald Biggar Willett Professor, University of Illinois at Urbana-Champaign. His presentation on “AI for Education: Towards Personalised Learning at Scale” highlighted the exciting opportunities for applying AI techniques to transform education to make it both more affordable and more effective.

In other sessions, speakers shared their work in areas such as federated learning, self-supervised deep learning, multi agent interaction, Gaussian processes and low-resource machine learning, and also covered AI applications in sectors such as healthcare. Additional important aspects of AI such as ethics and governance were discussed too, as were career-related topics such as job hunting and entrepreneurship.

The plus side of a virtual camp

Looking back on AI Summer Camp 2020, Dr Winkler felt the virtual format had its advantages. “We could not hold any social events, such as the buffet dinner and Night Safari outing that we had last year, but on the plus side, we could offer much lower registration fees, and open up the school to a larger number of people with no auditorium size constraints.”

New AI Makerspace at Singapore Polytechnic

To meet the rising local demand for Artificial Intelligence (AI) skills and to assist local industries in their digital transformation, Singapore Polytechnic (SP) through its Data Science and Analytics Centre (DSAC) has collaborated with AISG to set up an AI Makerspace on the Dover Road campus.

The new AI Makerspace, which is a satellite node of AISG’s existing Makerspace, will allow SP students including those from the Diploma in Applied AI & Analytics the opportunity to intern and be mentored by AISG engineers and DSAC staff to leverage AI Bricks to build AI solutions. 

As part of the collaboration, DSAC will work with AISG to offer relevant training courses as well as AI Clinics for Small, Medium Enterprises (SMEs). Through the courses and clinics, employees and owners of SMEs will have a better understanding of AI and Makerspace’s AI Bricks, and learn to harness AI solutions to increase productivity and business opportunities.

The latest AI Makerspace is timely as the Services and Digital Economy Technology Roadmap cites AI as one of the key technology areas that will change the world and take Singapore’s economy forward in the coming years. As Singapore embarks on its Smart Nation journey, SP is glad to partner AISG to equip our youths and companies with much needed technological skills that will help our industries transform.

Ask JARVIS – The Personalised AI Agent for DHL Care

Prototype developed for AI in Health Grand challenge helps pave the way for predictive care, personalised care and patient empowerment

What is the likelihood of a Diabetes, Hypertension and hyperLipidemia (DHL) patient developing complications over the next five years, and what are the factors that contribute to this risk? A JARVIS-DHL prototype developed for the AI in Health Grand Challenge has the answers.

Launched in June 2018, the AI in Health Grand Challenge seeks to explore how AI technologies and innovations can help solve important problems faced by Singapore and the world. The focus was on healthcare, and the challenge was on how AI can be used to help primary care teams stop or slow disease progression and complication development in 3H (hyperlipidemia, hyperglycemia, hypertension) patients by 20 percent in five years.

It is estimated that 3H is present in up to 20 percent of the adult population in Singapore and will rise with an ageing population, leading to an increase in healthcare spending and impacting the quality of life of those who are affected.


JARVIS-DHL is one of three proposals that have been awarded. It is led by the researchers from the Institute of Data Science NUS (IDS), in collaboration with SingHealth Health Services Research Centre (HSRC), Singapore National Eye Centre (SNEC), National Heart Centre Singapore (NHCS) and Duke-NUS.  JARVIS-DHL aims to build a consolidated AI platform which can be used to improve the 3H care delivery process by facilitating practice of evidence-based personalised care and shared-decision making.

The researchers’ focus was on transforming local DHL primary care through the following three-pronged approach:

  • From reactive to predictive care by enabling accurate predictive stratification of DHL patients
  • From “one-size-fits-all” to personalised care by enabling customised care based on local and individual contexts
  • From passive to active patients by enabling patient education, self-care and monitoring

Benefits to Primary Care Teams

For primary care teams, early screening and risk stratification enables them to right-site care for 3H patients instead of relying on the reactive event-driven sequential referral model. This allows patients to spend less time in healthcare institutions, and also enables healthcare resources to be put to optimal use.

By facilitating evidence-based personalised care and shared-decision making, JARVIS-DHL also enables primary care physicians to work with patients to increase treatment adherence. For example, the system is able to recommend evidence-based treatment options, quantify personalised treatment benefits and the risk of complications, and adapt the treatment regimen based on the patient’s lifestyle. This helps alleviate the patient’s anxiety over perceived side effects and support holistic clinical decision-making.

Benefits to Users

Through the use of technologies for patient education, self-care and monitoring, patients are empowered to take ownership of their healthcare journey beyond their visits to the clinic, supporting a shared decision making with primary care physicians.

12-month report card

The team obtained access to local clinical datasets pertinent to their research and went on to develop the prototype for JARVIS-DHL, a consolidated AI platform which can be used to improve the care delivery process by facilitating evidence-based personalised care and shared-decision making.

The prototype incorporates a diabetes risk calculator that can compute the risk profiles of DHL patients that are likely to develop complications over a five-year period. The system gathers local primary care data as well as healthcare and lifestyle tracking data to create AI algorithms and models that can help identify at-risk patients. It identifies the specific factors that contribute to their risk and stratifies patients into various risk groups for the delivery of predictive care.

Next Steps

Whilst advancing AI research is a key goal of the AI Grand Challenge, one of the important takeaways for the team was the need to balance the aspirations for cutting-edge AI research against its practical impact in clinical applications.

With this in mind and as they approach Stage 2 of the development, the team has adopted a balanced approach that will deliver practical real-world impact as it validates and refines its AI model for deployment in clinics.

For more details, please visit


About the Team

Lead Principal Investigator: Prof Wynne Hsu (NUS)

Co-Principal Investigators:

  • Professor Ng See-Kiong (NUS)
  • Professor Lee Mong Li (NUS)
  • Associate Professor Chee Yong Chan (NUS)
  • Professor Wong Tien-Yin (SingHealth)
  • Professor Marcus Ong Eng Hock (SingHealth)
  • Associate Professor Tan Ngiap Chuan (SingHealth)
  • Dr Teh Ming Ming (SingHealth)
  • Adjunct Associate Professor Yeo Khung Keong (SingHealth)

Host Institution: National University of Singapore (NUS)

Partner Institution(s): SingHealth Group (SingHealth)

In 2019, the team published papers for top international AI platforms such as the Conference on Computer Vision and Pattern Recognition (CVPR), the IEEE International Conference on Image Processing (ICIP), and the IEEE International Conference on Tools with Artificial Intelligence (ICTAI). It has also received a request from the American Diabetes Association (ADA) to feature JARVIS-DHL in the association’s Thought Leadership Film Series.

The AI in Health Grand Challenge

The AI in Health Grand Challenge is a five-year, two-stage programme with a total funding quantum of $35 million. AI Singapore, together with an International Review Panel, selected three projects to be awarded Stage 1 funding of $5million per project for the first two years. The projects focused on applying AI technologies in innovative ways across the continuum of 3H (hyperlipidemia, hyperglycemia, hypertension) care.

Taking A Leap of Faith From Cancer Biology to AI

As a PhD student in cancer biology, Simon Chu took a huge leap of faith when he dived into artificial intelligence (AI) without any formal background in computer science or mathematics.

The odds seemed stacked against him. “I knew that it will be difficult to get into an AI role with my background,” he said.

To get a foothold in the field, he signed up for the AI for Industry (AI4I) programme – a self-paced, self-directed learning programme which gave him a year of access to DataCamp, an online learning resource for data science and analytics.

“I studied religiously on DataCamp, completing one to two courses per day. I completed the coursework requirement for AI4I fairly quickly, and I went beyond that to further enhance my knowledge with other courses on Data Camp,” he recalled. As part of the old requirement for AI4I, he had to attend 2 face-to-face workshops. The first workshop was actually a talk on AI for Everyone (AI4E). The session was not too technical and he learnt about the product development cycle from the talk.

And although he did not make it through his first attempt at the AI Apprenticeship Programme (AIAP), the knowledge he accumulated gave him the confidence to apply once again.

A different lens

Today, Simon is well on the way to completing the AIAP, and the experience has been an eye opener for him. He finds that AI presents a different lens for understanding data and ways in which the world works. Instead of the hypothesis-driven approach which was central to biology experiments, his experience in AIAP challenges him to let the data tell the story, instead of finding the data to support a story.

His passion for AI has also grown as he developed a firmer grasp of ways to develop his own AI models. He is currently working on a Singlish language model in the field of Natural Language Processing.

Despite the progress he has made, family and friends still ask him why he chose to go into AI after spending almost a decade in the field of biology. His answer is that biology and AI are not mutually exclusive options, and he firmly believes that his years in biology have not gone to waste.

Bilingual in Biology and AI

As he picks up skills in AI, he understands that AI is a way of dealing with data, and that it needs to be applied within the context of domain knowledge. In this regard, a biology background enables him to “speak both languages” and there will be opportunities for him to return to it and apply his AI skills, he said.

He also emphasises the importance of staying “teachable”. In an industry that is evolving very quickly, where research papers written three to five years ago could already be outdated, passion in AI has to be accompanied by a willingness to keep learning, he said.

Sharing his experiences with others who are planning to come on board to re-apply for AIAP, he said, “When you have a sense of what the technical assessment/interview is like, you know exactly what you are lacking in terms of skills and knowledge. So work on improving those areas.”

Besides working on technical skills such as coding and machine learning concepts, it is also important to understand the product development cycle. “Attend talks and workshop organised by AI Singapore and other parties, they might be helpful,” he advised. “Don’t give up! If a biologist like me can do it (eventually), so can you.”

If you are keen to prepare for the AIAP, click here (Becoming an AI Apprentice – Field Guide)



Making Inroads into A Male-Dominated World

Traditionally, the field of artificial intelligence (AI) has been dominated by men. When the AI Apprentice Programme (AIAP) was first launched in May 2018, only two women took up the gauntlet. Since then, however, there has been a growing number of women proving their mettle in this field. Among them is Fiona Lim, one of 19 women who have joined the programme to date and has since graduated from AIAP .

First-hand experience

For Fiona, it all began in March 2019 when she was working as a data analyst at a consulting firm. Fiona had graduated from the National University of Singapore with a degree in Statistics but felt the need to build a stronger technical foundation for her role. Looking around for suitable online courses that could help her, she came across AIAP, which is run by AI Singapore.

The nine-month programme presented her with an opportunity that she could not pass up – a chance to do a deep dive into AI concepts through self-directed learning, learn alongside passionate mentors and peers, and apply the knowledge to a real industry project.

The end-to-end project would provide her with first-hand experience not only in developing AI models, but also in building the data pipeline and deploying it as an application programming interface (API).

In the beginning, Fiona found the going tough especially with her lack of experience in programming. However, her knowledge of statistics came in handy. “I was able to grasp learning content quickly, and the challenges and obstacles along the way actually motivated me because ultimately, what I wanted to take away was the learning experience,” she said.

In the process of trying to understand how an industry expert thinks and finding the best model that can automate part of human work, Fiona was also introduced to machine learning and deep learning models. This stoked her interest in research.

Reaching out to more women

After completing the AIAP in Dec 2019, Fiona started work as a research assistant in the field of Natural Language Processing at Nanyang Technological University. She hopes to use her new-found knowledge to one day build a machine-learning product that can help people communicate better, especially the elderly who may not be fluent in English.

She would also like to see more women joining the field of AI. At AI Singapore, she was given the opportunity to present to visiting guests and to share her experiences with female students through community involvement projects. She has also spoken with women who reached out to her on LinkedIn to find out more about AIAP and is a member of Women Who Code and Coding Girls. These are online communities where women share their experiences and give each other tips on conducting presentations, carrying out AI conversations and how to survive in the AI world.

For women who are keen to explore the field of AI, Fiona encourages them to give it a try, and not to be afraid to seek help when they come across difficulties. “There are plenty of people out there who are very willing to share their experiences and help you out. As long as you have the right attitude, never give up learning and are always give it your best, everything else will follow.”

Link up with Fiona.

If you are keen to prepare for the AIAP, click here (Becoming an AI Apprentice – Field Guide)

Discovering the Science behind Hyperparameter Tuning

Companies hire large teams of data scientists to manually tune hyperparameter configurations of deep learning models.  These parameters are used to control the learning process and the tuning is extremely tedious and time consuming since it involves training the model to know the resulting performance of each configuration.

For Bryan Low, an associate professor at the National University of Singapore’s Department of Computer Science, the burning question is: “Can we transform this process of optimising the hyperparameters of a machine learning (ML) model into a rigorous ‘science’?”

Prof Low is intrigued by this possibility, which will free up data scientists to work on results analysis and other more meaningful tasks. It also dovetails with his wider research vision, which is to enable “learning with less data”.

The quest for answers led him to delve deeper into the area of automated machine learning (AutoML), specifically Bayesian optimisation algorithms which help simplify and quicken the search for optimal settings by identifying which parameters are dependent on one another.

Tackling the fundamental questions

“Traditionally, it is considered an ‘art’ to tune the hyperparameter configurations of deep learning and ML models such as learning rate, number of layers and number of hidden units, so as to optimise their predictive performance,” explained Prof Low.

To transform this into a science, several fundamental questions had to be tackled. For example: How can Bayesian optimisation be scaled to handle a large number of hyperparameters and large batches of hyperparameter queries? How can auxiliary information be exploited to boost its performance? How can Bayesian optimisation be performed under privacy settings?

In seeking answers to these questions, one of the interesting things that Prof Low uncovered was that AutoML/ Bayesian optimisation tools can have many applications beyond the hyperparameter optimisation of ML models.

“There are many complex ‘black-box’ problems to which Bayesian optimisation can be applied, to reduce the number of costly trials/experiments needed to find an optimal solution,” he noted. Examples included optimising properties in material design or battery design, optimising the environmental conditions for maximising crop yield, the performance of adversarial ML, and single- and multi-agent reinforcement learning.

Multi-party machine learning

More recently, Prof Low has embarked on another line of research to achieve his vision of “learning with less data”. He is working in multi-party machine learning where a party with some data tries to improve its ML model by collaborating with other parties with data.

There are two key challenges involved in this. The first lies in having to combine heterogeneous black-box models without any knowledge of their internal architecture and local data, in order to create a single predictive model that is more accurate than its composite models.

One way to address this is to find a common language to unite the disparate models. This paves the way for the creation of a surrogate model from the different machine learning models, and has the potential to elevate machine learning to another level by combining multiple models to harness their collective intelligence.

The second challenge lies in trusted data sharing and data valuation, where Prof Low and his research team ask questions such as: “How can multiple parties be incentivised to share their data? How do we value their data?”

In this pioneering work, Prof Low has introduced a novel and intuitive perspective – a party that contributes more valuable data will receive a more valuable model in return (instead of monetary reward). To achieve this, formally-defined incentives such as fairness and stability have been adapted from cooperative game theory to encourage collaboration in machine learning.

His research journey

For Prof Low, research can be described as a hobby – one that that he has been pursuing for nearly two decades. During his final year as an undergraduate, it even replaced gaming as something that he would “naturally indulge in”, and he has not looked back since.

The field of AI/ML has likewise powered on. Prof Low remembers that when he first presented at the AAAI (Association for the Advancement of Artificial Intelligence) conference back in 2004, there were only 453 papers submitted for review. This year, there were 7,737.

Indeed, as his passion for research continues to burn, his chosen field of AI/ML has gone “from cold to scotching hot”.

Imbuing ML with Human-like Intelligence

How can a financial fraud detection model trained in one country be applied in another? How does mastery of C++ lead to rapid mastery of Java and C#?

Associate Professor Sinno Jialin Pan from Nanyang Technological University cited the first as an application of transfer learning, and the second as an analogy to explain how he would like to bring it forward.

Prof Pan believes that a machine can be said to be intelligent only if it has the ability to transfer learning. This is because the ability to learn and transfer skills or knowledge to new situation or context is a particularly strong aspect of human intelligence.

He first heard of the term “transfer learning” in 2006 as a PhD student working on a WiFi-based indoor localisation system using machine learning (ML) techniques. It referred to an ML paradigm motivated by human beings’ ability to transfer learning.

Guided by intuition

Intuition told Prof Pan that transfer learning could hold the answer to the WiFI localisation problem that he was working on. When doing experiments, he found that the distributions of Wi-Fi signals changed over time due to the dynamic environment and the use of different mobile devices. To ensure that a localisation system performs accurately, he had to figure out how to adapt a machine-learning-based model to the changing environment and different types of mobile devices.

Prof Pan set out to develop general transfer learning methodologies that would give machines the ability to learn by transferring knowledge across different tasks automatically. Unlike heuristic transfer learning methods which are designed for specific applications (such as image classification, sentiment classification, etc.), general transfer learning methodologies require two fundamental research issues to be addressed. They are: How to automatically measure the “distance” between any pair of domains or tasks, and how to design learning objectives based on domain/task-invariant information derived from the measurable distance.

Through his research, he found that kernel embedding of distributions was ideal for measuring the distance between domains or tasks. Based on this non-parametric technique, he developed several transfer learning methods to train a model on domain/task-invariant information and build a bridge between different domains/tasks for knowledge transfer.

Transfer learning in fraud detection

One of the many potential applications of this was in fraud detection. Prof Pan noted that ML techniques have been widely used to capture patterns in customers’ behaviours and build fraud detection models based on historical data. However, as behaviours are region-dependent, a fraud detection model trained with historical data from one region or country may fail to make accurate detections in another region or country.

At the same time, it requires a lot of historical data to train an accurate fraud detection model, and this may not be available in, for example, a new market. In this case, transfer learning is a promising technique to help adapt a well-trained fraud detection model to new regions or countries with only limited historical data.

But Prof Pan is still not satisfied. “Though many promising transfer learning methods have been developed, most of existing methods fail to accumulate knowledge when performing transfer learning,” he said. In other words, for each specific transfer learning task involving a specific pair of domains or tasks, the transfer learning procedure has to be run from scratch.

Reuse of knowledge

What Prof Pan is now embarking on is an attempt to develop a continual transfer learning framework, where the machine gets “smarter” and “smarter” after solving more and more transfer learning tasks. He likens this to a computer science student spending six months to master the C++ programming language. After that when he/she wants to learn the Java programming language, he/she may only need to spend less than three months to master it.

If the student further wants to learn the C# programming language, he/she may only need to spend days to master it. “The reason behind this is that with the transfer of learning, the student’s understanding or knowledge of object-oriented languages becomes deeper after he/she learns Java, which also helps him/her to learn C# faster,” he explained.  

To translate this learning behaviour into transfer learning algorithms, the knowledge needs to be distilled and cumulated after performing each transfer learning task. A key research issue is how to represent knowledge in a more compact form after the “learning”, so that it can be refined and reused in the next transfer learning task. “In this way, knowledge can be accumulated, which makes machines’ transfer learning ability more powerful,” he said.

Understanding the Behaviour of Learning Algorithms in Zero-sum Games

In economic and game theory, zero-sum games are settings for perfect competition where the gain of one player is exactly equal to the loss of the other. But what happens in environments where many intelligent agents – human or artificial – interact with one other? Do these systems attain a state of equilibrium or do they become chaotic? And what are the conditions that influence these outcomes?

These are some of the questions that Georgios Piliouras, assistant professor of Engineering Systems and Design, Singapore University of Technology and Design (SUTD), is trying to answer through his work on multi-agent reinforcement learning in games.

For Prof Piliouras, the focus on this research area stems from his fascination with how complex phenomena emerge from simple components, such as neurons coming together to form the brain, an ant colony self-organising and building complex structures, or how the global economy works.

“In every one of these cases we can create pretty reasonable models of the behaviour of the individual constituents of these networks,” he noted. “But when we scale them up, the global emergent behaviour can, in many cases, be unexpected.” 

Unexpected chaos

Prof Piliouras’s objective is to create a robust and scalable theory of how learning algorithms behave in general decentralised environments. One of the standard classes of these environments is the zero-sum game which lies at the core of many recent artificial intelligence (AI) architectures.

An example is Generative Adversarial Networks, where two neural networks compete against each other. One of them, the Generator, tries to create realistic looking images whereas the other one, the Discriminator, tries to predict whether the images presented are real world images or synthetic ones. “By having the networks compete against each other, we can create AI that produces very realistic looking images,” explained Prof Piliouras.

The same mathematical concept lies at the core of Alpha-Go/Alpha-Zero, the AI systems produced by DeepMind, which learned to master the game of Go through self-play. 

However, Prof Piliouras’ research found that many standard learning dynamics such as gradient descent (an optimisation algorithm that is used to update the parameters of a machine learning model) are unstable and in fact chaotic in zero-sum games. This suggests that zero-sum games and other similar multi-agent settings can be more complex than standard economic theory suggests.

New multi-agent AI architectures

To improve the performance of self-learning systems, Prof Piliouras is working to create learning algorithms that behave predictably and converge to equilibrium instead of behaving chaotically.

To date, he has co-authored several joint papers with researchers from DeepMind to leverage these ideas and create new multi-agent AI architectures.

His research group also published five papers in the Conference on Neural Information Processing Systems (NeurIPS) in 2019, with two of them receiving spotlight awards. The same year, the team received a best paper award nomination at the International Conference on Autonomous Agents and Multiagent Systems, which is the premier conference on multi-agent systems.

“Publishing in these top ML conferences provides a great opportunity for communicating our ideas to a wide audience and getting some valuable feedback,” said Prof Piliouras, who plans to keep probing deeper into the structure of multi-agent reinforcement learning in games.

“There are a lot of challenges and questions that we still do not quite understand especially when we have a large number of users and complex action spaces,” he said. “There is definitely a lot of exciting work to be done both on the theoretical as well as the experimental front.”  

Prof Piliouras counts himself lucky to have had the opportunity to collaborate with many brilliant researchers around the globe in the course of his work. “My research journey so far has been very rewarding,” he said. “I am happy with the progress we have made already on some of the fundamental questions in the area, and at the same time I am excited about where we are going next.”

mailing list sign up

Mailing List Sign Up C360