Get SGD58.85 off your premium account! Valid till 9 August 2021. Use the Code ‘SGLEARN2021’ upon checkout. Click Here

TagUI in the Enterprise

Empower Employees with Robotic Process Automation

Robotic Process Automation (RPA) refers to a form of business process automation where well defined sets of human-computer interactions are programmed to be executed by software (known as robots) at high volume and speed. Not only does RPA increase productivity, it improves job satisfaction by alleviating the need for humans to perform repetitive tasks like pointing-and-clicking on web forms and entering structured data from a source document.

Gartner predicts that 90% of large organizations globally will have adopted RPA in some form by 2022 as they look to digitally empower critical business processes through resilience and scalability, while recalibrating human labor and manual effort.

– Gartner Newsroom, September 2020

Over the last 5 years, large enterprises have fueled the rise of RPA as the fastest growing market segment in enterprise software. If software could be used to dramatically reduce costs and increase productivity, which organisation would not be tempted to have a taste of RPA?

Traditionally, RPA robot deployments are done in a centralised manner where everything is controlled and managed from a central server to many RPA robot nodes. These robot nodes can be physical hardware computers or virtual machines, but not individual employees’ laptops. This model has been in place for many years but is now reaching a point of diminishing returns. The industry is in need of a new model.

In this new model, a RPA champion within the organisation takes care of the learning, teaching and guidance on the development and deployment of RPA robots. This champion spreads the usage of RPA across the organisation, and is the go-to person for any questions or blockers. More importantly, employees are empowered to create their own RPA mini-robots on their laptops. However, this is difficult to realise if the cost of the software license is high and the RPA tooling is too sophisticated for the average employee. The TagUI free RPA tool by AI Singapore solves this problem.

Here are some of the features TagUI offers.

Scalable across the enterprise

  • It works in an air-gapped environment without internet access.
  • The cost of license for TagUI software is $0 (no hidden catch).
  • Works on Windows, macOS, Linux so everyone can do RPA.

Unparalleled ease of use

  • Easy learning curve, 1 day to be productive vs UiPath’s one week.
  • Users can create and deploy RPA robots from MS Word / Excel.
  • Data science users can choose to use a Python version instead.

Unique selling points

  • Free support – 24h soft SLA, weekly Zoom 1-to-1, Telegram group.
  • Modifiable – being open-source the tool can be fully customisable.
  • Uses more popular JavaScript / Python instead of UiPath’s .NET.
  • Plays extremely well with Microsoft Power Automate Desktop.

Listen too to the community of active users.

Interested to know more? Get started with TagUI here

Or check out the latest TagUI webinar below.


(Top Image : The TagUI development roadmap)

Lessons From Managing AI Projects

Tips for Successful AI Project Delivery

In a previous article, we learnt about how Artificial Intelligence/Machine Learning (AI/ML) projects are managed at AI Singapore (AISG). In this article, we will go through some of the lessons learnt from managing AI/ML projects in general.

What is expected of a Project Manager?

Typical expectations are: know your stakeholders and manage their expectations, continuous communication, manage the scope, get on top of the timeline, control the project cost, foresee all the pitfalls, be prepared to face off, etc. In my view, what is also important is what the project manager brings to the table in terms of the lessons learnt from past projects. This is where seasoned project managers stand out from the rest, but even so, they are not immune to the potential failure in foreseeing certain pitfalls, particularly when they are in a less familiar territory like managing AI/ML projects.


Managing AI/ML projects is not an art but science. Due to the dynamic nature of AI/ML projects, all the pitfalls a project manager faces while managing traditional software engineering projects, such as scope creeps, missed timelines, oversight in risk management and misaligned objectives, are amplified more here.

All right, how different are machine learning projects from software engineering projects? In traditional programming, with a set of rules/business logics built, the software applies the logic on the input data and produces an output. On the other hand, in machine learning, the ML algorithm automatically formulates the rules from the data.


From a project management perspective, software engineering projects are far better understood and mostly well documented whereas AI/ML projects are still evolving and in experimentation. The differences could derail an AI/ML project to a large extent if it were managed the same way as a software engineering project. Some of these differences are subtle enough that a mistake might not be apparent to the project manager until it occurs. Let us look at various potential pitfalls of managing AI/ML projects, in the area of stakeholders’ expectations, data and project timeline.

STAKEHOLDERS’ EXPECTATIONS – At the starting point of an AI/ML project, it is very important that all the stakeholders are aligned and are on the same page in terms of understanding the project objectives and the expected outcomes from the project. Some of the important questions to ask at this stage are:

  1. Do the stakeholders understand the very nature of Machine Learning (or Deep Learning) processes?
  2. Do the stakeholders have the same understanding of the definitions of ML model performance metrics like Accuracy, Precision, Recall, F1 Score, Mean Average Precision, etc., or do they know which metrics serve them and their businesses better?
  3. Are the project acceptance criteria or project success metrics clearly defined with no ambiguity, i.e. what is expected from the AI/ML project?

For successful completion and implementation of any AI/ML project, the answers for all the above questions should be – Yes. We need to have a good understanding of the business case and ask the right questions to the stakeholders to:

  1. Choose the appropriate ‘target’ dataset (list of features).
  2. Choose the suitable performance metrics (Accuracy, Precision, etc.).
  3. Be aware of and decide on the trade-offs between available choices.

For project acceptance, the project success metrics can be business-centric, e.g. improved productivity of X hours, but one should also express the expected performance of the ML model and the performance metrics to adopt, e.g. Y% Recall. To better manage the stakeholders’ expectations, it is suggested that the model performance be expressed as a range, e.g. (Y1 to Y2)% Recall. It is also suggested to qualify that, although a suitable model architecture plays a big part on the model performance, the model performance ultimately depends on the data, be it data quantity or quality.

One may want to consider the following additional non-functional acceptance criteria, along with the performance criteria:

  • Model training and inference times.
  • How frequent the ML model is to be trained or utilized – this would help indicate how crucial are the model training and inference times.

DATA, DATA AND DATA – Data is key, in terms of its availability, quantity and quality for the training and evaluation of the ML model, and it is central to the success of AI/ML projects. All ML algorithms depend on a sizable set of data to train the model and the accuracy of the model largely depends on the quality of the training data.

Data Readiness – It is imperative that the required size and quality of data is available at the onset of AI/ML projects. This enables critical data related activities to be performed, such as data checking, exploratory data analysis, etc., according to the project timeline. It is from such activities that data quantity and/or data quality issues, if any, are discovered. Early discovery of data issues allows remedy actions to be followed up to continue to steer the project forward. To better pre-empt data issues, it is recommended to obtain some samples of the data, prior to the commencement of the project, to perform sanity checks.

Data Volume – A large dataset, in the magnitude of thousands of data points, if not more, is needed in most cases of AI/ML projects. Shortfall of the required data volume is likely to result in poor or decreased performance in the ML model accuracy and consequently unsatisfactory customers. To enable the model training, especially of deep learning models, which involve large datasets, it would require reasonably high-end compute and storage resources, so one has to be prepared to have the required GPUs and gigabytes of storage space up and ready.

In the case where the large dataset is extracted from multiple data sources and/or of various data formats, one also has to be prepared for the potential increase in the data extraction, transformation and ingestion effort, as extracting data from different sources will require the data pipeline to set up many more connections and different interfaces (e.g. web crawling, APIs, etc.) which is time and effort consuming. This has cost and time implications to the project.

Data Quality – Data quality issue encompasses many different aspects, such as invalid data format (e.g. wrong datetime format), incomplete data (e.g. missing values), noisy data (e.g. outliers), imbalanced data (i.e. the distribution of examples across the known classes is biased or skewed), unlabeled data or missing annotations, unscaled data, etc. Poor data quality, if not treated, will impact the ML model performance, regardless of the model architecture employed. Having said that, one has to be aware of the additional time and effort needed to treat data (data analysis and visualization, data cleaning, data imputation, etc.) and cater for them in the project timeline.

PROJECT TIMELINE – Failure to plan is planning to fail. It is understandable that stakeholders expect a short implementation timeline for projects, but it is important to note that AI/ML projects may require a special degree of patience and flexibility. First and foremost, during the initial stage of a project, if no minimum training data (adequately annotated) is available and it cannot be made available very soon, the project is set to be delayed. Without the training data, work related to exploratory data analysis, feature engineering and ML model training will be impeded. In such a situation, it is better off to postpone the project.

Even if the training data is sufficient and ready, due to the nature of AI/ML projects where a lot of experimentation must be carried out before a suitable algorithm is zeroed down, enough time must be considered in the project timeline for the various experiments. Also, the project progress could well appear to be slow initially, so one must constantly educate and update the stakeholders on the incremental work completed, to assure them that the project is progressing to meet the project objectives according to the timeline.

It is also important to note that, during the acceptance stage of a project, if the stakeholders are not aligned with the project outcomes and, while testing the ML model, they start to develop unnecessary assumptions and expectations on what the model can deliver or cannot deliver, the testing and acceptance process will be dragged and the timeline will be impacted. This point ties back to stakeholders’ expectations and can be mitigated if the stakeholders’ expectations are properly managed.

IN CONCLUSION, this article aims to share and highlight the pitfalls of managing AI/ML projects in the area of stakeholders’ expectations, data and project timeline. Knowing the pitfalls and being conscious of them helps a project manager a long way in managing and delivering AI/ML projects successfully.




The AI Project Management Series

Introduction to Computer Vision (CV) Hub

Since its establishment in 2017, AI Singapore (AISG) has embarked on numerous Computer Vision projects within the 100E programme, which helps organisations use AI to solve their business problems. One such collaboration was with IBM in which AISG delivered a system to help their product engineers better predict the return rates of different hardware products (check out the articlevideo interview and podcast to know more).

With each successful completion of a project, the engineering team’s experience and expertise with Computer Vision (CV) solutions have grown. It was realised that many solutions had common components which could be shared. Thus, the CV Hub was established to develop a customisable and user-centric AI Engine that could speed up the development work in future projects.

After a year of development, Version 1.0 of the AI Engine (Figure 1) has the ability to train models, perform inference, and evaluate models. The engine has been tested and its performance benchmarked in different edge and cloud deployment scenarios.

Figure 1: Current AI Engine (Version 1.0) for computer vision

One example of such a deployment was the system built in collaboration with HP Inc to assist social distancing within their manufacturing facility. Another example is the mobile app currently in development to assist exercise tracking of sit-ups and push-ups. The team is actively working to add more models and use cases to expand its capabilities. With the successful deployment of the engine for real-world use cases, the team is now planning to open-source some components of the engine by the end of 2021.

In addition, owing to the ongoing COVID-19 pandemic, there has been a surge in interest for home-based fitness. In order to help meet the demand, the CV Hub team has extended the pose estimation use case within the engine to have the capability to provide real-time feedback for any motor skills based activity. This use case has been packaged into an application known as SensAI, a word play on the word Sensei which means teacher in Japanese.

The video below offers a sneak peek into how users can interact with the application while exercising. This application is scheduled for launch in Q4 2021.

As the demand for deployable Computer Vision solutions increases in the near future, the CV Hub team at AI Singapore is prepared to assist and collaborate with companies to solve real-world problems. You can get in touch with us via email at cvhub@aisingapore.org.

 

Data Versioning for CD4ML – Part 2

(This post is the second of a two-part series where I will be touching on a component of CD4ML: data versioning, and how it can be incorporated into an end-to-end CI/CD workflow. This second part brings one through a deeper and more technical dive into the topic, to give the audience a better understanding of how the tool can be applied concerning the different aspects of a workflow while the first one presents high-level overviews of the topic, ending off with the introduction of DVC.)

Table of Contents

  1. Preface
  2. Prerequisites & Tools: The required services & tools to get our hands dirty.
  3. Problem Statement & Overview of Workflows: What are we working towards and why bother?
  4. Practical Section 1 – Beginning Continuous Integration: Automating CI workflows through GitHub Actions.
  5. Practical Section 2 – Automated Testing & Versioning with DVC: Testing code on GitHub Actions and using DVC beyond versioning of data.
  6. Practical Section 3 – Reviewing Model Experiments: Track model experiments with DVC and review through GitHub.
  7. Practical Section 4 (Optional) – Packaging & Serving: Packaging the trained models, for serving.
  8. Next Steps: Where to go from here.

Preface

In the previous post, we have touched on the deliberations that one can make when it comes to versioning data as well as an introduction of DVC. For this post, we will be exploring more of DVC’s functions while diving deeper into some aspects of CD4ML, particularly data & pipeline versioning, and continuous integration. With that said, this whole post will be much more technical than the first one. Naturally, we will be using different tools along the way which will help us in facilitating the intended workflows.

IMPORTANT NOTE: While this post does its best to be as beginner-friendly as possible, it does assume some basic technical proficiency of the reader.

Back to top

Prerequisites & Tools

This section covers the tools, service accounts and basic knowledge required for us to follow through the post.

Some tools are to be installed on our local machine and a few accounts are needed for services required for the facilitation of some workflows. Even though we will be training a deep learning model later on, the local machine to be used need not be with a GPU, but they need to have decent computing power.

  1. Git

We will be leveraging on a versioning control system and the choice for this walkthrough is Git. If we went through the whole of Part 1, Git would have been installed as DVC by default leverages on Git for versioning of its metafiles. This post/walkthrough also assumes that one is comfortable with the basics of Git. Otherwise, the basics can be picked up through resources like this or this.

  1. GitHub

GitHub is a platform for hosting source code to be versioned; it will be used for several purposes in this walkthrough:
– as a remote for the code repository that we will be creating
– GitHub Actions for automating some of the workflows that will be introduced along the way
Do create an account if you have yet to do so.

  1. Weights & Biases

As we work towards garnering a working predictive model, we will be using an experiment tracking tool and that’s where Weights & Biases comes in. One can create a free account, download the relevant CLI/library and get started right away with access to a dashboard. Once signed up, do take note of the W&B API key for your account under ‘Settings’.

  1. DVC (CLI and Remote Storage)

CLI
Of course, we will be having DVC in the spotlight as indicated by the name of the post. Installation of DVC in your local machine is required. This post assumes the installation of the tool through the following methods as opposed through pip or conda:
– Installation for MacOS
– Installation for Windows
– Installation for Linux
One can still install the tool through pip or conda but do note that the YAML file for creating our development environment later will not include it and you would have to install it yourself.

Remote Storage for DVC
In this walkthrough, artefacts and cache created locally are to be pushed to a remote storage and we will be accessing them across differing machines or different directories within the same machine. You are to set up your own remote storage and have the credentials for access on standby. While the walkthrough uses Azure Blob storage as an example, one can use any remote storage that’s desired, as long as DVC supports it. Below are some suggestions for which you can get started with:

If you have not signed up for an Azure account before, you can get complimentary $200 credit which means that you can gain instant access to a multitude of Azure services. You would still need a credit card on sign up but this walkthrough would not incur much cost, if at all. We highly recommend resorting to this service as it is one of the easiest to get started with. This link will walk you through on how to set up a container for your blobs after which you can obtain the connection string from the ‘Access keys’ tab on your Azure portal.

If you do not have access to a credit card, this is the next best alternative as all you would need is just a Google account and you are automatically provided with 15GB of storage. However, this method is the most time consuming as using Google Drive as a remote storage for DVC would present you with a suboptimal experience i.e. you will incur more time in following through as DVC pushes and pulls would be much slower. DVC’s documentation has a comprehensive walkthrough on how to get your Google Drive all set up as a remote storage.

  1. Docker and Docker Hub account

Through GitHub Actions, we will be automating builds of a container (Docker) image as well as running workflows using this containerised environment. Docker installation on your local machine is optional however it would be a more wholistic experience if you can try out building images on your own local machine. A Docker Hub account is required to store our Docker images online; we will be pushing and pulling images from our account’s Docker image repositories.

If you are unfamiliar with Docker, you can refer to the following resources on the basics:
– What is a Container?: https://www.docker.com/resources/what-container
– Docker for Beginners: https://docker-curriculum.com/
– Docker Simplified – A Hands-On Guide for Absolute Beginners: https://www.freecodecamp.org/news/docker-simplified-96639a35ff36/

  1. Anaconda

If you have worked with Python’s virtual environments before, Anaconda is just an alternative for which you would have access to Anaconda packages/libraries instead of just those from PyPI. For those who have never delved into virtual environments before, you should. For those finding Anaconda to be hefty with the footprint it leaves, you can resort to installing the lightweight alternative Miniconda instead. We will be using conda environments for this walkthrough.

Back to top

Problem Statement & Overview of Workflows

Before we dive right into the tools and workflows, let us be presented with a problem statement that would be deemed relevant for the topic of the day. This would give us some direction as we explore the differing concepts.

What is the problem?

Good question. Well, let’s say we are tasked with developing a deep learning model that is able to conduct binary classification of sentiments from inferred texts. If you have accumulated plenty of experience building such models, you would know that this is not too hard of a task to get started on (performance of model aside). Here’s an example of a tutorial that basically lets you do that (in fact the code for the model training aspect later is derived from this tutorial). However, when we are taking into consideration an environment for which we would have to work with other individuals, sticking to development workflows which are contained mostly within Jupyter or Colab notebooks might not translate well into collaborative relationships. Deliberating on this begets several things which might include the following:

  • Modularisation

The utility of Jupyter notebooks are certainly not to be dismissed; they allow for quick experimentation and iterations. However, functions or control structures written within notebooks would be made much more useful if they were modularised, encapsulated within a proper project structure. Reasons for this includes how cumbersome it can be to version control notebooks as well as how it can easily introduce bad habits to programming newbies. It does has its merits (lower barrier of entry for experimentations, persisted outputs, cool plugins) but one should be wary of dead code and how mess can easily accumulate within such environments.

  • Continuous Integration (CI)

In a project with many moving parts that are worked on by different people, a workflow with seamless feedback loops and automated testing is important. There should not be plain assumptions that any code or refactoring works to be committed and merged into the mainline repository would work as expected or that they follow good conventions. For example, how can we ensure that the functions written for text processing would still work the same should we change the regex expressions being used? Unit and integration testing are mechanisms we can utilise for automatic provision of feedback. We will also take a look into automating the builds of the development environment needed for running these tests and checks.

  • Versioning

Versioning of source code is a relatively straightforward matter. It’s hard to say the same with data and model artefacts. As mentioned in the first part of this series, we cannot use Git for versioning our dataset and model files so that’s where DVC and remote storages come in. However, beyond just versioning file artefacts, there lies great utility with versioning of the pipelines that have these artefacts as dependencies or outputs.

In summary, we will be churning out binary sentiment classification models with the usage of modularised pipelines for which the dependencies and outputs will be tracked. Some workflows will be automated through a CI tool.

Outline of Workflows

  1. Beginning Continuous Integration
    – Repository Set Up
    – Create a GitHub Action Workflow
    – Populating GitHub Secrets
    – Automating a Container (Docker) Build
    – Triggering a Workflow Manually
  2. Automated Testing & Versioning with DVC
    – DVC with Remote Storage
    – Adding & Versioning Raw Data
    – DVC Pipelines
    – Adding & Automating Unit Tests
  3. Reviewing Model Experiments
    – Experiment Tracking
    – Reviewing Model Experiments Through Comments
    – Reproducing The Model Training Pipeline
  4. Packaging & Serving

Back to top

– Practical Section 1 –

Beginning Continuous Integration

Hereon, we will be getting our hands dirty so get ready with all the needed accounts/keys/tools (read: ‘Prerequisites & Tools’).

Repository Set Up

We will begin by creating a new repository on our GitHub account and then cloning it to our local machine.

Screenshot of a newly created repository on GitHub.

In the example above, the repository is named dvc-for-cd4ml so it will be referenced as such from hereon. Again, let’s clone the repository to our local machine:

$ git clone https://github.com/<YOUR_GITHUB_USERNAME_HERE>/dvc-for-cd4ml.git
$ cd dvc-for-cd4ml

IMPORTANT NOTE: If you have set up 2FA for your GitHub account, you have to use either a personal access token or SSH key instead of your password when accessing GitHub through your terminal. Please refer to the linked resources on how to create a personal access token or connecting to GitHub with SSH.

GitHub Actions

Viewing your repository on GitHub, you would notice a tab called ‘Actions’:

GitHub Actions tab on the repository.

So what is GitHub Actions? If you have used other CI/CD tools like Jenkins, Travis CI or Circle CI, GitHub Actions is basically GitHub’s alternative offering, with very minimal set up – you have access to it from the get go, free for use for the most part especially if your project is open sourced. If you have never used CI/CD tools before, you can automatically execute workflows following certain actions like pushes to a Git repository or manual triggers. With such capabilities, you can seamlessly test, conduct automated checks, integrate and even deploy your code. While the initial learning curve might seem daunting, the payoff can be huge (or not, depending on execution) and hopefully its benefits would be more observable once you have gone through this technical walkthrough. With that said, let’s go ahead with creating our first GitHub Action workflow.

Additional Reference(s):
– GitHub Actions’ Documentation: https://docs.github.com/en/free-pro-team@latest/actions
– https://martinfowler.com/articles/continuousIntegration.html

Creating a GitHub Action Workflow

Now, copy the files linked below into the repository within their own prefixed subdirectories. (If a file is prefixed with .github/workflows/, make sure that the directory .github/workflows/ exists beforehand.)

You can either manually create the files, copy-pasting the contents from the links above or…..you can run the following commands (using curl) to instantly create the files:

# NOTE: Run these commands from the repository's root location
$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/.github/workflows/build-environment.yaml --create-dirs -o .github/workflows/build-environment.yaml
$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/.github/workflows/retag-docker-image.yaml --create-dirs -o .github/workflows/retag-docker-image.yaml
$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/scripts/check-file-matches.sh --create-dirs -o scripts/check-file-matches.sh

Now, add (track) and commit the files:

$ git add .github scripts
$ git commit -m "First commit; add GA workflows and script for building environment."
$ git push origin master

After pushing the files, you would observe that the ‘Actions’ tab is updated with a couple of workflows ('Change Latest Docker Image' and 'Build Environment'):

GitHub Actions tab updated with workflows.

IMPORTANT NOTE: Understanding the basic overview, concepts and terminologies of GitHub Actions would be advantageous before we do a deep dive into the workflows below.

Let’s go through each of the files we have pushed:

  • .github/workflows/build-environment.yaml

Click here for a detailed rundown of the file.

  • scripts/check-file-matches.sh

Click here for a detailed rundown of the file.

  • .github/workflows/retag-docker-image.yaml

Click here for a detailed rundown of the file.

Back to top

Populating GitHub Secrets

In the workflows shared above, before being able to push the Docker builds we would have to log in to our Docker Hub account first, using credentials to be stored on GitHub in an encrypted manner. We are to store our Docker Hub username and password on the GitHub repository under the ‘Settings’ tab right by the sidebar ‘Secrets’ (reference link). The credentials to be stored are to be named as such:

  • DOCKERHUB_USER: Docker Hub account username.
  • DOCKERHUB_PW: Docker Hub password.
The repository’s secrets page after adding the credentials.

Automating a Container (Docker) Build

Now that we have the first set of credentials configured, let us work towards executing the workflows above. Before we add the Dockerfile and Conda environment YAML file to the repository, let us create a new branch (build-env) which would eventually be merged into the master branch through a pull request. We then push this newly created branch to the remote.

$ git checkout -b build-env
$ git push -u origin build-env

Create the files linked below, add, commit, and then push to remote:

$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/docker/dvc-for-cd4ml.Dockerfile --create-dirs -o docker/dvc-for-cd4ml.Dockerfile
$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/dvc-for-cd4ml-conda.yaml --create-dirs -o dvc-for-cd4ml-conda.yaml
$ git add docker dvc-for-cd4ml-conda.yaml
$ git commit -m "Add Dockerfile and Conda YAML file for building the custom dev environment."
$ git push origin build-env

Let’s go through each of the files we have pushed:

  • docker/dvc-for-cd4ml.Dockerfile

Click here for a detailed rundown of the file.

  • dvc-for-cd4ml-conda.yaml

Click here for a detailed rundown of the file.

After the push above, your GitHub ‘Actions’ tab would look something like this (if you see an orange circle icon, wait for a few minutes):

GitHub Actions tab after first push to build-env.

You can inspect the workflow run corresponding to the latest commit message above and you would see that all the steps defined in .github/workflows/build-environment.yaml have been executed (as indicated by the circled check icons).

Detailed view of executed job steps for the Build Environment workflow.

The relevant commit SHA for this workflow run is indicated by a dotted box in the image above. If you were to inspect the ‘dvc-for-cd4ml’ repository in your Docker Hub account, you would see something like the following:

List of repositories in a Docker Hub namespace (account).
‘dvc-for-cd4ml’ repository on Docker Hub.

Triggering a Workflow Manually

In this subsection, we will be taking a look into how we can trigger a workflow on the ‘Actions’ tab through a button where an input can be provided as well. In order to have this feature be enabled, one would have to define a workflow_dispatch event in a workflow file (see explanation for .github/workflows/retag-docker-image.yaml).

You would notice that the ‘dvc-for-cd4ml’ Docker image on Docker Hub is with a tag that is the same as the commit SHA indicated on the GitHub Action workflow run page (in this example it’s ‘2afcd31’). Taking note of this commit SHA/tag, head back to the GitHub repository’s ‘Action’ tab and inspect the Change Latest Docker Image workflow.

Change Latest Docker Image workflow on repository’s GitHub Actions tab.

With reference to the dotted red box in the image above, click on the grey ‘Run workflow’ button. In the popup, paste the commit SHA obtained from the previous workflow into the value box and click on the green ‘Run workflow’ button. The workflow would run (initially accompanied with a circular orange icon) and after a few minutes, you should see a green check mark.

Change Latest Docker Image workflow successfully ran.

Check back on the ‘dvc-for-cd4ml’ repository on Docker Hub and you will observe that this time, there exists another tag called 'latest'.

‘dvc-for-cd4ml’ repository on Docker Hub with the 'latest' tag.

Now that we have a Docker image with the 'latest' tag, we can use it for a workflow in the subsequent section.

Now one might wonder: why is this workflow manually triggered? One reason is that we might not prefer that every recent image that has been automatically built be tagged as 'latest'without further verification/testing. Exposing broken images to other workflows can cause bottlenecks and even a false sense of security. Verification that a Docker image with a specific tag is good to go can be done in one’s local machine or separate isolated workflows. Do note that this workflow that you are being exposed to might not be an optimal one but alas this is an introductory walkthrough.

Before moving on to our next set of workflows, we would have to merge the build-env branch into the master branch. First navigate to the ‘Pull requests’ tab and then click on the green ‘New pull request’ button. Comparing the build-env branch with the master branch as the base, click on the green ‘Create pull request’ in the ‘Comparing changes’ page.

The ‘Comparing changes’ page to create a pull request on the GitHub repo.

In the following page, you can give the pull request a customised name and leave a comment providing more details to the reviewer as to what the request entails.

Page for detailing the pull request to be created.

A common use case for pull requests is for a developer to merge some work (a feature branch per se) that they have done into a mainline branch (like master). This work is to be reviewed by authorised individual(s); the pull request would either be approved, the reviewer(s) would leave some comments for further revisions, rejected or ignored completely. The following page is what would be observed in a clean pull request where there are no conflicts and the changes can be merged:

Pull request page, build-env ready to be merged into master.

Oh, these wonderful green checks. Anyway, since you are the sole developer/reviewer of this repository, you can go ahead and click the button to merge the pull request and then confirm it. We will role play as a reviewer for this time round. (Also assuming the reviewer does not mind the typo ‘condo’ in the pull request title.)

View after the build-env branch has been merged into master.

You can see that once you’ve merged the branch, you have the option to delete that said branch. Since the changes in the build-env branch has been merged into master, it is okay to delete it. For now, we can just leave it; you can delete it later if you want. Now, on our local machine’s terminal, let us retrieve the changes merged into the remote master branch into our local master branch.

$ git checkout master
$ git pull origin master

Now, you have covered the complete workflow of making changes to the custom containerised development environment through automated builds. Let’s move on the next section where we do more with automated testing as well as versioning with DVC.

Back to top

– Practical Section 2 –

Automated Testing & Versioning with DVC

The previous section focused on automated builds for containerised environments, providing some of you a first look into CI workflows. There was little work done on the local machine but we will see more of that in this section where data preparation and versioning of such pipelines are to be carried out. Also, there will be less hand-holding from this section onwards so we will move faster with less pinpointing.

First, let us create a Conda environment that will contain the libraries needed:

$ conda env create -f dvc-for-cd4ml-conda.yaml -n dvc-for-cd4ml
$ conda activate dvc-for-cd4ml

We also have to ensure that we have the following:

  • DVC is intalled (within or out of the conda environment); check installation with the following command: dvc version
  • A remote storage account all set up (in this walkthrough we will use Azure Blog Storage)

DVC with Remote Storage

Create a new branch for this section and push it to remote repository:

$ git checkout -b test-code-check-exp
$ git push -u origin test-code-check-exp

Now, we initialise this local repository as a DVC project:

$ dvc init
# The `.dvc` folder is automatically tracked by Git after initialisation
$ git commit -m "DVC init."

Set up a blob container on Azure storage (this example assumes a container named dvc-remote) and retrieve the connection string through the ‘Access keys’ tab on the left-hand panel.

A storage account page on the Azure Portal, where the access keys can be retrieved.

Now, add the remote storage to our local DVC project:

$ dvc remote add -d azremote azure://dvc-remote/dvc-for-cd4ml
$ dvc remote modify --local azremote connection_string '<PASTE YOUR CONNECTION STRING HERE>'

IMPORTANT NOTE: Setting up remote storages for DVC differs between different providers. For more information on supported storage types and how to set them up for DVC, refer to this link.

Adding a remote storage for DVC would change the configuration file .dvc/config. This file is tracked so any user of the repository can gain access to these configurations. However, as shown above, any configuration meant to be used by DVC but to be obscured and not tracked by Git can be added with the --local flag. These untracked configurations can be found in the file .dvc/config.local instead. Let’s track the global DVC configuration.

$ git add .dvc/config
$ git commit -m "Add remote storage for DVC."

Adding & Versioning Raw Data

Remote storage for DVC has been configured and now we shall add the raw data. (This raw data is a subset of the Large Movie Review Dataset provided by the Association for Computational Linguistics.)

$ wget https://ryzalkdev.blob.core.windows.net/ryzal-pub-misc/dvc-for-cd4ml-resources/acl-imdb-movie-reviews.zip --directory-prefix=./data/raw/
$ unzip data/raw/acl-imdb-movie-reviews.zip -d data/raw
$ rm data/raw/acl-imdb-movie-reviews.zip
$ ls -la data/raw
# Output for raw data directory
$ tree data/raw -L 2
data/raw
└── acl-imdb-movie-reviews
    ├── test
    └── train

Here, we are going to track the subdirectory data/raw using the dvc add command. What this command basically does is track the data/raw folder, add it into the local DVC cache, creates a .dvc file (data/raw.dvc) and adds the subdirectory to a .gitignore file. The documentation for the command provides more details for what happens behind the scenes.

$ dvc add data/raw
$ git add data/raw.dvc data/.gitignore
$ git commit -m "Add raw data for DVC to track."

Now that the data has been tracked locally, we have to push it to the remote storage that acts as a long-term storage. Different versions of data pushed to such remote storages can be retrieved by individuals in a team. One might wonder: how does DVC know which version to pull in from the cache or remote storage? It is through the .dvc files which contains MD5 checksums for verifying the integrity of the data or artefacts in question. Again, DVC’s documentation for the relevancy and explanation of the .dvc files and directories will provide you with sufficient details.

$ dvc push data/raw.dvc

After running the command above, you can check the Azure portal’s storage viewer and you would observe that in your blob container you would have a folder named 'dvc-for-cd4ml' containing many subdirectories. Let’s process the raw dataset now. Create and add the following files:

$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/src/__init__.py --create-dirs -o src/__init__.py
$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/src/data_prep.py --create-dirs -o src/data_prep.py
$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/params.yaml --create-dirs -o params.yaml
$ git add src/__init__.py src/data_prep.py params.yaml
$ git commit -m "Add Python scripts for data preparation and config (params) file."

For more explanation on the files we have just tracked and committed, click on the expandable sections below:

  • src/__init__.py

Click here for a detailed rundown of the file.

  • src/data_prep.py

Refer to the docstrings contained within each function for further explanation.

  • params.yaml

Click here for a detailed rundown of the file.

DVC Pipelines

In the following step, we will be executing the script to process each of the text file in the directory data/raw/acl-imdb-movie-reviews. However, rather than just a simple execution through python data_prep.py, we will make use of the dvc run command instead.

# Making sure that the conda environment has been activated
$ conda activate dvc-for-cd4ml
$ dvc run -n data_prep \
    -d src/data_prep.py -d data/raw \
    -o data/processed \
    python src/data_prep.py

Using this method would automatically add and track the dependencies and outputs of a stage that is to defined by the user. When this repository is eventually cloned and other developers intend to reproduce the pipelines, they can just execute the dvc repro command which checks for the dependencies’ integrity. Should the dependencies deviate, dvc repro will rerun the pipeline(s) and the new outputs resulting from the differed dependencies would be tracked. In the example above, the data/raw folder (as well as the data preparation script itself) is added as a dependency for the data_prep stage. The output of the data_prep stage is the data/processed folder. However, this time, we didn’t use the dvc add command like with the raw data. The reason for this is because as mentioned, dependencies and outputs defined in dvc run pipelines are automatically tracked (not pushed though, do note the difference). However, you would notice the creation of two new files: dvc.yaml and dvc.lock. Summarising the utility of these files: dvc.yaml keeps track of pipeline stages defined by the user and dvc.lock contains the MD5 checksums relating to the dependencies and outputs of the defined pipelines. These DVC files are to be versioned by Git so that we can checkout differing versions of pipelines, data, or other artefacts.

$ git add data/.gitignore dvc.yaml dvc.lock
$ git commit -m "Add and execute data preparation pipeline to/through DVC."

For greater details on the files generated by DVC (like dvc.yaml and dvc.lock), please refer to DVC’s documentation here. I can’t recommend DVC’s documentation enough; it’s great.

Don’t forget to push the tracked data and artefacts related to the data_prep stage to the remote storage.

$ dvc push data_prep

Adding & Automating Unit Tests

So we have created a script for processing the raw data. That is nice. What would be even more nice if we are to introduce some unit tests to ensure that the functions we have written for the data processing pipeline would always work as intended, especially after changes would be incorporated by fellow developers. What we are working towards is a workflow for which code changes are tested following pull requests. The tests will be automatically invoked and the results will be made visible to reviewer(s). If one sees red, that means tests have failed and this warrants further changes to the branch, until more green check marks could be seen. In addition to unit tests, we will also conduct checks for code quality through linting. Checks and validations introduced through unit tests are more explicit. Linting is more wholistic in the sense that it checks for potential bugs, bad coding conventions and stylistic errors. We will be using the Python libraries pytest and pylint for unit testing and linting respectively.

Let’s add such unit tests to our repository by creating and committing the following files:

$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/tests/__init__.py --create-dirs -o tests/__init__.py
$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/tests/test_data_prep.py --create-dirs -o tests/test_data_prep.py
$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/.pylintrc --create-dirs -o .pylintrc
$ git add tests/__init__.py tests/test_data_prep.py .pylintrc
$ git commit -m "Add unit tests and linter configuration."

For more explanation on the files we have just tracked and committed, click on the expandable sections below:

  • tests/test_data_prep.py

Click here for a detailed rundown of the file.

  • .pylintrc

Click here for a detailed rundown of the file.

We can try out the unit tests and linting capabilities locally using the following commands:

$ pytest tests --cov=src
$ pylint --fail-under=8.0 src

Running the commands above will yield outputs detailing the results. The unit tests’ results usually detail the number of passes, fails and warnings. The linter would showcase a score, in this case, out of 10. Depending on the standards i.e. score that your team intends to uphold, the linter can output a binary result. In the example above, the score 8.0 is specified and so if any code passes through it and does not meet the mark, the command will exit with an error.

Now, we add a GitHub Action workflow which runs the aforementioned tests and checks every time a pull request is made to the master branch:

$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/.github/workflows/test-analyse-code.yaml --create-dirs -o .github/workflows/test-analyse-code.yaml

IMPORTANT NOTE: The Docker image stated in the workflow file above has the user prefix ryzalk. Before adding and tracking the file, modify it such that your Docker Hub username is used instead. At the time of writing, there’s no way to pass on the Docker Hub username for the container image definition.

$ git add .github/workflows/test-analyse-code.yaml
$ git commit -m "Add GitHub Action workflow for running unit tests and linter."
$ git push origin test-code-check-exp

For more explanation on this workflow, click on the expandable section below:

Click here for a detailed rundown of the file.

Now, create a pull request comparing the test-code-check-exp branch with master and what you would observe is that instead of a single GitHub Action workflow being accounted for, you would see another one called ‘Test & Analyse Code’ running, and the end result would look something like the screenshot below:

Pull request where all checks have passed, including unit tests and linter.

Rejoice! However, DO NOT MERGE this pull request just yet. We are going to push more commits to this branch in the next section which are going to be reflected in this same pull request. So with the tests and checks configured above, we can mitigate any possible degradation of the code base and check that conventions and standards are well-adhered to. Let’s move on to model experimentation and training.

Back to top

– Practical Section 3 –

Reviewing Model Experiments

With the processed set of data, we can now train a model that is able to take in a text and conduct binary sentiment classification (positive VS negative). We will be using the Keras framework for training the model. As mentioned previously, the script for training the model is derived from this tutorial.

Experiment Tracking

Before we train the model, make sure that your Weights & Biases (W&B) account has been set up. With the API key retrieved from ‘Settings’, you can log in to W&B through the following command:

$ wandb login <YOUR_WANDB_API_KEY>

You can also execute the following to save you the task of logging in to W&B beforehand:

$ export WANDB_API_KEY=<YOUR_WANDB_API_KEY>

Let’s create and commit the model training script linked below:

$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/src/train.py --create-dirs -o src/train.py
$ git add src/train.py
$ git commit -m "Add model training script."

(Refer to the docstrings contained within each function in the script for further explanation.)

Like with the pipeline for processing the raw data, we are to create and track the pipeline for training the model through DVC:

$ dvc run -n train_model \
    -d data/processed -d src/train.py \
    -o models/text-classification-model -o train-run-metrics.md \
    -p train.epochs,train.bs,train.metric,train.pretrained_embedding \
    python src/train.py

If your model training is without a GPU in the picture, it would take some time. In the meantime, let’s go through the execution. As seen from above, the dependencies for the train_model stage are the processed data (data/processed) and the training script itself (src/train.py), while the outputs are the resulting model (models/text-classification-model) and a Markdown file (train-run-metrics.md) which contains the following information:

  • ‘Link to run’ – A link that directs one to the experiment run page on the W&B portal/dashboard. A screenshot of an experiment run page can be found below.
  • ‘Model loss’ – A float value that indicates the resulting test loss of the model that was trained.
  • ‘Model accuracy’ – A float value that indicates the accuracy metric of the model that was trained.
A run page for a model experiment that has finished running.

What is different from the previous time round where we have created a DVC pipeline stage is that we use the -p flag here. Usage of this flag allows us to also track the different parameters defined in the file params.yaml which we intend to modify from time to time between different experiments. In this example, we intend to keep track of the parameters (keys) epochsbsmetric, and pretrained_embedding. See DVC’s documentation on params here.

$ git add models/.gitignore .gitignore dvc.yaml dvc.lock
$ git commit -m "Add a DVC pipeline stage for training the binary sentiment classification model."
$ git tag -a "model-v1.0" -m "First version of text classification model."
$ dvc push train_model

With the commands above, we add the updated DVC files (dvc.yaml and dvc.lock) as well as the newly created .gitignore files which tells Git to ignore the outputs of the train_model stage. Of course, we are to push these tracked artefacts to the remote storage with dvc push again. We also provide the latest commit with a Git tag. The utility of this tag is that we can easily checkout a specific commit/reference with it. With the provided “model-v1.0” tag above, we can just execute git checkout model-v1.0 in the future to get back to this state, without having to scramble for the exact commit SHA of that specific commit/reference. Tags are also useful for indicating releases. We will make use of this tag later on.

Reviewing Model Experiments Through Comments

Now, we are to create another GitHub Action workflow which allows us to post a comment in the pull request. The contents of this comment will be derived from the Markdown file (train-run-metrics.md) generated by the training script. Recall that this file is an output of the train_model stage tracked by DVC. We create and commit the following file:

$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/.github/workflows/comment-pull-req.yaml --create-dirs -o .github/workflows/comment-pull-req.yaml

IMPORTANT NOTE: The Docker image stated in the workflow file above has the user prefix ryzalk. Before adding and tracking the file, modify it such that your Docker Hub username is used instead. At the time of writing, there’s no way to pass on the Docker Hub username for the container image definition.

$ git add .github/workflows/comment-pull-req.yaml
$ git commit -m "Add GitHub action workflow for displaying model training experiment results. Trigger: comment-model-exp"

More explanation for the workflow file can be found in the collapsible section below:

  • .github/workflows/comment-pull-req.yaml

Click here for a detailed rundown of the file.

You may notice that the commit message for adding this new workflow contains a keyword: comment-model-exp. This is a custom keyword configured to trigger the act of commenting on the pull request with the Markdown file through the defined GitHub Action workflow. However, remember that the Markdown file (train-run-metrics.md) is tracked by DVC which means that it is not ‘uploaded’ to Git but instead to DVC’s remote storage. The runner spun up for GitHub Actions would therefore require access to the remote storage so we need to add the Azure connection string as a GitHub secret; name it AZ_CONN_STRING.

The updated repository’s secrets page.
$ git push origin test-code-check-exp
The pull request updated with a comment that details the performance of the model we have trained.

As observed above, a comment containing information from the Markdown file has been posted. So what’s the use or relevance of this? Well, for one, such comments can quickly initiate discussions on a model training experiment, with relevant information at disposal. As this job of posting a comment is only triggered by a keyword in the commit message, this means that such discussions are selectively invoked. Let’s just assume that the reviewer in our scenario is satisfied with the model; we can merge the pull request.

Reproducing The Model Training Pipeline

Remember the dvc repro command from above? We are going to use it for this section; let’s pretend that we have a fellow engineer working on this same project and they have to run the same model training pipeline, just with a different parameter value. To begin this ‘simulation’, navigate to another directory that does not have this ‘dvc-for-cd4ml’ repository and clone the repo:

$ git clone https://github.com/<YOUR_GITHUB_USERNAME_HERE>/dvc-for-cd4ml.git
$ cd dvc-for-cd4ml

Create a new branch improve-model:

$ git checkout -b improve-model
$ git push -u origin improve-model

From this new branch, we are to run another model experiment. However, before we are able to train the model, we need the processed data to be provided as inputs to this pipeline. The remote storage set up for DVC allows for us to easily retrieve the processed data without having to run the data_prep stage again. Remember, we are treating repository we have just cloned as though it is located in another machine. So, we have to configure the connection string for DVC to connect to the remote storage:

$ dvc remote modify --local azremote connection_string '<PASTE YOUR CONNECTION STRING HERE>'

After adding the necessary configuration above, we pull the artefacts pertaining to the relevant pipeline stages (data_prep and train_model) from the remote storage:

$ dvc pull data_prep train_model

Now recall that the model training pipeline stage train_model is tracked by DVC, and the repository has a dvc.yaml file that retains the definition of the stage. We do not have to run the multiline dvc run command again. With dvc repro, DVC will check for any deviations in the dependencies or parameters set for the targeted pipeline stage. Should that be the case, DVC will run the pipeline again, taking into account those deviations. Let’s try out the dvc repro command here:

# Make sure that the relevant conda environment has been activated
$ conda activate dvc-for-cd4ml
$ dvc repro --downstream train_model
Stage 'train_model' didn't change, skipping
Data and pipelines are up to date.

The resulting output tells us that since no tracked dependencies or parameters of the train_model stage has deviated, we can just make do with the artefacts that currently exist and the train_model stage can be skipped. The --downstream flag is for indicating which stage that dvc repro should start from. The example above simply means that we are to reproduce pipelines starting from train_model onwards. Without the flag, DVC will check through every stage, which might not be ideal or too time consuming. Now, let’s create a deviation. Change the value for the epoch parameter in the params.yaml file and now rerun the same dvc repro command:

Text editor (VS Code) highlighting the changed value for the ‘train.epoch’ key.
$ dvc repro --downstream train_model
Running stage 'train_model' with command:
	python src/train.py
...

After the training is done, the dvc.lock file will be modified, containing the updated MD5 checksums. We have to track and commit the files that have been modified, and provide the commit with a tag providing some indication that it contains the files relevant for this new model version. Don’t forget to push the relevant artefacts to the remote storage!

$ git add dvc.lock params.yaml
$ git commit -m "Train second version of sentiment classification model (5 epochs). Trigger: comment-model-exp"
$ git tag -a "model-v2.0" -m "Second version of text classification model."
$ git push origin improve-model
$ dvc push train_model

Create a pull request comparing the improve-model branch with master. A comment containing details for the newly trained model will be posted once the triggered workflow has finished running. The pull request can be merged after. Within this same local repository, check out the master branch and pull in all the changes.

$ git checkout master
$ git pull origin master

So in this master branch, for which DVC has been configured, we have two different versions of the binary sentiment classification tracked. Say we want to like to retrieve the first version of the model, (perhaps for a reason that it has better accuracy) how do we do that? Well, that’s where the utility of the dvc.lock file comes in. We retrieve the version of the dvc.lock file that belongs to the tag detailing the first version of the model ('model-v1.0') and we have dvc checkout to retrieve that version of the mode.

$ git tag
...
model-v1.0
model-v2.0
$ git checkout model-v1.0 -- dvc.lock
$ dvc checkout train_model
...
M       models/text-classification-model/
M	train-run-metrics.md

The resulting output from the last command above shows that the artefacts contained in the folder models/text-classification-model/ and the file train-run-metrics.md have been modified, reflecting the version from the tag 'model-v1.0'. From here, we can see how easy it is for us to toggle between the different versions of the models and other relevant artefacts. With the Markdown file/artefact, we can make use of the link embedded within it to extract more information for the model experiment.

Back to top

– Practical Section 4 (Optional) –

Packaging & Serving

So we know how to toggle between different versions of trained models. However, these models won’t be of much use unless we package them for end-users to request for predictions from it. In this section, we will create a simple Flask app that loads a trained model and have it serve predictions/responses (positive VS negative sentiment) upon requests by end-users. At the end, we package this app into a Docker container and we test it out, all on our local machine. With that said, this requires Docker to be installed on your machine. Let’s create and add the following files:

$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/src/serve_model.py --create-dirs -o src/serve_model.py
$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/docker/dfc-model-flask-server.Dockerfile --create-dirs -o docker/dfc-model-flask-server.Dockerfile
$ curl https://raw.githubusercontent.com/ryzalk/dvc-for-cd4ml/master/.dockerignore --create-dirs -o .dockerignore
  • src/serve_model.py

This script contains the code for the Flask application that loads up the trained model (located under models/text-classification-model) and returns a prediction indicating whether the submitted text is of a positive or negative sentiment.

  • dfc-model-flask-server.Dockerfile

This Dockerfile contains the commands for building the Docker image for serving the model. This Docker image will have the relevant Conda environment installed and the trained model within it. Basically, this image packages all that is relevant for us to conduct binary sentiment classification.

  • .dockerignore

When building a Docker image from a folder, that folder will be regarded as a context for the Docker engine. If you regard a whole folder as a context, the whole folder will be sent as a context to the Docker daemon so if your folder is huge, which in this context it definitely is, a lot of time will be consumed before the layer building even starts. Therefore, this file would be crucial for specifying the context needed for our Docker build workflows. Do not ignore .dockerignore.

Build the Docker image.

$ docker build . -t dfc-model-flask-server:model-v1.0 -f ./docker/dfc-model-flask-server.Dockerfile

After the build is done, run a container from the image has been built. We bind the port 80 from the container to the port 80 on the host machine. We also name the container 'dfc-serving'.

$ docker run -d -p 80:80 --name dfc-serving dfc-model-flask-server:model-v1.0

To make a POST request to the Flask app, you can use the curl command to specify the endpoint, header and data to be sent, as such:

$ curl -X POST "localhost/predict" -H "Content-Type: application/json" -d '{"text": "This movie was unpleasant, like the year 2020."}'
{"sentiment":"negative"}

To stop the container:

$ docker stop dfc-serving

Don’t forget to track and push these files.

# To get rid of changes to the dvc.lock file
$ git checkout master -- dvc.lock
$ git add .dockerignore docker/dfc-model-flask-server.Dockerfile src/serve_model.py
$ git commit -m "Add Python script and Dockerfile for model serving."
$ git push origin master

What you can do with this Docker image is to easily share it with your colleagues for them to test out the inference server (Flask app) and perhaps even deploy it in a proof-of-concept/staging environment.

IMPORTANT NOTE: This Docker image is not suitable for a production environment. You would need a production server (like a WSGI server) for matters like handling of multiple requests or asynchronous compute. Deployment of the prediction app in a production-ready environment is out of this walkthrough’s scope.

Back to top

Next Steps

This walkthrough brings one through a deeper dive into the different CD4ML concepts but it scratches the surface. There are plenty of improvements that can be introduced to the workflows that have been introduced. Here are some examples:

While this walkthrough has greater emphasis with the usage of DVC and GitHub Actions, the concepts we have come across are definitely applicable to other alternative tools; there are many different pathways to achieving the same objectives so have all the fun with exploring them!

Back to top

How Did AI Singapore Build a “200” Strong All-Singaporean AI Engineering Team With the Blue Ocean Strategy?

The classic management book Blue Ocean Strategy by Chan Kim & Renée Mauborgne describes the business environment using the terms ’red ocean’ and ‘blue ocean’.

Blue ocean strategy is the simultaneous pursuit of differentiation and low cost to open up a new market space and create new demand. It is about creating and capturing uncontested market space, thereby making the competition irrelevant.

– From https://www.blueoceanstrategy.com/

In short, avoid the red market and hunt in the blue market where your competitors are not playing.

When AI Singapore (AISG) started in June 2017, we were tasked to help 100 companies build their AI products and solutions through the 100Experiments (100E) programme. We started with four (4) AI engineers/data scientists with co-workers from my previous organisations I persuaded to join me. 

We then put out a recruitment advertisement, and we received around 300 resumes, of which only ten were Singaporeans resumes. I only managed to hire 1 Singaporean to join our team – He is still with us today. 

Being a government-funded programme and hosted by a local university, our salary structure and incentives could not match what the industry was willing to pay. How can we build up an AI Engineering team to meet our 100E KPIs when we have big tech giants such as Google, Microsoft, Grab, Facebook, etc., looking to hire the exact talent profiles in Singapore?

We decided to hunt in the blue ocean for these AI talents instead.

AI Apprenticeship Programme

AI Singapore launched the AI Apprenticeship Programme (AIAP)® in early 2018, where we deliberately hunted for AI talents using only two criteria:

  1. Must be a Singapore citizen.
  2. Pass the AIAP technical test and interview ( ➥ read what we look for, test and interview) .

We purposefully avoided stating the academic qualifications the candidate needs to have, such as a computer science or engineering degree with a specialisation in AI or ML. Instead, we stated the skills and knowledge the candidate needed to have to join the apprenticeship programme. We looked for talents that could be nurtured and developed in a short period of time. 

Hence, we looked for passionate individuals who probably self-taught themselves AI/ML and Python (or R) programming and basic software engineering skills. They typically would have gone online and participated in various MOOCs or read up and practised their AI/ML skills on platforms such as Kaggle; are familiar with cloud computing and data management. Only some would have formal education in AI/ML at the university level.

One of the biggest misconceptions that hiring managers to have is that only Computer Science graduates can do AI and ML. Some of my best AI engineers and data scientists have degrees in economics, psychology, business, biology or industrial engineering. 

academic background

What we have found is that the ability to learn fast, passion for solving data problems, and the love of working with data is key to being a good AI Engineer or Data Scientist. 

Everyone can learn to program. Not everyone has a passion for data and using data to solve a business problem.

This strategy has worked out exceedingly well for us as we have found passionate individuals from various disciplines who have decided to cross over to the AI/ML sector. They have taken it upon themselves to learn AI/ML at a deep level. Some have spent up to three years poring over their old university mathematics and statistics books. 

Kevin – one of our senior candidates from AIAP batch #3, commented, “I borrowed my son’s CMU maths/stats and AI textbooks as I found online MOOC was not sufficient”.
➥ Watch the interview with Kevin.

What they lacked was the opportunity to apply their knowledge to a real-world AI/ML business problem. The AIAP provided the chance to work on a real-world industry AI project (100E is explained in the next section) or an internal AI Singapore project to build AI/MLOps platform tooling, AI solutions and products (AI Bricks).

The project phase of 7-months is executed in-house in AI Singapore premises. AI engineers are assigned to the 100E project work together with the apprentices who are treated like the full-time staff of AI Singapore, and they are in turn supported by our AI/ML Dev Ops and Data Engineering team and the various AI Heads who are experts in the various fields of AI/ML.

Internship vs Apprenticeship

We do not believe in the internship model where “students” are sent to an organisation to intern. The model is broken in Singapore. In our AIAP apprenticeship model, we hot-house the apprentices. This accelerates their learning and skills. And by working so closely with their peers on a daily basis, and in an environment where everyone eats-drinks-sleeps AI/ML and deployment, real-world AI projects get delivered.

The outcome is a Minimum Viable Product (MVP) co-created by AI Singapore and the sponsoring organisation. The combined team will typically work on 10-15 sprints over the seven months and deploy the model into production by the final month.

In Provost & Fawcett’s 2013 Data Science for Business, the authors opined on page 319,

  • Data science is a craft.
  • Craft is learned by experience.
  • The most effective learning path resembles the classic trade where aspiring data scientists work as apprentices to masters.
  • Apprentices become “journeyman.”
  • Some become masters and take on apprentices.

The AI Apprenticeship Programme offers one such learning path. The Singaporean AI apprentices are mentored by AI Singapore’s AI mentors, some of whom might have gone through the apprenticeship programme themselves only a few months earlier.

The 100 Experiment (100E)

The 100E programme is where AI Singapore matches and assembles a team of AI scientists, AI engineers, and AI apprentices to help an organisation solve their AI problem statement. The organisation brings the problem statement, dataset, domain expertise, engineering/IT resources and co-funding. 

For a project to be undertaken by AI Singapore under the 100E programme, the company must meet the following criteria:

  1. be a Singapore registered entity with a local engineering/IT/AI team (or planned to have)
  2. have the dataset
  3. problem statement with ROI or strategic value to Singapore

AI Singapore has engaged more than 500 companies and 15,000 professionals through our outreach and industry programmes to date. We approved 66 100E proposals; completed and deployed 23 of the 100E into production.

AI Singapore has done projects in nearly all of Singapore’s primary industry clusters!

Sompo Holdings Asia collaborated with AI Singapore to build an AI-powered fraudulent claims detection system for enhanced claims processes. Sompo received an AI Award for general insurance during Singapore Business Review’s recently concluded Technology Excellence Awards in 2020. Watch  Sompo interview.

IBM Singapore collaborated with us to help them better classify the quality of different hardware products into different risk categories based on product return rates as well as predict future product return rates. The AI project involved using image processing, deep learning and time-series analytics. Engineers only need a few minutes to analyse the data, providing them with a glimpse into the future to pre-empt potential customer issues. The cost savings from this solution was also tripled! Watch IBM interview

Conclusion

Many of the apprentices often have multiple offers even before they complete the programme. AI Singapore hires up to 30% of these apprentices, and the rest end up in organisations such as DSTA, Grab and DBS Bank today. Some have gone on to form their own AI startups.

Of the nearly 162 Singaporean AI apprentices trained to date, AI Singapore has retained around 50 of these apprentices as our AI Engineers in various AI Engineering roles such as AIML Ops, AI modelling,  Data Engineering and AI Products teams. Many take on the role of mentors to train in-coming batches of apprentices. We also continue to hire from the industry experienced AI engineers, Singaporeans and foreigners, to supplement our home-grown team.

In recognition of the AI Apprenticeship Programme’s talent development innovation, IDC awarded AI Singapore the “Talent Accelerator for Singapore” in 2019. Read the IDC AISG award story.

Why AIAP works is that apprentices get to work on real-world AI projects – not a toy and syntactic datasets and problem statements. They get to experience what it takes to deliver an AI project to an actual customer – including all the pains of working with some demanding customers, missing or limited datasets and changing user requirements.

One of our pioneer batch of AIAP graduates who now works for the defence industry said: “There are many technical tutorials out there, but few offer the hands-on experience needed to address real-world problems, and that is one of the key differentiators of AI Singapore’s AI Apprenticeship Programme.”
➥ Read the interview with Derek.

Growing our own timber – one apprentice at a time.

mailing list sign up

Mailing List Sign Up C360