Get SGD58.85 off your premium account! Valid till 9 August 2021. Use the Code ‘SGLEARN2021’ upon checkout. Click Here

Robustness testing pipeline for NLP with Microsoft CheckList

In order to streamline robustness testing into the AI engineering process, the SecureAI team has made it a priority to integrate robustness testing tools into AI Singapore’s MLOps pipelines. In this article, we will share our experience integrating Microsoft CheckList, as a black-box robustness testing tool, in our CI/CD processes along with various technologies as shown in Figure 1. We will use an example from the SG-NLP project developed by AI Singapore’s NLP Hub for demonstration. Unfortunately, the example is currently unavailable on the SG-NLP site (as of writing) and will be added to the suite of models in a future release.


Figure 1. Tool Stack

What is CheckList?

CheckList is an evaluation methodology and tool developed by Microsoft, for comprehensive behavioural testing of NLP models. NLP models are typically very large and complex, with the same backbone being adapted for a diverse range of downstream tasks. A single statistic may not be able to provide useful insights to understand and improve the model. 

CheckList guides users in designing tests targeted toward specific language capabilities, which better reflects the complexities of language tasks. A suite of capability tests gives the user a more comprehensive understanding of model performance compared to a single statistic.

CheckList introduces different test types, which assess relative changes in the model’s predictions in response to changes in input, rather than simply comparing the predictions to the ground truth as done in standard functional tests.

The test types included in CheckList are:

  • Minimum Function Test (MFT): Similar to unit tests in software engineering, composed of simple examples that verify a specific behaviour.
  • Invariance (INV): Applies label-preserving perturbations to inputs, and expects the model prediction to remain the same.
  • Directional Expectation (DIR): Applies perturbations to inputs, and expects the model to behave in a specified way.

The tests are black-box in nature, as they only require the model’s output in response to certain inputs. This allows them to be applied to any model, regardless of implementation, unlike white-box testing techniques, which require knowledge about a model’s implementation and may only be applicable to certain types of models.

Designing tests with the CheckList methodology

The use case we have selected is the ‘Learning to Identify Follow-Up Questions’ (LIF) task. As illustrated in Figure 2, given a passage as context, series of question-answer pairs as conversation history and a candidate question, the model identifies if the candidate is a valid follow-up question.


Figure 2. LIF task

As part of the CheckList process, we sat down with the team from NLP Hub to ideate and design suitable tests for the model. The predefined list of capabilities provided by CheckList served as useful prompts during this process.

The following are some examples of the tests that we designed for the demonstration:




Test types

Robustness (typo)

Introducing typos into the text

brother -> brohter


Robustness (contractions)

Expanding and contracting contractions

They’re →They are They are → They’re


Taxonomy (synonyms)

Replacing words with their synonyms

see -> envision


When a test is run, the data is perturbed to generate test cases, which are passed to the model for predictions. When a model does not behave as expected on a test case, it is considered as a failed case. The failure rates on each test may indicate potential areas of improvement for the model.

Implementing tests in CheckList

Most of the basic perturbations are readily available as functions in CheckList, however, they are designed to be applied directly to strings. Due to the additional complexity of the model input (a single input is represented as a JSON object with predefined keys for the various sub components, as shown in Figure 2), we had to implement adapters for the built-in perturbation functions in order to use them on our data.

To carry out tests with the perturbed data, CheckList requires a function that can return predictions and confidences from the model for a given set of data. As the model being tested was deployed on the cloud, this was easily accomplished by implementing a function that interacts with the model via its REST API.

Data versioning and DVC pipelines

We explored DVC and used it as part of our workflow with two objectives in mind. The first is for data versioning. The second is to make use of DVC pipelines, which reduces unnecessary runtime with intermediate artifact caching. The robustness testing pipeline can be organized into stages, with some stages depending on artifacts from the previous ones, as shown in Figure 3. When the pipeline is executed with DVC, it will only run the stages where the dependencies (data or code) have changed, and reuse intermediate artifacts from the previous run. This allows us to update individual tests or add new ones without the need to re-run the entire pipeline every time.


Figure 3. Stages and artifacts of robustness pipeline in DVC

Integration of robustness testing in Git workflow

We implemented robustness testing as a job in our GitLab CI pipeline, as shown in Figure 4. The CheckList tests are run as a DVC pipeline, configured with a remote storage. CML is used to post the results as a comment to the commit associated with the pipeline.


Figure 4. Integration of robustness testing in CI pipeline

An example of the comment in a GitLab merge request is shown in Figure 5. This allows the user to easily view the results within their development platform, and decide if the model performance is sufficient for deployment or use the insights from the evaluation to improve the model in an informed manner.


Figure 5. Report posted as GitLab comment in a merge request

Interactive result analysis with Voila

CheckList comes with an interactive Jupyter widget to facilitate analysis of the perturbed text and test results. In order to integrate it into the CI/CD pipeline, we turned it into a standalone application with Voila, as shown in Figure 6. We configured the CI/CD pipeline to deploy a containerized Voila application on a Kubernetes cluster, using the results of each run. The user can access the container for each run at a unique URL, and look through the detailed results of each test and examples of the test cases, as shown in Figure 7. The user may analyze these and use the insights to make informed decisions on using or improving the model.

Figure 6. Visual summary Jupyter widget deployed as Voila app
Figure 7. Test details and an example of a failed test case


All in all, CheckList provides us with a systematic process for designing a comprehensive suite of tests for NLP models. The black-box nature of the tests and flexibility of the tooling allows it to be applied across a diverse range of tasks, possibly even beyond the realm of NLP applications.

By combining CheckList with various technologies, we have demonstrated how robustness testing can be integrated into the ML development process in a reproducible and convenient manner.

Moving forward, the SecureAI team aims to continue the progress in this area and contribute to the development of more secure and trustworthy AI systems. Stay tuned!


Great news! We are on medium! You can check out this article on our medium profile:

Do follow us for more articles!



  • Wayyen manages the on-premise and on-cloud infrastructure resources used by AISG engineers and apprentices. He also works with MLOps, SecureAI and Synergos teams to bring out new tools and platforms for better CI/CD/CT in machine learning.

  • Ernest is an AI Engineer with a background in Physics who enjoys coding and problem solving. He has a keen interest in science and technology and is currently expanding and growing his repertoire in the area of Artificial Intelligence. Being part of the SecureAI team, Ernest spends his time exploring the robustness and uncertainty quantification of Machine Learning models as well as the best practices for MLOps.

  • AI Engineer in the SecureAI team with a background in Physics.

  • Enjoys solving problems with the use of technology. Currently in the SecureAI team, exploring tools for testing the robustness of machine learning models and incorporating best practices into the ML development process.

  • Jianshu has many years of AI/Data Science research and consulting experience. He has good track records in delivering values to clients and also quality academic research. One of his papers has also been awarded the Test of Time award by one of the leading AI conferences. In the recent years, he has spent most of his time in putting AI/ML into real-world usage and promoting ethical aspect of AI/ML, e.g. explainability, fairness, robustness, and privacy-preserving of AI/ML models.

  • A 20-year veteran in tech startups and MNCs, Najib focuses on High- Performance Computing (HPC ) as well as Cloud, Data and Artificial Intelligence (AI). He has led engineering teams in several organisations, some of which were startups that were acquired or exited successfully. He has helped build several of the first generation HPC cluster systems and infrastructure in Singapore and the region. He was also a lecturer for NUS School of Continuing and Lifelong Education (NUS SCALE) where he conducted workshops on Reproducible Data Science, Data Engineering and Conversational AI bots (Chatbots). He currently heads the AI Platforms Engineering team in the Industry Innovation Pillar at AI Singapore (AISG) where his team focuses on building the AI infrastructure and platforms for researchers, engineers and collaborators to solve challenging problems.

Share the post!


Related Posts

mailing list sign up

Mailing List Sign Up C360