Get SGD58.85 off your premium account! Valid till 9 August 2021. Use the Code ‘SGLEARN2021’ upon checkout. Click Here

Robustness testing pipeline for NLP with Microsoft CheckList

In order to streamline robustness testing into the AI engineering process, the SecureAI team has made it a priority to integrate robustness testing tools into AI Singapore’s MLOps pipelines. In this article, we will share our experience integrating Microsoft CheckList, as a black-box robustness testing tool, in our CI/CD processes along with various technologies as shown in Figure 1. We will use an example from the SG-NLP project developed by AI Singapore’s NLP Hub for demonstration. Unfortunately, the example is currently unavailable on the SG-NLP site (as of writing) and will be added to the suite of models in a future release.


Figure 1. Tool Stack

What is CheckList?

CheckList is an evaluation methodology and tool developed by Microsoft, for comprehensive behavioural testing of NLP models. NLP models are typically very large and complex, with the same backbone being adapted for a diverse range of downstream tasks. A single statistic may not be able to provide useful insights to understand and improve the model. 

CheckList guides users in designing tests targeted toward specific language capabilities, which better reflects the complexities of language tasks. A suite of capability tests gives the user a more comprehensive understanding of model performance compared to a single statistic.

CheckList introduces different test types, which assess relative changes in the model’s predictions in response to changes in input, rather than simply comparing the predictions to the ground truth as done in standard functional tests.

The test types included in CheckList are:

  • Minimum Function Test (MFT): Similar to unit tests in software engineering, composed of simple examples that verify a specific behaviour.
  • Invariance (INV): Applies label-preserving perturbations to inputs, and expects the model prediction to remain the same.
  • Directional Expectation (DIR): Applies perturbations to inputs, and expects the model to behave in a specified way.

The tests are black-box in nature, as they only require the model’s output in response to certain inputs. This allows them to be applied to any model, regardless of implementation, unlike white-box testing techniques, which require knowledge about a model’s implementation and may only be applicable to certain types of models.

Designing tests with the CheckList methodology

The use case we have selected is the ‘Learning to Identify Follow-Up Questions’ (LIF) task. As illustrated in Figure 2, given a passage as context, series of question-answer pairs as conversation history and a candidate question, the model identifies if the candidate is a valid follow-up question.


Figure 2. LIF task

As part of the CheckList process, we sat down with the team from NLP Hub to ideate and design suitable tests for the model. The predefined list of capabilities provided by CheckList served as useful prompts during this process.

The following are some examples of the tests that we designed for the demonstration:




Test types

Robustness (typo)

Introducing typos into the text

brother -> brohter


Robustness (contractions)

Expanding and contracting contractions

They’re →They are They are → They’re


Taxonomy (synonyms)

Replacing words with their synonyms

see -> envision


When a test is run, the data is perturbed to generate test cases, which are passed to the model for predictions. When a model does not behave as expected on a test case, it is considered as a failed case. The failure rates on each test may indicate potential areas of improvement for the model.

Implementing tests in CheckList

Most of the basic perturbations are readily available as functions in CheckList, however, they are designed to be applied directly to strings. Due to the additional complexity of the model input (a single input is represented as a JSON object with predefined keys for the various sub components, as shown in Figure 2), we had to implement adapters for the built-in perturbation functions in order to use them on our data.

To carry out tests with the perturbed data, CheckList requires a function that can return predictions and confidences from the model for a given set of data. As the model being tested was deployed on the cloud, this was easily accomplished by implementing a function that interacts with the model via its REST API.

Data versioning and DVC pipelines

We explored DVC and used it as part of our workflow with two objectives in mind. The first is for data versioning. The second is to make use of DVC pipelines, which reduces unnecessary runtime with intermediate artifact caching. The robustness testing pipeline can be organized into stages, with some stages depending on artifacts from the previous ones, as shown in Figure 3. When the pipeline is executed with DVC, it will only run the stages where the dependencies (data or code) have changed, and reuse intermediate artifacts from the previous run. This allows us to update individual tests or add new ones without the need to re-run the entire pipeline every time.


Figure 3. Stages and artifacts of robustness pipeline in DVC

Integration of robustness testing in Git workflow

We implemented robustness testing as a job in our GitLab CI pipeline, as shown in Figure 4. The CheckList tests are run as a DVC pipeline, configured with a remote storage. CML is used to post the results as a comment to the commit associated with the pipeline.


Figure 4. Integration of robustness testing in CI pipeline

An example of the comment in a GitLab merge request is shown in Figure 5. This allows the user to easily view the results within their development platform, and decide if the model performance is sufficient for deployment or use the insights from the evaluation to improve the model in an informed manner.


Figure 5. Report posted as GitLab comment in a merge request

Interactive result analysis with Voila

CheckList comes with an interactive Jupyter widget to facilitate analysis of the perturbed text and test results. In order to integrate it into the CI/CD pipeline, we turned it into a standalone application with Voila, as shown in Figure 6. We configured the CI/CD pipeline to deploy a containerized Voila application on a Kubernetes cluster, using the results of each run. The user can access the container for each run at a unique URL, and look through the detailed results of each test and examples of the test cases, as shown in Figure 7. The user may analyze these and use the insights to make informed decisions on using or improving the model.

Figure 6. Visual summary Jupyter widget deployed as Voila app
Figure 7. Test details and an example of a failed test case


All in all, CheckList provides us with a systematic process for designing a comprehensive suite of tests for NLP models. The black-box nature of the tests and flexibility of the tooling allows it to be applied across a diverse range of tasks, possibly even beyond the realm of NLP applications.

By combining CheckList with various technologies, we have demonstrated how robustness testing can be integrated into the ML development process in a reproducible and convenient manner.

Moving forward, the SecureAI team aims to continue the progress in this area and contribute to the development of more secure and trustworthy AI systems. Stay tuned!


Great news! We are on medium! You can check out this article on our medium profile:

Do follow us for more articles!


Secure AI Engineering in AI Singapore

With more AI systems being deployed into production, it becomes critical to ensure that the systems are secure and trustworthy. Here in AI Singapore, the SecureAI team is dedicated to developing processes and tools to support the creation of secure and trustworthy AI systems.

As shared in the previous article, one of the key ingredients to robust AI systems is process. Currently, operationalizable process guidelines are missing to guide organizations in developing, verifying, deploying, and monitoring AI systems.

To fill this gap, the SecureAI team has worked on developing a set of guidelines that draws upon AI Singapore’s experience in delivering 100E projects, and consolidates knowledge and best practices from the larger AI community – notably from the Berryville Institute of Machine Learning (BIML) Architectural Risk Analysis (ARA) and Google’s ML test score paper.

In this article, we will share an overview of our findings and how we operationalized them in the organization.

Engineering AI Systems Securely

An AI system is a specific type of software system. The field of software engineering has a relatively well-established set of best practices for the development of software systems. In comparison, the domain of AI engineering is in its infancy and the best practices are constantly being updated and improved. 

The full life cycle of an AI system generally consists of the stages as shown in Figure 1.

Figure 1. Life cycle of an AI system.

The considerations for engineering an AI system can be grouped into one of the following four areas of focus: data, modelling, infrastructure, and monitoring. Each of these areas can pertain to one or more parts of the life cycle. The following are a selection of key considerations under each area, which we have identified to be important for the development of secure AI systems.


Data is a key part where AI projects differ from a traditional software project. Traditional software systems have their logic coded in their source code whereas AI systems rely on learning from the data provided. This means that any bias or compromise in the data can result in vastly different behaviors and unwanted outcomes in the AI system. Therefore, it is critical to ensure that data used is trustworthy and reliable.

As data is arguably one of the most important components of an AI system, there are many other considerations in this category. This includes, but is not limited to, checking for input-output feedback loops, proper representation of the problem space, data splitting methodology, avoiding unwanted bias from data processing, and ensuring privacy/anonymity.


The model or algorithm is typically what people think of when it comes to AI systems. The model chosen needs to be suitable for the complexity of the problem. It is also important to identify and verify assumptions associated with the models.

Beyond the choice of algorithm, model development is a complex process where many small decisions have to be made along the way that can potentially have a critical impact on the performance of the model. It is important to examine these choices systematically, for example, it is important to evaluate the sensitivity of hyperparameters, and whether the metric used for the machine learning task is appropriate.

Beyond basic functional requirements, an AI system can also be tested for non-functional requirements1 such as fairness, robustness, and interpretability. Robustness testing specifically, is an area of focus for the SecureAI team and we will be sharing in much greater detail about our work in this area in subsequent articles of the series.


Infrastructure supports the entire life cycle of the AI system. This is not limited to training and testing, but also to deployment and future enhancement of models. The infrastructure should facilitate the process of model training, model validation, and model rollback when needed.

It is important to have proper access control and versioning of the data, model, and code, for traceability, reproducibility, and security. The development and production environments should also be properly isolated.


The performance of an AI system could change in unexpected ways over time due to reasons such as changing trends or degradation of physical hardware, e.g. sensors which provide input data or computational devices that the model runs on. It is important to continually monitor the performance of the system to ensure that it meets requirements. The monitoring should be able to automatically alert the relevant teams when the performance deviates from expected, so that the necessary actions can be taken promptly, e.g. retraining of the model, updating of dependencies, maintenance of hardware and etc.

All of the four aforementioned aspects must be managed properly in order to ensure that the AI system is reliable and secure. This is not an exhaustive list but rather an introduction to the topic of secure AI engineering. For more in-depth discussion on the topic, interested readers may refer to the linked resources.

Operationalizing the Principles

In order to put the above principles into practice, the SecureAI team has developed the following process that involves a knowledge sharing and security review.

Knowledge Sharing

At the start of a project, the SecureAI team conducts a sharing about the common risks faced during the development and deployment of AI systems. The target audience are AI practitioners, engineers and the project stakeholders from both AI Singapore and partners from the industry. 

The primary goal of the sharing is to ensure that everybody involved with the project understands the importance and implications of AI risks and are aligned with the goal of minimizing the risks. 

It also enables the AI developers and engineers to proactively secure the AI system as they are developing it, as well as for practitioners from a traditional cybersecurity background to understand the security implications of deploying AI in their systems.

Security Review

The project team is provided with a checklist that consists of questions that are designed to aid them in systematically identifying and mitigating potential risks in an AI system. Throughout the development process, the project team can refer to the checklist for guidance. 

When the project team is ready, they can fill in the checklist with the details of their system design. Based on the responses, the SecureAI team provides an overall risk assessment and recommendations for mitigating potential risks. This process can be iterative in nature, to facilitate the development of more secure AI systems.

At the end of the project, the final version of the report will be handed over along with the project deliverables to the project sponsors.

Risk Control Checklist Examples

The questions in the checklist are organized into four sections reflecting the typical life cycle of an AI system as mentioned above (data, modelling, infrastructure, and monitoring). Sample questions and recommendations are shown in Table 1 and 2, respectively.

Table 1. Example questions from the risk control checklist.
Section Question Answer [Y/N/NA] Elaboration [Please justify all answers, including ‘NA’]
Data Is your dataset representative of the problem space?   Please describe the problem space that the ML system aims to address. 
Please elaborate on how you have ensured that the distribution of the data is representative of the problem (e.g. data covers all intended operating conditions/target demographic, term frequency matches the natural distribution of the target corpus, classes are balanced). Please note down any constraints in obtaining a representative dataset, if any.
Modelling Have you ensured that your model is sufficiently robust to noise in the inputs?   Please elaborate on how the model was tested for robustness.
Infrastructure Is your ML pipeline integration tested?   Please elaborate on how you have ensured that your full ML pipeline is integration tested and how often (e.g. automated test that runs the entire pipeline – data prep, feature engineering, model training and verification, deployment to the production servicing system – using a small subset of data, in regular intervals or whenever changes are made to the code, model or server).
Monitoring Will any degradations in model quality or computational performance be detected and reported for the deployed model?   Please elaborate on how degradations of model performance in the production environment are detected and reported.
Table 2. Examples of recommendations that may be provided to a project team.
Checklist Section Areas of Improvement Recommendation
Modelling The ability to explain the model is relatively low due to the application of a deep learning model.  Post-hoc explainers, like LIME or Grad-CAM, could be applied.
Infrastructure The data and model artefacts are manually versioned with timestamps. It is suggested that a proper model lifecycle management tool is used. This would help to keep an inventory of different models and their corresponding performance and model stage transition (e.g. promoting a model to production stage or roll-back from production). 

Following this process allows us to have more confidence that the AI systems developed in AI Singapore are secure and trustworthy. This checklist is continually improved based on feedback and experience from executing projects.

Hopefully, this article has given the reader an idea of how we practice secure AI engineering in AI Singapore. In the subsequent articles of this series, we will be diving into a focus area of SecureAI as mentioned in the ‘modelling’ section above: robustness testing. Stay tuned to learn more about the topic and our work in this area!

1 Machine Learning Testing: Survey, Landscapes and Horizons

mailing list sign up

Mailing List Sign Up C360