Get SGD58.85 off your premium account! Valid till 9 August 2021. Use the Code ‘SGLEARN2021’ upon checkout. Click Here

Robustness Testing of AI Systems

Standard model evaluation processes involve measuring the accuracy (or other relevant metrics) on a hold-out test set. However, the performance on these test sets do not always reflect the ability of the model to perform in the real world. This is because a fundamental assumption when deploying AI models is that all future data is of a similar distribution to what the model was trained on. However, in practice, it is very common to encounter data that is statistically different from the train set, which can potentially cause AI systems to become brittle and fail.

An AI model will always be exposed to a variety of new inputs after deployment due to the fact that the testing data is limited (i.e. a finite subset of all data available). Therefore, the concept of robustness testing is to assess the behaviour of the model to such new inputs, and identify its limitations before deployment. One way to achieve this is by curating additional data from other sources to test the model more comprehensively. However, that can be quite difficult practically. An alternative approach is to introduce mutations into the test data, with the aim of systematically mutating the data towards new and realistic inputs, similar to what the AI system will encounter in the real world. This forms the basis of many robustness testing techniques.

The research community has developed many different approaches[1] for robustness testing, which can be broadly categorised into white-box[2] and black-box testing[3]. White-box testing requires knowledge of the way the system is designed and implemented, whereas black-box testing only requires the system’s outputs in response to a certain input. These different testing techniques provide different insights about the models.

Evaluating robustness of Computer Vision (CV) deep learning model with NTU DeepHunter  

One white-box robustness testing tool that we are exploring comes from AI Singapore’s collaborator, the NTU Cybersecurity Lab (CSL). We will briefly introduce the tool before sharing our insights from using it with a computer vision use case.

In traditional software testing, fuzzing is used to detect anomalies by randomly generating or modifying inputs, and feeding it to the system[4]. A complementary concept is testing coverage, which measures how much of the program has been tested, it is used to quantify the rigour of the test. The goal is to maximise test coverage and uncover as many bugs as possible.

Analogously, fuzz testing can also be applied to machine learning systems. The NTU CSL group under Prof Liu Yang developed DeepHunter[5], a fuzzing framework for identifying defects (cases where the model does not behave as expected) in deep learning models. DeepHunter aims to increase the overall test coverage by applying adaptive heuristics based on run-time feedback from the model. We will attempt to give a brief overview of the tool in the next few paragraphs.

A key component of the fuzzing framework is the mechanism by which new inputs to the system are generated: metamorphic mutations. Metamorphic mutations are transformations in the input that are expected to yield unchanged or certain expected changes in the predictive output[7]. These transformed inputs are known as mutants. For example, some mutations for CV tasks can be varying the brightness of the picture or performing a horizontal flip. For NLP tasks, it can be contracting words or changing words to their synonyms. The mutation strategies should be specified by the user depending on their use case and requirements.

Another component is the coverage criteria. This criteria is computed for each mutant, to determine whether it contributes to a coverage increase. There are various definitions of coverage for deep learning models[8], which are based on behaviours of the neurons in a neural network.  For example, Neuron Coverage (NC) measures the neurons that are activated within a predefined threshold (major functional range), while Neuron Boundary Coverage (NBC) measures the corner-case regions. Regardless of the specific criteria used, the general idea is that tests with higher coverage are expected to capture more diverse behaviours of the model, and allow more defects to be identified, i.e. the test data is perceived to be new to the model. For more details on the assessment of the coverages, please refer to the literature.

Figure 1. The overall workflow of DeepHunter. Image is adapted from [6].

The overall workflow of DeepHunter is illustrated in Figure 1. It starts with an initial set of ‘seeds’ (inputs to the model) which are added to a seed queue for mutation. The core of DeepHunter is a fuzzing loop which combines a seed selection strategy (heuristics to select the next seed for mutation) with the metamorphic mutation, coverage criteria, and runtime model prediction. The seed selection strategy chosen is such that mutants which increase the coverage or the model fails to predict correctly will be added back to the queue for further mutation. The test cases which the model failed to predict correctly are collected for analysis, e.g. checking if the mutant is realistic. This coverage-guided fuzzing technique was demonstrated[5] to be more effective than random testing in identifying a greater number of defects in the model. For more details on the methodology, please refer to the literature.

Figure 2. Illustration of the deep learning model inference pipeline for activity classification.

One of the first users of the tool in AI Singapore is the CV Hub team, for their activity classification use case. A typical CV use case consists of a pre-trained object detection or pose estimation model, combined with use-case specific heuristics or models downstream. The CV Hub team was interested in learning about the robustness of the deep learning model that they developed for activity classification. As illustrated in Figure 2, the model takes in key point coordinates of a human pose, from a pre-trained pose estimation model upstream, and classifies it into an activity.

Figure 3. Example renders of pose key points (input to model) before and after mutation. Data is from the JHMDB dataset.

To identify suitable mutation strategies for testing the model, we conducted a discovery session with the CV Hub team to understand the requirements of the use case. We identified a number of possible mutation strategies, and one of them is to mirror key points by flipping the image horizontally, as shown in Figure 3. This mutation strategy is provided to the tool, which uses it as part of its fuzzing process to generate mutants.

We ran the robustness testing process on the original model and the results are shown in Table 1. The coverage-guided fuzzing identified a large number of defects, which implies that the model was not robust to the mutations.

Model Accuracy on test set Number of fuzzer iterations Number of defects
Original 65.3% 5000 2193
Retrained 64.3% 5000 94
Table 1. Test set accuracy and results of coverage-guided fuzzing for each of the models.

After analysing the results of the coverage-guided fuzzing, a strategy was developed to improve the robustness of the model by retraining it with augmented data. The results of the robustness testing on the retrained model are also shown in Table 1. The smaller number of defects identified implies that it is more robust to the mutations. (Note: In this article, we have demonstrated robustness testing using just one mutation strategy. Additional mutation strategies should be used to obtain a more complete picture of the model’s robustness.)

To compare the robustness testing to a standard model evaluation, we have also included the test set accuracy for each of the models in Table 1. Analysing model performance by this metric alone would have led us to infer that both models’ performance were roughly the same. However, the results from the robustness testing revealed that the models perform very differently when subjected to mutations. Therefore, through this testing process, we are more confident that the retrained model will likely be able to handle new and unseen inputs when deployed.

In summary, we have demonstrated how robustness testing can give us additional insights about a model’s performance beyond the standard evaluation, as well as actionable insights for improvement. This gives us more confidence when using the model. In the next article, we will continue our exploration into other robustness testing tools by exploring a different testing tool, Microsoft Checklist, and its application to an NLP use case.


[1] J. Zhang, M. Harman, L. Ma and Y. Liu, “Machine Learning Testing: Survey, Landscapes and Horizons” in IEEE Transactions on Software Engineering, vol. , no. 01, pp. 1-1, 5555.
doi: 10.1109/TSE.2019.2962027

[2] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2019. DeepXplore: automated whitebox testing of deep learning systems. Commun. ACM 62, 11 (November 2019), 137–145. DOI:

[3] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. 2017. Practical Black-Box Attacks against Machine Learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security (ASIA CCS ’17). Association for Computing Machinery, New York, NY, USA, 506–519. DOI:

[4] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz and S. Yoo, “The Oracle Problem in Software Testing: A Survey,” in IEEE Transactions on Software Engineering, vol. 41, no. 5, pp. 507-525, 1 May 2015, doi: 10.1109/TSE.2014.2372785.

[5] Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. DeepHunter: a coverage-guided fuzz testing framework for deep neural networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis(ISSTA 2019). Association for Computing Machinery, New York, NY, USA, 146–157. DOI:

[6] X. Xie, H. Chen, Y. Li, L. Ma, Y. Liu and J. Zhao, “Coverage-Guided Fuzzing for Feedforward Neural Networks,” 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2019, pp. 1162-1165, doi: 10.1109/ASE.2019.00127.

[7] Chen, T.Y., Cheung, S.C., & Yiu, S. (2020). Metamorphic Testing: A New Approach for Generating Next Test Cases. ArXiv, abs/2002.12543.

[8] Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: multi-granularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). Association for Computing Machinery, New York, NY, USA, 120–131. DOI:


  • Enjoys solving problems with the use of technology. Currently in the SecureAI team, exploring tools for testing the robustness of machine learning models and incorporating best practices into the ML development process.

  • AI Engineer in the SecureAI team with a background in Physics.

  • Ernest is an AI Engineer with a background in Physics who enjoys coding and problem solving. He has a keen interest in science and technology and is currently expanding and growing his repertoire in the area of Artificial Intelligence. Being part of the SecureAI team, Ernest spends his time exploring the robustness and uncertainty quantification of Machine Learning models as well as the best practices for MLOps.

  • A 20-year veteran in tech startups and MNCs, Najib focuses on High- Performance Computing (HPC ) as well as Cloud, Data and Artificial Intelligence (AI). He has led engineering teams in several organisations, some of which were startups that were acquired or exited successfully. He has helped build several of the first generation HPC cluster systems and infrastructure in Singapore and the region. He was also a lecturer for NUS School of Continuing and Lifelong Education (NUS SCALE) where he conducted workshops on Reproducible Data Science, Data Engineering and Conversational AI bots (Chatbots). He currently heads the AI Platforms Engineering team in the Industry Innovation Pillar at AI Singapore (AISG) where his team focuses on building the AI infrastructure and platforms for researchers, engineers and collaborators to solve challenging problems.

  • Jianshu has many years of AI/Data Science research and consulting experience. He has good track records in delivering values to clients and also quality academic research. One of his papers has also been awarded the Test of Time award by one of the leading AI conferences. In the recent years, he has spent most of his time in putting AI/ML into real-world usage and promoting ethical aspect of AI/ML, e.g. explainability, fairness, robustness, and privacy-preserving of AI/ML models.

  • Wayyen manages the on-premise and on-cloud infrastructure resources used by AISG engineers and apprentices. He also works with MLOps, SecureAI and Synergos teams to bring out new tools and platforms for better CI/CD/CT in machine learning.

  • David is a Ph.D. candidate at NTU, School of Computer Science and Engineering. His research focuses on enabling quality measures for responsible AI. Thereby, AI systems can be validated for quality standards before deployed in the real world. David co-leads the national AI standardisation committee for Secure AI where he creates the standards and certification process for high-risk AI applications of Singapore together with leading industry, academic and governmental institutions.

Share the post!


Related Posts

mailing list sign up

Mailing List Sign Up C360