
Current LLM evaluations do not sufficiently measure all we need
Written By:
Tristan Koh Ly Wey
Research Assistant, AI Singapore
Evaluating Large Language Models (LLMs) presents a complex challenge. Although evaluations provide metrics that seem to objectively measure LLM performance, these figures often do not always effectively capture their nuanced behaviours in real-world applications. Therefore, evaluations, while useful, are not absolute and require careful interpretation. The following will be addressed: (a) Why are LLM evaluations important? (b) What are some existing issues with current evaluation approaches? (c) What are the impacts of imperfect evaluations on the generative AI (genAI) ecosystem? and (d) What can be done about it?
Importance of LLM evaluations
At first glance, LLMs seem impressive. LLMs are able to generate creative human-like text, images and video using simple, open-ended input in natural language across multiple domains, tasks and contexts. Nevertheless, beyond anecdotal first impressions, how can LLMs’ capabilities be objectively and rigorously measured?
Evaluations are designed to meet such a need. Evaluations are methodologies and functions to evaluate a system-under-test (i.e. LLM) for a particular purpose and interpret the results. It consists of (a) a set of tests with metrics, and (b) a summarisation of the tests using the metrics (MLCommons, 2024). Evaluations attempt to measure performance of LLMs on varying tasks such as reasoning, coding, knowledge retrieval etc. Further, measuring LLM safety concerns such as toxicity or “dual-use capabilities” (Barrett et al., 2024) are of increasing importance. Through testing specific LLM capabilities, evaluations also serve multiple broader purposes within the genAI ecosystem. These include marketing (e.g. GPT-4o mini has “superior textual intelligence and multimodal reasoning” (OpenAI, 2024)), regulatory compliance such as under the EU AI Act, assessing use-case suitability by app developers and consumers, and as an indicator of improvement of model inference capabilities (Hoffmann et al., 2022).
LLM evaluation can be thought of as similar to traditional software testing which aims to measure system behaviour that produce metrics that are reproducible, generalisable across different contexts, consistent over time and objective (McIntosh et al., 2024). However, unlike traditional software testing, LLM evaluations are especially challenging as LLMs are black boxes and their outputs are probabilistic. Errors cannot be traced back to a specific part of the code and system characteristics can only be generalised through repeated multiple testing of prompts for a particular task at scale rather than only through anecdotal examples.
Measurement theory & issues with evaluations
Measurement theory concerns the conditions under which mathematical objects (including numbers) can be used to represent the properties of objects. These representations can then be used to express relationships between these objects (Tal, 2020). Measurement theory provides a suitable theoretical framework to assess evaluations because evaluations use numbers to claim certain kinds of relations between different LLMs (i.e. GPT-4o is more capable at reasoning than Claude 3.5). A good measurement has to be reliable and valid. Reliability is often synonymous with terms such as consistency, stability, and predictability. Validity is usually synonymous with truthfulness, accuracy, authenticity and soundness. A good measurement can be reliable but invalid, but not the converse. In other words, reliability is a necessary but insufficient condition for validity (Hubley & Zumbo, 1996).

Figure 1: Reliability vs validity (Trochim, n.d.)
When evaluations are reliable and valid, these evaluations can be useful predictors of future performance on similar tasks. However, as explained below, LLM evaluations have various reliability and validity issues that undermine their robustness.
Reliability
First, LLM evaluations have low internal consistency reliability. This type of reliability assesses consistency between different evaluations that purport to measure the same construct (Hubley & Zumbo, 1996). Some examples that demonstrate low internal consistency reliability are provided below:
- Making spurious correlations during in-context learning: Adding various spurious triggers in prompts used for in-context learning such as random characters (e.g. # / *) & rare words (e.g. solipsism, serendipity) at varying positions led to an average performance drop of 33.7% on GPT-2 (amongst other tested models) on sentiment analysis tasks (Tang et al., 2023). The performance drop also increased with increasing model size.
- Formatting: Changing the ordering, numbering format (e.g. from (A) to (1)), quantity of options and adding a “none of the above” option for MCQ-style evaluations results in a significant performance decrease for each of these formatting changes (Wang et al., 2024; Ganguli et al., 2023).
- Phrasing of prompts: Rephrasing responses to other questions that still test the same construct results in great variations in performance. For instance, each option for each MCQ was converted into true / false statements, generating three questions with the correct answer as “false”, and one question with the correct answer as “true”. While there was a high consistency (up to 91%) between the responses for the latter questions and the responses from the original MCQ format, there was low consistency (only 51%) between the responses for the former questions because the answers were wrongly predicted as “true” (Wang et al., 2024). In other words, while the model could correctly select the one true option out of three other false options when the question was phrased as a MCQ, the model failed to identify that the other three options were explicitly “false” when the question was phrased as true/false questions.
Overfitting & predictive validity
Second, models may be overfitting to popular evaluations because most of such evaluations like MMLU (Massive Multitask Language Understanding) and HellaSwag are open-sourced and are likely included in the training data. This contravenes a principle of machine learning: the model’s capabilities should be tested with an out-of-distribution dataset to assess whether the model is able to generalise beyond its training data. The prevailing data science practice separates datasets into train, validation & test sets. The test set is not seen by the model during training and used after the model’s parameters have been optimised on the validation set to provide an unbiased indicator of model performance (Baheti, 2021). However, when answers are included in training data, the evaluation is no longer an out-of-distribution dataset and does not provide an unbiased indicator of model performance. Overfitting thus causes the evaluation to have reduced predictive validity because the evaluation is now not measuring the LLM’s capability to generalise well to unseen data distributions.
Construct validity
Third, evaluations may not actually be measuring the intended construct (construct validity). As LLMs are black-boxes, though they may return high scores on popular evaluations, these responses may be an imitation of capability (i.e. “stochastic parrots”) rather than a reflection of the models’ actual capabilities. There is no easy or established method to discern the former from the latter. Evaluations may be measuring other irrelevant constructs rather than what was intended.
For example, chain-of-thought (CoT) prompting is frequently used to increase performance on evaluations. Step-by-step explanations generated by the model could be seen as a demonstration of the model’s underlying reasoning capabilities. However, these explanations can be convincing, but unfaithful to the LLMs’ underlying reasoning when using certain biasing features in the CoT prompts. For instance, when all the answers to the examples used to pre-prompt the model were provided as (A), the model similarly generated CoT explanations that justified (A) as the correct option even though (A) was incorrect without explicitly stating that it was, in fact, the biaising features that caused the LLM to choose (A) as the answer (Turpin et al., 2023). Therefore, when CoT is used in evaluations, though CoT produces higher scores, the model was unlikely to have been exhibiting underlying reasoning capability through use of CoT. Rather, CoT seems more likely to produce explanations based on irrelevant statistical correlations in the CoT prompts.
Typical scoring methods may also lack construct validity. Responses are usually aggregated by predefined metrics to summarise the model performance for that particular evaluation. Typical metrics used in machine learning include RMSE / R-squared (for regression models) or precision / recall / F1 (for classification models). The prevailing convention in LLM evaluation restricts the output of the models to a MCQ format, allowing conventional P/R/F1 scores to be used. “The critical weakness is whether the metric actually reliably tracks the property of interest, not the rigour with which the metric is evaluated” (Olah & Jermyn, 2024). Using such restricted metrics may risk reducing high-dimensional concepts such as creativity, coherence or reasoning into a single number.
Convergent validity
Fourth, popular benchmarks such as MMLU & HellaSwag seem to have poor convergent validity with other evaluations that aim to evaluate reasoning constructs closely related to those being evaluated using MMLU & HellaSwag.
Convergent validity refers to how well the scores of a test converge with other constructs based on what would be expected from theory (Shou, Sellbom & Chen, 2022). Convergent validity is also used as evidence to demonstrate high construct validity. For instance, depression is theoretically closely linked to anxiety. To demonstrate that a measure for depression has high construct validity, an established measure for anxiety can be used to show correlation between the scores of the different measures (Hubley, 2014).
A simple reasoning problem was formulated called AIW (Alice-In-Wonderland): “Alice has X brothers and Y sisters. How many sisters does Alice’s brother have?” [where X and Y are integers varied across prompts] (Nezhurina et al., 2024). There was a high discrepancy discovered between models’ performance on popular evaluations (i.e. MMLU, HellaSwag) and AIW. While the latest and biggest models (GPT-4o, Claude 3 Opus, GPT-4) performed consistently well on these popular evaluations (85% and above), on AIW, only these latest and biggest models scored around 40% and more, with the highest, GPT-4o scoring only 65%. AIW, MMLU & HellaSwag all purport to measure various related constructs of reasoning capabilities. Though AIW is not an established measure for reasoning, AIW’s problem is intuitively straightforward to solve. Hence, the high discrepancy between scores on AIW and MMLU raises concerns about the potential lack of construct validity of MMLU & HellaSwag.
External validity
Last, even if there is construct validity, external validity may be an issue as evaluation results may not be generalisable. The most common use-cases for such models are on downstream chatbot applications such as drafting text across different disciplines, asking for advice, and explanation of concepts that take place over multi-turn conversations which require more creative output (Wiggers, 2024). Though capability to conduct natural, factual and helpful conversations to generate varying textual outputs undoubtedly requires competency in knowledge, reasoning and inference as measured by existing popular evaluations, measuring these characteristics in isolation through MCQ format does not necessarily relate to competency in downstream applications. App developers also consider more than raw performance metrics when integrating foundation models into software. Some considerations include a balance between costs, type (e.g. multimodal, text, multi language etc.), and latency of generation (Spisak et al., 2024) which are not usually assessed as part of existing evaluations.
Impact on the genAI ecosystem
Combined with the necessity of evaluations to achieve essential broader societal objectives, these issues with LLM evaluations have wide-ranging downstream impact. For instance, some financial experts note that genAI are not “designed to solve the complex problems that would justify the costs [of investment]” and predict that the genAI hype bubble may be about to burst (Goldman Sachs, 2024). Whether genAI indeed has such capability or is merely hype depends, amongst other things, on reliable and valid evaluations.
Evaluations also risk being viewed as ends for model development in themselves rather than merely fallible means of measuring broader, abstract concepts. One well-examined instance of this phenomenon outside AI is how IQ tests became relied upon as a determinate predictor of future educational performance rather than how they were originally designed – to be merely a measure of current cognitive abilities, used to curate appropriate educational aids and tutoring for the child (van Hoogdalem & Bosman, 2023). Muller 2020 terms this as “metric fixation”. A similar phenomenon could arise for the stakeholders in the genAI ecosystem.
- Model developers and regulators (e.g. OpenAI, EU AI Office etc.): An assumption is that abstract concepts such as intelligence, understanding or safety can always be fully measured through evaluations. Over reliance on the current state of evaluations by these major stakeholders have a strong normative effect on model development because of metric fixation that result in the development of model capabilities which may not be generalisable to real-life needs and risks.
- App deployers: Due to lack of generalisability of evaluations, product-market fit may be affected and app deployers cannot rely on evaluations to select the right product and design appropriate safety mitigations for their specific use-case.
- Minority actors: Issues with external validity due to regulatory capture may be a natural consequence when major actors control the design of evaluations. For example, if the external validity of evaluations for bias is limited to the concerns of predominantly white, Western and economically and socially privileged demographics (Bergman et al., 2023), model capabilities would be focused on developing culturally narrow model capabilities at the expense of representation from minority demographics.
What’s next: Possible solutions
Existing “quick fixes” can be instituted to improve reliability and validity. For instance, qualitative human review of a subset of prompts & model responses instead of relying on a single metric may improve reliability. Some evaluation suites also implement various types of irrelevant perturbations to design “adversarial” prompts which still aim to measure the same construct. Such perturbations change tokens but maintain semantic, reasoning and answer invariance which would increase internal consistency reliability (Li et al., 2024). To increase predictive validity, evaluations should not be entirely open-sourced to avoid overfitting. Instead, a hold-out validation set should be kept closed-sourced. Formalising these quick fixes as standardised protocols can improve reliability and predictive validity.
However, evaluations can be reliable but still invalid. To substantially improve validity in light of the multifaceted nature of evaluations, evaluations should shift away from the current paradigm of testing. Some have proposed to adapt methodology used for testing human capability within cognitive sciences to test LLMs (Burden, 2024; Zhuang et al., 2023). Cognitive science actively grapples with the measurement of human intelligence as an abstract, social construct. For instance, psychometric testing has designed tests with standardised test protocols that allow the comparison of an individual’s performance with an appropriate comparison group across varying purposes such as cognitive impairment or functional capacity for everyday tasks. Test scores have to be interpreted and applied differently depending on the specific purpose and context (Institute of Medicine, 2015). In this spirit, the MLCommons LLM evaluation test specification schema (MLCommons, 2024) tries to achieve such clarity in the design of evaluations, though more can be done to learn and adapt from such existing experimental controls identified in cognitive science.
In the same vein, instead of considering evaluations as akin to comparing the speed & acceleration of simple technical objects such as cars, a more suitable comparison could be with the mature fields of sovereign credit ratings, university rankings or cognitive science. While each of these fields aim to achieve objective measurement of their respective systems, these measurements have faced similar problems. Sovereign credit ratings have been criticised for their failure to anticipate financial crises (external validity) (Haspolat, 2015). University rankings have similarly been critiqued for their inability to measure educational quality as educational experience cannot be reduced to a simple ranking (construct validity). Some universities have reallocated resources to improve their ranking at the expense of quality research and teaching (metric fixation) (Robinson, 2014).
LLMs should be evaluated as sociotechnical systems, rather than merely technical objects like conventional software. Sociotechnical systems are a configuration of technologies, services and infrastructures, regulations and actors that fulfil a societal function (Schot et al., 2016). The interaction between the social and technical components determine the risks that manifest (Weidinger et al., 2023). As finance and education are sociotechnical systems, the evaluation of credit risks is not a simple application of financial forecasting models and the assessment of educational quality cannot be fully captured by university rankings. Similarly, evaluations aim to measure capabilities with existing societal meanings such as intelligence, reasoning, toxicity, and bias. GenAI is more than just technology, but includes services, infrastructures and regulators. Accordingly, typical software testing approaches may no longer apply. LLM evaluations should eventually move towards socio-technical methods such that LLM evaluations can sufficiently measure all that we need.
References
Artificial Analysis. (n.d., n.d. n.d.). Artificial Analysis. Artificial Analysis: Model & API Providers Analysis. Retrieved July 5, 2024, from https://artificialanalysis.ai
Baheti, P. (2021, September 13). Train Test Validation Split: How To & Best Practices [2024]. V7 Labs. Retrieved July 4, 2024, from https://www.v7labs.com/blog/train-validation-test-set
Barrett, A. M., Jackson, K., Murphy, E. R., Madkour, N., & Newman, J. (2024, May). Benchmark Early and Red Team Often. UC Berkeley Centre for Long-Term Cybersecurity. Retrieved July 16, 2024, from https://cltc.berkeley.edu/wp-content/uploads/2024/05/Dual-Use-Benchmark-Early-Red-Team-Often.pdf
Bergman, A. S., Hendricks, L. A., Rauh, M., Wu, B., Agnew, W., Kunesch, M., Duan, I., Gabriel, I., & Isaac, W. (2023). Representation in AI Evaluations. In 2023 ACM Conference on Fairness, Accountability, and Transparency. FAccT ’23: the 2023 ACM Conference on Fairness, Accountability, and Transparency. ACM. https://doi.org/10.1145/3593013.3594019
Burden, J. (2024). Evaluating AI Evaluation: Perils and Prospects (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2407.09221
Chen, E. (2023, n.d. n.d.). HellaSwag or HellaBad? 36% of this popular LLM benchmark contains errors. Surge AI. Retrieved July 8, 2024, from https://www.surgehq.ai/blog/hellaswag-or-hellabad-36-of-this-popular-llm-benchmark-contains-errors
European Commission. (2024, July 16). European AI Office. Shaping Europe’s digital future. Retrieved July 19, 2024, from https://digital-strategy.ec.europa.eu/en/policies/ai-office
Ganguli, D., Schiefer, N., Favaro, M., & Clark, J. (2023, October 4). Challenges in evaluating AI systems. Anthropic. Retrieved July 5, 2024, from https://www.anthropic.com/news/evaluating-ai-systems
Goldman Sachs. (2024, June 30). Gen AI: Too much spend, too little benefit. Goldman Sachs Global Macro Research. https://www.goldmansachs.com/intelligence/pages/gs-research/gen-ai-too-much-spend-too-little-benefit/report.pdf
Haspolat, F. B. (2015). Analysis of Moody’s Sovereign Credit Ratings: Criticisms Towards Rating Agencies Are Still Valid? In Procedia Economics and Finance (Vol. 30, pp. 283–293). Elsevier BV. https://doi.org/10.1016/s2212-5671(15)01296-4
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. de L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. van den, Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., … Sifre, L. (2022). Training Compute-Optimal Large Language Models (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2203.15556
Hubley, A. M. (2014). Discriminant Validity. In Encyclopedia of Quality of Life and Well-Being Research (pp. 1664–1667). Springer Netherlands. https://doi.org/10.1007/978-94-007-0753-5_751
Hubley, A. M., & Zumbo, B. D. (1996). A Dialectic on Validity: Where We Have Been and Where We Are Going. In The Journal of General Psychology (Vol. 123, Issue 3, pp. 207–215). Informa UK Limited. https://doi.org/10.1080/00221309.1996.9921273
Institute of Medicine, Board on the Health of Select Populations, & Committee on Psychological Testing, Including Validity Testing, for Social Security Administration Disability Determinations. (2015). Psychological testing in the service of disability determination. doi:10.17226/21704
Keegan, J. (2024, July 17). Everyone is judging AI by these tests. but experts say they’re close to meaningless. The Markup. https://themarkup.org/artificial-intelligence/2024/07/17/everyone-is-judging-ai-by-these-tests-but-experts-say-theyre-close-to-meaningless
Li, J., Hu, R., Huang, K., Zhuang, Y., Liu, Q., Zhu, M., Shi, X., & Lin, W. (2024). PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2405.19740
McIntosh, T. R., Susnjak, T., Liu, T., Watters, P., & Halgamuge, M. N. (2024). Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2402.09880
MLCommons. (2024, May 22). MLCommons AI Safety v0.5 Benchmark Proof of Concept Test Specification Schema. Retrieved August 1, 2024, from https://drive.google.com/file/d/1gUjDvwRIqRsLmJ2lfnCygnXzlgIHBrMG/view
Muller, J. Z. (2020). The perils of metric fixation. In Medical Teacher (pp. 1–3). Informa UK Limited. https://doi.org/10.1080/0142159x.2020.1840745
Narayanan, A., & Kapoor, S. (2023, October 4). Evaluating LLMs is a minefield. cs.Princeton. Retrieved July 15, 2024, from https://www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/
Olah, C., & Jermyn, A. (2024, March 1). Reflections on Qualitative Research. Transformer Circuits Thread. Retrieved July 5, 2024, from https://transformer-circuits.pub/2024/qualitative-essay/index.html
OpenAI. (2024, July 18). GPT-4o mini: advancing cost-efficient intelligence. OpenAI. Retrieved July 22, 2024, from https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/
Ravi, S. S. (n.d., n.d. n.d.). Enterprise Scenarios Leaderboard – a Hugging Face Space by PatronusAI. Hugging Face. Retrieved July 5, 2024, from https://huggingface.co/spaces/PatronusAI/enterprise_scenarios_leaderboard
Robinson, D. (2014). The mismeasure of higher education? The corrosive effect of university rankings. In Ethics in Science and Environmental Politics (Vol. 13, Issue 2, pp. 65–71). Inter-Research Science Center. https://doi.org/10.3354/esep00135
Schot, J., Kanger, L., & Verbong, G. (2016). The roles of users in shaping transitions to new energy systems. In Nature Energy (Vol. 1, Issue 5). Springer Science and Business Media LLC. https://doi.org/10.1038/nenergy.2016.54
Shou, Y., Sellbom, M., & Chen, H.-F. (2022). Fundamentals of Measurement in Clinical Psychology. In Comprehensive Clinical Psychology (pp. 13–35). Elsevier. https://doi.org/10.1016/b978-0-12-818697-8.00110-2
Spisak, J., Wampler, D., & Bnayahu, J. (2024, June 4). Evaluation of Generative AI – what’s ultimately our goal? AI Alliance. Retrieved July 5, 2024, from https://thealliance.ai/blog/evaluation-of-generative-ai—whats-ultimately-our
Tang, R., Kong, D., Huang, L., & Xue, H. (2023). Large Language Models Can be Lazy Learners: Analyze Shortcuts in In-Context Learning. In Findings of the Association for Computational Linguistics: ACL 2023. Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-acl.284
Tal, E. (2020). Measurement in Science. Retrieved from The Stanford Encyclopedia of Philosophy website: https://plato.stanford.edu/archives/fall2020/entries/measurement-science/
Trochim, W. M. (n.d.). Reliability & Validity. Conjointly. Retrieved August 2, 2024 from https://conjointly.com/kb/reliability-and-validity/
Turpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2305.04388
van Hoogdalem, A., & Bosman, A. M. (2023). Intelligence tests and the individual: Unsolvable problems with validity and reliability. In Methodological Innovations (Vol. 17, Issue 1, pp. 6–18). SAGE Publications. https://doi.org/10.1177/20597991231213871
Wang, H., Zhao, S., Qiang, Z., Xi, N., Qin, B., & Liu, T. (2024). Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2402.01349
Weidinger, L., Rauh, M., Marchal, N., Manzini, A., Hendricks, L. A., Mateos-Garcia, J., Bergman, S., Kay, J., Griffin, C., Bariach, B., Gabriel, I., Rieser, V., & Isaac, W. (2023). Sociotechnical Safety Evaluation of Generative AI Systems (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2310.11986
Wiggers, K. (2024, March 7). Why most AI benchmarks tell us so little. TechCrunch. Retrieved July 4, 2024, from https://techcrunch.com/2024/03/07/heres-why-most-ai-benchmarks-tell-us-so-little/?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAAAJG4TPRTa0ACRucG5xKCEwxWjjTpUP0li4T7J4-24BHbHC-E0qtHiVMiTGyV1-M-4oC_aBZ3MJr-RxkCvERov0_FdIJK