Empowering Representativeness: Strengthening LLMs Through Southeast Asia’s Linguistic Tapestry

Project SEALD (Southeast Asian Languages in One Network Data) is the pioneer of multilingual data collection for Large Language Models (LLMs) in Southeast Asia (SEA) region. This represents one of the most extensive data collections of Southeast Asian languages: Indonesian, Malay, Tamil, Burmese, Filipino, Vietnamese, Thai, Lao and Khmer.

AI Singapore (AISG) and Google Research have embarked on Project SEALD, a research collaboration to enhance datasets that can be used to train, fine-tune, and evaluate large language models (LLMs) in languages spoken across Southeast Asia (SEA). This collaboration seeks to improve cultural context awareness and capabilities in SEA LLMs, and advance their applicability across the region to bring broad benefits to society.

Improving Inclusivity in SEA LLMs

The region’s local and regional cultures, values and norms differ from those of Western countries, where most large language models originate. To address that, AISG developed SEA-LION (Southeast Asian Languages in One Network) — a family of LLMs specifically pre-trained and instruct-tuned to be more representative of Southeast Asia’s cultural contexts and linguistic nuances.

A key part of SEA-LION involves building up a diverse and high-quality data corpus that supports not only the training of SEA-LION models but other models that can add value to SEA-centric use cases. The research under Project SEALD will enhance the availability of evaluation datasets and recipes for the data corpus — starting with Indonesian, Thai, Tamil, Filipino and Burmese.

As part of Project SEALD, the teams will also work on:
• Developing translocalization and translation models,
• Establishing best practices for instruction tuning datasets,
• Creating tools to enable translocalization at scale, and
• Publishing pre-training recipes for SEA languages

AISG and Google will release the datasets and output from Project SEALD in open-source, in order to enable the progress of the SEA LLM ecosystem and foster strong regional expertise.

Project SEALD will engage with ecosystem partners — academic, industry and Government — in various ways. These include working with industry players for data collection, curation and quality checks, collaborating with academia in different SEA countries to implement state-of-the-art techniques in evaluation and benchmarking, and partnering with government stakeholders in Singapore and across the region to advance use cases that bring about broad benefits to society.

“The SEA-LION LLM project has always been about building a community and ecosystem that will continuously work together to enhance the quality of the SEA-LION data corpus and continuously improve on SEA-LION’s capabilities. We are happy that Google now stands as a key part of the SEA-LION ecosystem and we look forward to building better datasets through Project SEALD in collaboration with Google for the benefit of the entire community.”

Dr Leslie Teo, Senior Director of AI Products, AI Singapore

Read the announcement details here.

DATAUnlock the Treasure Trove of Southeast Asian Languages

The data involved in this project encapsulates a variety of text and speech formats derived from Southeast Asian languages: Indonesian, Malay, Tamil, Burmese, Filipino, Vietnamese, Thai, Lao and Khmer. The table provided below illustrates the population and official language(s) of each country.

CountryPopulation (in millions)Official Language(s)
Brunei Darussalam0.45Malay
Cambodia16.95Khmer
East Timor (Timor-Leste)1.36Portuguese & Tetum
Indonesia277.53Indonesian
Laos7.63Lao
Malaysia34.31Malay
Myanmar54.58Burmese
Philippines117.34Filipino & English
Singapore6.02English, Malay, Chinese (Mandarin) & Tamil
Thailand71.80Thai
Vietnam98.86Vietnamese
Source: Worldometers (This table is alphabetically arranged based on the country name)

The collected data, upon its readiness, will be made openly accessible to the public in line with our commitment to open-source principles. We anticipate its imminent release in the near future. We appreciate your patience as we bring this project to fruition. Stay tuned for upcoming updates. Upon the data’s completion, it will be accessible for direct download.

Quotes from Key Partners

“Google is proud to be partnering with AI Singapore to put Singapore and Southeast Asia on the map of AI development. By focusing on Southeast Asian languages and cultural understanding, Project SEALD will significantly improve the existing corpus and evaluation benchmarks in these languages. This will open new opportunities and make AI more inclusive and accessible for individuals and businesses throughout the region.”

Yolyn Ang, Google APAC

“VISTEC is excited to join the pan-ASEAN NLP development offered by Project SEALD, a vital collaborative mechanism that brings our diverse NLP communities into one strategic direction. In particular, Project SEALD will alleviate the low-resource constraints within many Southeast Asian languages with new pretrained language models, datasets, and benchmarks. VISTEC is proud to be an official partner contributing our expertise in Thai NLP to this project.”

Assoc. Prof. Sarana Nutanong, Vidyasirimedhi Institute of Science and Technology, Thailand

“As we continue to work with AI Singapore through XFORM, INC in developing
localized, comprehensive and inclusive datasets, we are looking forward to contributing to Project SEALD, which will be significant work in building localized, culture-driven, context-sensitive and open-source LLMs for Southeast Asia through the Ateneo Social Computing Science Laboratory.”

Maria Regina Estuar, Head, Ateneo Social Computing Science Laboratory; CEO, XFORM, Inc., Philippines

Contact Us

Help shape the future of AI in SEA! Partner with Google and AISG to enhance regional LLMs and create language solutions tailored to our region. Researchers, developers and businesses, your expertise is needed to drive innovation in this exciting field. Contact us here to get involved.

Supported by