Welcome to the AI Products webpage which is dedicated to Large Language Models (LLMs). Our team builds the SEA-LION (Southeast Asian Languages In One Network) family of LLMs that is specifically pre-trained and instruct-tuned for the Southeast Asian (SEA) region.
The SEA-LION model is a significant leap in the field of Natural Language Processing (NLP). It is built on the robust MPT architecture and has a vocabulary size of 256K. For tokenization, the model employs our custom SEABPETokenizer, which is specially tailored for SEA languages, ensuring optimal model performance.
SEA-LION currently comes in two variants for the base model: a 3 billion parameter model and a 7 billion parameter model.
A 7B instruction-tuned version of SEA-LION is also available: 7B instruct model.
Why SEA-LION?
Large Language Models (LLMs) are a type of artificial intelligence model designed to understand and generate human language. They are trained on vast amounts of text data and can perform a wide range of tasks such as translation, summarization, answering questions, and even writing code.
Existing LLMs display strong bias in terms of cultural values, political beliefs and social attitudes. This is due to the training data, especially those scraped from the Internet, which often has disproportionately large influences from western, industrialized, rich, educated, and democratic (WIRED) societies. People from non-WIRED societies are less likely to be literate, to use the Internet, and to have their output easily accessed.
Our work in SEA-LION aims to create LLMs that cater to under-represented population groups and low resource languages in the SEA region. The following figure shows our training data distribution.
At the heart of understanding the SEA region lies the issue of language tokenization, which is the vital process of breaking down text into individual word pieces for training the LLM. Existing tokenizers of popular LLMs are often English-centric. We created a custom SEABPETokenizer for optimal model performance after testing a variety of tokenization approaches and evaluating their results. Our SEABPETokenizer with a vocabulary size of 256K tokens is designed to balance fertility and proportion of continued words, the general performance of existing models, and our linguistic understanding of SEA languages.
More information is available at the SEA-LION GitHub page.
Open Source for the Community
Our SEA-LION is an open source model focused on SEA languages and cultures, and it will be made freely open and available to the general public. We are working with many regional partners in the areas of data collaborations and use cases. But there is a lot more to do. Together, we stand on the verge of a breakthrough for SEA and we invite you to contribute and be a part of this exciting journey.
Contact us with this SEA-LION Inquiry Form to find out more.
LLM Gallery
Stay tuned for future SEA-LION application showcases.
In the News
- Sea-Lion explained: Southeast Asia’s first large language model (Computerweekly, 5 Feb 2024)
- Singapore leads regional LLM development with SEA-LION (Deeptech Times, 5 Feb 2024)
- How SEA-LION aims to bridge the cultural gap existing in popular AI tools (e27, 31 Jan 2024)
- Why Singapore’s LLM isn’t sweating GPT-4 (Tech in Asia, 31 Jan 2024)
- Tokopedia, NCS among companies in pilot run of AI language model for South-east Asia (Business Times, 24 Jan 2024)
- AI Singapore brings inclusive Generative AI models to Southeast Asia with AWS (AWS, 24 Jan 2024)
- $70m S’pore AI initiative to develop first large language model with South-east Asian context (Straits Times, 4 Dec 2023)
Partners
Quotes and Logo