Welcome to the AI Products webpage which is dedicated to Large Language Models (LLMs). Our team builds the SEA-LION (Southeast Asian Languages In One Network) family of LLMs that is specifically pre-trained and instruct-tuned for the Southeast Asian (SEA) region.

The SEA-LION model is a significant leap in the field of Natural Language Processing (NLP). It is built on the robust MPT architecture and has a vocabulary size of 256K. For tokenization, the model employs our custom SEABPETokenizer, which is specially tailored for SEA languages, ensuring optimal model performance.

SEA-LION currently comes in two variants for the base model: a 3 billion parameter model and a 7 billion parameter model.

A 7B instruction-tuned version of SEA-LION is also available: 7B instruct model.

Why SEA-LION?

Large Language Models (LLMs) are a type of artificial intelligence model designed to understand and generate human language. They are trained on vast amounts of text data and can perform a wide range of tasks such as translation, summarization, answering questions, and even writing code.

Existing LLMs display strong bias in terms of cultural values, political beliefs and social attitudes. This is due to the training data, especially those scraped from the Internet, which often has disproportionately large influences from western, industrialized, rich, educated, and democratic (WIRED) societies. People from non-WIRED societies are less likely to be literate, to use the Internet, and to have their output easily accessed.

Our work in SEA-LION aims to create LLMs that cater to under-represented population groups and low resource languages in the SEA region. The following figure shows our training data distribution.

At the heart of understanding the SEA region lies the issue of language tokenization, which is the vital process of breaking down text into individual word pieces for training the LLM. Existing tokenizers of popular LLMs are often English-centric. We created a custom SEABPETokenizer for optimal model performance after testing a variety of tokenization approaches and evaluating their results. Our SEABPETokenizer with a vocabulary size of 256K tokens is designed to balance fertility and proportion of continued words, the general performance of existing models, and our linguistic understanding of SEA languages.

More information is available at the SEA-LION GitHub page.

Open Source for the Community

Our SEA-LION is an open source model focused on SEA languages and cultures, and it will be made freely open and available to the general public. We are working with many regional partners in the areas of data collaborations and use cases. But there is a lot more to do. Together, we stand on the verge of a breakthrough for SEA and we invite you to contribute and be a part of this exciting journey.

Contact us with this SEA-LION Inquiry Form to find out more.

LLM Gallery

Stay tuned for future SEA-LION application showcases.

Partners

Quotes and Logo