Welcome to the AI Products webpage which is dedicated to Large Language Models (LLMs). Our team builds the SEA-LION (Southeast Asian Languages In One Network) family of LLMs that is specifically pre-trained and instruct-tuned for the Southeast Asian (SEA) region.
The SEA-LION model is a significant leap in the field of Natural Language Processing (NLP). It is built on the robust MPT architecture and has a vocabulary size of 256K. For tokenization, the model employs our custom SEABPETokenizer, which is specially tailored for SEA languages, ensuring optimal model performance.
A non-commercial 7B instruct model is also available: 7B instruct model.
Large Language Models (LLMs) are a type of artificial intelligence model designed to understand and generate human language. They are trained on vast amounts of text data and can perform a wide range of tasks such as translation, summarization, answering questions, and even writing code.
Existing LLMs display strong bias in terms of cultural values, political beliefs and social attitudes. This is due to the training data, especially those scraped from the Internet, which often has disproportionately large WEIRD-based origins. WEIRD refers to Western, Educated, Industrialized, Rich, Democratic societies. People of non-WEIRD origin are less likely to be literate, to use the Internet, and to have their output easily accessed.
Our work in SEA-LION aims to create LLMs with increased representation of the non-WEIRD population groups and the low resource languages in the SEA region. The following figure shows our training data distribution.
At the heart of understanding the SEA region lies the issue of language tokenization, which is the vital process of breaking down text into individual word pieces for training the LLM. Existing tokenizers of popular LLMs are often English-centric. We created a custom SEABPETokenizer for optimal model performance after testing a variety of tokenization approaches and evaluating their results. Our SEABPETokenizer with a vocabulary size of 256K tokens is designed to balance fertility and proportion of continued words, the general performance of existing models, and our linguistic understanding of SEA languages.
More information is available at the SEA-LION GitHub page.
Open Source for the Community
Our SEA-LION is an open source model focused on SEA languages and cultures, and it will be made freely open and available to the general public. We are working with many regional partners in the areas of data collaborations and use cases. But there is a lot more to do. Together, we stand on the verge of a breakthrough for SEA and we invite you to contribute and be a part of this exciting journey.
Contact us at firstname.lastname@example.org to find out more.
Stay tuned for future SEA-LION application showcases.
Quotes and Logo