November 12, 2023 – SiloGen is today announcing a release of the first model checkpoints of a family of multilingual open source large language models (LLMs), covering all official European languages and code.
- Together with the University of Turku and HPLT, SiloGen launched at the end of August an initiative to build open multilingual LLMs, with the aim of ensuring European digital sovereignty and democratizing access to LLMs.
- The unique open source initiative combines a world-class team, access to a record amount of compute on Europe’s most powerful supercomputer LUMI, a record amount of data, and a distinctive software layer to train LLMs.
- Two months after initiating the training efforts for a family of models, we are excited to release the first checkpoint milestones for Poro 34B.
Named ‘Poro’ after the Finnish word for reindeer, this new 34 billion parameter LLM for English, Finnish and code is an early look at what is in store from our multilingual model family. Future Poro releases will expand support to other European languages and add capabilities, such as updated model architecture, expanded context window, modalities etc. As one of the first projects in the field of LLMs, we will also be providing external researchers unprecedented access to the training of models. In a program called Poro Research Checkpoints, we will release a series of checkpoints for the model during the training process. Sharing these checkpoints will enable visibility into language model training among researchers and practitioners who do not have the resources to train their own large models from scratch.
Poro’s advanced capabilities with European languages like Finnish descend from how it addresses the core challenge for low-resource languages: training LLMs requires enormous amounts of data, but for low-resource languages like Finnish, sufficient data is simply not available. In general, Poro addresses this by cross-training low-resource languages with high-resource languages. This takes advantage of a cross-lingual signal that allows the model to achieve higher performance for the low-resource language than training a monolingual model, and has the further advantage of teaching the model basic translation capability.
After 30% of training, Poro already extends state-of-the-art base model performance on the Finnish language benchmark FIN-bench (e.g. FinGPT, Llama, Mistral), and in light of current experiments expect similar results as we expand to other languages. This is achieved without compromising performance in English, for which Poro is on course to achieve performance on par with, and beyond, comparable open English-oriented models (e.g., Llama and Mistral).
Poro is the result of a collaboration between Silo AI’s generative AI arm SiloGen and the University of Turku’s TurkuNLP Group and HPLT project, bringing together cutting-edge research and industry expertise.
Features of Poro 34B
Below is a summary of key features of Poro 34B. When training completes we will additionally release instruction and chat tuned varieties of the Poro 34B base model. For transparency with respect to model architecture, data and other technical information, please refer to the official model card.
- Poro Research Checkpoints: Checkpoints for the model are released throughout the training process, providing external researchers with unprecedented access to investigate the model training process.
- Model architecture: Poro 34B is 34.2 billion parameters and uses a BLOOM architecture with ALiBi embeddings to allow for context window extrapolation. While model architecture for the initial model has been kept simple, future models under progress will support additional capabilities, such as flash attention, rotary embeddings and grouped query attention.
- Multilingual capabilities: Poro is designed to process English and Finnish, and has proficiency with a variety of programming languages. Additionally, it can perform basic translation between English and Finnish.
- Open source: Poro is freely available under the Apache 2.0 License, implying applicability for both commercial and research use.
- Dataset: The model is trained with a dataset of 1 trillion tokens, with English, Finnish and a variety of programming languages represented.
- Training details: Poro is trained using 512 AMD MI250X GPUs on the LUMI supercomputer in Finland.
Considerations for Use
The intended audience for Poro Research Checkpoints is academic and industry research. These checkpoints are not suitable for deployment in a production use case without further training, fine-tuning and testing.
We wish to thank the operators of the LUMI/EuroHPC supercomputer for computational resources and technical support, including AMD, HPE and CSC – the IT Center for Science, Finland. TurkuNLP researchers have received funding from the European Union’s Horizon Europe research and innovation programme High Performance Language Technologies (HPLT) under grant agreement No 101070350.
About Silo AI
Silo AI is Europe’s largest private AI lab – a trusted AI partner that brings competitive advantage to product R&D. We build AI-driven solutions and products to enable smart devices, autonomous vehicles, industry 4.0, and smart cities. Silo AI provides its customers unique access to world-class AI expertise, as well as the Silo OS infrastructure to speed up AI development and deployment. Established in 2017, Silo AI is on a mission to build a European flagship AI company, with offices currently in Finland, Sweden, Denmark, the Netherlands, Germany, United States and Canada.
SiloGen is a large-scale initiative with the aim of building generative AI technology for Europe’s digital sovereignty. As Silo AI’s generative AI arm, SiloGen combines some of Europe’s leading generative AI and large language model (LLM) experts with access to data sources, powerful computational resources and infrastructure to train, run and operate LLMs. SiloGen has been operational since late 2022 and is currently working with clients like Allianz, Happeo, Sandvik and Tietoevry. As a trusted provider SiloGen offers base and specialized models as well as a development platform to ensure accurate, trustworthy and robust downstream applications.
The TurkuNLP Group is a group of researchers at the University of Turku, with a research focus on various aspects of natural language processing, language technology and digital linguistics. TurkuNLP has contributed to a large number of open source NLP resources, such as FinBERT, WikiBERT, FinGPT, Turku Dependency Treebank, Universal Dependencies, Turku Neural Parsing Pipeline, Large internet corpora, Turku Paraphrase Corpus, Turku Sentiment Corpus, Wikidata normalization, TurkuONE etc. The University of Turku is an international academic community of 25,000 students and staff and was ranked among the 301–400 best universities in the 2023 Shanghai Ranking.