Europe’s Open Language Model Family Poro Extends Checkpoints, Languages and Modalities

November 30, 2023 – Poro is a family of multilingual open source large language models (LLMs), with the aim of strengthening European digital sovereignty and democratizing access to LLMs. To ensure transparency and openness, and as part of the Poro Research Checkpoint program, we are today announcing new model checkpoints, as well as the next-generation models with additional languages and modalities.

Together with the University of Turku and HPLT, SiloGen launched an initiative to build a family of open multilingual LLMs with a world-class team, access to a record amount of compute and data, and a distinctive software layer to train LLMs.
Two months later, we are now releasing the next two checkpoint milestones, covering a total of 50% of training for Poro 34B. Model evaluations prove performance for low-resource languages, with best-in-class performance for the Finnish language.
As a next step, the model family adds support to the Nordic languages, including Swedish, Norwegian, Danish and Icelandic, and announces a partnership with LAION, adding vision capability and commencing the training of multimodal models

In mid-November, we published the first three checkpoints of Poro 34B, a multilingual, open European language model showing performance evidence on low-resource languages like Finnish, without compromising performance in English. We’re now publishing the next two checkpoints for Poro 34B, with in total 50% of the model trained. After five model checkpoints, the results for Poro 34B show that Poro is already outperforming all existing open language models on the Finnish language, including FinGPT, Mistral, Llama and the BLUUMI 176 billion parameter model among others (FinGPT is the first large generative Finnish language model (Luukkonen et al., forthcoming, EMNLP)).

“I’m proud of the results we have already been able to achieve with the Poro models. Already at this stage, I believe it’s safe to say that Poro 34B is, to date, the best open Finnish language model available. It’s inspiring to see how we have been able to use some of the learnings from FinGPT and the BLUUMI 176 billion parameter model, improve on those, and now have an even better model. We expect to reach 100% of training Poro 34B in the coming weeks.” says Research Fellow Sampo Pyysalo from TurkuNLP.

Added languages and modalities

With the proficient initial model results, we are now excited to announce a set of new models of the Poro model family, with additional capabilities. We have commenced training Poro 2, a model which covers English, Finnish, Swedish, Norwegian, Danish, Icelandic and code. Poro 2 has an updated and more modern architecture, and comes in a variety of model sizes. This is an important step towards the aim of covering all European languages, and our vision of European digital sovereignty with AI infrastructure for European companies to benefit from.

Poro 2: a modern architecture with more languages

Below is a summary of key features of Poro 2. When training completes we will additionally release instruction and chat tuned varieties of the Poro base models. For transparency with respect to model architecture, data and other technical information, please refer to the official model card.

Poro Research Checkpoints: Checkpoints for the model are released throughout the training process, providing external researchers with unprecedented access to investigate the model training process.
Model architecture: Poro 2 consists of different-sized models and upgrades the Poro 1 BLOOM architecture and ALiBi embeddings with additional capabilities, such as flash attention, rotary embeddings, grouped query attention, additional languages etc.
Multilingual capabilities: Poro 2 is designed to process English and Nordic languages, and has proficiency with a variety of programming languages. Additionally, it can perform basic translation between English and Nordic languages.
Open source: Poro 2 is freely available under the Apache 2.0 License, implying applicability for both commercial and research use.
Training details: Poro is trained using 512 AMD MI250X GPUs on the LUMI supercomputer in Finland.

Language models with vision

While extending support to additional European languages, we are now also announcing that the upcoming model generations will add vision to their capabilities. This is enabled through a partnership with LAION (Large-scale Artificial Intelligence Open Network) for building a set of multimodal models. LAION is a global non-profit organization, with an aim to make large-scale data sets, machine learning models and related code publicly available. They provide assets, such as the LAION-5B dataset and the open toolbox for NSFW and toxicity detection LAION-SAFETY, for developing safe, trustworthy and reliable multimodal models. Their assets are among others behind the image generation tool Stable Diffusion. This partnership will introduce vision capabilities to the Poro model family through a modular architecture by providing vision to existing models, as well as opening up opportunities to additional multimodal architectures in the future.

“In line with the plan to cover all European languages, it’s a natural step to start with an extension to the Nordic languages. And it’s likewise natural to extend Poro with vision. Through a partnership with LAION, multimodal models help in expanding the potential use cases and possibilities for value creation. Models with vision capabilities will be able to interpret, summarize, and describe documents containing both text and images. Like textual data, we see an even larger potential for generative AI to consolidate large amounts of data of different modalities.” Peter Sarlin, Silo AI CEO and co-founder, notes.

The collaboration with LAION brings together industry expertise and experience, strong and rigorous academic research, and an open source philosophy. This is a strong foundation for ensuring trustworthy, reliable and robust models. We hope the level of transparency enabled by our open source approach, in combination with the Poro Research Checkpoint program, will add to the trust we have been able to build with partners and clients alike.

Considerations for Use

The intended audience for Poro Research Checkpoints is academic and industry research. These checkpoints are not suitable for deployment in a production use case without further training, fine-tuning and testing.

Acknowledgments

We wish to thank the operators of the LUMI/EuroHPC supercomputer for computational resources and technical support, including AMD, HPE and CSC – the IT Center for Science, Finland. TurkuNLP researchers have received funding from the European Union’s Horizon Europe research and innovation programme High Performance Language Technologies (HPLT) under grant agreement No 101070350.

About Silo AI

Silo AI is Europe’s largest private AI lab – a trusted AI partner that brings competitive advantage to product R&D. We build AI-driven solutions and products to enable smart devices, autonomous vehicles, industry 4.0, and smart cities. Silo AI provides its customers unique access to world-class AI expertise, as well as the Silo OS infrastructure to speed up AI development and deployment. Established in 2017, Silo AI is on a mission to build a European flagship AI company, with offices currently in Finland, Sweden, Denmark, the Netherlands, Germany, United States and Canada.

www.silo.ai<

About SiloGen

SiloGen is a large-scale initiative with the aim of building generative AI technology for Europe’s digital sovereignty. As Silo AI’s generative AI arm, SiloGen combines some of Europe’s leading generative AI and large language model (LLM) experts with access to data sources, powerful computational resources and infrastructure to train, run and operate LLMs. SiloGen has been operational since late 2022 and is currently working with clients like Allianz, Happeo, Sandvik and Tietoevry. As a trusted provider SiloGen offers base and specialized models as well as a development platform to ensure accurate, trustworthy and robust downstream applications.

About TurkuNLP

The TurkuNLP Group is a group of researchers at the University of Turku, with a research focus on various aspects of natural language processing, language technology and digital linguistics. TurkuNLP has contributed to a large number of open source NLP resources, such as FinBERT, WikiBERT, FinGPT, Turku Dependency Treebank, Universal Dependencies, Turku Neural Parsing Pipeline, Large internet corpora, Turku Paraphrase Corpus, Turku Sentiment Corpus, Wikidata normalization, TurkuONE etc. The University of Turku is an international academic community of 25,000 students and staff and was ranked among the 301–400 best universities in the 2023 Shanghai Ranking.

About LAION

LAION is a non-profit organization bringing together a diverse community passionate about advancing the field of machine learning for the greater good. Our mission is to democratize access to large-scale machine learning models, datasets, and code, fostering collaboration and innovation on a worldwide scale. We invite you to be part of our movement. Join us and explore the possibilities of a future where machine learning is a force for positive change.

If you're building AI or vision-enabled products, you've come to the right place.

Europe’s Open Language Model Family Poro Extends Checkpoints, Languages and Modalities

Added languages and modalities

Poro 2: a modern architecture with more languages

Language models with vision

Considerations for Use

Acknowledgments

About Silo AI

About SiloGen

About TurkuNLP

About LAION

Pages

Topics

Contact

Address

Phone