This is ALIA, the AI promoted by the Spanish Government

The Spanish government has announced the creation of ALIA, an open and public AI. What are its objectives, is it a good idea and what applications will it have?
The President of the Government, Pedro Sánchez, took advantage of his need at the closing of the meeting ‘HispanIA: how artificial intelligence will improve our future’ to announce the development of ALIA, a public artificial intelligence (AI) infrastructure in Spanish and co-official languages, open and funded 100% with public resources, as reported a few weeks ago.
The project is coordinated by the Barcelona Supercomputing Center -Centro Nacional de Supercomputación (BSC-CNS)-, with the impulse and leadership of the Secretary of State for Digitalisation and Artificial Intelligence.
‘The Government of Spain is promoting ALIA with the aim of strengthening the country’s technological sovereignty and positioning it as a benchmark in AI in Europe. This project seeks to reduce dependence on international models, promote the use of Spanish and the co-official languages in the technological field and democratise access to AI. Moreover, it is aligned with the Artificial Intelligence Strategy and the European AI Regulation, promoting a transparent, responsible and accessible technology for citizens, companies and public institutions’, says Marc Bara, professor at OBS Business School.
‘ALIA also responds to the need to improve Spain’s economic competitiveness through technological innovation, generating capabilities in key sectors such as public administration, scientific research and business development. In addition, President Pedro Sánchez stressed that this initiative seeks to ensure that Spain plays a relevant role in the global development of AI, especially in languages other than English,’ he adds.
Juan Ignacio Moreno, head of AI Solutions & Strategy at Innova-tsn, points out that ‘it seeks to promote Spanish and the co-official languages, correcting the biases and limitations generated by models trained in English’.
‘The LLMs we use have been trained using large amounts of text, called corpus, by technology companies, mainly American, such as Google, OpenAI, Meta… This means that most of them were written in English, so it is in this language where they work more accurately. With ALIA, the aim is to obtain LLMs trained on texts written in the languages of our country, which would make it possible to have language models optimised in our languages,’ adds Pablo Méndez, director of Altia’s Artificial Intelligence area.
This could also serve to gain weight in the whole of Latin America, bearing in mind that Spanish is the fourth most spoken language in the world, after English, Chinese and Hindi. ‘The Spanish administration has considered it essential for a Spanish-speaking country to lead the development of generative AI models specifically trained in Spanish,’ says Pablo Beldarrain, Generative AI solutions leader at Neoris Spain.
In addition, Moreno notes that ALIA aims to ‘reduce dependence on AI models controlled by large US corporations and guarantee its own infrastructure to avoid risks such as censorship, data loss or interruptions in critical services’.
Similarly, Beldarrain says that ‘this initiative not only benefits Spain, but also represents a strategic advance for Europe’. ‘In a context where AI development is dominated by US tech giants, it is crucial for Europe to strengthen its independence in this area. The recent emergence of DeepSeek, a cost-effective LLM developed in China, is evidence of how other regions are already investing in models adapted to their own languages and needs. This further reinforces the need for Europe to play an active role in the evolution of generative AI,’ he says.
What is ALIA all about?
‘ALIA is a pre-trained language model. To generate this model, a system has been trained with a corpus of more than 17 billion words, spread over 34 million documents. In other words, it is the equivalent of ChatGPT, but instead of having to pay a US company to use it, it is freely accessible and feeds the Spanish innovation ecosystem,’ explains Luis de la Fuente, deputy director of Research at the School of Engineering and Technology and principal investigator of the “Data Driven Science” group at the International University of La Rioja (UNIR).
In this way, Méndez specifies that the first step will be ‘to create a very extensive and high quality corpus in the target Iberian languages, to then train from scratch open foundational language models that improve their performance in these target languages and subsequently allow the creation of high quality products in these languages’.
Thus, ALIA encompasses a family of AI models with the aim of promoting technological solutions adapted to the characteristics of our country. ‘Specifically, pilot projects are currently being implemented to validate and evolve this strategy, such as an internal chatbot in the Tax Agency, to streamline its operation and citizen service; and an application in primary care that, through advanced data analysis, improves the early diagnosis of heart failure,’ says José Antonio Lozano, head of AI & Business Innovation at Tokiota.
In addition, Bara emphasises that these projects have been conceived ‘with principles of ethics and transparency, minimising bias and complying with the European AI Regulation’.
Moreno also emphasises that ALIA is an open source project, ‘which allows both public and private entities to use and adapt it according to their needs’.
In this sense, the OBS Business School expert explains that ALIA ‘offers accessible resources for researchers, companies and developers, including tools such as “ALIA Kit” for commercial, educational and scientific applications’.
Another aspect that distinguishes ALIA from other better-known generative AI models is its funding and support scheme, with an initial investment of 10.1 million euros. ‘It has the backing of the public administration, through European and Spanish funds, which guarantees its viability and long-term projection. In addition, specific aid is planned to promote the development of applications in strategic sectors such as health, public administration and research,’ explains the Neoris manager.
He also stresses the importance of the project having the support of the Barcelona Supercomputing Center. ‘The training of large-scale language models requires massive computational capacity, an investment that only the European public administration or US and Chinese technology giants can afford. In this respect, ALIA has the backing of the Barcelona supercomputer, enabling it to compete with other major international infrastructures. This commitment to European technological autonomy is a significant difference with respect to previous projects,’ he explains.
Spanish languages, a differential element
In addition to its open and public nature, there is no doubt that ALIA’s differential element is its focus on Spanish and the different co-official languages of our country: Galician, Basque, Catalan and Valencian.
‘The main difference is the text corpus on which the model is trained: a collection of high quality texts, not just extracted from the internet, in the target languages. This corpus will also be open and available to any organisation on the planet that wants to train an LLM that is versed in our languages. Moreover, thanks to the care and quality standards with which this corpus is built, there will be no intellectual property issues, as is the case with other corpora downloaded from the internet,’ says Méndez.
For example, Moreno points out that ‘it uses public language corpora such as parliamentary documents and scientific repositories, as well as anonymised health data, which reduces bias and improves accuracy in technical fields such as administration, science and medicine’.
It also notes that ‘ALIA’s family of models is scalable, with various sizes adapted to the needs of different sectors, from small companies to public bodies, thus facilitating its adoption’. It also highlights that ‘ALIA promotes sustainability through energy efficiency and carbon footprint reduction’.
A necessary project?
Several language models are already on the market, so it is worth considering whether it is really worthwhile to undertake such a project with public funding.
‘Although ALIA is presented with distinctive features, a detailed analysis reveals that these differences may not be as significant as claimed. The model boasts training from scratch, using the MareNostrum 5 supercomputer, processing 6.9 billion tokens in 35 European languages. However, the results question the effectiveness of this approach,’ says Bara.
‘The reality is that ALIA comes into an ecosystem where highly competent multilingual solutions already exist. The performance data is particularly revealing. In standard natural language understanding (NLI) tests, ALIA achieves just 51.77% accuracy in the XNLI_en test, while Llama 2, launched in July 2023, achieves 66%. In question answering tasks, the gap is even more significant. ALIA scores 81.53% on SQuAD_en, compared to 93-94% for Llama 2. These numbers are not just statistics, but represent the model’s actual ability to understand and process natural language, a fundamental skill for any practical application,’ he says.
In this way, he believes that ‘the justification for developing a Spanish proprietary model could lie in arguments of technological sovereignty and control over the data’. However, he believes that ‘ALIA’s current approach raises significant doubts about its ability to meet even this objective’.
‘Despite presenting itself as a tool to enhance Spanish, only 16.12% of its training data is in our language, while English dominates, with 39.31%. This linguistic distribution is reflected in its performance. The model shows better results in English than in Spanish, contradicting its fundamental purpose,’ he warns.
The economic aspect is particularly worrying when considering the alternative pointed out by industry experts. ‘Google engineers have pointed out that a process of fine-tuning existing models with 17 billion tokens could have produced results at a fraction of the current cost. This observation is particularly relevant when considering the Llama 2 comparison tables were removed from official documentation, raising concerns about transparency in the management of public resources,’ he says.
Furthermore, he believes that the transparency and open source code under the Apache 2.0 licence with which it has been developed ‘are positive elements, but not unique’. ‘Other projects, such as Llama 2, also offer open access to their models. The recent DeepSeek is also open source,’ he says.
On the other hand, he points out that the pilot projects announced for ALIA, mentioned above, ‘although interesting, do not seem to justify the magnitude of the investment made’.
‘The real need, perhaps, was not to develop a model from scratch, but to strengthen the basic digital infrastructure of the public administration, an area where the shortcomings are evident and the impact on citizens would be immediate,’ he says.
‘We could say that, while the intention behind ALIA may be laudable – to promote Spanish technological innovation and independence in AI – the actual execution suggests that it was not the most necessary or efficient approach. A more strategic approach would have been to invest in the specialisation of existing models for specific use cases of the Spanish public administration, combined with a substantial upgrade of the basic digital infrastructure,’ he concludes.
Despite this, other experts believe that such a project is indeed necessary. For example, the head of Innova-tsn emphasises that ALIA addresses ‘three critical problems’.
‘On the one hand, the language gap, given that Spanish is the third most spoken language on the internet, but few AI models are trained in this language, which causes biases in key sectors such as medicine or law.
‘It is true that the accuracy of translations of US and Chinese foundational models when interacting in Spanish is very high, but training the ALIA foundational model with a mostly local document corpus can improve its adaptability to the specific needs of Spanish citizens,’ he adds.
Furthermore, Altia’s AI director believes that ‘all states should have language models trained on large corpora of text written in the languages spoken by their citizens’, as only in this way can the best solutions based on linguistic AI, generative or otherwise, be developed’.
The second aspect refers to strategic autonomy, ‘since dependence on foreign models leads to sensitive data being processed outside Spain and the European Union, affecting privacy and security’, says the Innova-tsn expert.
‘A policy that generates competitiveness was needed, with models and tools that can compete in this market and that are based within the European Union,’ agrees the UNIR professor. ‘It is important that Spain and the European Union continue to take steps in this direction to achieve greater technological autonomy and compete with other powers such as the United States and China,’ adds Lozano.
And the last open front has to do with inclusive innovation, ‘since ALIA, being a public and open source infrastructure, democratises access to AI, boosting competitiveness and local technological development,’ Moreno explains.
Likewise, the Tokiota representative believes that ‘this project also helps to democratise advanced technologies among small and medium-sized enterprises, which contributes to increasing their competitiveness’, since it gives companies the opportunity to create customised applications, boosting their competitiveness in the market.
‘The fact that the licence is open means that users have total freedom to use it, without being subject to licences that may change over time,’ adds the UNIR professor.
On the other hand, Beldarrain emphasises that this initiative could be fundamental in attracting and retaining talent. ‘Spain has enormous potential in terms of technological and engineering talent, but it has not yet established itself as a global hub of reference. Some cities, such as Malaga, have already become technology hubs with a growing community of AI developers and professionals. This project will help to strengthen this trend and expand it to other areas of the country, consolidating Spain as a centre of AI innovation,’ he explains.
What will it be used for?
ALIA’s potential applications are many and extend to various sectors. ‘Wherever computers need to interact with people in natural language or understand information written in free text, ALIA can be applied: state-of-the-art chatbots specialised in our languages, extracting structured information from unstructured text to avoid mechanising a lot of information for humans, the best text translators between our co-official languages, exploitation of free text such as medical reports, summaries of press or judicial texts and a long etcetera,’ lists the director of Altia’s AI area.
For example, De la Fuente points out that Spanish companies and administrations will be able to use ALIA to generate chatbots that facilitate service with their users. ‘It is another competitor in the LLM market, with the characteristic of being free to use and developed in our context,’ he explains.
Likewise, the head of Innova-tsn points out that ‘digitised access to public services supported by ALIA could be particularly beneficial due to the knowledge of the specific regulations and organisational structure of the Administration’.
Likewise, the Neoris expert believes that ‘it will contribute to the optimisation of services and to the improvement in the management of aid and resources, especially in key sectors such as assistance to people with special needs’, as it will allow the automation of bureaucratic processes, improve communication with citizens and offer more agile and efficient solutions.
In the field of research, he predicts that ‘it could contribute to the creation of specialised models in linguistic and medical studies in Spanish, boosting the development of AI in scientific and academic contexts’. In fact, he argues that ‘perhaps in this field its exploitation could provide a differential added value, subject to the challenges of continuous investment, adaptability and speed of evolution that the market for AI products will have to face’.
Beldarrain maintains that ‘its use will make it possible to speed up scientific processes through advanced image analysis and the optimisation of large volumes of data’, as generative AI ‘will represent a qualitative leap in the ability of scientists to interpret information and draw conclusions more quickly and accurately’.
Similarly, he predicts that ‘ALIA will have a significant impact on improving the detection of diseases such as cancer, the personalisation of treatments and the optimisation of medical resource management’. ‘From the analysis of medical records to the development of support tools for healthcare workers, generative artificial AI will facilitate access to more accurate diagnoses and more efficient healthcare systems,’ he adds.
Lozano also believes it will be able to facilitate the translation of information into and out of Spanish and the co-official languages in which the AI has been trained. ‘Thanks to the automation provided by artificial intelligence, content can be translated efficiently and accurately, reducing the efforts of the administration and companies to maintain linguistic co-officiality.
In short, the OBS Business School expert believes that the main beneficiaries of ALIA will be ‘academic and research institutions, which could study and improve the model thanks to its open code; public administrations seeking solutions adapted to local regulatory frameworks; and Spanish technology companies, which could develop specific applications’.
What benefits can it bring?
Moreno predicts that ALIA ‘can offer significant benefits in different areas, especially for SMEs, citizens and the positioning of Spain in Europe in terms of AI’.
In the business sphere, he says it can help reduce the costs of using AI, ‘thanks to access to public and open models and possible government subsidies that encourage the adoption of AI, boosting innovation and competitiveness’.
In the case of citizens, he says that ‘the implementation of ALIA in public services could improve speed and accuracy in areas such as healthcare and tax administration, resulting in faster medical diagnoses, simplified tax procedures and better interaction with the administration’.
At the institutional level, he considers that this initiative ‘reinforces Spain’s cultural identity, adapting AI to co-official languages, ensuring that Catalan, Galician, Basque and Valencian speakers have access to advanced technology without language barriers and preserving the country’s cultural diversity’.
Finally, he believes that ALIA positions our country as an important player in ethical and multilingual AI in the European context, ‘contributing to European technological sovereignty and strengthening the continent’s independence from models controlled by large foreign corporations, promoting technological development based on transparency and accessibility’.
On this last point, it is worth being cautious. Although it may help to reduce dependence on technology providers based outside European borders, the Tokiota expert believes that ‘it is premature to say that ALIA guarantees Spain’s technological sovereignty’. He stresses that ‘its success will depend on the effective implementation and evolution of the project’.
Furthermore, Bara believes that ALIA is born with important limitations in the face of the competition already established in the market, as we saw earlier. Therefore, he believes that ‘a more effective strategy for technological sovereignty could have been to invest in the adaptation and specialisation of existing models for specific Spanish needs, combined with the development of robust digital infrastructure and the training of local AI talent’.