The MarIA project, the system of language models created at the Barcelona Supercomputing Center – National Supercomputing Center (BSC-CNS), based on the web files of the National Library of Spain (BNE) and framed and financed with the Plan of s of the Language of the Secretary of State for Digitization and Artificial Intelligence (SEDIA), has advanced in its development and its new version allows existing texts to be summarized and new texts to be created from headlines or words.
The MarIA project is the first massive artificial intelligence system and expert in understanding and writing in the Spanish language. Due to its volume and capabilities, it has placed the Spanish language in third place among the languages that have massive open access models, after English and Mandarin. It has been built from the digital documentary heritage of the National Library of Spain, which tracks and archives websites written in Spanish and has been trained with the MareNostrum 4 supercomputer. And it is published openly so that application developers, companies, groups of investigation and the society in general can use it in infinity of uses.
The latest advances in MarIA constitute a milestone in achieving the objectives of the National Artificial Intelligence Strategy and the Recovery, Transformation and Resilience Plan, with which Spain intends to lead the world in the development of tools, technologies and applications for the projection and use of the Spanish language in the areas of application of AI. Specifically, the National Language Plan in which this project is part of aims to promote the development of natural language processing, machine translation and conversational systems in Spanish and co-official languages.
Models for understanding language and models for generating texts
A language model is an artificial intelligence system made up of a set of deep neural networks that have been trained to acquire an understanding of the language, its lexicon, and its mechanisms for expressing meaning and writing at an expert level. These complex statistical models, which relate words in texts in a systematic and massive way, are capable of "understanding" not only abstract concepts, but also their context. With these models, developers of different applications can create tools for multiple uses, such as classifying documents or creating proofreaders or translation tools.
The first version of MarIA was developed with Roberta, a technology that creates "encoder"-type language models. These types of models, given a text sequence, generate an interpretation that can be used, for example, to classify documents, answer multiple choice questions, find semantic similarities in different texts or detect the feelings expressed in them.
The new version has been created with GPT-2, a more advanced technology that creates generative decoder models and adds features to the system. Decoder models, given a text sequence, can generate new texts. With this, they can be used, for example, to make automatic summaries, simplify complicated writing tailored to different user profiles, generate questions and answers, maintain complex dialogues with users and even write complete texts (which could appear to be written by humans). from a headline or a small number of words.
These new capabilities make MarIA a tool that, with "ad hoc" training adapted to specific tasks, can be very useful for application developers, companies and public administrations. For example, the models that have been developed in English so far are used to generate text suggestions in writing applications, to summarize contracts or the complicated documents that detail the benefits of a product, depending on what each user wants to know, and to search for specific information within large text databases and relate it to other relevant information.
“With projects like MarIA, which will be incorporated into PERTE for the development of a digital economy in Spanish,' we are taking firm steps towards an artificial intelligence that thinks in Spanish, which will multiply economic opportunities for companies and the Spanish technology industry. Because language is much more than a means of communication. It is a projection of the way we have to see the world, also in the new digital reality”, points out the Secretary of State for Digitization and Artificial Intelligence, Carme Artigas.
“As the institution responsible for electronic legal deposit, the National Library of Spain (BNE) preserves millions of websites, millions of words that are repeated in a given context and that are the product of many collections of the Spanish web, both domain.es and selective, carried out for years by the BNE teams, which makes up the great corpus of Spanish spoken in our country today — explains Ana Santos, director of the BNE—. For us it is a great satisfaction that these files are useful for this pioneering project, based on artificial intelligence technologies, which will allow machines to understand and write in Spanish, which is a milestone in the field of data processing. natural language”.
"We appreciate SEDIA's initiative to promote future issues, such as the promotion of the Spanish language in the digital world and the AI environment," says the director of the BSC-CNS, Mateo Valero. We are delighted to put our experts in natural language and artificial intelligence and the calculation capacity of our infrastructures at the service of relevant challenges for society, such as the one that this initiative responds to”.
For her part, the director of the BNE's Division of Digital Processes and Services, Mar Pérez Morillo, highlighted that "in the collections we focus on events that have influenced or marked society and its language". Likewise, the BNE actively cooperates with the regional collection centers that use the tools that the BNE makes available to them. “We are in a race against time, developing strategies and tools that fight against what is called the digital dark age”, explained Morillo.
Trained with more than 135 billion words and 9.7 trillion operations
In language models, the number of parameters with which the system is trained is the element that gives them the greatest capacity for generalization and, therefore, intelligence. The data from the National Library with which MarIA has been trained consists of more than 135 billion words (135,733,450,668, specifically), occupying a total of 570 Gigabytes.
To create and train MarIA, the BSC's MareNostrum supercomputer has been used and a calculation power of 9.7 trillion operations (969.exaflops) has been necessary. A flop (floating point operation) is the unit of measurement used to express the calculation capacity of a supercomputer per second and exa is the prefix that expresses 1018, that is, a trillion.
Of these 969 exaflops, 201 were necessary to process the data coming from the National Library, removing everything that was not well-formed text (page numbers, graphics, sentences that do not end, wrong encodings, duplicate sentences, other languages, etc. ) and save only the correct texts in the Spanish language, as it is actually used. The remaining 768 exaflops were used to train the neural networks of the GPT-2 model.
The current version of MarIA will now give rise to specialized versions in different application areas, including biomedical and legal, and will evolve to solve the specific problems mentioned above.
In parallel, PlanTL will continue to expand MarIA to: adapt to new technological developments in natural language processing (more complex models than the GP-T2 now implemented) trained with greater amounts of data, create workspaces to facilitate the use of MarIA by companies and research groups in the appropriate computing environments and embed them in evaluation and certification systems of the quality of the systems developed in different domains.
Notice: JavaScript is required for this content. The link to download the e-book will appear inside the frame. See our Privacy Policy for downloading documents