Researchers from Mohamed bin Zayed University of Artificial Intelligence (MBZIAE) in Dubai and Paris University have unveiled Atlas-Chat, the first “large language model” (LLM) designed for Darija, Morocco’s Arabic dialect.
Having been specifically developed for Darija, the groundbreaking language model is a significant leap forward in addressing the lack of representation of low-resource languages such as Moroccan Arabic, which LLMS often overlook.
The team of researchers constructed their dataset by consolidating existing Darija language sources, “creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control,” the research abstract states.
Darija is the everyday language of most Moroccans and is a reflection of the richness of the Kingdom’s vibrant culture and multilingual landscape. It blends Arabic, Tamazight, French, Spanish, and increasingly English, making it a dynamic and notoriously difficult language to master or to process digitally.
Written sources of the language are scarce. Originally an oral language, it has only begun to be written down with the rise of social media and online communications. Many Moroccans use Latin script to write in Darija, while others use the Arabic alphabet.
Despite the many challenges these nuances pose, the researchers have reported “a 13% performance boost” over preexisting LLMs.
“Notably, our models are outperforming both state-of-the-art and Arabic-specialized LLMs such as LLaMa, Jais, and AceGPT,” they noted.
The researchers have released their findings for free on Research Gate. They are committed to safeguarding the project on an open-access basis, open to all, in hopes of encouraging others to join the effort of highlighting underrepresented languages and dialects.
“We believe this is just the beginning. Countless languages and dialects are currently underrepresented in [natural language processing] NLP, and we hope Atlas-Chat paves the way for more inclusive language models that reflect the true diversity of global communication,” the researchers said.