New open source tool catalogs African language resources

Lanfrica enables research on any of the current and extinct languages from the African continent.
2 readers like this.
Pixelated globe

Geralt. CC0.

The last few months have been full of activity at Lanfrica, and we are happy to announce that Lanfrica has been officially launched.

What is Lanfrica?

Lanfrica aims to mitigate the difficulty encountered when seeking African language resources by creating a centralized, language-first catalog.

For instance, if you're looking for resources such as linguistic datasets or research papers in a particular African language, Lanfrica will point you to sources on the web with resources in the desired language. If those resources do not exist, we adopt a participatory approach by allowing you to contribute papers or datasets.

Infographic indicating listing the sources in Lanfrica, including African data set collections, African language policy documents, media coverage, and linguistic, legal, sociological, and political publications

(Chris Emezue, CC BY-SA 4.0)

At Lanfrica, we employ a language-focused approach. With 2,199 African languages accounted for, our language section boasts of all the African languages—yes, all of them, including the extinct ones! We have created algorithms that can tell, with much effectiveness, the African language(s) involved in a resource, enabling us to curate even works that do not explicitly specify the African languages they worked on (and there are many).

Lanfrica offers enormous potential for better discoverability and representation of African languages on the web. Lanfrica can provide useful statistics on the progress of African languages. As a simple illustration, the language filter section offers an immediate overview of the number of existing natural language processing (NLP) resources for each African language.

Screenshot of a list of 26 South African languages with a number next to each indicating the number of resources found for each, in descending order

(Chris Emezue, CC BY-SA 4.0)

From this search result, you can easily see that among South African languages, Afrikaans has 28 NLP resources, while Swati has just eight. Or, to take another example, the Gbe cluster languages of Benin have far fewer NLP resources than some of the South African languages.

Screenshot of a list of 10 Gbe languages. Only one, Fon, has any resources available; the others all have zero.

(Chris Emezue, CC BY-SA 4.0)

Such insight can lead to better allocation of funds and efforts towards bringing the more under-researched languages forward in NLP, thereby fostering the equal progress of African languages.

Lanfrica v1 is just the beginning. We have major updates coming up in the future:

  • We plan to enable our users to sign up and add to or edit the resources on Lanfrica.

  • Our current resources currently consist of NLP datasets. Next, we plan to work on publications in computational linguistics and linguistic publications. See the infographic above for all the types of resources planned for inclusion.

  • We are exploring various techniques to simplify the process through which relevant resources are identified and connected to Lanfrica.

For more updates as we move forward, become part of the Lanfrica community by joining our Slack or following us on Twitter.


This article originally appeared on the Lanfrica blog and is republished with permission.

best me
Chris Emezue is the Founder of Lanfrica. He is a Masters student at the Technical University of Munich, studying Mathematics in Data Science. He has worked extensively on (and contributed at Masakhane [https://www.masakhane.io/] to) a number of projects in AfricaNLP (like MMTAfrica, OkwuGbe).

Comments are closed.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.