Giter VIP home page Giter VIP logo

ciral's Introduction

๐ŸŒ CIRAL

CIRAL (Cross-Lingual Information Retrieval for African Languages) is a test collection focused on promoting the research and evaluation of Cross-Lingual Information Retrieval (CLIR) for African languages. Our collection covers cross-lingual retrieval between English and 4 African languages, with queries in English and passages in the African languages. This repo provides details of the test collection, guidelines for system evaluations and baselines.

Hosted as a track at the Forum for Information Retrieval Evaluation (FIRE) 2023, the goal of our track was to promote participation and community evaluations in CLIR for African languages. More information regarding the track can be found at the website: Ciral@Fire2023

๐Ÿ“š Corpora

The current languages in CIRAL are Hausa, Swahili, Somali and Yoruba. The corpora consists of passages from news articles, mined from indigenous websites of the different languages.

Link to Dataset: https://huggingface.co/datasets/CIRAL/ciral-corpus/

Statistics and details of the collection are found below.

Language News Sources # of Passages # of Articles Link
Hausa (hau) LegitNG, DailyTrust, VOA, Isyaku, etc. 715,355 240,883 ๐Ÿค—
Somali (som) VOA, UN Swahili, MTanzania, etc. 827,552 629,441 ๐Ÿค—
Swahili (swa) VOA, Tuko, Risaala, Caasimada, etc. 949,013 146,669 ๐Ÿค—
Yoruba (yor) Alaroye, VON, BBC, Asejere, etc. 82,095 27,985 ๐Ÿค—

For each language, passages are stored in JSONL files where each line corresponds to a passage in JSON format. The fields provided for each passage include:

  • docid: Unique identifier of the passage
  • title: Title of the news article from which the passage was extracted
  • text: Text of the passage
  • url: News article url

๐Ÿ“š Queries and Relevance Judgements

CIRAL's queries and relevance judgements are provided for the four languages in three sets: development set, test set A and test set B. Additionally, test set A consists of pools curated from CIRAL's shared task. The queries and relevance judgement files can be accessed in the Hugging Face repo.

Statistics for the queries and relevance judgements are given below. The development set consists of few samples to analyze relevance and evaluate proposed systems using the provided judgements.

Dev Test A Test B
Language #Q #J #Q #J Pool Size #Q #J Link
Hausa (ha) 10 165 80 1,447 7,288 312 5,930 ๐Ÿค—
Somali (so) 10 187 99 1,798 9,094 239 4,324 ๐Ÿค—
Swahili (sw) 10 196 85 1,656 8,079 113 2,175 ๐Ÿค—
Yoruba (yo) 10 185 100 1,921 8,311 554 10, 569 ๐Ÿค—

Both query and relevance judgements files are in the .tsv format. Each line in the query file is represented as:

qid\tquery

while the judgements are in the standard TREC format:

qid Q0 docid relevance

๐Ÿ“š Guidelines and Resources

Task Description

This task entails queries formulated as natural language questions in English, and retrieval done at the passage-level for the different African languages. Information retrieval systems developed for the task will receive the collection of passages and a set of queries for the different African languages. For each query in the test set, proposed systems are to return a ranked list of passages ordered by likelihood of binary relevance to the query. Up to 1000 passages per query can be submitted, results with more than 1000 would be truncated.

Details regarding participation can be found in this section of the website.

Getting started with IR and CIRAL

For more details on getting started with IR and understanding the task, please check the provided Quick Start

๐Ÿ”Ž Baselines and Evaluation

Baselines and reproduction guides are provided in this section. Please note that this only covers searching, as the indexes have already been built.

The baselines can be reproduced using Pyserini. To reproduce the baselines:

  1. Install the development version of Pyserini by following this guide.
  2. Follow the commands in the 2-click-reproduction (2CR)

Our reranking baseline models are also available on Hugging Face: mT5, AfrimT5.

Citation

@inproceedings{10.1145/3626772.3657884,
author = {Adeyemi, Mofetoluwa and Oladipo, Akintunde and Zhang, Xinyu and Alfonso-Hermelo, David and Rezagholizadeh, Mehdi and Chen, Boxing and Omotayo, Abdul-Hakeem and Abdulmumin, Idris and Etori, Naome A. and Musa, Toyib Babatunde and Fanijo, Samuel and Awoyomi, Oluwabusayo Olufunke and Salahudeen, Saheed Abdullahi and Mohammed, Labaran Adamu and Abolade, Daud Olamide and Lawan, Falalu Ibrahim and Sabo Abubakar, Maryam and Nasir Iro, Ruqayya and Imam Abubakar, Amina and Mohamed, Shafie Abdi and Mohamed, Hanad Mohamud and Ajayi, Tunde Oluwaseyi and Lin, Jimmy},
title = {CIRAL: A Test Collection for CLIR Evaluations in African Languages},
year = {2024},
isbn = {9798400704314},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3626772.3657884},
doi = {10.1145/3626772.3657884},
pages = {293โ€“302},
numpages = {10},
keywords = {african languages, cross-lingual information retrieval},
location = {Washington DC, USA},
series = {SIGIR '24}
}

ciral's People

Contributors

mofetoluwa avatar labaran1 avatar

Stargazers

Hwi avatar Adeyinka Michael Sotunde avatar

Forkers

labaran1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.