Labs

CLEF 2026 hosts a total of 16 labs.

BioASQ

Large-scale biomedical semantic indexing and question answering

BioASQ organizes a series of challenges (shared tasks) for biomedical information access and machine learning systems in two complementary research directions: (a) the automated indexing of large volumes of unlabelled data, such as scientific articles, with biomedical concepts, (b) the processing of biomedical questions and the generation of comprehensible answers. Regarding the first direction, this year BioASQ introduces i) the new Task BioNNE-R, Nested Relation Extraction in Russian and English, ii) a new edition of Task ELCardioCC on Clinical Coding of Greek Cardiology Discharge Letters written in Greek, that focuses on document-level annotation, and iii) a new edition of Task GutBrainIE on Gut-Brain Interplay Information Extraction, incorporating more diverse and relevant biomedical literature. Regarding biomedical Question Answering (QA) direction, a whole infrastructure has been developed to support the established QA task (Task B), as well as the innovative Task Synergy, on QA for developing problems. In addition, a new edition of Task MultiClinSum on multilingual summarization of clinical case reports is introduced this year (Task MultiClinSum-2), extended to additional languages, namely German, Dutch, Catalan, Swedish, Norwegian, and Italian.

CheckThat!

Developing technologies for identifying and verifying claims

The 9th edition of the CheckThat! lab at CLEF targets three tasks: (i) scientific web discourse, (ii) generating full-fact-checking articles, and (iii) fact-checking numerical and temporal claims. These tasks represent challenging classification and retrieval problems, including multilingual settings.

ELOQUENT

New evaluation methods for generative language models

The ELOQUENT evaluation lab experiments with new evaluation methods for generative language models to meet some of the challenges in the path from laboratory to application. The organisers include commercially active AI developers as well as research groups. This lab explores the following important characteristics of generative language model quality: (1) Trustworthiness, a many-faceted notion which involves topical relevance and truthfulness, discourse competence, reasoning in language, controllability, and robustness across varied input, which is at the forefront of current development projects for generative language models; (2) Multi-linguality and cultural fit: the suitability of a language model for some cultural and linguistic area which is at top of attention, not least for the European arena; (3) Self-assessment: the reliability of a language model to assess the quality of itself or some other language model, using as little human effort as possible; (4) Limits of language models: the delimitation of world knowledge and generative capacity.

eRisk

Early risk prediction on the internet

We propose eRisk 2026, the next edition of CLEF’s lab series on early risk prediction in online data. Building on nine previous editions (2017–2025), that explored important tasks such as depression, anorexia, self-harm, pathological gambling or eating disorders. This edition, eRisk 2026, introduces three main challenges: The first task is related with the interaction with conversational agents who have been instructed to simulate different user behaviours and conditions. Participants need to interact with the LLMs, predict the depression severity and their main symptoms (if exists). The second task corresponds to the second edition of the Contextualised Early-Depression Detection task, leveraging full Reddit conversation threads for richer conversational and contextual scenarios to emit timely risk predictions. Finally, the third task, symptom sentence ranking for Attention-Deficit Hyperactivity Disorder (ADHD), extends our ranking framework to a previously unexplored condition. For considering the ADHD symptoms, we use the ASRS-v1.1 clinical questionnaire. The lab continues the established three-year task cycle, offers baselines and high-quality datasets, and advances conversational and symptom level analysis as key elements for mental health solutions.

EXIST

Sexism identification in social networks

This lab focuses on the detection of sexist messages in social networks (SN). Inequality and discrimination against women that remains embedded in society is increasingly being replicated online. Internet perpetuates and even naturalizes gender differences and sexist attitudes. The EXIST 2026 lab will continue to focus on the detection of sexism in social networks, while introducing a novel paradigm that integrates human-centered signals into the AI development pipeline. In this edition, we extend the “learning with disagreement” framework by incorporating sensor-based data from people exposed to potentially sexist content. This includes measurements such as skin conductance, heart rate variability, and other sensor data that reflect unconscious responses to sexism. Given the nature of these multimodal signals, we will concentrate on analyzing memes and short videos—formats that combine visual and textual cues and are especially suited for capturing the emotional and cognitive impact of online content. This human-in-the-loop approach not only acknowledges the diversity of subjective reactions to sexism, but also opens new avenues for building more robust, equitable, and interpretable systems. By integrating both conscious feedback and unconscious reactions from annotators, EXIST 2026 aims to foster a more nuanced and ethically grounded understanding of sexism across platforms and formats.

FinMMEval

Multilingual and multimodal evaluation of financial AI systems

The inaugural edition of FinMMEval, a CLEF 2026 Workshop dedicated to the multilingual and multimodal evaluation of financial AI systems. Real-world financial decision-making relies on diverse data sources, including textual reports, visual documents, and time-series signals, spread across languages and modalities. To support robust, interpretable, and auditable AI in this domain, FinMMEval proposes three complementary pilot tasks: (1) Document Parsing and Structured Extraction from scanned financial filings, (2) Financial Exam Question Answering, and (3) Financial Decision Making based on market context, combining historical prices, news, and portfolio status. The lab includes datasets in six languages, such as English, Spanish, Arabic, Greek, Bulgarian and Hindi, covering low-level perception to high-level decision-making. We aim to foster cross-disciplinary collaboration across NLP, CV, time-series modeling, and financial industries, and establish FinMMEval as a benchmark for the next generation of financial AI systems.

HIPE

Evaluating accurate and efficient person–place relation extraction from multilingual historical texts

Based on a pilot study that confirmed the feasibility of the task, HIPE-2026 targets a single but fundamental relation type (person–isAt–place) and additionally requires participants to 1) determine the temporal scope of this relation, and 2) assess the textual evidence that supports it. Working with challenging materials – i.e OCR-noisy, multilingual and domain-diverse newspaper articles – participants will contribute to the development of approaches that are key to constructing historical knowledge graphs, reconstructing biographies, enabling spatial analysis, and advancing text understanding of historical material. Given the energy costs of frontier models and the need to process large-scale cultural heritage collections, we identify efficiency as a critical challenge. HIPE-2026 will therefore offer two sub-tracks: one targeting maximum accuracy, the other prioritizing a trade-off between accuracy and computational efficiency. A surprise dataset will be included to evaluate generalization across domains. All datasets will be released to support transparency, reuse and further research.

ImageCLEF

Multimodal challenge in CLEF

Building on the success of previous editions from 2003 to 2025, ImageCLEF 2026 continues as part of the CLEF initiative’s Multimodal Challenge. The campaign focuses on advancing and benchmarking technologies for multimodal data analysis—including annotation, classification, indexing, and retrieval. The primary goal of ImageCLEF 2026 is to facilitate research through access to large-scale, diverse multimodal datasets tailored to a wide range of practical domains and scenarios. Continuing the momentum of recent successful editions, this year’s challenge will again embrace interdisciplinary problem solving across key areas such as medical imaging, knowledge representation, data generation, encouraging innovative, real-world solutions. While ImageCLEF 2026 will retain the tasks from the 2025 edition, with some improvements, while adding a new task to expand futher. Participants will be invited to tackle new and evolving tasks that reflect emerging needs and technologies, fostering collaboration across disciplines.

JOKER

Automatic humour analysis

Humour poses a unique challenge for artificial intelligence, as it often relies on non-literal language, cultural references, and linguistic creativity. The JOKER Lab, now in its fourth year, aims to advance computational humour research through shared tasks on curated, multilingual datasets, with applications in education, computer-mediated communication and translation, and conversational AI. In this 2026 edition of JOKER lab, we will conitnue three main tasks from the last year: (1) humour-aware information retrieval, which involves searching a document collection for humorous texts relevant to user queries in either English or Portuguese; (2) pun translation, focussed on humour-preserving translation of paronomastic jokes from English into French; and (3) onomastic wordplay translation, a task addressing the translation of name-based wordplay from English into French. We observed significant changes in the approaches of the participants in 2025.

LifeCLEF

Biodiversity monitoring using AI-powered tools

Biodiversity monitoring using AI-powered tools has become vital for tracking species distributions and assessing ecosystem health on a large scale. Automated image- and sound-based species recognition, in particular, continues to accelerate conservation efforts by enabling rapid, low-cost surveys of vulnerable populations. However, the ever-growing variety of algorithms and data sources underscores the need for standardized benchmarks to assess real-world performance. Since 2011, the LifeCLEF lab has filled this role by organizing annual evaluations that promote collaboration among AI experts, citizen science, and ecologists. (i) AnimalCLEF: Discovery of individual animal, (ii) BirdCLEF+: Multi-taxonomic species identification in soundscape recordings, (iii) MarineCLEF: Location-aware classification of marine species in underwater imagery, (iv) PestCLEF: Information extraction on plant pests from news articles, (v) PlantCLEF: Multi-species plant identification in quadrat images.

LongEval

Longitudinal evaluation of model performance

Most Information Retrieval (IR) benchmarks evaluate systems at a single point in time, despite data and user behaviors changing over time. Research shows that IR and text classification systems lose effectiveness as data patterns evolve, especially when test data is temporally distant from training data. This lab encourages developing models that maintain performance over time by providing training and testing data from different periods. We propose the fourth LongEval Lab to further focus on evaluating IR systems’ ability to generalize across time, using datasets split by various temporal distances to assess how well systems handle evolving documents and queries. For 2026 we plan a total of 4 tasks, widening the scope of long-term IR to new dynamics beyond documents, topics and qrels, closer to evolving user behavior with user simulation tasks.

PAN

Stylometry and digital text forensics

The PAN evaluation lab on digital text forensics includes three returning and two new tasks, which all tackle recent and relevant challenges from the field of text forensics with a special focus on the detection and analysis of text produced by generative~AI. The tasks are: (1) Voight-Kampff Generative AI Detection, (2) Text Watermarking, (3) Multi-author Writing Style Analysis, (4) Generative Plagiarism Detection, and (5) Reasoning Trajectory Detection.

QuantumCLEF

Quantum computing at CLEF

The goal of QuantumCLEF is to establish an evaluation infrastructure for quantum computing (QC) algorithms with a focus on applications in the Information Access domain. This initiative aims to: (1) explore novel problem formulations that enable the efficient and effective use of QC techniques; (2) assess the performance of QC methods in comparison to conventional, non-quantum approaches executed on classical hardware; (3) foster interdisciplinary collaboration among researchers from areas such as Information Retrieval, Recommender Systems, and Operations Research, to facilitate knowledge exchange and engagement with QC technologies.

SimpleText

Simplify scientific text (and nothing more)

Over the last few years, the SimpleText Track has created an active community of researchers in NLP and IR working together to improve access to scientific text. Its benchmarks on scientific passage retrieval, scientific terminology detection and explanation, and scientific text simplification have become standard references. After using a similar track setup in 2021-2024, we significantly changed the track’s setup and tasks in 2025. We will continue this successful setup in 2026. Hence, the CLEF 2026 SimpleText track will contain the following 2+1 tasks. Task 1 (Text Simplification): Simplify scientific text. Task 2 (Controlled Creativity): Identify and avoid hallucination. Task 3 (SimpleText Revisited): selected tasks by popular request.

TalentCLEF

Skill and job title intelligence for human capital management

The second edition of the Skills and Job Title Intelligence for Human Capital Management (TalentCLEF) workshop aims to foster the development and thorough assessment of NLP-based decision support systems in the field of human resources, as well as to offer a meeting place for professionals and researchers interested in the application of technologies in this area. The workshop will be conducted as an evaluation lab featuring two shared tasks with the goal of advancing fair talent matching in Human Capital Management: (1) Task A - Contextualized Job-Person Matching, focused on finding right candidates for specific job positions, and (2) Task B - Job-Skill Matching with Skill Type Classification, focused on finding relevant skills to a given job position.

Touché

Argumentation systems

Decision-making and opinion-forming are everyday tasks that involve weighing pro and con arguments for or against different options. With ubiquitous access to all kinds of information on the web, everybody has the chance to acquire knowledge for these tasks on almost any topic. However, current information systems are primarily optimized for returning relevant results and do not address deeper analyses of arguments or multi-modality. To close this gap, the Touché lab series, running since 2020, has several tasks to advance both argumentation systems and the evaluation thereof.