2018 Challenge

Tentative Timeline

February 20 — Registration opens
March 5 — Training data release
May 7 — Test data release
May 9 — System outputs due
June 2 — Abstracts due
TBD — Workshop

Task Description

This task aims to answer the question, “Can NLP systems use narrative medical records to identify which patients meet selection criteria for clinical trials?” The task requires NLP systems to compare each patient to a list of selection criteria, and determine if the patients meet, do not meet, or possibly meet each criterion.

Definitions and background

Identifying patients who meet certain criteria for placement in clinical trials is a vital part of medical research. Finding patients for clinical trials is a challenge, as medical studies often have complex criteria that cannot easily be translated into a database query, but rather require examining the clinical narratives in a patient’s records. This is time-consuming for medical researchers who need to recruit patients, so often researchers are limited to patients who either seek out the trial for themselves, or who are pointed towards a particular trial by their doctor. Recruitment from particular places or by particular people can result in selection bias towards certain populations (e.g., people who can afford regular care, or people who exclusively use free clinics), which in turn can bias the results of the study (Mann, 2003 ; Genletti et al, 2009 ). Developing NLP systems that can automatically assess if a patient is eligible for a study can both reduce the time it takes to recruit patients, and help remove bias from clinical trials (Stubbs, 2013 ).

However, matching patients to selection criteria is not a trivial task for machines, due to the complexity the criteria often exhibit. For example, consider the phrase “Patient must be taking aspirin for MI prevention.” In this case, it is insufficient for a system to extract only the medication (i.e., aspirin); rather, the system must also determine the reason why the patient is taking the medication (i.e., for MI prevention). This shared task aims to identify whether a patient meets, does not meet, or possibly meets a selected set of eligibility criteria based on their longitudinal records. The eligibility criteria come from real clinical trials and focus on patients’ medications, past medical histories, and whether certain events have occurred in a specified timeframe in the patients’ records.


This task uses data from the 2014 i2b2/UTHealth Shared-Tasks and Workshop on Challenges in Natural Language Processing for Clinical Data, with tasks on de-identification and heart disease risk factors. The data consists of nearly 300 sets of longitudinal patient records, annotated by medical professionals to determine if each patient matches a list of 13 selection criteria. These criteria include determining whether the patient has taken a dietary supplement (excluding Vitamin D) in the past 2 months, whether the patient has a major diabetes-related complication, and whether the patient has advanced cardiovascular disease.

All the files have been annotated at the document level to indicate whether the patient meets or does not meet each criterion. The gold standard annotations will provide the category of each patient for each criterion. A limited number of gold standard span annotations will be provided to show how decisions about each patient were made. Participants will be evaluated on the predicted category of each patient in the held-out test data.

The data for this task is provided by Partners HealthCare. All records have been fully de-identified and manually annotated for whether they meet, possibly meet, or do not meet clinical trial eligibility criteria.

Data for the challenge will be released under a Rules of Conduct and Data Use Agreement. Obtaining the data requires completing a registration, which will start February 20, 2018.

Two sample gold standard files and the annotation guidelines are available for download.

Evaluation Format

The evaluation for both NLP tasks will be conducted using withheld test data. Participating teams are asked to stop development as soon as they download the test data. Each team is allowed to upload (through this website) up to three system runs for each of these tracks. System output is to be submitted in the exact format of the ground truth annotations, which will be provided by the organizers.


Participants are asked to submit a 500-word long abstract describing their methodologies. Abstracts may also have a graphical summary of the proposed architecture. The document should not exceed 2 pages (i.e., 1.5" line spacing, 12pt-font size). The authors of either top performing systems or particularly novel approaches will be invited to present or demonstrate their systems at the workshop. A special issue of a journal will be organized following the workshop.

Organizing Committee

  • Ozlem Uzuner, co-chair, George Mason University
  • Amber Stubbs, co-chair, Simmons College
  • Michele Filannino, co-chair, MIT
  • Kevin Buchan, SUNY at Albany
  • Susanne Churchill, Harvard Medical School
  • Isaac Kohane, Harvard Medical School
  • Hua Xu, UTHealth
  • Ergin Soysal, UTHealth


Sara Geneletti, Sylvia Richardson, and Nicky Best. 2009. Adjusting for selection bias in retrospective, case-control studies. Biostatistics, 10(1):17–31.

C. J. Mann. 2003. Observational research methods. Research design II: cohort, cross sectional, and case-control studies . Emergency Medical Journal, 20:54–60.

Stubbs, A. A Methodology for Using Professional Knowledge in Corpus Annotation. PhD Thesis, Brandeis University, 2013.