Data Sets

The following table is a summary of the data that are available for download by approved users. For more details on the challenge that produced the data, click on the challenge year.

The n2c2 data sets are provided as a community service. They consist of fully deidentified clinical notes and products of challenges. They are freely available for the research community but subject to a Data Use Agreement (DUA) that must be honored. Each individual user must access the data independently through the DBMI Data Portal. Under no circumstances are copies of any data files to be provided to additional individuals or posted to other websites, including GitHub. Preview the DUA here.

# Challenge Year Challenge Topic Citation Data Combined Filesize
1 2006 Deidentification & Smoking

Uzuner Ö, Juo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc. 2007, 14(5):550-63. https://doi.org/10.1197/jamia.M2444

Uzuner Ö, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc. 2008;15(1):14-24. https://doi.org/10.1197/jamia.M2408

Dataset 1a:
889 unannotated, de-identified discharge summaries

Dataset 1b:
889 de-identified discharge summaries with de-identification challenge annotations, training and test sets and ground truth

Dataset 1c:
A subset of the above 889 (N = 502) de-identified discharge summaries with smoking challenge annotations, training and test sets and ground truth

14 MB
2 2008 Obesity

Uzuner Ö. Recognizing Obesity and Co-morbidities in Sparse Data. Journal of the American Medical Informatics Association. July 2009; 16(4): 561-570. https://doi.org/10.1197/jamia.M3115  

Obesity Challenge data consisted of 1237 discharge summaries from the Partners HealthCare Research Patient Data Repository. These data were chosen from the discharge summaries of patients who were overweight or diabetic and had been hospitalized for obesity or diabetes sometime since 12/1/04. 13.6 MB
3 2009 Medication

Uzuner Ö, Solti I, Cadag E. Extracting Medication Information from Clinical Text. Journal of the American Medical Informatics Association. 2010;17(5):514-518 https://doi.org/10.1136/jamia.2010.003947

Uzuner Ö, Solti I, Xia F, Cadag E. Community Annotation Experiment for Ground Truth Generation for the i2b2 Medication Challenge. Journal of the American Medical Informatics Association. 2010;17(5):519-523 https://doi.org/10.1136/jamia.2010.004200
A total of 1243 deidentified discharge summaries were used for the medication challenge; 696 of these summaries were released during the development period. The i2b2 team generated ‘gold standard’ annotations for 17 of the 696. A total of 547 discharge summaries were held out for testing. Challenge participants (after turning in their system submissions) collectively developed the gold standard annotations for 251 summaries out of the 547. 34.7 MB
4 2010 Relations Uzuner Ö, South B, Shen S, DuVall S. 2010 i2b2/VA Challenge on Concepts, Assertions, and Relations in Clinical Text. Journal of the American Medical Informatics Association. 2011;18(5):552-556. https://doi.org/10.1136/amiajnl-2011-000203 Partners Healthcare, Beth Israel Deaconess Medical Center, and the University of Pittsburgh Medical Center (UPMC) contributed discharge summaries to the 2010 i2b2/VA challenge. In addition, UPMC contributed progress reports. A total of 349 training reports, 477 test reports, and 877 unannotated reports were de-identified and released to challenge participants with data use agreements. (Note that because of IRB limitations, only part of the original 2010 data is available for research beyond the original challenge.) 2.7 MB
5 2011 Coreference Uzuner Ö, Bodnari A, Shen S, Forbush T, Pestian J, South BR. Evaluating the state of the art in coreference resolution for electronic medical records. J Am Med Inform Assoc. 2012;19(5):786-791. https://doi.org/10.1136/amiajnl-2011-000784

Two corpora (978 files in total):

1. From 2010 i2b2/VA Relations Challenge corpus: discharge summaries from Partners Healthcare, Beth Israel Deaconess Medical Center, and the University of Pittsburgh Medical Center (UPMC); and progress reports from UPMC (814 files in total).

2. Ontology Development and Information Extraction (ODIE) corpus: de-identified clinical reports and pathology reports from Mayo Clinic, and de-identified discharge records, radiology reports, surgical pathology reports, and other reports from UPMC (164 files in total).
3.8 MB
6 2012 Temporal Relations

Sun W, Rumshisky A, Uzuner Ö. Evaluating temporal relations in clinical text: 2012 i2b2 Challenge. J Am Med Inform Assoc. 2013;20(5):806-813. https://doi.org/10.1136/amiajnl-2013-001628

Sun W, Rumshisky A, Uzuner Ö. Annotating temporal information in clinical narratives. J Biomed Inform. 2013;6(Supplement):S5-S12. https://doi.org/10.1016/j.jbi.2013.07.004

Clinical history and hospital course sections of 310 de-identified discharge summaries from Partners Healthcare and the Beth Israel Deaconess Medical Center, with annotations of clinical events, temporal expressions and temporal relations.

 
7 2014 Deidentification & Heart Disease Risk

Kumar V, Stubbs A, Shaw S, Uzuner Ö. Creation of a new longitudinal corpus of clinical narratives. J Biomed Inform. 2015;58(Supplement):S6-S10. https://doi.org/10.1016/j.jbi.2015.09.018

Stubbs A, Kotfila C, Uzuner Ö. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. J Biomed Inform. 2015;58(Supplement):S11-S19.
https://doi.org/10.1016/j.jbi.2015.06.007

Stubbs A, Uzuner Ö. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. J Biomed Inform. 2015;58(Supplement):S20-S29. https://doi.org/10.1016/j.jbi.2015.07.020

 

1304 de-identified longitudinal medical records from Partners HealthCare describing 296 patients, selected to support research into the progression of Coronary Artery Disease (CAD) in diabetic patients

 
8 2016 Deidentification
and Symptom Severity (RDoC for Psychiatry)

Uzuner Ö, Stubbs A, Filannino M. A natural language processing challenge for clinical records: Research Domains Criteria (RDoC) for psychiatry. Journal of Biomedical Informatics 2017;75(Supplement):S1-S3. https://doi.org/10.1016/j.jbi.2017.10.005

Stubbs A, Filannino M, Uzuner Ö. De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID shared tasks Track 1. Journal of Biomedical Informatics 2017;75(Supplement):S4-S18. https://doi.org/10.1016/j.jbi.2017.06.011

Filannino M, Stubbs A, Uzuner Ö. Symptom severity prediction from neuropsychiatric clinical records: Overview of 2016 CEGS N-GRID shared tasks Track 2. Journal of Biomedical Informatics 2017;75(Supplement):S62-S70. https://doi.org/10.1016/j.jbi.2017.04.017

1,000 deidentified and annotated psychiatric intake notes

(Note that because of IRB limitations, the 2016 data will not be made available for research outside the original challenge.)

n/a
9 2018 Cohort Selection and Adverse Drug Events & Medication Extraction

Stubbs A, Filannino M, Soysal E, Henry S, Uzuner Ö. Cohort selection for clinical trials: n2c2 2018 shared task track 1. Journal of the American Medical Informatics Association 2019;26(11):1163–1171. https://doi.org/10.1093/jamia/ocz163

Henry S, Buchan K, Filannino M, Stubbs A, Uzuner Ö. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. Journal of the American Medical Informatics Association 2020;27(1):3-12. https://doi.org/10.1093/jamia/ocz166

Cohort Selection:
De-identified longitudinal medical records from Partners HealthCare for 288 patients from the 2014 challenge, with 2-5 records per patient. All the patients in this dataset have diabetes, and most are at risk for heart disease.

Adverse Drug Events & Medication Extraction:
505 discharge summaries drawn from the MIMIC-III (Medical Information Mart for Intensive Care-III) clinical care database, selected using a query that searched for an ADE in the International Classification of Diseases code description of each record

 
10 2019 Clinical Semantic Textual Similarity, Family History Extraction, and Clinical Concept Normalization

Wang Y, Fu S, Shen F, Henry S, Uzuner Ö, Liu H. The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview. JMIR Med Inform 2020;8(11):e23375. https://doi.org/10.2196/23375

 

 

Henry S, Wang Y, Shen F, Uzuner Ö. The 2019 National Natural language processing (NLP) Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task on clinical concept normalization for clinical records. Journal of the American Medical Informatics Association 2020;27(10):1529-1537. https://doi.org/10.1093/jamia/ocaa106