The following table is a summary of the data that are available for download by approved users. For more details on the challenge that produced the data, click on the challenge year.
The n2c2 data sets are provided as a community service. They consist of fully deidentified clinical notes and products of challenges. They are freely available for the research community but subject to a Data Use Agreement (DUA) that must be honored. Each individual user must access the data independently through the DBMI Data Portal. Under no circumstances are copies of any data files to be provided to additional individuals or posted to other websites, including GitHub. Preview the DUA here.
# | Challenge Year | Challenge Topic | Citation | Data | Combined Filesize |
---|---|---|---|---|---|
1 | 2006 | Deidentification & Smoking |
Uzuner Ö, Juo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc. 2007, 14(5):550-63. https://doi.org/10.1197/jamia.M2444 Uzuner Ö, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc. 2008;15(1):14-24. https://doi.org/10.1197/jamia.M2408 |
Dataset 1a:
Dataset 1b:
Dataset 1c: |
14 MB |
2 | 2008 | Obesity |
Uzuner Ö. Recognizing Obesity and Co-morbidities in Sparse Data. Journal of the American Medical Informatics Association. July 2009; 16(4): 561-570. https://doi.org/10.1197/jamia.M3115 |
Obesity Challenge data consisted of 1237 discharge summaries from the Partners HealthCare Research Patient Data Repository. These data were chosen from the discharge summaries of patients who were overweight or diabetic and had been hospitalized for obesity or diabetes sometime since 12/1/04. | 13.6 MB |
3 | 2009 | Medication |
Uzuner Ö, Solti I, Cadag E. Extracting Medication Information from Clinical Text. Journal of the American Medical Informatics Association. 2010;17(5):514-518 https://doi.org/10.1136/jamia.2010.003947 Uzuner Ö, Solti I, Xia F, Cadag E. Community Annotation Experiment for Ground Truth Generation for the i2b2 Medication Challenge. Journal of the American Medical Informatics Association. 2010;17(5):519-523 https://doi.org/10.1136/jamia.2010.004200 |
A total of 1243 deidentified discharge summaries were used for the medication challenge; 696 of these summaries were released during the development period. The i2b2 team generated ‘gold standard’ annotations for 17 of the 696. A total of 547 discharge summaries were held out for testing. Challenge participants (after turning in their system submissions) collectively developed the gold standard annotations for 251 summaries out of the 547. | 34.7 MB |
4 | 2010 | Relations | Uzuner Ö, South B, Shen S, DuVall S. 2010 i2b2/VA Challenge on Concepts, Assertions, and Relations in Clinical Text. Journal of the American Medical Informatics Association. 2011;18(5):552-556. https://doi.org/10.1136/amiajnl-2011-000203 | Partners Healthcare, Beth Israel Deaconess Medical Center, and the University of Pittsburgh Medical Center (UPMC) contributed discharge summaries to the 2010 i2b2/VA challenge. In addition, UPMC contributed progress reports. A total of 349 training reports, 477 test reports, and 877 unannotated reports were de-identified and released to challenge participants with data use agreements. (Note that because of IRB limitations, only part of the original 2010 data is available for research beyond the original challenge.) | 2.7 MB |
5 | 2011 | Coreference | Uzuner Ö, Bodnari A, Shen S, Forbush T, Pestian J, South BR. Evaluating the state of the art in coreference resolution for electronic medical records. J Am Med Inform Assoc. 2012;19(5):786-791. https://doi.org/10.1136/amiajnl-2011-000784 |
Two corpora (978 files in total): 1. From 2010 i2b2/VA Relations Challenge corpus: discharge summaries from Partners Healthcare, Beth Israel Deaconess Medical Center, and the University of Pittsburgh Medical Center (UPMC); and progress reports from UPMC (814 files in total). 2. Ontology Development and Information Extraction (ODIE) corpus: de-identified clinical reports and pathology reports from Mayo Clinic, and de-identified discharge records, radiology reports, surgical pathology reports, and other reports from UPMC (164 files in total). |
3.8 MB |
6 | 2012 | Temporal Relations |
Sun W, Rumshisky A, Uzuner Ö. Evaluating temporal relations in clinical text: 2012 i2b2 Challenge. J Am Med Inform Assoc. 2013;20(5):806-813. https://doi.org/10.1136/amiajnl-2013-001628 Sun W, Rumshisky A, Uzuner Ö. Annotating temporal information in clinical narratives. J Biomed Inform. 2013;6(Supplement):S5-S12. https://doi.org/10.1016/j.jbi.2013.07.004 |
Clinical history and hospital course sections of 310 de-identified discharge summaries from Partners Healthcare and the Beth Israel Deaconess Medical Center, with annotations of clinical events, temporal expressions and temporal relations. |
|
7 | 2014 | Deidentification & Heart Disease Risk |
Kumar V, Stubbs A, Shaw S, Uzuner Ö. Creation of a new longitudinal corpus of clinical narratives. J Biomed Inform. 2015;58(Supplement):S6-S10. https://doi.org/10.1016/j.jbi.2015.09.018
Stubbs A, Kotfila C, Uzuner Ö. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. J Biomed Inform. 2015;58(Supplement):S11-S19. Stubbs A, Uzuner Ö. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. J Biomed Inform. 2015;58(Supplement):S20-S29. https://doi.org/10.1016/j.jbi.2015.07.020
|
1304 de-identified longitudinal medical records from Partners HealthCare describing 296 patients, selected to support research into the progression of Coronary Artery Disease (CAD) in diabetic patients |
|
8 | 2016 |
Deidentification and Symptom Severity (RDoC for Psychiatry) |
Uzuner Ö, Stubbs A, Filannino M. A natural language processing challenge for clinical records: Research Domains Criteria (RDoC) for psychiatry. Journal of Biomedical Informatics 2017;75(Supplement):S1-S3. https://doi.org/10.1016/j.jbi.2017.10.005 Stubbs A, Filannino M, Uzuner Ö. De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID shared tasks Track 1. Journal of Biomedical Informatics 2017;75(Supplement):S4-S18. https://doi.org/10.1016/j.jbi.2017.06.011 Filannino M, Stubbs A, Uzuner Ö. Symptom severity prediction from neuropsychiatric clinical records: Overview of 2016 CEGS N-GRID shared tasks Track 2. Journal of Biomedical Informatics 2017;75(Supplement):S62-S70. https://doi.org/10.1016/j.jbi.2017.04.017 |
1,000 deidentified and annotated psychiatric intake notes (Note that because of IRB limitations, the 2016 data will not be made available for research outside the original challenge.) |
n/a |
9 | 2018 | Cohort Selection and Adverse Drug Events & Medication Extraction |
Stubbs A, Filannino M, Soysal E, Henry S, Uzuner Ö. Cohort selection for clinical trials: n2c2 2018 shared task track 1. Journal of the American Medical Informatics Association 2019;26(11):1163–1171. https://doi.org/10.1093/jamia/ocz163 Henry S, Buchan K, Filannino M, Stubbs A, Uzuner Ö. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. Journal of the American Medical Informatics Association 2020;27(1):3-12. https://doi.org/10.1093/jamia/ocz166 |
Cohort Selection:
Adverse Drug Events & Medication Extraction: |
|
10 | 2019 | Clinical Semantic Textual Similarity, Family History Extraction, and Clinical Concept Normalization |
Wang Y, Fu S, Shen F, Henry S, Uzuner Ö, Liu H. The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview. JMIR Med Inform 2020;8(11):e23375. https://doi.org/10.2196/23375
Henry S, Wang Y, Shen F, Uzuner Ö. The 2019 National Natural language processing (NLP) Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task on clinical concept normalization for clinical records. Journal of the American Medical Informatics Association 2020;27(10):1529-1537. https://doi.org/10.1093/jamia/ocaa106 |