Data Sets

The following table is a summary of the data that are available for download by approved users. For more details on the challenge that produced the data, click on the challenge year.

The n2c2 data sets are provided as a community service. They consist of fully deidentified clinical notes and products of challenges. They are freely available for the research community but subject to a Data Use Agreement (DUA) that must be honored. Each individual user must access the data independently through the DBMI Data Portal. Under no circumstances are copies of any data files to be provided to additional individuals or posted to other websites, including GitHub. Preview the DUA here.

#	Challenge Year	Challenge Topic	Citation	Data	Combined Filesize
1	2006	Deidentification & Smoking	Uzuner Ö, Juo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc. 2007, 14(5):550-63. https://doi.org/10.1197/jamia.M2444 Uzuner Ö, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc. 2008;15(1):14-24. https://doi.org/10.1197/jamia.M2408	Dataset 1a: 889 unannotated, de-identified discharge summaries Dataset 1b: 889 de-identified discharge summaries with de-identification challenge annotations, training and test sets and ground truth Dataset 1c: A subset of the above 889 (N = 502) de-identified discharge summaries with smoking challenge annotations, training and test sets and ground truth	14 MB
2	2008	Obesity	Uzuner Ö. Recognizing Obesity and Co-morbidities in Sparse Data. Journal of the American Medical Informatics Association. July 2009; 16(4): 561-570. https://doi.org/10.1197/jamia.M3115	Obesity Challenge data consisted of 1237 discharge summaries from the Partners HealthCare Research Patient Data Repository. These data were chosen from the discharge summaries of patients who were overweight or diabetic and had been hospitalized for obesity or diabetes sometime since 12/1/04.	13.6 MB
3	2009	Medication	Uzuner Ö, Solti I, Cadag E. Extracting Medication Information from Clinical Text. Journal of the American Medical Informatics Association. 2010;17(5):514-518 https://doi.org/10.1136/jamia.2010.003947 Uzuner Ö, Solti I, Xia F, Cadag E. Community Annotation Experiment for Ground Truth Generation for the i2b2 Medication Challenge. Journal of the American Medical Informatics Association. 2010;17(5):519-523 https://doi.org/10.1136/jamia.2010.004200	A total of 1243 deidentified discharge summaries were used for the medication challenge; 696 of these summaries were released during the development period. The i2b2 team generated ‘gold standard’ annotations for 17 of the 696. A total of 547 discharge summaries were held out for testing. Challenge participants (after turning in their system submissions) collectively developed the gold standard annotations for 251 summaries out of the 547.	34.7 MB
4	2010	Relations	Uzuner Ö, South B, Shen S, DuVall S. 2010 i2b2/VA Challenge on Concepts, Assertions, and Relations in Clinical Text. Journal of the American Medical Informatics Association. 2011;18(5):552-556. https://doi.org/10.1136/amiajnl-2011-000203	Partners Healthcare, Beth Israel Deaconess Medical Center, and the University of Pittsburgh Medical Center (UPMC) contributed discharge summaries to the 2010 i2b2/VA challenge. In addition, UPMC contributed progress reports. A total of 349 training reports, 477 test reports, and 877 unannotated reports were de-identified and released to challenge participants with data use agreements. (Note that because of IRB limitations, only part of the original 2010 data is available for research beyond the original challenge.)	2.7 MB
5	2011	Coreference	Uzuner Ö, Bodnari A, Shen S, Forbush T, Pestian J, South BR. Evaluating the state of the art in coreference resolution for electronic medical records. J Am Med Inform Assoc. 2012;19(5):786-791. https://doi.org/10.1136/amiajnl-2011-000784	Two corpora (978 files in total): 1. From 2010 i2b2/VA Relations Challenge corpus: discharge summaries from Partners Healthcare, Beth Israel Deaconess Medical Center, and the University of Pittsburgh Medical Center (UPMC); and progress reports from UPMC (814 files in total). 2. Ontology Development and Information Extraction (ODIE) corpus: de-identified clinical reports and pathology reports from Mayo Clinic, and de-identified discharge records, radiology reports, surgical pathology reports, and other reports from UPMC (164 files in total).	3.8 MB
6	2012	Temporal Relations	Sun W, Rumshisky A, Uzuner Ö. Evaluating temporal relations in clinical text: 2012 i2b2 Challenge. J Am Med Inform Assoc. 2013;20(5):806-813. https://doi.org/10.1136/amiajnl-2013-001628 Sun W, Rumshisky A, Uzuner Ö. Annotating temporal information in clinical narratives. J Biomed Inform. 2013;6(Supplement):S5-S12. https://doi.org/10.1016/j.jbi.2013.07.004	Clinical history and hospital course sections of 310 de-identified discharge summaries from Partners Healthcare and the Beth Israel Deaconess Medical Center, with annotations of clinical events, temporal expressions and temporal relations.
7	2014	Deidentification & Heart Disease Risk	Kumar V, Stubbs A, Shaw S, Uzuner Ö. Creation of a new longitudinal corpus of clinical narratives. J Biomed Inform. 2015;58(Supplement):S6-S10. https://doi.org/10.1016/j.jbi.2015.09.018 Stubbs A, Kotfila C, Uzuner Ö. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. J Biomed Inform. 2015;58(Supplement):S11-S19. https://doi.org/10.1016/j.jbi.2015.06.007 Stubbs A, Uzuner Ö. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. J Biomed Inform. 2015;58(Supplement):S20-S29. https://doi.org/10.1016/j.jbi.2015.07.020	1304 de-identified longitudinal medical records from Partners HealthCare describing 296 patients, selected to support research into the progression of Coronary Artery Disease (CAD) in diabetic patients
8	2016	Deidentification and Symptom Severity (RDoC for Psychiatry)	Uzuner Ö, Stubbs A, Filannino M. A natural language processing challenge for clinical records: Research Domains Criteria (RDoC) for psychiatry. Journal of Biomedical Informatics 2017;75(Supplement):S1-S3. https://doi.org/10.1016/j.jbi.2017.10.005 Stubbs A, Filannino M, Uzuner Ö. De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID shared tasks Track 1. Journal of Biomedical Informatics 2017;75(Supplement):S4-S18. https://doi.org/10.1016/j.jbi.2017.06.011 Filannino M, Stubbs A, Uzuner Ö. Symptom severity prediction from neuropsychiatric clinical records: Overview of 2016 CEGS N-GRID shared tasks Track 2. Journal of Biomedical Informatics 2017;75(Supplement):S62-S70. https://doi.org/10.1016/j.jbi.2017.04.017	1,000 deidentified and annotated psychiatric intake notes (Note that because of IRB limitations, the 2016 data will not be made available for research outside the original challenge.)	n/a
9	2018	Cohort Selection and Adverse Drug Events & Medication Extraction	Stubbs A, Filannino M, Soysal E, Henry S, Uzuner Ö. Cohort selection for clinical trials: n2c2 2018 shared task track 1. Journal of the American Medical Informatics Association 2019;26(11):1163–1171. https://doi.org/10.1093/jamia/ocz163 Henry S, Buchan K, Filannino M, Stubbs A, Uzuner Ö. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. Journal of the American Medical Informatics Association 2020;27(1):3-12. https://doi.org/10.1093/jamia/ocz166	Cohort Selection: De-identified longitudinal medical records from Partners HealthCare for 288 patients from the 2014 challenge, with 2-5 records per patient. All the patients in this dataset have diabetes, and most are at risk for heart disease. Adverse Drug Events & Medication Extraction: 505 discharge summaries drawn from the MIMIC-III (Medical Information Mart for Intensive Care-III) clinical care database, selected using a query that searched for an ADE in the International Classification of Diseases code description of each record
10	2019	Clinical Semantic Textual Similarity, Family History Extraction, and Clinical Concept Normalization	Wang Y, Fu S, Shen F, Henry S, Uzuner Ö, Liu H. The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview. JMIR Med Inform 2020;8(11):e23375. https://doi.org/10.2196/23375 Henry S, Wang Y, Shen F, Uzuner Ö. The 2019 National Natural language processing (NLP) Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task on clinical concept normalization for clinical records. Journal of the American Medical Informatics Association 2020;27(10):1529-1537. https://doi.org/10.1093/jamia/ocaa106

National NLP Clinical Challenges (n2c2)

Continuing the legacy of the i2b2 NLP Shared Tasks

logos-hmsdbmi-gmuvolgenau.png

Data Sets

n2c2-custom-css.html