10 excellent healthcare data sources

Nathan Sutton
5 min readMar 30, 2022

Spanning surveys, electronic records, and medical images.

I have taken on a number of mentees interested in the applications of data to the healthcare space over the years. Their biggest roadblock to developing useful products is the lack of available healthcare data. This equally applies to startups — the sales cycle of actually convincing a covered entity under HIPAA to share data in exchange for a new product can take months, years!

Fortunately, depending on your needs there is a reasonable diversity of healthcare data available for free. You just have to be willing to jump through some hoops.

tl;dr — you can find the entire list and contribute your own on my Github.

Electronic Health Records

MIMIC — EHRs are the golden goose of data sources in healthcare, but good luck getting even read-only access to one. My first recommendation for people interested in healthcare data is the Multiparameter Intelligent Monitoring in Intensive Care, or MIMIC database. In particular, focus on the Observational Medical Outcomes Partnership (OMOP) translation of MIMIC into a harmonized relational database. That will at least give you a fighting chance not to have to rewrite every single pipeline you build when you land a real client.

  • Cost: Free
  • Size: 49,785 hospital admissions of 38,597 patients
  • Requires: Application, human subjects research course, several week turnaround time
  • Includes: demographics, vitals, labs, procedures, medications, nursing notes, imaging reports, mortality events
  • Reference: https://doi.org/10.13026/C2XW26
  • Access: https://mimic.mit.edu/docs/gettingstarted
  • Temporal coverage: 2001 through 2012
  • Source: Single Institution — Beth Israel Deaconess Medical Center
  • Owner: MIT Laboratory for Computational Physiology

SYNTHEA — If you don’t have time for a full application process, you can quickly spin up synthetic data from the SYNTHEA project that realistically models patient healthcare pathways.

Medical Images

There are a number of open datasets for medical images given how much easier it is to anonymize images than electronic health records. This list is by no means exhaustive. You’ll notice a common theme — chest x-rays are the dominant data available.

MIDRC —The Medical Imaging and Data Resource Center have an interesting project that is meant to solve many of the problems startups have faced in taking a deep-learning model through FDA clearance. I especially appreciate their emphasis on multi-institution data with a stratified sampling scheme to ensure the validity of any results across diverse populations.

  • Cost: Free
  • Size: 29,630 imaging studies of 10,835 patients (and growing)
  • Labels: COVID
  • Requires: No application
  • Includes: imaging, demographics
  • Reference: https://doi.org/10.1117/1.JMI.8.S1.010902
  • Access: https://data.midrc.org/DD
  • Temporal coverage: 2019 onwards
  • Source: Multiple Insitutions
  • Owner: Center for Translational Data Science @ University of Chicago

NIH — The National Institute of Health chest x-ray data was the catalyst for many deep learning applications in radiology.

  • Cost: Free
  • Size: 108,948 chest radiographs of 32,717 patients
  • Labels: 9 different pathologies
  • Requires: No application
  • Includes: imaging
  • Reference: Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, Ronald M. Summers; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2097–2106
  • Access: https://stanfordmlgroup.github.io/competitions/chexpert/
  • Source: Multiple Institutions
  • Owner: National Institute of Health

CheXpert — more recently Stanford released CheXpert with more pathologies and radiographs than the NIH data.

Medical Surveys

NHIS — The National Health Interview Survey is a great place to start if you’re looking for anonymous healthcare data on individuals outside of an electronic health record. A grain of salt, it’s a survey so the values are self-reported. These surveys go back to the 1960s, but practically they are more consistent in the last decade. You can go get the data directly yourself from the government, but I would skip the conforming and go straight to the IPUMS project at the University of Minnesota. They maintain a slick interactive web application that lets you add data to your cart to generate an extract.

  • Cost: Free
  • Size: Variable, 50–100k responses annually
  • Requires: No application
  • Includes: self-reported health outcomes, demographics, socioeconomic data
  • Reference: https://doi.org/10.18128/D070.V6.4
  • Access: https://nhis.ipums.org/
  • Temporal coverage: 1963 through present
  • Source: Census Microdata
  • Owner: IPUMS

Medical Claims Databases

The most diversity in healthcare data can be found in the data trail of the payments infrastructure.

NIS — The National Inpatient Sample represents ~20% of all hospital admissions, and so it is verifiably huge. Students can access most of these data for a reasonable cost, at least relative to other claims databases.

  • Cost: $150 to $1000 per year depending on your situation and data selected
  • Size: Huge, this approximates a 20-percent stratified sample of all hospital discharges
  • Requires: Application and 15m training course
  • Includes: fine-grain billing details, hospital reported health outcomes, demographics
  • Access: https://www.distributor.hcup-us.ahrq.gov/Databases.aspx
  • Temporal coverage: 2012 through present
  • Source: Multiple Institutions
  • Owner: Healthcare Cost and Utilization Project

MarketScan — if you have a pile of cash or are supported by an institution, you can look into IBM’s all-payer claims database.

SynPUF — Years ago the Centers for Medicare and Medicaid services created a synthetic extract that mimics their limited claims dataset (available for purchase). This would make sense if Medicare plans are your primary endpoint.

  • Cost: Free
  • Requires: No application
  • Includes: synthetic billing data that mirrors the limited CMS datasets
  • Reference: Borton, J. M., et al. “Data Entrepreneurs’ Synthetic PUF: A Working PUF as an Alternative to Traditional Synthetic and Non-synthetic PUFs.” JSM Proceedings, Survey Research Methods Section (2010).
  • Access: https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF
  • Temporal coverage: 2008 through 2010
  • Source: Multiple institutions
  • Owner: Centers for Medicare and Medicaid Servicee

Summary statistics

Dartmouth Health Atlas — If you are just looking for statistics relating to healthcare cost and utilization in a geographic region, look no further than the Dartmouth Health Atlas. I have had a lot of success tieing these in as features in ML models when my data is already spatially coded.

  • Cost: Free
  • Requires: no application or restrictions
  • Includes: medical reimbursement rates, access to care, mortality, hospital capacity
  • Access: https://data.dartmouthatlas.org/#rates
  • Temporal coverage: 2011 through 2019
  • Source: Multiple institutions
  • Owner: Dartmouth

--

--