An open database of hospital prices

Nathan Sutton
3 min readSep 7, 2021

Normalized data for healthcare research

tl;dr

You can visit this Github repository or Kaggle data to find a open source database of reported hospital prices. Currently I covered most hospitals in North Carolina, but I plan to expand across the southeast.

Motivation

The Centers for Medicare and Medicaid Services recently required hospitals under 45 CFR §180.50 to publish a list of prices on their websites. They specifically instruct hospitals to make these lists…

  • As a comprehensive machine-readable file with all items and services.
  • In a display of shoppable services in a consumer-friendly format.

There is a lot of variation in adherence to these policies. Without strong guidance on formatting from CMS, it is no wonder hospitals are all over the map on formatting. Many hospitals have complied with the new rules but in ways that are not consumer friendly. 500 Megabytes of JSON data is not a strong start!

Turquoise Health has created a consumer-friendly lookup tool to interactively look up reported prices in different hospital systems. However, my guess is that they would not be happy to sharing the underlying data they have monetized (but I will ask). This repository fills the gap with open data for health researchers and data people.

Supplied Data

If you don’t have the proclivity to transform these data yourself with docker, there are CSV extracts available in ./volumes/data/extracts or directly on Kaggle. They are broken down into four distinct groups.

  • gross: this is often the top line item that the hospital never actually charges
  • cash: this is the self-pay discounted price you would pay without insurance
  • max: this is the maximum negotiated rate by an insurance company in the hospital network.
  • min: this is the minimum negotiated rate by an insurance company in the hospital net

A minority of hospitals included the payer and plan specific charges as their own column. I found that hospitals much more frequently reported the de-identified maximum and minimum negotiated prices, and so I started there.

Ontology

Using standardized codes will ensure that we are talking about the same thing in different hospitals. To accomplish this goal I built on the excellent work of the Athena vocabulary to define the ontology of healthcare procedures. This maps CPT and HCPCS codes into a common data model.

The disadvantage with this normalization is that we exclude the hospital-specific items such as their room and board charges. For example, ‘deluxe single’ does not confer the same charge in different hospital systems.

A word of caution

I took great care to make the data generation process reproducible inside of a docker container. However, I sacrificed some scalability for the name of speed of development. There are some excellent examples how you could scrape your way through this to complete automation. I introduced a manual step of downloading a file and naming it by my generated hospital ID. All other transformations are codified and reproducible in the container.

Where to go from here

I will use these data as a launching point to investigate questions around the economics of hospital prices. The New York Times had a great quote in their recent article ‘Hospitals and Insurers Didn’t Want You to See These Prices. Here’s Why’.

The trade association for insurers said it was “an anomaly” that some insured patients got worse prices than those paying cash.

This seems like an excellent question for data people to answer.

--

--