Deploy your H2O model without hassle in plain SQL

Nathan Sutton
3 min readMar 6, 2019
or is SQL the new ML?

I consistently feel (warranted) pressure to improve my model with more elaborate feature engineering only to later facepalm when I need to put that model into production. The path forwards to deploy a complex model is clear, just make an API! This post represents the other end of the spectrum. Let’s do as much in the database as possible.

tl;dr

You can still leverage all the power of Python/R to build and validate a generalized linear model. These functions will allow you to quickly deploy your model in a database as raw SQL code.

Who this is for

I have written this especially for

  • Data Science practitioners
  • who are asked to deliver interpretable linear models quickly
  • whose data starts in a SQL database
  • whose predictions end up in a SQL database
  • who want to keep is simple

Functions

In this post we’ll use the R version.

But you could just as easily use the equivalent Python.

Walkthrough

In this post I’ll build a purposefully simple model for all-cause readmissions based on a demo sample of the MIMIC-III critical care database. This is a relational database that keeps track of all of a patient’s medical information (labs, diagnoses, vitals, etc) during a single hospital stay.

I purposefully did all the feature engineering / denormalization in SQL so that my deployment script would be this simple.

In another post I’ll speak about how I got to this view from the MIMIC source tables. The important thing to know here is that the rows represent a cross-section of individuals across time. This is a really simple model, and so each of the columns represent a binary indicator (encoded as 1=present, 0=absent).

Now that we’ve inspected the data, let’s bring it into a dataframe.

With our data in hand we can build a simple logistic regression model with H2O.

Then translate it into SQL.

The power here is that we can cut R (or Python) out of the loop entirely. All that we need to do is put this into an appropriate SQL structure, like a view.

This function is only useful if it faithfully returns the same predictions. Hooray for the 1:1 line!

Make your life simple

Some of my most successful deployments have followed this pattern. Without any complications from R or Python code to mess anything up, these views silently did their job and never failed.

Feel free to reach out and connect if this is helpful to you!

--

--