Machine Learning With A Heart - Benchmark

We've all got to start somewhere. In this post we'll show you how to start using data science to predict heart health!

Casey Fitzpatrick
DrivenData

So you want to harness the power of machine learning but need a place to start? We've got just the task for you: detecting heart disease! Heart disease is the number one cause of death worldwide, so if you're looking to use data science for good you've come to the right place. To learn how to prevent heart disease we must first learn to reliably detect it. That's where you––yes you––come in!

To join the competition, follow this link.

In our brand new warm up competition we're asking you to predict the presence or absence of heart disease given various data about a patient, including resting blood pressure, maximum heart rate, and EKG readings, as well as other information like age and sex. The data comes from the Statlog Heart dataset via the UCI Machine Learning repository. This is one of the smallest, least complex datasets on DrivenData, and a great place to dive into the world of data science competitions.

In this post, we'll walk through a very simple first pass model for predicting heart disease from patient data, showing you how to load the data, make some predictions, and then submit those predictions to the competition.

To get started, we import libraries for loading, manipulating, and visualizing the data.

In [1]:
%matplotlib inline

from pathlib import Path

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
DATA_DIR = Path('..', 'data', 'final', 'public')

Loading the Data

Image and quote from the [Centers for Disease Control and Prevention](https://www.cdc.gov/heartdisease/facts.htm): As plaque builds up in the arteries of a person with heart disease, the inside of the arteries begins to narrow, which lessens or blocks the flow of blood. Plaques can also rupture (break open) and when they do a blood clot can form on the plaque, blocking the flow of blood.

On the data download page, we provide everything you need to get started:

  • Training Values: These are the features you'll use to train a model. There are 13 features in data, including resting blood pressure, maximum heart rate, and EKG readings, as well as other information like age and sex. Each patient is identified by a unique (random) patient_id, which you can use as an index.
  • Training Labels: These are the labels. Every patient_id in the training values data has a corresponding label in this file. A 0 indicates no heart disease present, whereas a 1 indicates the presence of heart disease.
  • Test Values: These are the features you'll use to make predictions after training a model. We don't give you the labels for these samples, it's up to you to generate probabilities of the presence or ansence of heart disease for these patient_ids!
  • Submission Format: This gives us the filenames and columns of our submission prediction, filled with all 0.5 as a baseline. Your submission to the leaderboard must be in this exact form (with different prediction values, of course) in order to be scored successfully!

Since this is a benchmark, we're only going to use a subset of the features in the dataset. It's up to you to take advantage of all the information!

In [3]:
# for training our model
train_values = pd.read_csv(DATA_DIR / 'train_values.csv', index_col='patient_id')
train_labels = pd.read_csv(DATA_DIR / 'train_labels.csv', index_col='patient_id')

Let's take a look at the head of our training features

In [4]:
train_values.head()
Out[4]:
slope_of_peak_exercise_st_segment thal resting_blood_pressure chest_pain_type num_major_vessels fasting_blood_sugar_gt_120_mg_per_dl resting_ekg_results serum_cholesterol_mg_per_dl oldpeak_eq_st_depression sex age max_heart_rate_achieved exercise_induced_angina
patient_id
0z64un 1 normal 128 2 0 0 2 308 0.0 1 45 170 0
ryoo3j 2 normal 110 3 0 0 0 214 1.6 0 54 158 0
yt1s1x 1 normal 125 4 3 0 2 304 0.0 1 77 162 1
l2xjde 1 reversible_defect 152 4 0 0 0 223 0.0 1 40 181 0
oyt4ek 3 reversible_defect 178 1 0 0 2 270 4.2 1 59 145 0
In [5]:
train_values.dtypes
Out[5]:
slope_of_peak_exercise_st_segment         int64
thal                                     object
resting_blood_pressure                    int64
chest_pain_type                           int64
num_major_vessels                         int64
fasting_blood_sugar_gt_120_mg_per_dl      int64
resting_ekg_results                       int64
serum_cholesterol_mg_per_dl               int64
oldpeak_eq_st_depression                float64
sex                                       int64
age                                       int64
max_heart_rate_achieved                   int64
exercise_induced_angina                   int64
dtype: object

And the labels

In [6]:
train_labels.head()
Out[6]:
heart_disease_present
patient_id
0z64un 0
ryoo3j 0
yt1s1x 1
l2xjde 1
oyt4ek 0

Explore the Data

In [7]:
train_labels.heart_disease_present.value_counts().plot.bar(title='Number with Heart Disease')
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x1143b33c8>

The data is relatively well-balanced, so we won't take any steps here to equalize the classes.

In [8]:
selected_features = ['age', 
                     'sex', 
                     'max_heart_rate_achieved', 
                     'resting_blood_pressure']
train_values_subset = train_values[selected_features]

A quick look at the relationships between our features and labels

In [9]:
sns.pairplot(train_values.join(train_labels), 
             hue='heart_disease_present', 
             vars=selected_features)
Out[9]:
<seaborn.axisgrid.PairGrid at 0x10f246978>

The Error Metric – LogLoss

The metric in this competition is logarithmic loss, or log loss, which uses the probabilities of class predictions and the true class labels to generate a number that is closer to zero for better models, and exactly zero for a perfect model.

You can see from the formula for log loss that highly confident (probability close to one) wrong answers will contribute more to the total log loss number. This property of log loss makes it more informative alternative to accuracy. Below we'll use the Scikit Learn implementation of log loss to evaluate our model before submitting to the leaderboard.

Build the Model

When it comes to classic first pass models, few can contend with logisitc regression. This linear model is fast to train, easy to understand, and typically does pretty well "out of the box".

Below we'll combine the Scikit Learn logistic regression model with a preprocessing tool using Pipeline and GridSearchCV––two of sklearn's tools for streamlining the process of model training and hyperparamter optimization. You may be new to machine learning, but there's no better time to start developing good habits, and using pipelines is a good habit that you won't regret learning.

Logisitc Regression

In [10]:
# for preprocessing the data
from sklearn.preprocessing import StandardScaler

# the model
from sklearn.linear_model import LogisticRegression

# for combining the preprocess with model training
from sklearn.pipeline import Pipeline

# for optimizing parameters of the pipeline
from sklearn.model_selection import GridSearchCV

In Pipelines you pass a list of "steps" in the form of tuples with a step name in the first field and the object associated to the step in the second. Pipeline then creates one "estimator" object that you can pass data into for training and predcition.

In [11]:
pipe = Pipeline(steps=[('scale', StandardScaler()), 
                       ('logistic', LogisticRegression())])
pipe
Out[11]:
Pipeline(memory=None,
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logistic', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

Grid search allows you try out different parameters in the pipeline. Below, we try out different values for Scikit Learn's logistic regression "C" parameter as well as its regularization method.

To specify that these parameters are for the LogisticRegression part of the pipeline and not the StandardScaler part, the keys in our parameter grid (a python dictionary) take the form stepname__parametername. (Note the double underscore!)

The CV in GridSearchCV is another best practice to prevent overfitting your model.

In [12]:
param_grid = {'logistic__C': [0.0001, 0.001, 0.01, 1, 10], 
              'logistic__penalty': ['l1', 'l2']}
gs = GridSearchCV(estimator=pipe, 
                  param_grid=param_grid, 
                  cv=3)

With the parameter grid we've created and cross-validation, we're about to test 30 different models and take the best one!

In [13]:
gs.fit(train_values_subset, train_labels.heart_disease_present)
Out[13]:
GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logistic', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'logistic__C': [0.0001, 0.001, 0.01, 1, 10], 'logistic__penalty': ['l1', 'l2']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

Let's take a look at the best parameters.

In [14]:
gs.best_params_
Out[14]:
{'logistic__C': 1, 'logistic__penalty': 'l2'}

And the in-sample log loss score. Notice that since log loss wants the class probilities we call predict_proba and not simply predict, which would return the predicted labels, not their probabilites.

In [15]:
from sklearn.metrics import log_loss

in_sample_preds = gs.predict_proba(train_values[selected_features])
log_loss(train_labels.heart_disease_present, in_sample_preds)
Out[15]:
0.546686009615344

Time to Predict and Submit

For the log loss, we'll be using the class probabilities, not the class predictions. This means we call the predict_proba method as above. Further, we'll take every row in the second column, which corresponds to the positive case of heart_disease_present.

Let's load up the data, process it, and see what we get on the leaderboard.

In [16]:
test_values = pd.read_csv(DATA_DIR / 'test_values.csv', index_col='patient_id')

Select the subset of features we used to train the model.

In [17]:
test_values_subset = test_values[selected_features]

Make Predictions

Again, note that we take only the second column.

In [18]:
predictions = gs.predict_proba(test_values_subset)[:, 1]

Save Submission

We can use the column name and index from the submission format to ensure our predictions are in the form.

In [19]:
submission_format = pd.read_csv(DATA_DIR / 'submission_format.csv', index_col='patient_id')
In [20]:
my_submission = pd.DataFrame(data=predictions,
                             columns=submission_format.columns,
                             index=submission_format.index)
In [21]:
my_submission.head()
Out[21]:
heart_disease_present
patient_id
olalu7 0.604407
z9n6mx 0.053553
5k4413 0.752436
mrg7q5 0.097412
uki4do 0.791249
In [22]:
my_submission.to_csv('submission.csv')

Check the head of the saved file

In [23]:
!head submission.csv
patient_id,heart_disease_present
olalu7,0.6044072540043635
z9n6mx,0.05355289481734789
5k4413,0.7524357630080116
mrg7q5,0.09741203030747234
uki4do,0.7912486789601635
kev1sk,0.08747047862618101
9n6let,0.5409852616595142
jxmtyg,0.6565676676376131
51s2ff,0.301744665671968

Submit to leaderboard

Woohoo! It's a start! And that's exactly what we intend with these benchmarks. We're sure you'll be able to top this model in no time, and we can't wait to see what you come up with. Happy importing!

Stay updated

Join our newsletter or follow us for the latest on our social impact projects, data science competitions and open source work.

There was a problem. Please try again.
Subscribe successful!
Protected by reCAPTCHA. The Google Privacy Policy and Terms of Service apply.

Latest posts

All posts

insights

Life beyond the leaderboard

What happens to winning solutions after a machine learning competition?

winners

Meet the winners of Phase 2 of the PREPARE Challenge

Learn about how winners detected cognitive decline using speech recordings and social determinants of health survey data

resources

Open-source packages for using speech data in ML

Overview of key open-source packages for extracting features from voice data to support ML applications

tutorial

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

An introduction to using large language models via the benchmark to a document summarization challenge.

winners

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Learn about the results and winning methods from the early literacy screening challenge.

resources

Where to find a data job for a good cause

Finding data jobs for good causes can be difficult. Learn strategies, job lists, and tips to find organizations with open positions working on causes you care about.

winners

Meet the Winners of the Youth Mental Health Narratives Challenge

Learn about the winning solutions from the Youth Mental Health Challenge Automated Abstraction and Novel Variables Tracks

winners

Meet the winners of the Forecast and Final Prize Stages of the Water Supply Forecast Rodeo

Learn about the winners and winning solutions from the final stages of the Water Supply Forecast Rodeo.

insights

10 takeaways from 10 years of data science for social good

This year DrivenData celebrates our 10th birthday! We've spent the past decade working to use data science and AI for social good. Here are some lessons we've learned along the way.

tutorial

Goodnight Moon, Hello Early Literacy Screening Benchmark

In this guest post from the MIT Gabrieli Lab, we'll show you how to get started with the literacy screening challenge!

tutorial

Youth Mental Health: Automated Abstraction Benchmark

Learn how to process text narratives using open-source LLMs for the Youth Mental Health: Automated Abstraction challenge

winners

Meet the winners of Phase 1 of the PREPARE Challenge

Learn about the top datasets sourced for Phase 1 of the PREPARE Challenge.

resources

Teaching with DrivenData Competitions

Inspiration and resources for teaching students data science, machine learning, and AI skills with DrivenData competitions.

winners

Meet the winners of the Pose Bowl challenge

Learn about the top solutions submitted for the Pose Bowl: Spacecraft Detection and Pose Estimation Challenge.

winners

Meet the winners of the Water Supply Forecast Rodeo Hindcast Stage

Learn about the winning models for forecasting seasonal water supply from the first stage of the Water Supply Forecast Rodeo.

tools

Cookiecutter Data Science V2

Announcing the V2 release of Cookiecutter Data Science, the most widely adopted data science project template.

resources

How to make data science projects more open and inclusive

Key practices from the field of open science for making data science work more transparent, inclusive, and equitable.

winners

Meet the winners of the Kelp Wanted challenge

Dive into the solutions from the super segmenters who best detected kelp in Landsat imagery!

winners

Meet the winners of the SNOMED CT Entity Linking Challenge

Meet the winners with the best systems for detecting clinical terms in medical notes.

winners

Meet the winners of the Pale Blue Dot challenge

Learn about the top visuals created for the Pale Blue Dot: Visualization Challenge and the solvers behind them.

Work with us to build a better world

Learn more about how our team is bringing the transformative power of data science and AI to organizations tackling the world's biggest challenges.