Huzzah. Our new competition predicting fog water collection from weather data is up! Let's put on our DATA SCIENCE HATS and get to work.

TL;DR version

Here's what we're doing right now: looking at the microclimate measurements as independent snapshots in time as predictors of the outcome variable. You can grab the notebook on Github and follow along at home.

Cool stuff we're not doing (but that the winners will undoubtedly explore)

Here's what we're NOT doing right now:

  • Using macro-climate weather data at different levels of sampling to add information and reduce variance.
  • Any form of time series modeling, like, at all.
  • Incorporating more complex models of the underlying system we wish to model. And by complex here we mean "vaguely realistic."

Here's a new term for y'all courtesy of the DrivenData team:

BM;IC Basic Model; Ignored Complexity

Somebody throw it on Urban Dictionary — let's make fetch happen. Okay, time to load up the tools of the trade.

In [1]:
%matplotlib inline
from __future__ import print_function

from matplotlib import pyplot as plt
import seaborn as sns

import numpy as np
import pandas as pd

Loading the data

For this benchmark, we are only going to use microclimate data.

In [2]:
microclimate_train = pd.read_csv('data/eaa4fe4a-b85f-4088-85ee-42cabad25c81.csv',
                                 index_col=0, parse_dates=[0])

microclimate_test = pd.read_csv('data/fb38df29-e4b7-4331-862c-869fac984cfa.csv',
                                index_col=0, parse_dates=[0])

labels = pd.read_csv('data/a0f785bc-e8c7-4253-8e8a-8a1cd0441f73.csv',
                     index_col=0, parse_dates=[0])

submission_format = pd.read_csv('data/submission_format.csv',
                                index_col=0, parse_dates=[0])

Looking at the training data (we are given labels for these)

First of all, let's just "walk around" this data a bit, and use the common summary functions to get a better idea for what it looks like.

In [3]:
microclimate_train.shape
Out[3]:
(5802, 9)
In [4]:
microclimate_train.describe()
Out[4]:
percip_mm humidity temp leafwet450_min leafwet460_min leafwet_lwscnt gusts_ms wind_dir wind_ms
count 5781.000000 5781.000000 5781.000000 5781.000000 4617.000000 5781.000000 5794.000000 5794.000000 5794.000000
mean 0.078972 0.554852 15.566805 0.991033 0.799972 457.476362 3.381701 135.184925 2.827167
std 0.973970 0.282715 7.126274 1.903983 1.743959 48.172783 1.832613 96.186550 1.637490
min 0.000000 0.000000 0.000000 0.000000 0.000000 297.625000 0.000000 0.000000 0.000000
25% 0.000000 0.319978 9.937500 0.000000 0.000000 438.583333 2.007544 60.760417 1.588485
50% 0.000000 0.496859 14.470833 0.000000 0.000000 441.625000 3.143336 102.041667 2.619792
75% 0.000000 0.827473 20.937500 0.000000 0.000000 447.000000 4.538977 209.656250 3.867351
max 24.250000 1.072792 36.508334 5.173913 5.000000 1023.000000 11.518700 355.000000 10.092204
In [31]:
microclimate_train.tail()
Out[31]:
percip_mm humidity temp leafwet450_min leafwet460_min leafwet_lwscnt gusts_ms wind_dir wind_ms
2015-12-22 14:00:00 0 0.342318 15.287500 0 0 439.041667 4.471919 78.458333 3.277452
2015-12-22 16:00:00 0 0.343302 14.754167 0 0 439.958333 3.109807 80.166667 2.099749
2015-12-22 18:00:00 0 0.351736 13.675000 0 0 440.916667 3.344510 71.375000 2.556580
2015-12-22 20:00:00 0 0.363110 12.862500 0 0 441.000000 4.375524 72.875000 3.973177
2015-12-22 22:00:00 0 0.377436 12.741667 0 0 441.000000 4.798827 65.333333 4.400671
In [6]:
microclimate_train.isnull().sum(axis=0)
Out[6]:
percip_mm           21
humidity            21
temp                21
leafwet450_min      21
leafwet460_min    1185
leafwet_lwscnt      21
gusts_ms             8
wind_dir             8
wind_ms              8
dtype: int64

Here's the big takeway: we will have to deal with some missing values.

Looking at the test data (we will be predicting labels for these rows)

In [7]:
microclimate_test.shape
Out[7]:
(1110, 9)
In [8]:
microclimate_test.describe()
Out[8]:
percip_mm humidity temp leafwet450_min leafwet460_min leafwet_lwscnt gusts_ms wind_dir wind_ms
count 1110.000000 1110.000000 1110.000000 1110.000000 918.000000 1110.000000 1110.000000 1110.000000 1110.000000
mean 0.015060 0.517156 15.563383 0.884949 0.527505 453.707082 3.348732 132.093260 2.805653
std 0.154859 0.265408 6.979713 1.816510 1.448019 35.884056 2.051050 94.138572 1.831789
min 0.000000 0.052849 2.237500 0.000000 0.000000 374.666667 0.000000 0.000000 0.000000
25% 0.000000 0.300331 10.662500 0.000000 0.000000 438.333333 1.857712 58.104167 1.433361
50% 0.000000 0.455093 14.272917 0.000000 0.000000 440.895833 2.902347 100.500000 2.405700
75% 0.000000 0.719250 20.813542 0.000000 0.000000 444.833333 4.625943 209.270833 3.971081
max 3.575000 1.037941 35.525000 5.041667 5.000000 690.833333 10.276614 349.666667 9.354568
In [10]:
microclimate_train.isnull().sum(axis=0)
Out[10]:
percip_mm           21
humidity            21
temp                21
leafwet450_min      21
leafwet460_min    1185
leafwet_lwscnt      21
gusts_ms             8
wind_dir             8
wind_ms              8
dtype: int64

So again, we will definitely need to pay attention to missing values and figure out smart ways to deal with that. File under #datascientistproblems.

Let's also plot all the data points in the training and test data to get a sense for how the data set is split:

In [11]:
fig, axs = plt.subplots(nrows=microclimate_train.shape[1], ncols=1, sharex=True, figsize=(16, 18))

columns = microclimate_train.columns
for i, ax in list(enumerate(axs)):
    col = columns[i]
    ax.plot_date(microclimate_train.index, microclimate_train[col], ms=1.5, label='train')
    ax.plot_date(microclimate_test.index, microclimate_test[col], ms=1.5, color='r', label='test')
    ax.set_ylabel(col)
    
    if i == 0:
        ax.legend(loc='upper right', markerscale=10, fontsize='xx-large')

This isn't your grandpa's random train/test split

Here's another fun insight: this problem has a time component and in the real world we are trying to predict the future. That is, we're trying to figure out the upcoming yield based on current weather. For those of us concerned about overfitting (hint: all of us), we will need to think hard about our modeling assumptions.

So, things that we could do but probably shouldn't:

  • Imputing missing values using all of the data.
  • Treating every data point as if it stands alone and is independent from other points in time.
  • Drawing on weather that hasn't happened yet to inform our current predictions.

But this is a benchmark and we're not all about rules on this blog. Watch us break every single one of these cautionary warnings below! (That's why they call it a benchmark. (Actually that statement doesn't make sense, we don't know why it's called a benchmark. (HOLY MOLY, TOO MANY PARENTHESES #inception #common-lisp)))

Looking at relationships between inputs and yield

In [12]:
print('train', microclimate_train.shape)
print('labels', labels.shape)
train (5802, 9)
labels (5802, 1)

In [13]:
microclimate_train.columns.tolist()
Out[13]:
['percip_mm',
 'humidity',
 'temp',
 'leafwet450_min',
 'leafwet460_min',
 'leafwet_lwscnt',
 'gusts_ms',
 'wind_dir',
 'wind_ms']
In [14]:
wanted_cols = [u'percip_mm', u'humidity', u'temp', u'leafwet450_min', u'leafwet_lwscnt', u'wind_ms']
wanted = microclimate_train[wanted_cols].copy().dropna()
wanted['yield'] = labels['yield']

sns.pairplot(wanted, diag_kind='kde')
plt.show()