The BioMassters Challenge Starter Code

Getting Started with MATLAB

Hello all! We at MathWorks®, in collaboration with DrivenData, are excited to bring you The BioMassters Challenge. The objective of this challenge is to estimate the annual aboveground biomass (AGBM) in a given patch of Finland when provided satellite imagery of that patch. In this blog we are providing a basic starter example in MATLAB®. In this code, I create a basic image-to-image regression model and train it to predict annual AGBM for each pixel in the input data. Then I use this model on test data and save the results in the format required for the challenge.
This should serve as basic starting code to help you to start analyzing the data and work towards developing a more efficient, optimized, and accurate model using more of the training data available. To request your complimentary MATLAB license and other getting started resources, visit the MathWorks BioMassters challenge homepage. You can also access and download this code by visiting this GitHub repository.
agbm_Banner.png
[Fig 1.1: a visualization of AGBM in color. Image derived from data provided by the Finnish Forest Centre under the CC BY 4.0 license]

Table of Contents

1. High level Overview of the Dataset
2. Import and Preprocess the data
3. Create the Model
3.1. Using Deep Network Designer
3.2. Programmatically
4. Set Training Preferences
5. Train the Network
6. Evaluate the Model on New Data
6.1. Visualize
6.2 Calculate Performance Metrics
7. Possible Next Steps for Improvement
8. Predict on Test Data & Export Results
9. Additional Resources
10. Helper Functions

The Data

Each chip_id represents one patch of land in a given year. For each chip, you are provided approximately 24 satellite images and 1 AGBM image.
The satellite imagery comes from two satellites called Sentinel-1 (S1) and Sentinel-2 (S2), covering nearly 13,000 patches of forest in Finland from 2016 to 2021. Each chip is 2,560 by 2,560 meters, and the images of these chips are 256 by 256 pixels, so each pixel represents a 10 by 10 meter area of land within the chip. You are provided a single image from each satellite for each calendar month. For S1, each image is generated by taking the mean across all images acquired by S1 for the chip during that time. For S2, you are provided the best image for each month.
The AGBM image serves as the label for each chip in a given year. Just like the satellite data, the AGBM data is provided in the form of images that cover 2,560 meter by 2,560 meter areas at 10 meter resolution, which means they are 256 by 256 pixels in size. Each pixel in the satellite imagery corresponds to a pixel in the AGBM image with the same chip ID.
For the competition, you will use this data to train a model that can predict this AGBM value when provided with only the satellite imagery. To learn more about the images, features, labels and submission metrics, head over to the challenge's Problem Description page!

Preview the Data

To understand the data that we will be working with, let's look at a few example images for a specific chip_id. In the sections below, the images correspond to chip_id 0a8b6998.
First, define a variable that points to the S3 bucket so that we can access the data. You can find this path in the 'biomassters_download_instructions.txt' file provided on the data download page. Make sure this is the path for the entire bucket, not any specific folder - it should start with 's3://'.
% Example path, you will need to replace this
s3Path = 's3://competition-bucket-name-location/';
This will be used throughout the blog.

Sentinel-1:

For each chip_id, we expect to see 12 images from Sentinel-1 with the naming convention {chip_id}_S1_{month}, where month is a value between 00 and 11. There are cases where there may be missing data, which could result in one or more of these images missing.
Each Sentinel-1 image has four bands, where each band is one 256x256 matrix that contains a specific measurement for the chip. Let's visualize each band of one of these S1 images:
exampleS1Path = fullfile(s3Path, 'train_features', '0a8b6998_S1_00.tif');
exampleS1 = imread(exampleS1Path);
 
% To visualize each layer, rescale the values of each pixel to be between 0 and 1
% Darker pixels indicate lower values, ligher pixels indicate higher values
montage(rescale(exampleS1));

Sentinel-2:

Much like Sentinel-1, for each chip_id, we expect to see 12 images from Sentinel-2 with the naming convention {chip_id}_S2_{month}, where month is a value between 00 and 11. There are cases where there may be missing data, which could result in one or more of these images missing.
Each Sentinel-2 image has 11 bands, where each band is one 256x256 matrix that contains a specific measurement for the chip. Let's visualize each band of one of these S2 images:
exampleS2Path = fullfile(s3Path, 'train_features', '0a8b6998_S2_00.tif');
exampleS2 = imread(exampleS2Path);
 
% To visualize each layer, rescale the values of each pixel to be between 0 and 1
% Darker pixels indicate lower values, ligher pixels indicate higher values
montage(rescale(exampleS2));

AGBM:

For each chip_id, there will be one AGBM image, with the naming convention {chip_id)_agbm.tif. This image is a 256x256 matrix, where each element is a measurement of aboveground biomass in tonnes for that pixel. For 0a8b6998, it looks like this:
exampleAGBMPath = fullfile(s3Path, 'train_agbm', '0a8b6998_agbm.tif');
exampleAGBM = imread(exampleAGBMPath);
 
% Since we only have to visualize one layer, we can use imshow
imshow(rescale(exampleAGBM))

Import and Preprocess the Data

Before we can start building a model, we have to find a way to get the data into the MATLAB Workspace. The data for this competition is contained in a public Amazon S3® bucket. The URL for this bucket will be provided once you have registered, so make sure you have signed up for the challenge so you can access the data. In total, all of the imagery provided is about 290GB in size, which is too much to work with all at once. So that we can work with all of the data, I will be taking advantage of MATLAB's imageDatastore, which allows us to read the data in one chip_id at a time and will make it easy to train a neural network later on. If you want to learn more about datastores, you can refer to the following resources:
  1. Getting Started with Datastore
  2. Datastores for Deep Learning
  3. Datastore for Image Data
We use the s3Path variable we created earlier to create a agbmFolder, which points specifically to the AGBM training data.
agbmFolder = fullfile(s3Path, 'train_agbm');
We can then use agbmFolder to create a datastore for our input (Satellite imagery) and output (AGBM imagery) data, named imInput and imOutput respectively. When you use an imageDatastore, you can change the way images from the specified directory are read in to the MATLAB Workspace using the 'ReadFcn' option. Since I want to read one AGBM image but 24 satellite images at a time, I define a helper function readTrainingSatelliteData that reads the filename of the AGBM file we will read, which contains the chip_id, and instead reads in and preprocesses all corresponding satellite images. Then I use the built-in splitEachLabel function to divide the dataset into training, testing, and validation data, so that we can evaluate its performance during and after training. For this example, I chose to use 95% of the data for training, 2.5% for validation and 2.5% for testing because I wanted to use most of the data for training, but you can play around with these numbers.
The readTrainingSatelliteData helper function does the following:
  1. Extracts the chip_id from the filename of the AGBM image that we will read
  2. Reads in and orders all satellite images that correspond to this chip_id
  3. Handles missing data. Since this is just our first model, I have decided to omit any images that contain missing data.
  4. With the remaining data, finds the average value of each pixel for each band.
  5. Rescales the values to be between 0 and 1. Each satellite has different units of measurement, which can make it difficult for some algorithms to learn from the data properly. By normalizing the data scale, it may allow the neural network to learn better.
This results in a single input image of size 256x256x15, where each 256x256 matrix represents the average values for one band from S1 or S2 over the course of the year. Since S1 has 4 bands and S2 has 11, this results in 15 matrices. This is a very simplified way to represent the data, as this will only be our starting model.
imInput = imageDatastore(agbmFolder, 'ReadFcn', @(filename)readTrainingSatelliteData(filename, s3Path), 'LabelSource', 'foldernames');
[inputTrain,inputVal,inputTest] = splitEachLabel(imInput,0.95,0.025);
For the output data, we will use the default read function, as we only need to read one image at a time and don't need to do any preprocessing. Since we are passing the same directory to each datastore, we know that they will read the images in the same chip_id order. Once again, split the data into training, testing, and validation data.
imOutput = imageDatastore(agbmFolder, 'LabelSource', 'foldernames');
[outputTrain,outputVal,outputTest] = splitEachLabel(imOutput,0.95,0.025);
Once the data has been preprocessed, we combine the input and output sets so they may be used with our neural network later.
dsTrain = combine(inputTrain, outputTrain);
dsVal = combine(inputVal, outputVal);
dsTest = combine(inputTest, outputTest);
The preview function allows me to view the first item in the datastore, so that we can validate that the inputs (the first item) and outputs (the second item) are what we are expecting them to be:
sampleInputOutput = preview(dsTrain);
montage(rescale(sampleInputOutput{1})); % Input Data
imshow(rescale(sampleInputOutput{2})); % Output Data

Create the Model

Now that the data is imported and cleaned up, we can get started on actually developing a neural network! This challenge is interesting, in that the inputs and outputs are images. Often, neural networks will be used to take an image as input and output a class (image classification) or maybe a specific value (image-to-one regression), as shown below:
[Fig 2.1: visualization of an image classification convolutional neural network]
In this challenge, we are tasked with outputting a new image, so our network structure will need to look a little different:
[Fig 2.2: visualization of an image-to-image convolutional neural network]
No matter the type of network you're using, there are two different ways you can make or edit a deep learning model: with the Deep Network Designer app or programmatically.

Option 1: Create with the Deep Network Designer App

First, we have to choose a network architecture. For this blog, I have decided to create a starting network architecture using the 'unetLayers' function. This function provides a network for semantic segmentation (an image-to-image classification problem), so it can be easily adapted for image-to-image regression. If you want to learn more about other starting architectures, check out this documentation page on Example Deep Learning Networks Architectures.
Since the input images will be 256x256x15, this must also be the input size of the network. For the other options, I chose an arbitrary number of classes since we will change the output layers anyway, and a starting depth of 3.
lgraph = unetLayers([256 256 15], 2,'encoderDepth',3);
From here, I can open the Deep Network Designer app and modify the model interactively. I like this option as it lets me visualize what the network looks like and it's easier to see that I've made the changes I want.
deepNetworkDesigner(lgraph)
When the app opens, it should look similar to the image below. If it's zoomed in on certain layers, you can zoom out to see the full network by pressing the space bar.
[Fig 3.1: Deep Network Designer]
From here, remove the last two layers, and change the "Final-ConvolutionLayer" so that NumFilters is equal to 1. Some tips for using the app for this step: