case-studies

Monitoring water quality from satellite imagery

DrivenData created an open-source package to detect harmful algal blooms using machine learning and satellite imagery. Included running a machine-learning competition, conducting end user interviews, and engineering a deployable pipeline

The challenge

Inland water bodies provide a variety of critical services for both human and aquatic life, including drinking water, recreational and economic opportunities, and marine habitats. A significant challenge water quality managers face is the formation of harmful algal blooms (HABs). One of the major types of HABs is cyanobacteria. HABs produce toxins that are poisonous to humans and their pets, and threaten marine ecosystems by blocking sunlight and oxygen.

While there are established methods for using satellite imagery to detect cyanobacteria in larger water bodies like oceans, detection in small inland lakes and reservoirs remains a challenge. Manual water sampling is accurate, but too time intensive and difficult to perform continuously. Satellite data and other remote sensing data can enable faster, more comprehensive monitoring.

The approach

DrivenData hosted the Tick Tick Bloom challenge to rapidly test a wide variety of possible data sources, model architectures, and features. Over 1,300 participants competed to detect cyanobacteria blooms in small, inland water bodies using publicly available satellite, climate, and elevation data.

The competition provided critical research code and a proof of concept for detection. DrivenData carried the results forward in a few ways to transform those results into an actionable tool in the hands of end users.

  • User interviews: Conducted human-centered design (HCD) interviews to better understand how to optimally address on-the-ground user needs. Undertanding current workflows is crucial to technical decisions like prediction format, most relevant performance metrics, and compute constraints.
  • Model experimentation: Combined and iterated on the most useful pieces from competition-winning models to determine which approaches were the most robust, accurate, and generalizable outside of the competition setting.
  • Code organization: Simplified and restructured code to create a more efficient, configurable, and deployable pipeline.

The results

DrivenData developed an open source tool, CyFi (Cyanobacteria Finder), which enables satellite-based detection of HAB outbreaks in lakes, reservoirs, and rivers. In a benchmark comparison, we found that CyFi performs at least as well as Sentinel-3 based tools but captures ten times the number of lakes across the U.S. Our work helps water quality managers better allocate resources for in situ sampling, make more informed decisions around when to issue public health warnings, and ultimately keep the human and marine life that rely on small inland water bodies safe and healthy.

Screenshot of the landing page for CyFi's documentation, showing an overview, quickstart instructions, and a navigation bar with other useful topics.

The landing page for CyFi's documentation


CyFi is written to reflect best practices in open and reproducible data science, and anyone can use the package or contribute to the code on Github. A paper published in SciPy proceedings documents the full process of creating CyFi in detail.

As part of the Tick Tick Bloom competition, DrivenData aggregated manual cyanobacteria labels from 14 data providers across the U.S., creating a unique, nationally representative ground truth dataset. The full dataset of 23,570 measurements is now publicly available for anyone to learn from.

Partners

The Tick Tick Bloom competition was created on behalf of NASA, with collaboration from NOOA, EPA, USGS, DOD's Defense Innovation Unit, Berkeley AI Research, and Microsoft AI for Earth.

Our real-world impact

All projects
Partners: EverFree

A production application to support survivors of human trafficking

Built the Freedom Lifemap platform, a digital tool designed to support survivors of human trafficking on their journey toward reintegration and independence

Partners: Max Planck Institute for Evolutionary Anthropology, Arcus Foundation, WILDLABS

Automating wildlife identification for research and conservation

Detected wildlife in images and videos—automatically and at scale—by building the winning algorithm from a DrivenData competition into an open source python package and a web application running models in the cloud.

Partners: Private sector, social sector

Building LLM solutions

Built solutions using LLMs for multiple real-world applications, across tasks including semantic search, summarization, named entity recognition, and multimodal analysis. Work has spanned research on state-of-the-art models tuned for specific use cases to production ready retrieval-augmented AI applications.

Partners: The World Bank, The Conflict and Environment Observatory

Identifying crop types using satellite imagery in Yemen

Used satellite imagery to identify crop extent, crop types and climate risks to agriculture in Yemen, informing World Bank development programs in the country after years of civil war.

Partners: IDEO.org

Illuminating mobile money experiences in Tanzania

Analyzed millions of mobile money records to uncover patterns in behavior, and then combined these insights with human-centered design to shape new approaches to delivering mobile money to low-income populations in Tanzania.

Partners: Insecurity Insight, Physicians for Human Rights

Tracking attacks on health care in Ukraine

Built a real-time, interactive map to visualize attacks on the Ukrainian health care system since the Russian invasion began in February of 2022. The map will support partner efforts to provide aid, hold aggressors accountable in court, and increase public awareness.

Partners: CABI Plantwise

Mining chat messages with plant doctors using language models

Automated recognition of agricultural entities (such as crops, pests, diseases, and chemicals) in WhatsApp and Telegram messages among plant doctors, enabling new ways to surface emerging trends and improve science-based guidance for smallholder farmers.

Partners: Data science company foundation

Matching students with schools where they are likely to succeed

Used machine learning to match students with higher education programs where they are more likely to get in and graduate based on their unique profile, with a focus on backgrounds traditionally less likely to attend college or apply to more competitive programs.

Partners: Fair Trade USA

Mapping fair trade products from source to shelf

Visualized the flow of fair trade coffee products from the farms where they are grown to the stores where they are sold, connecting the nodes in supply chain transactions and increasing transparency for customers and auditors.

Partners: The World Bank, Angaza, GOGLA, Lighting Global

Developing performance indicators and repayment models in off-grid solar

Analyzed repayment behaviors across dozens of pay-as-you-go (PAYG) solar energy companies serving off-grid populations throughout Africa, and developed KPIs to facilitate standardized reporting for PAYG portfolios.

Partners: Haystack Informatics

Modeling patient pathways through hospitals

Mapped out the probabilistic patient journeys through hospitals based on tens of thousands of patient experiences, giving hospitals a better view into the timing of the activities in their departments and how they relate to operational efficiency.

Partners: Yelp, Harvard University, City of Boston

Predicting public health risks from restaurant reviews

Flagged public health risks at restaurants by combining Yelp reviews with open city data on past inspections. An algorithmic approach discovers 25% more violations with the same number of inspections.

Partners: Education Resource Strategies

Smart auto-tagging of K-12 school spending

Built algorithms that put apples-to-apples labels on school budget line items so that districts understand how their spending stacks up and where they can improve, saving months of manual processing each year.

Partners: Love Justice

Building data tools to fight human trafficking in Nepal

Aided anti-trafficking efforts at border crossings and airports by combining data across locations and surfacing insights that give interviewers greater intelligence about the right questions to ask and how to direct them.

Partners: GO2 Foundation for Lung Cancer

Putting AI into the hands of lung cancer clinicians

Translated advances in machine learning research to practical software for clinical settings, building an open source application through a new kind of data challenge.

Partners: Microsoft

Driving data education through custom competitions

Developed online, white-label data science competitions for students to synthesize their learnings and test their skills on applied challenges. Each capstone features a real-world dataset that focuses on an important issue in the social sector.

Work with us to build a better world

Learn more about how our team is bringing the transformative power of data science and AI to organizations tackling the world's biggest challenges.