case studies

1 May 2020

Mining chat messages with plant doctors using language models

Automated recognition of agricultural entities (such as crops, pests, diseases, and chemicals) in WhatsApp and Telegram messages among plant doctors enables new ways to surface emerging trends and improve science-based guidance for smallholder farmers

The organization¶

CABI's mission is to improve people’s lives worldwide by providing information and applying expertise to solve problems in agriculture and the environment. Plantwise is a global programme led by CABI that helps farmers lose less of what they grow to plant health problems by providing timely, appropriate, and actionable advice.

CABI has established a global plant clinic network, run by trained plant doctors, where farmers can find practical plant health advice. Farmers visit with samples of their crops, and plant doctors diagnose the problem and make science-based recommendations on ways to manage it. Plant doctors have access to the Plantwise Knowledge Bank, which includes diagnostic resources and best-practice pest management advice.

The challenge¶

Plant doctors use WhatsApp and Telegram chats to communicate. These messages contain valuable real-time information on plant health and crop issues. Yet without systematic, automated analysis, it is very difficult to identify places where farmers are receiving bad advice or to surface important patterns like emerging pests and trends in agricultural issues.

The approach¶

This project was focused on prototyping methods for entity extraction: the process of automatically identifying and digesting the agricultural units that users are discussing such as plants and pests.

Example hand labeled message containing a crop, pathogen, and many symptoms.

To accomplish this objective, the DrivenData team worked with CABI to:

Collect data sources into an entity knowledgebase
Create "gold labels" based on subject matter knowledge for evaluation
Develop a baseline pattern matching model
Train a statistical named entity recognition (NER) model
Visualize results in interactive dashboards
Hand off a reproducible pipeline

Two thousand chat messages were hand-labeled by the team to identify entities (crop, pest, chemical, pathogen, fungus, and symptom) to provide the corpus for evaluation.

The results¶

The prototyped NER model correctly identified 69% of entities in chat messages. In 93% of cases where entity text is correct, the label is too. The best performing categories were crops, fungus, and chemicals. A proof-of-concept comparison of automated extraction from chats with clinic records reflected the expected spike in fall armyworm mentions due to the outbreak in 2017.

Diagram of the named entity recognition (NER) pipeline.

This work demonstrates the ability to extract entities that can enable trend-level analysis from plant doctor messages and help guide interventions and early action.

Our real-world impact

All projects

Partners: CodePath

Data engineering from the ground up

Built data infrastructure to ingest, clean, integrate, and organize data across CodePath, created interactive dashboards for accurate monitoring of program trends, and provided trusted data expertise to identify and hire talent to carry the work forward.

Partners: Max Planck Institute for Evolutionary Anthropology, Arcus Foundation, WILDLABS

Automating wildlife identification for research and conservation

Detected wildlife in images and videos—automatically and at scale—by building the winning algorithm from a DrivenData competition into an open source python package and a web application running models in the cloud.

Partners: Private sector, social sector

Building LLM solutions

Built solutions using LLMs for multiple real-world applications, across tasks including semantic search, summarization, named entity recognition, and multimodal analysis. Work has spanned research on state-of-the-art models tuned for specific use cases to production ready retrieval-augmented AI applications.

Partners: The World Bank, The Conflict and Environment Observatory

Identifying crop types using satellite imagery in Yemen

Used satellite imagery to identify crop extent, crop types and climate risks to agriculture in Yemen, informing World Bank development programs in the country after years of civil war.

Partners: Bureau of Ocean Energy Management, NOAA Fisheries, Wild Me

Protecting endangered beluga whales with computer vision

Designed and administered a computer vision challenge that produced state-of-the-art machine learning models to identify and match individual endangered beluga whales from photo surveys.

Partners: EverFree

A production application to support survivors of human trafficking

Built the Freedom Lifemap platform, a digital tool designed to support survivors of human trafficking on their journey toward reintegration and independence

Partners: ReadNet

Crowdsourcing solutions for AI assisted early literacy screening

Ran a machine learning challenge to develop automatic scoring methods for audio clips from literacy screener exercises. Automated scoring can help teachers quickly and reliably identify children in need of early literacy intervention.

Partners: Science for America

Making higher education data more accessible

Created an open source Python library and interactive data visualization platform for analyzing U.S. higher education data and illuminating trends and disparities in STEM education.

Partners: IDEO.org

Illuminating mobile money experiences in Tanzania

Analyzed millions of mobile money records to uncover patterns in behavior, and then combined these insights with human-centered design to shape new approaches to delivering mobile money to low-income populations in Tanzania.

Partners: Insecurity Insight, Physicians for Human Rights

Tracking attacks on health care in Ukraine

Built a real-time, interactive map to visualize attacks on the Ukrainian health care system since the Russian invasion began in February of 2022. The map will support partner efforts to provide aid, hold aggressors accountable in court, and increase public awareness.

Partners: Wellcome

Addressing algorithmic bias in medical research

Conducted a literature review to understand the current state of bias identification & mitigation in mental health research, and synthesized recommended best practices from the field of machine learning.

Partners: NASA

Monitoring water quality from satellite imagery

Created an open-source package to detect harmful algal blooms using machine learning and satellite imagery. Included running a machine-learning competition, conducting end user interviews, and engineering a robust, deployable pipeline.

Partners: Data science company foundation

Matching students with schools where they are likely to succeed

Used machine learning to match students with higher education programs where they are more likely to get in and graduate based on their unique profile, with a focus on backgrounds traditionally less likely to attend college or apply to more competitive programs.

Partners: Fair Trade USA

Mapping fair trade products from source to shelf

Visualized the flow of fair trade coffee products from the farms where they are grown to the stores where they are sold, connecting the nodes in supply chain transactions and increasing transparency for customers and auditors.

Partners: University of Maryland

Processing multimodal tutoring data

Built well-engineered data pipelines to extract machine learning features from audio, video and transcript data collected from online tutoring sessions, enabling a team at the University of Maryland to study how relationship-building affects student outcomes.

Partners: The World Bank, Angaza, GOGLA, Lighting Global

Work with us to build a better world

Learn more about how our team is bringing the transformative power of data science and AI to organizations tackling the world's biggest challenges.