Open Source Projects
DrivenData maintains a number of popular open source projects for the data science, machine learning, and software engineering communities. Check them out here!
Cookiecutter Data Science
A logical, reasonably standardized, and flexible project structure for doing and sharing data science work
Since starting DrivenData, we’ve seen a lot of data science in the wild. As the field develops, it’s becoming increasingly important to organize data science work so that it’s easy to reproduce and build upon.
Cookiecutter Data Science is a widely used project template that keeps data scientists organized and on track.
Deon: An Ethics Checklist for Data Scientists
A command line tool that allows you to easily add an ethics checklist to your data science projects
When there's a lot at stake, checklists make sure big questions don't slip through the cracks and tough conversations happen even (especially) in fast-moving environments. The goal of deon is to push that conversation forward and provide concrete, actionable reminders to the developers that have influence over how data science gets done.
One command jumpstarts the conversation all data teams should be having. Explore the checklist here!
cloudpathlib
pathlib-style classes for cloud storage services
Have you wished for a consistent and easy interface in Python to access files in cloud storage like S3 and Azure? cloudpathlib is an extensible Python library that provides pathlib.Path-style classes for dealing with files in various cloud storage services, with seamless local caching.
Our goal is to be the meringue of file management libraries: the subtle sweetness of pathlib working in harmony with the ethereal lightness of the cloud.
erdantic
entity relationship diagrams for Python data model classes like Pydantic
Looking for an easy, clean way to visualize your data model? erdantic is a simple tool for drawing entity relationship diagrams (ERDs) that show how data model classes are connected. Generate ERDs from models defined with multiple supported frameworks, such as Pydantic and dataclasses.
If you have data models in Python, this is a great way to illustrate your schema and add a visual reference to your documentation.
nbautoexport
Making it easier to code review Jupyter notebooks, one script at a time
nbautoexport automatically exports Jupyter notebooks to various file formats (.py, .html, and more) upon save while using Jupyter. One great use case is to automatically have script versions of your notebooks to facilitate code review commenting.
pandas_path
Path style access for pandas
Love pathlib.Path? Love pandas? Wish it were easy to use pathlib methods on pandas Series? This package is for you.
Just one import adds a .path accessor to any pandas Series or Index so that you can use all of the methods on a Path object.
Winning Models from DrivenData Competitions
Prize-winning algorithms from DrivenData’s competitions
DrivenData runs machine learning competitions to help non-profits, NGOs, governments, and other social impact organizations use data science in service of humanity. Part of our mission is to enable data scientists and mission-driven organizations to learn from the work done in these competitions. To this end, the code submitted by winners is released under an open source license for others to learn from, use, and adapt.
Check out how ML experts built their winning algorithms!
Project Zamba
Computer vision for wildlife research and conservation
Zamba (meaning "forest" in Lingala) is an open-source Python package that uses machine learning and computer vision to help automate time-intensive video processing tasks for wildlife monitoring, enabling researchers to focus on interpreting the content and using the results.
Zamba builds on the winning solution from the Pri-matrix Factorization challenge and includes multiple state-of-the-art, pretrained machine learning models for species and blank detection in different geographies. It can also be used to train custom models on new species and geographies based on user-provided labeled data.
CyFi: Cyanobacteria Finder
harmful algal bloom detection from satellite imagery
Harmful algal blooms like cyanobacteria occur all around the world and endanger both human and marine health.
CyFi is a command line tool that uses satellite imagery and machine learning to detect dangerous concentrations of cyanobacteria in small, inland water bodies. Built on the winning solutions of the Tick Tick Bloom competition, the goal of CyFi is to help water quality managers better allocate resources for in situ sampling and make more informed decisions around public health warnings for critical resources like lakes and reservoirs.
Concept to Clinic
An AI-powered application for early lung cancer detection built for radiologists
In the Concept to Clinic challenge, hundreds of data scientists and engineers from around the world came together to build open source tools to fight the world’s deadliest cancer. The prototype developed during the live challenge period between August 2017 and January 2018 focused on helping clinicians flag, assess, and report concerning nodules from CT scans.
This open-source project is an end-to-end application that allows radiologists to better interact with state-of-the-art AI as part of their diagnostic process.