Clinical-stage biotechnology firm Recursion recently announced the release of an open-source biological dataset, RxRx1, which the company has been building for more than five years.
At more than two petabytes, and across more than 10 million different biological contexts, Recursion’s data is the world’s largest image-based dataset designed specifically for the development of machine learning algorithms in experimental biology and drug discovery.
The recently will be accompanied by a competition available through the NeurIPS 2019 Competition Track and co-sponsored by NVIDIA and Google Cloud. The goal of the competition is to inspire the development of effective machine learning methods that can identify representations of biology from the complex experimental dataset, called RxRx1.
“To answer fundamental questions facing biology and disease, and reimagine the drug discovery paradigm, we’re building the world’s largest, relatable, empirical biological dataset,” said Chris Gibson, Ph.D., CEO, Recursion. “We expect that the richness of this dataset, combined with the context surrounding the scale of our efforts, will inspire the world’s machine learning and AI community to help us in our mission to decode biology to radically improve lives.”
The RxRx1 dataset is composed of images of human cells from more than 1,000 experimental conditions with dozens of biological replicates produced weeks and months apart in a variety of human cell types. Each batch of experimental data contains unique experimental variations, giving data scientists a rich proving ground to experiment with methods to tackle the noise inherent in even the most well-run empirical studies.
Experimental complexity and variability are major challenges in the application of machine learning to biological datasets, particularly in drug discovery. While machine learning approaches have the potential to accelerate drug discovery, fundamental challenges remain in combating the complexity and variability in biological datasets.