The best experiment doesn't exist

as shown by an analysis of DepMap data

Jun 25, 2026

Important service announcement: we’ve just launched a new virtual-cell research site at research.turbine.ai featuring a curated set of research problems where we can support collaborations with data, feedback and validation.

What is the effect of knocking out the EGFR gene in HCT-116 cells?

Does this question even have a clear answer?

We’re generally very interested in experimental reproducibility, as it is a cornerstone of good AI in biology.

It's well known that reproducing biology within the same lab is much easier than reproducing it across labs. But we generally expected stronger signals to reproduce more reliably across labs.

Diving Dep(Map)

That expectation started to change when we took a deeper look at the Dependency Map CRISPR data.

Most of the data comes from two major contributors: Broad and Sanger. Broad uses the Avana CRISPR guide library, while Sanger uses KY.

At first glance, the agreement looks respectable, r=0.73.

Overall gene dependency results of the Broad Avana (X axis) and Sanger KY (Y axis) screen compared. Each point represents a gene/cell line pair.

A veteran bioinformatician might raise an eyebrow at the spread in the high-effect region, but overall the plot looks reasonable.

However, when we start to filter for significant gene dependencies, the correlation craters. (r=0.347).

Fig. 2. Broad Avana (X axis) vs Sanger KY (Y axis) gene dependencies filtered for gene/cell-line pairs where at least one screen identified a significant dependency. The stronger the gene effect, the **greater** the variance seems to be.

That’s concerning. But could it simply be natural variability?

This is what the Broad dataset’s internal replicates look like:

Fig. 3. A few internal biological replicates from the Broad screen. Here, stronger gene effects correspond to lower variance, which is what we would expect from a consistent signal.

Even when replicate correlations are mediocre or low, strong gene effects stilly nicely hug the diagonal. That’s a clear and reproducible signal within the lab.

This is not the first time we’ve encountered this pattern: inter-lab reproducibility of interesting events being surprisingly low while their intra-lab reproducibility remains reasonably high.

What does this mean?

A tale of many truths

What if the original question is ill-defined?

If we ask, “What is the effect of a CRISPR knock-out of the gene EGFR in HCT-116 cells using the Avana library?”, we can already give a more precise answer.

Even more if: “What is the effect of a CRISPR knock-out of the gene EGFR in HCT-116 cells using the Avana library, with a 21-day transfection protocol, ….”

You can probably see where this is heading.

Reproducibility improves as we define the experiment better and better, but in doing that, we may be painting ourselves into a corner.

Every additional condition narrows the scope of the truth we are learning.

If all of these conditions are necessary to establish ground truth, then the knowledge we accumulate applies only within that very specific experimental setting.

This is not ideal.

Okay, if there is no single truth, can we at least say which data is better?

Better is what translates

Well, why are we doing any of this in the first place?

In this case, what we are trying to model are patients’ tumors with dysfunctional genes.

Of course, the slow accumulation of somatic mutations in a tumor is very different from the abrupt DNA damage introduced by CRISPR. But maybe, albeit imperfect, one protocol gives us a better model than the other.

So the quick answer is that the better data is the one that translates better.

But translation depends on the application

Can we now answer the original question?

Does Avana translate better than KY?

Well, translate where?
To patient tumors?
To neurodevelopmental processes?
To toxicity predictions?
If translation depends on the application, so must “better”.

Which probably means there is no single truth to learn from.

And if there is no single truth to learn from, there may never be a single experimental method that makes biology universally learnable. Such an understanding could emerge from a lot of different data sources and protocols somehow made to work together.

But what if nothing we can measure really translates?

Context transfer

The consensus is that CRISPR screens rarely translate as-is to patients1.

If that’s true, then - using this badly drawn image of biological surfaces and their measurement projections from the multimodality post -

If the blue surface describes the actual cell states, what you can see from different endpoint measurements are just separate projections of the original space, a different one for each endpoint.

whatever you can measure in an in vitro CRISPR screen and whatever happens in the patient cancer microenvironment may not even be on overlapping slices of the blue space of cell states.

Which means that to create a general virtual cell, we cannot avoid starting to understand and map how the immeasurable blue surface behaves.
So, perhaps the goal is not to generate ever more data from a single protocol.

Perhaps the goal is to collect many different protocols and learn to translate among them.

Thanks to Miklos Laczik, Csaba Papp and Balazs Szabo for teaching us this with your investigation!

For example: “A systematic, genome-wide association analysis that
integrated CRISPR–Cas9 screens with pharmacological responses
for 397 drugs found clear associations between drug sensitivity and
the knockout of their canonical targets for only ~25% of the tested
compounds.” Gonçalves, E. et al. Drug mechanism-of-action discovery through the integration of pharmacological and CRISPR screens.

Turbine AI

Discussion about this post

Ready for more?