Turbine AI

The best experiment doesn't exist

Kris Szalay — Thu, 25 Jun 2026 14:02:02 GMT

Important service announcement: we’ve just launched a new virtual-cell research site at research.turbine.ai featuring a curated set of research problems where we can support collaborations with data, feedback and validation.

What is the effect of knocking out the EGFR gene in HCT-116 cells?

Does this question even have a clear answer?

We’re generally very interested in experimental reproducibility, as it is a cornerstone of good AI in biology.

It's well known that reproducing biology within the same lab is much easier than reproducing it across labs. But we generally expected stronger signals to reproduce more reliably across labs.

Diving Dep(Map)

That expectation started to change when we took a deeper look at the Dependency Map CRISPR data.

Most of the data comes from two major contributors: Broad and Sanger. Broad uses the Avana CRISPR guide library, while Sanger uses KY.

At first glance, the agreement looks respectable, r=0.73.

Overall gene dependency results of the Broad Avana (X axis) and Sanger KY (Y axis) screen compared. Each point represents a gene/cell line pair.

A veteran bioinformatician might raise an eyebrow at the spread in the high-effect region, but overall the plot looks reasonable.

However, when we start to filter for significant gene dependencies, the correlation craters. (r=0.347).

Fig. 2. Broad Avana (X axis) vs Sanger KY (Y axis) gene dependencies filtered for gene/cell-line pairs where at least one screen identified a significant dependency. The stronger the gene effect, the greater the variance seems to be.

That’s concerning. But could it simply be natural variability?

This is what the Broad dataset’s internal replicates look like:

Fig. 3. A few internal biological replicates from the Broad screen. Here, stronger gene effects correspond to lower variance, which is what we would expect from a consistent signal.

Even when replicate correlations are mediocre or low, strong gene effects stilly nicely hug the diagonal. That’s a clear and reproducible signal within the lab.

This is not the first time we’ve encountered this pattern: inter-lab reproducibility of interesting events being surprisingly low while their intra-lab reproducibility remains reasonably high.

What does this mean?

A tale of many truths

What if the original question is ill-defined?

If we ask, “What is the effect of a CRISPR knock-out of the gene EGFR in HCT-116 cells using the Avana library?”, we can already give a more precise answer.

Even more if: “What is the effect of a CRISPR knock-out of the gene EGFR in HCT-116 cells using the Avana library, with a 21-day transfection protocol, ….”

You can probably see where this is heading.

Reproducibility improves as we define the experiment better and better, but in doing that, we may be painting ourselves into a corner.

Every additional condition narrows the scope of the truth we are learning.

If all of these conditions are necessary to establish ground truth, then the knowledge we accumulate applies only within that very specific experimental setting.

This is not ideal.

Okay, if there is no single truth, can we at least say which data is better?

Better is what translates

Well, why are we doing any of this in the first place?

In this case, what we are trying to model are patients’ tumors with dysfunctional genes.

Of course, the slow accumulation of somatic mutations in a tumor is very different from the abrupt DNA damage introduced by CRISPR. But maybe, albeit imperfect, one protocol gives us a better model than the other.

So the quick answer is that the better data is the one that translates better.

But translation depends on the application

Can we now answer the original question?

Does Avana translate better than KY?

Well, translate where?
To patient tumors?
To neurodevelopmental processes?
To toxicity predictions?
If translation depends on the application, so must “better”.

Which probably means there is no single truth to learn from.

And if there is no single truth to learn from, there may never be a single experimental method that makes biology universally learnable. Such an understanding could emerge from a lot of different data sources and protocols somehow made to work together.

But what if nothing we can measure really translates?

Context transfer

The consensus is that CRISPR screens rarely translate as-is to patients1.

If that’s true, then - using this badly drawn image of biological surfaces and their measurement projections from the multimodality post -

If the blue surface describes the actual cell states, what you can see from different endpoint measurements are just separate projections of the original space, a different one for each endpoint.

whatever you can measure in an in vitro CRISPR screen and whatever happens in the patient cancer microenvironment may not even be on overlapping slices of the blue space of cell states.

Which means that to create a general virtual cell, we cannot avoid starting to understand and map how the immeasurable blue surface behaves.
So, perhaps the goal is not to generate ever more data from a single protocol.

Perhaps the goal is to collect many different protocols and learn to translate among them.

Thanks to Miklos Laczik, Csaba Papp and Balazs Szabo for teaching us this with your investigation!

Subscribe now

For example: “A systematic, genome-wide association analysis that
integrated CRISPR–Cas9 screens with pharmacological responses
for 397 drugs found clear associations between drug sensitivity and
the knockout of their canonical targets for only ~25% of the tested
compounds.” Gonçalves, E. et al. Drug mechanism-of-action discovery through the integration of pharmacological and CRISPR screens.

A cautionary tale on biological truths

Kris Szalay — Fri, 13 Mar 2026 14:21:36 GMT

Subscribe now

Your biology textbook lies.

Image from: https://wallpapercave.com/w/wp11528256

By implication.
The cell you learned so much about in college and grad school doesn’t exist.

Let me tell you a story and we’ll get back to this.

Biological Truth Benchmarking

In the early days of Turbine, our model was a hand-wired signaling graph, painstakingly tuned to reproduce textbook behavior. It was fragile and at best passable in terms of predictive performance, but people using the simulations knew exactly what was going on inside.

Then we started using AI to do the wiring.

We shattered previous performance ceilings, but also mechanistic expectations along with them. When biologists looked into the pathway activation patterns, they were alarmed. The virtual cell was clearly doing the wrong thing in response to drugs.

It was not a surprising outcome - from the AI’s perspective the whole protein graph was merely a bunch of tunable parameters to fit a million data points with. And there are a lot of ways to fit with a million data points with tens of thousands of parameters.

OK, so we need a way to guide the AI. But how?

We started writing a set of textbook rules.

Suppose protein A and protein B are both required to activate protein C, and C has no other inputs. Any wiring where C activates when only A or only B is present must be prohibited.

We crafted more than a thousand such rules and called them Biological Truth Benchmarks, or BTBs for short. Here are a few examples.

A few of the G1 cell cycle entry BTBs from the past.

As you see, we quickly had to realize that real protein interactions are much messier. Pathways have interactions, so many ifs and yes-buts that rules like these cannot fully capture (the “biological context”).

Still, it’s better than nothing, right?
So we started building these rules into the loss function.

But given the lot of ifs, how strongly should the model obey them? We didn’t know the exact cell lines in which these rules were originally established. Even if we did, they were measured in only a handful of cellular contexts, making them a drop in the ocean compared to other training data.

So we decided that each rule should hold in the majority of cells.

For an ML engineer, this is a horrifying criterion. Optimization can no longer happen independently per sample. The order of samples starts to matter. Sweat and tears were had. But eventually a working setup emerged.

And the results were underwhelming.

Training was much harder, performance didn’t improve, and many horrible mechanistic quirks remained.

There is no general cell

Investigation revealed that in clear contrast to what we wanted, performance and BTB scores were disconnected. You can have excellent predictive performance with terrible BTB scores (which is where we started) but also excellent BTB scores with terrible predictive performance.

Good textbook biology does not make for better generalization performance (and vice versa). The image shows two example sentences illustrating that the success of BTBs has no bearing on phenotypic outcome - both wrong and good predictions have very close BTB success rates.

Which defeats the whole purpose.
Worse, there was a trade-off in trainings between BTB and viability performance.

Some argued that performance didn’t matter if the mechanism is more correct. But if the performance degrades, the model generalizes less. And if it generalizes less, it is — by definition — capturing less of cells’ internal workings.

This is deeply counterintuitive — if the model obeys more of the scientifically established rules, how can it generalize worse?

Because scientific truths are themselves overgeneralizations of the underlying data.

Science is built by humans, for humans. It’s an explicit job of science to compress the messy, complicated reality into understandable decision rules. Each rule, individually, would be a sound decision in the context it was derived.

It blows up when you start combining these rules.

Our mistake was saying that each rule should work in the majority of the cells without precisely defining said majority.

Because if you just optimize any majority (and the rule indeed applies to roughly a majority), you’ll get, on average, half the cell lines wrong with each rule.

With 1000 rules the chance of getting any appreciable amount cell lines right for most rules is, for most intents and purposes, zero.

So by training to BTBs, we were explicitly teaching the wrong idea to the machine. The idea of the general cell described in the textbooks to which all rules apply equally.

But that general cell doesn’t exist. There is no single cell in which all these rules are simultaneously true. Some rules will be observable in a certain type of cell, but not others — there will always be some cell-specific activity that interferes with some rules.

The pattern of which rules apply where is precisely what distinguishes cell types. It is the keystone of predictive performance.

Now what?

If you want mechanistic correctness, you need mechanistic ground truth.
Post-treatment RNASeq, protein phosphorylation assays — something that gives you an additional trace inside the cell besides your endpoint.

The implication is unsettling, but also fascinating: real biology is much more complex than what can be contained by textbooks — most of that world is still waiting to be discovered.

Does this mean we should fire the experts and just collect more data?

Of course not.

First, you need the experts to match the right ML metric to the downstream application — Does a better score on your chosen metric give results they’re happier using? If not, you are optimizing the wrong objective.

Second, humans are horrible detecting if a fact is true — that’s what rigorous metrics are for. But we’re remarkable in detecting when something feels off, much better than any predefined ML metric.

So that’s the workflow I’m trying to drive in-house:
When experts detect a smell, capture it in data.
Make sure the metric is sensitive to those new examples.
Optimize against that expanded ground truth and iterate if needed.

And assume as little as possible.

As we got closer to correctly simulating biology, one textbook assumption after another fell apart. First the fixed-form equations, then the fixed activity patterns, and finally even the graph abstraction itself.

But that’s a story for another time.

Thanks for all that work 3 years ago, Laszlo Mero, Csilla Hegedus, Dora Kallai and Robert Sipos (and the reviews for Imre Gaspar and Balint Paholcsek)!

Subscribe now

Make multimodality work for you, not against

Kris Szalay — Wed, 25 Feb 2026 15:02:52 GMT

Subscribe now

Large-scale RNA data generation is starting to show diminishing returns. So, naturally, multimodality is starting to look very attractive.

After all, RNA is just one slice of reality. If we want to be able to build a general virtual cell, RNA alone may not be enough. So it makes sense to collect and add other types of data out there. More data could only improve our results, right?

Illustration gracefully lifted from here

Well, not necessarily. If you just mindlessly add multiple types of data to a training, it can just blow up, drastically decreasing performance.

A case in point

One of our teams was recently training a new model to power virtual assays for drug target discovery.

We didn’t just want to predict the overall effect of knocking out a gene, DepMap already measured that. We wanted something that can predict what happens inside the cell when that gene is knocked out.
A DepMap-wide Perturb-seq, if you will.

The obvious starting point for the training was to add DepMap and Perturb-Seq data together, and see what happens.

This.

First training performance on RNA only vs combined data. We only calculate correlation between the true and predicted RNA gene levels for samples where the RNA patterns are relatively stable to make sure we’re not trying to fit noise.

That is, instead of improving (or, in the worst case, doing nothing), the performance dropped significantly when the model tried to combine the two data types.

After spending some time chasing the culprit (as the two datasets are hard to compare due to being different readouts), the team came up with the following plot:

This heatmap shows the effect of knocking out a single gene (y-axis, lower is “more dead”) versus how much the molecular state of the cell changed in the RNA readout (L1 norm of the differential expression, x-axis). A correlation this small is counter-intuitive - why do changes that lead to cell death rarely induce large RNA swings?

revealing that, even in the ground truth data, how much the RNA levels change and how the cells fare seem quite unrelated.
This doesn’t make too much sense. Surely knocking out a gene that eventually kills the cell should start a cascade of repair processes which should be visible in RNA.
Either our metric (using L1 norm) is wrong, or the data doesn’t line up - that is, the biology is very different.

Different biology was a plausible explanation as the gene effects are measured in a 21-day assay, while the Perturb-seq experiments were generated in different labs, different setup, 7 days after transduction.

But if there’s no shared biology at all, there’s no way to connect these modalities together.

We got lucky. In the end, what was missing is a simple correction on cell counts:

The effect of knocking out a single gene (y-axis, lower is “more dead”) and how much the molecular state of the cell changed in the RNA readout (L1 norm of the differential expression, x-axis) are much better correlated after cell count correction. It’s not perfect, but I’m more willing to chalk up this level of mismatch to the L1 norm being an imprecise metric.

and after training on the corrected data, we got all the missing performance back, maybe1 even some more:

First training on combined data after cell count correction. Looks much better.

So there was a way to make the modalities work together despite the obvious differences in protocol.

That’s what seems to be the core question at play when building something multimodal: am I measuring the same biology when putting two datasets together?

The elephant in the room

Which got me thinking. Is there a way to generally quantify how good your assumption of shared biology is?

Imagine biology as cells moving on some kind of latent manifold describing the internal cell state. Then every modality is a projection, a lower dimensional representation of that mathematical surface.

You can collect as many projections as your budget allows, but you never get to see the surface itself. Hence the elephant and the blind men.

In theory, if your surface is smooth, and your sampling is dense enough, you could reconstruct the blue surface close enough to recover the true “working manifold” of biology.

You know you’re close when you can reliably predict across all modalities from just a few of them - RNA to phenotype, proteome to epigenome, etc.

But to do that, you need to learn which regions of the green space are close to which regions of the red space in the original blue space.
Which means points that are somehow externally paired in the training data.

That’s why Ron argued so passionately about paired data.

Truly, in an ideal world, every sample would come with every modality.
But most molecular measurements are destructive. So every pairing decision is fundamentally an assumption of shared biology — an assumption that the two measurements originate from the same spot in the blue surface.
It’s a stronger assumption if the data comes from the same patient, lab and protocol, but an assumption nonetheless.

On the other hand, sometimes data that look wildly unpaired can still work, as the example above shows.

So what actually makes multi-modal data work? I think it’s the following assumptions.

If the data doesn’t exist, following these could be a structured way to build it, testing each assumption as you iterate on your protocols.

If the data already exist, maybe it’s worth trying the opposite: train first. If it fails, start tracing back the chain of assumptions and see whether there’s something you can fix.

The pairing assumption

The deepest assumption is that each multi-modal pair truly describes the same biological state. The test is this: can you produce matching pairs with a single readout?

You are pairing samples based on them being in the same experimental condition? Or you are matching individual cells in heterogeneous assays based on microscopic features?

Both could be valid pairing algorithms. But in your setup, do they really produce consistent RNA-Seq vectors, before considering any other modality?

If they are consistent to your liking, it’s now safer to assume the other readouts from the same batch will be consistent as well (come from the same distribution).

If not, your pairing is off. Maybe you are missing experimental variables to control (synchronizing the cells, for example). Maybe you need better markers to match cells in imaging.

The smoothness assumption

Next up the ladder is checking if you have enough data to approximate the biological surface. The test: Can you reliably predict one modality from the others? Under which conditions?

Suppose you have some RNA-Seq data points and immunhistochemistry (IHC) images. Let’s say they come from different patients, but some are matched by indication, so maybe they are close enough.

If you can reliably predict one modality from another, it means your sampling density is sufficient relative to the smoothness of the surface between those points.

Just don’t forget to test how far that generalizes - Crossing tissue types is the barrier where it usually breaks.

If the prediction doesn’t work at all, or the training blows up, then the data points are too few and far apart (or, if you’re coming from the opposite direction, they may not represent the same biology).

Extrapolating

If all you wanted was something that translates from modality A to modality B, you’re done once the above prediction works. Usually we are asked to predict responses for new drugs or cell lines. Those responses won’t have any of the modalities measured.

In the end, the arbiter is whether your model improved by being multi-modal. Did you gain the ability to extrapolate to new interventions?

Because that’s why we ultimately generate all this data - to learn enough about the hidden blue surface to trace the missing parts.

Because that’s where the cures for disease lie.

Thanks to Laszlo Mero, Milan Sztilkovics and Zsolt Gyure for the case study data, and Daniel Veres for the review and comments!

Subscribe now

There is a clear improvement, but ablation studies show that most of it is driven by the correction, not the integration.

Assessing a Virtual Cell’s utility

Gerold Csendes — Wed, 07 Jan 2026 14:02:15 GMT

Subscribe now

Hitting 2026 running, I can tell you that the team was asked a lot internally to evaluate and share their opinion on many published approaches on creating virtual cells.
was inspired to write down the process he developed for those evaluations.
I converted it into Substack’s format, but had to cut the last chapter due to length limitations. You can get the full PDF here. Gerold, the floor is yours:

We assume the reader has some familiarity with Virtual Cells. For readers new to the field, we advise starting with our blogpost.

Intro
Step 1. Specify the claim
- The claim is defined by the test data
  - Genetic perturbations
  - Chemical perturbations
Step 2. Validate the evidence
- Check for data leakage
- Are the baselines strong and domain-informed?
- Can the score measure the claim?
  - Example 1: Evaluating differential effects in raw expression space
  - Example 2: Correlation (alone) in wide screens
- Is it learning actual biology?
Summary: Where is the field right now? (only in the PDF)

Intro

Virtual Cells are one of the hottest topics in computational biology right now. The field is chasing an “AlphaFold moment”: a model that turns messy biology into reliable perturbation predictions, at scale. That ambition is worth pursuing, but a sober reality check arrived in early 2025. Multiple groups—including Ahlmann-Eltze et al. and us—showed that several “foundation model” claims on drug discovery benchmarks could be matched or beaten by simple, classical baselines. Since then, new methods have continued to appear, and headline performance on popular benchmarks has often improved.

This note is a practical, drug discovery-facing guide to interpreting such Virtual Cell results. Concretely, we offer a lightweight review checklist that helps you (1) scope what a benchmark result can legitimately claim, and (2) validate whether the evidence is trustworthy.

In this note, we use a narrower definition of Virtual Cells: models that predict perturbation outcomes—e.g., the effect of CRISPRi or a drug on a cellular phenotype (most commonly post-perturbation gene expression). Most widely used Virtual Cell benchmarks follow this framing.

From the outside, it’s tempting to accept a straightforward narrative: Virtual Cells could approach experimental performance, the gap is “just” scaling data, model size, and compute until we get the equivalent of AlphaFold for perturbations. We (Turbine) are building Virtual Cells because we believe this will indeed eventually happen. Still, we think important context is often left out when people assess “utility” from benchmark tables alone. Drug discovery is expensive and failure-prone, and most proof points don’t survive the next step after a leaderboard read.

Our day-to-day work forces that disciplined downstream focus. We constantly benchmark state-of-the-art models, try to separate real progress from evaluation artifacts, and make the case to partners that our model is worth using instead of an open-source alternative. In practice, we found ourselves repeatedly applying a two-stage process: (1) specifying the claim and (2) validating the evidence.

What we generally find is that we are still far from the ambitious definition of Virtual Cells: models that generalize across cell types, perturbation modalities, and datasets. We also find that “validating the evidence” is not straightforward—one needs to read result tables through a critical lens.

The goal of this note is to provide a practical reading guide for benchmark-based claims: first, define what a reported score legitimately implies; second, identify failure modes—such as leakage, weak baselines, or misaligned metrics—that can overstate progress.

Step 1. Specify the claim

It is not always easy to parse what a method is claiming to do – that is, in what kind of downstream applications would the presented results be applicable. This is often fine in research; authors don’t always need to precisely describe in which context method X can be used. However, if we are talking about Virtual Cell utility, we need to translate that into a claim about usability. Terms like context generalization, out-of-distribution, or capturing cellular behavior frequently appear, but they are hard to validate unless they are grounded in a specific benchmark and split strategy.

The claim is defined by the test data

The most important factor for specifying the claim of a Virtual Cell is the benchmark(s) it is evaluated on. Another central axis is the dataset split strategy, which has a large impact on what conclusions are justified. Today, two benchmark families are especially common in the Virtual Cell space: Perturb-seq CRISPRi genetic benchmarks and the Tahoe-100M drug perturbation benchmark.

Genetic perturbations

Accurate modeling of genetic perturbations is a core challenge for drug discovery. If successful, it could support target identification by narrowing the hypothesis space and generating testable, context-specific predictions (e.g., expected pathway shifts or compensatory programs after target knockdown). Today’s most widely used datasets are Perturb-seq CRISPRi screens measured by post-perturbation transcriptomics, typically spanning only a handful of cell contexts and a perturbation set of essential genes. 1

Because the data are limited, benchmarks often use a gene-exclusive split (GEX, Fig. 1. right) —holding out genes at test time while evaluating in the same cell contexts. This setup has two practical limitations. First, for Target ID, we usually care about predictions in the disease-relevant cellular context (and about selectivity across contexts), whereas GEX evaluates generalization across genes within a small, fixed set of cell types. Second, restricting perturbations largely to essential genes can make the task systematically easier: essential-gene knockdowns often induce stronger and more stereotyped transcriptional programs, which can inflate aggregate metrics without guaranteeing robustness to more diverse, weaker, or more selective perturbations (e.g., non-essential targets, selectivity).

Zooming out, GEX is an artificial holdout from an application standpoint—targets are not “unseen genes” in deployment as much as they are new contexts (new disease state, new genetic background, new cell type). For that reason, strong GEX performance is best interpreted as a diagnostic (does the model properly represent genetic perturbations) rather than as sufficient evidence for Target ID utility.

A more practically useful evaluation is cell-exclusive generalization (CEX, Fig. 1. mid.)—holding out cellular contexts and asking whether a model can predict how the same perturbation behaves across cell types, including selectivity patterns. Unfortunately, CEX is underrepresented in current Perturb-seq benchmarks, making it difficult to test stronger claims about real-world Target ID workflows.

Ultimately, causal understanding of drug and biomarker interactions would necessitate being able to generalize to combinatorial perturbations, with at least one of the perturbations being unseen. This is the setup we most frequently encounter in pharmaceutical applications. The scarcity of public dual-perturbation data means that open-source, large-scale benchmarking of this capability is still unsolved.

Currently the most practical proxy for combinatorial benchmarking would be larger perturbational transcriptomics resources—datasets closer to “DepMap-scale” breadth in cell contexts, but with post-perturbation readouts.

In the near term, the community could adapt more realistic benchmarks by leaning on already available genome-wide Perturb-seq datasets (X-Atlas/Orion 2, Replogle et al. K5623) and adopting split protocols that explicitly measure cross-context generalization.

Chemical perturbations

Tahoe-100M is a popular drug perturbation benchmark containing ~50 cell lines and ~1100 compounds. Compared to GDSC, Tahoe-100M is more extensive on the compound axis (1100 vs ~300) but more limited in cellular context (50 vs ~1000 cell lines). It is commonly evaluated with a “context generalization” split, often corresponding to few-shot generalization along an axis (Fig. 1. left). How relevant is this setup for drug discovery decisions? We argue that few-shot generalization can mimic some workflows, but in many settings a fully exclusive setup is more broadly applicable: either drug-exclusive (DEX; Fig. 1. right) or cell-exclusive (CEX; Fig. 1. mid). Data is often scarce, and assuming that the compound in question has little-to-no assay data is a robust default that covers a broader range of applications. For this reason, we propose more challenging and clearly defined Tahoe-100M split protocols (CEX/DEX)

Under current splits, strong results support split-limited claims. Currently, high Perturb-seq performance is a diagnostic tool that clears the model for more specific downstream tests. High Tahoe-100M performance (in the usual context generalization setup, see below) mainly justifies few-shot screening claims. Strong performance on stricter CEX/DEX splits would support broader, more ambitious claims.

Figure 1. Schematic representation of dataset splits. Perturbation Exclusive is an umbrella term for any kind of perturbation exclusivity while DEX (drug exclusive) and GEX (gene / genetic exclusive) refer to the specific perturbation modality.

Step 2. Validate the evidence

Up until now we have discussed how we can specify a Virtual Cell’s claim given the benchmark it uses. But knowing what a result could claim is only half the story: we also need to decide whether the reported gains are trustworthy and would survive outside the benchmark setting. Reading the results table alone is rarely enough. Below are three questions we find imperative to ask when validating the evidence.

Check for data leakage

Before interpreting any results table, the first question is whether the reported score reflects generalization or memorization. Leakage—when evaluation information sneaks into training—directly undermines trust in benchmark performance.

Leakage is especially easy to introduce when models and data pipelines become complex, and it is a well-known concern in LLM evaluation: models can look strong on public benchmarks yet fail on genuinely new test sets when the data is truly unseen (see Math Olympiad). Similar dynamics appear in biology: when benchmarks contain near-duplicates (e.g., high sequence similarity across train/test) or share structure with training corpora, sophisticated models can appear to win by learning shortcuts rather than transferable biology.

There is a cautionary tale for leakage and benchmarks in the binding affinity benchmarking community. There, the commonly used PDBBind 2020 dataset contained very large, sometimes even 100%, sequence similarity to evaluation sequences (Fig. 4, paper).

Virtual Cells are particularly vulnerable to this risk because many are foundation models pretrained on vast, heterogeneous datasets. If downstream benchmarks (or close variants of them) make their way into pretraining, it becomes difficult to disentangle genuine generalization from memorization. This risk will likely grow as pretraining corpora expand and becomes harder to audit.

Leakage can also be subtle and “context-dependent.” Examples include:

Using proprietary data from the same perturbation domain as model features while reporting performance on an open benchmark.

Constructing perturbation graphs or neighborhood features using information that is only available because the benchmark perturbations are known in advance.

Number of cells per perturbation in single-cell datasets which usually correlates with the strength of the perturbation (VCC was affected, see Fig. 1 here)

These are not necessarily data leaks if the benchmark setup is careful enough—but in most evaluation settings they effectively are, and they should be disclosed and justified.

One practical way to mitigate leakage is evaluation on held-out, non-public test sets, where the training data and benchmark are separated by design. This is why community competitions (e.g., CASP/DREAM-style) are so valuable for measuring “in-the-wild” performance. The Virtual Cell Challenge (VCC) was an important step in that direction for the new era of Virtual Cells, even if the first iteration was largely about learning how to benchmark these new models.

Are the baselines strong and domain-informed?

Benchmarking complex models only against other complex models—especially if the benchmark itself is new—is risky. It can ignore substantial prior work in the domain and makes it hard to contextualize task difficulty. In much of modern AI, the “moment” progress happens is clear: a new architecture must beat strong incumbents (e.g., ViTs on ImageNet). Virtual Cells are not in that regime yet. We do not have a single, widely accepted “ImageNet for perturbations” and neither have we reached a point where a single model family cleanly dominates across settings.

What we do have are decades of biological and experimental insights about how cells, perturbations, and readouts should be represented. These insights already lead to surprisingly strong baselines, and they remain relevant even as model scale increases.

For that reason, we believe a Virtual Cell should be evaluated not only against “mean” and simple linear additive models, but also against a class of stronger, domain-informed methods—what we call computational biologist baselines (what an experienced computational biologist does on first approach trying to predict from the same data - see our random forest or ridge as an example). Concretely, these are models that deliberately encode biological structure (e.g., sensible covariates, perturbation graph structure, batching and control modeling, and cell-context conditioning) without requiring foundation-model scale.

It is encouraging that reporting simple baselines on perturbation benchmarks is becoming standard practice. Still, we often find that baselining is not yet rigorous enough: there is a large space between trivial baselines and the most complex Virtual Cells, and that middle ground is frequently underexplored. In our own evaluations, investing effort into this “computational biologist baseline” space often closes much of the apparent gap to SOTA—and not so rarely matches or outperforms it—changing the conclusion about whether a complex Virtual Cell is worth deploying.

Figure 2. Schematic representation of baselines and Virtual Cell (VC) results

Can the score measure the claim?

A strong benchmark score does not automatically mean a Virtual Cell is making useful predictions. In post-perturbation transcriptomics it is possible to achieve impressive-looking metrics while missing what matters in practice—e.g., capturing differential response, pathway-level shifts, or clinically meaningful rankings (which compound/target moves the biology in the desired direction). Several recent efforts argue for improved evaluation e.g. Systema, PerturBench, Diversity by Design, STATE but the field is still far from consensus, and community efforts like the Virtual Cell Challenge have highlighted how difficult it is to select a robust metric suite.

Why is this problem so hard? One key difference from protein structure prediction is that structure has a relatively intuitive geometry: predictions can be compared to ground truth in 3D, and “closeness” has a fairly direct interpretation. Post-perturbation transcriptomics, by contrast, is a ~20k-dimensional vector whose biological meaning is largely read out indirectly (e.g., through differential expression, pathway activity, and phenotype proxies). As a result, many “reasonable” metrics end up rewarding the wrong behavior.

This is not to say current metrics are useless, some are informative. But there are common choices that systematically overstate progress. Below are two failure modes we see often.

Example 1: Evaluating differential effects in raw expression space

In perturbation studies, what we care about is the differential effect: the change relative to an unperturbed (or matched control) state. Raw expressions mostly reflect cell identity, and in many settings the perturbation effect is comparatively small (Fig. 3). If a metric evaluates predictions in raw expression space, it can reward models for predicting baseline identity rather than perturbation response—artificially inflating scores and encouraging “do nothing” behavior. A simple no-change baseline (predict the control) is a useful sanity check here: if it scores surprisingly well, the metric is likely not measuring the perturbation biology you care about.

Figure 3. Schematic representation of raw and differential expressions. A no-change predictor would correlate well with the perturbed.

Example 2: Correlation (alone) in wide screens

Correlation in differential expression space can be useful—but only when the perturbation induces a reproducible, sufficiently strong signature. If the true effect is near zero, the measured differential expression is dominated by noise and won’t reliably reproduce in the lab (Fig. 4). It is unrealistic (and undesirable) to expect a model to match such examples. This is not a corner case: many genetic perturbations have weak transcriptomic effects, and many drug–cell–dose combinations produce negligible response. In these regimes, correlation becomes unstable and can penalize sensible behavior while rewarding overfitting or noise chasing.

A practical implication: metrics should be effect-size aware (e.g., stratify or weight by signature strength / experimental reproducibility), and evaluation should explicitly distinguish “predicting a real signal” from “predicting noise.”

Figure 4. Schematic relationship between signature strength (#DEGs) and replicate reproducibility of the perturbation; weak signatures tend to show low reproducibility in typical experimental settings.

Is it learning actual biology?

In drug discovery, we don’t just care about a good average prediction—we care about whether a model captures, in a way we can measure and quantify, the causal functional impact of a perturbation on a cell. In practice, this means asking whether the model’s internal representation supports the manipulations we routinely reason with: dose changes, target engagement, pathway relationships, and cross-modality links (genetic vs. chemical perturbations). We treat these as diagnostic checks: passing them does not prove the model is “right,” but failing them is often a warning sign that the model is exploiting shortcuts or missing structure that will matter in real deployments.

We’re cautious with “priors”: scientific history is full of surprises, and some intuitions will be wrong. Still, when a Virtual Cell both performs well and exhibits coherent, controllable behavior under these diagnostics, it is a strong reason to investigate deeper. Conversely, incoherence often points directly to where evaluation, data assumptions, or modeling choices are failing. The Large Perturbation Model is an example of exhibiting patterns of coherent representations.

Without loss of generality, we look for:

Dose response and directionality: does increasing dose move the predicted state consistently (and saturate plausibly), rather than producing erratic jumps?

MoA sensitivity vs chemistry sensitivity: Do drugs with shared mechanism of action cluster more strongly than drugs with superficial chemical similarity? Can the model separate on-target effects from confounders such as general toxicity or stress responses?

Target ↔ pathway consistency: if a pathway is genetically perturbed in training, does the model generalize to drugs acting on that pathway in a predictable way?

Cross-modality alignment: are a single-target inhibitor and its KO/CRISPRi analogue represented as “nearby” in the right contexts? More importantly, does genetic evidence improve out-of-distribution drug predictions in a measurable way?

Attribution sanity: when the model predicts a response, can we attribute it to plausible genes/pathways rather than dataset artifacts (batch, cell line identity proxies, etc.)?

Thanks to Bence Szalai and Krishna Bulusu for reviewing and improving this note.

Subscribe now

These are datasets from Replogle el. al and Nadine et. al

Spanning 2 cells: HCT116 and HEK293T

The “pan-expression” wide version

How did we get a regression model to the top of the Virtual Cell Challenge

Kris Szalay — Tue, 09 Dec 2025 15:00:31 GMT

This post is a high-level summary of our key learnings. You can find a more detailed and technical write-up on our Challenge experience here: Mean Predictors write-up

It’s been a while since the last DREAM challenge tested our ability to predict cellular responses. ARC took up the mantle with their Virtual Cell Challenge.

Naturally, we joined. We are the meanest predictors, after all.

In the end, a simple model first described in 1970 climbed to the top of the leaderboard for a while and ultimately landed 15th overall, and became one of the best “generalist” models under ARC’s newly announced criteria.

The placement itself is not what matters. What’s interesting is why this model was the right choice.

State of the VCC leaderboard at the 14th of October

The newly announced Generalist Leaderboard

Single cells are only useful if they are different

There’s a strong temptation to treat single-cell measurements as “big data.” After all, Perturb-Seq gives you reads from hundreds of thousands of cells. But what you get from each cell is just a somewhat redundant fragment of the biological reality.

In practice, we’ve always found that the information content comes from how many distinct pseudobulks (clusters) you can form. In heterogeneous ex vivo material, you can get 20–50 clusters per condition.

However, in Perturb-Seq, colonies are are relatively homogeneous. The true value of the technology lies in making it possible to do many conditions together in one flask, not in it being single-cell. So the number of effective data points you have is 1 RNASeq data point per condition.

So, as an exercise, how many effective RNASeq data points does Tahoe-100M have?

Based on the description, they multiplex 50 cells with 1100 perturbations - giving an effective data size 55,000 data points.

Which means that in practice, for all the cells sequenced in VCC, the effective dataset looked roughly like this:

150 training points
50 for the leaderboard
100 for the final evaluation

which means 300 effective data points.

This is why many teams like us chose to ignore the individual cells entirely and went to only predict the clusters’ behavior, still finishing in the top spots. I think this is pretty strong evidence that using all the single cells adds little to no additional information despite using magnitudes more compute.

The bottleneck today is data, not AI

If you add up the useful RNASeq-like data available to the field - Perturb-Seq, Tahoe (Drug-Seq), regular RNASeq, even LINCS (Broad’s public microarray-like dataset) - you will get to something like half a million effective datapoints. Compare this to the number of possible configurations of a 20,000-gene system and it becomes clear we still only see the tip of the iceberg. So regularization and inductive bias is still paramount.

Therefore, an AI architecture is only as good as (1) its ability to ingest the limited data we have, and (2) the inductive biases that allow it to exploit symmetry and structure.

So don’t dismiss simple models like regression or Random Forest. Given the same data, they can perform on par with far more complex architectures, and often generalize better to other tasks.

The person I probably learned the most from about machine learning is Prof. Abu-Mostafa. He said that your model should match the complexity of your data, not the complexity of your problem.

In 2025, we can formulate this as a philosophical razor:
The simplest model that covers your data and inductive bias will generalize best.

Relevant data beats more data. Put the work in.

We did not use any pretraining to get this high on the leaderboard. This reinforced our view that, today, foundation models offer little to no added value predicting perturbation effect.

What did matter were the handful of other Perturb-Seq datasets we found (Replogle et al., Nadig et al.) which contributed most of our predictive power1. These represent a tiny amount of data both compared to pre-training datasets (Geneformer) and to large-scale perturbation screens (LINCS or DepMap), but are much closer to the physical assay we aimed to predict.

However, incorporating these datasets required weeks of cleaning, alignment, and integration. Smaller teams recoiled from this work. We had the experience and tooling to do it, and it made a substantial difference. What matters is putting the work in.

Even a small set of 50–100 data points can materially improve prediction accuracy when those points are closer to the real scenario you aim to model than anything else available.

Metrics are hard, but especially in 20,000 dimensions

Measuring prediction performance in a way that reflects real-world use is notoriously difficult. Models hammer any scoring function thousands of times per GPU-second. If your metric has loopholes, the models will find and exploit them.

We always generate scatterplots, as strange patterns often reveal metric artifacts. For example, this is how we just discovered that ARC’s AUPRC method was unexpectedly generous to pseudo-bulk predictions2.

Now all of this becomes exponentially harder if you are trying to score predictions in a 20,000 dimensional space.

Trying to do all this in 20,000 dimensions amplify these issues. The core VCC metric PDS (Perturbation Discrimination Score) ranks which true perturbation vector your prediction is closest to. But what does “closest” really mean?

Teams quickly learned that the magnitude of the predictions matters more than the direction of the RNA vector, leading to teams scaling transformations to get better (but biologically not necessarily more correct) PDS scores.

Are you more interested in which genes change, or how much they change? And is a model that gets a subset of genes quantitatively right better than one that gets most genes only qualitatively right?

I don’t think it’s possible to have a single answer to this unless we completely understand how RNA levels map to the internal cell state. Which we don’t.

I don’t think we can clearly resolve this without fully understanding how RNA levels map onto the cell’s internal state - and that’s still ahead of us. Until then, the “right” metric is fundamentally tied to the downstream application you care about.

ARC’s metrics received plenty of criticism, but as general-purpose measures, they are not nearly as flawed as portrayed. They capture different facets of the prediction vector. They can (and should) be hardened, but given the lack of a universal downstream task, VCC’s choices were a reasonable first attempt.

For our applications, we prefer Pearson delta, which tests whether the pattern across the top differentially expressed genes is preserved3.

But don’t forget that RNA-seq prediction itself is not the end goal.

It’s best to evaluate virtual cells at the application

In applied settings, you want to know whether your predictions hold in real life: cell viability, cytokine levels, metabolism, motility - anything tied directly to phenotype. RNA-seq is only a fingerprint of the cell’s internal, unobservable state. And a fingerprint of a proxy will never fully answer the question you care about.

If you want to evaluate virtual cells, measure them at the application layer. That removes the guesswork about whether the model captured the part of the cell state from RNA that matters for your task.

This is not to diminish the value of RNA-seq itself. Post-treatment transcriptomics remains the closest thing we have to a generic, reusable data type - a kind of connective tissue that can link many downstream assays together.

So, do we have Virtual Cells or not?

To some degree. The VCC results show that with relevant data you can approach wet-lab reproducibility for certain perturbation effects.

But the challenge was not difficult enough to demonstrate robust cross-assay generalization, which is a key requirement for truly general virtual cells. Our internal experiments confirm this remains unsolved.

A huge thank you to the meanest predictors: Bence Szalai, Gerold Csendes, Bence Czako and Gema Sanz who did lion’s share of the work.

And of course, thanks to the ARC team for all the work they put in to make this challenge happen! We really enjoyed it.

Subscribe now

We added the external Perturb-Seq datasets as input features, not as additional training data. That is mostly an implementation detail, but worth noting.

How do you draw an AU-PRC curve when you only have a single point? Under the standard trapezoidal rule, you would project that point onto the Y-axis. ARC’s more generous interpretation is that if you push the threshold all the way to zero, you trivially achieve infinite precision.

Pearson delta only works when the perturbation produces a sufficiently large effect. Otherwise, you are simply correlating your predictions with experimental noise. You must first filter for responders. The number of data points that survive that filter will probably make you sad.

What can Virtual Cells do for you today?

Kris Szalay — Tue, 11 Nov 2025 15:02:20 GMT

The term Virtual Cell has taken off over the past year. On one hand, it’s great to see so much attention on computational biology. On the other, it has created confusion about what current and emerging technologies can actually do.

Let me try to clear that up.

What is a virtual cell?

The current trend around virtual cells traces back to this article by Bunne et al., where they introduced the term AIVC, or AI Virtual Cell. According to their definition:

In particular, an AIVC needs to have capabilities that allows researchers to (1) create a universal representation (UR) of biological states across species, modalities, datasets, and contexts, including cell types, developmental stages, and external conditions; (2) predict cellular function, behavior, and dynamics, as well as uncover the underlying mechanisms; and (3) perform in silico experiments to generate and test new scientific hypotheses and guide data collection to efficiently expand the virtual cell’s abilities.

In other words, a virtual cell would be a complete, general simulation - one that can accurately predict biology in new contexts without retraining.

That’s the dream.

Explainer box on biological context

Biological context is an umbrella term for all of the known or unknown variables and interactions that influence an experiment's outcome but are not explicitly measured.

The same cell can respond differently depending on:
- neighboring cells
- extracellular environment
- experimental protocol
- the natural evolution of the cell line over time
- or anything like the current moon phase (I kid you not)

which then become implicit conditions on any biological measurement. (Drug A is effective in X cells, but only if... (after X days, without support cells close enough, etc.)

The Bunne et al. paper states that the main bottleneck to achieving this general virtual cell is integration of diverse data types and scales. Some readers interpreted this as suggesting that, with the right (foundation) modeling approach, we could build a general virtual cell from the data we already have,

which is where my opinion differs.

Explainer box on foundation models vs virtual cells

It’s easy to confuse these terms because they’re often mentioned together, but they are different concepts. 

Foundation modeling is a method. AI models trained first on a massive, messy, unlabeled (in our case: has no specific treatment or outcome target, they are just there) dataset to understand the general problem space. Then, with a smaller set of task-specific data, they can be fine-tuned to perform specific tasks. This is how today's (Transformer-based) large language models work.

A virtual cell is an application. Virtual cells are models that predict how a biological system will respond to novel inputs. Virtual cells can use any type of model internally from a simple linear regression to a deep neural network like a Transformer.

So you can have virtual cells that aren't foundation models and foundation models that aren't virtual cells.
But foundation models are probably our current best bet for moving toward that elusive general virtual cell.

Understanding context will enable general virtual cells

If we want models that generalize to new biological contexts, we first need to understand what context actually means. You can’t predict outcomes in a new situation without knowing whether your training data can tell you that the situation is different.

We don’t need to list every single variable that influences an experiment - there are too many - but we do need cell “snapshots” rich enough to capture the fingerprints of all meaningful influences.

Put simply: if the same input can lead to multiple outcomes (beyond random noise), then the input is incomplete.

In theory, a perfect RNA-seq dataset might capture all this information, since most biological processes leave some trace in RNA levels. In practice, though, today’s signal-to-noise ratio likely erases many of those traces.

This probably isn’t a fundamental limit of biology, but of our experimental tools and how we use the data. Because

biology itself is remarkably reproducible; monozygotic twins (and clones) prove that you can replay embryonic development with great fidelity.

Yet, we struggle even trying to combine data from different labs or protocols.

So to build a general virtual cell, the first thing - I think - we’d need is a numeric description of cells detailed enough that no additional biological context needs to be specified. A digital cell, if you’d like.

This isn’t entirely out of reach, but I don’t think any data we currently have (in the forms we usually process them) carries that level of context.

So for general virtual cells, rather than throwing more compute at existing datasets, the better path forward is to determine which kinds of data actually enable predictions across new biological contexts, and start generating that.

The virtual cells we already have

Today’s data and methods give us is narrow virtual cells. Having calibration data for a well-defined assay, we can already virtualize many laboratory experiments with impressive accuracy. We call these virtual assays.

When we looked at which experiments could be simulated most reliably, one pattern stood out: success didn’t depend on the type of perturbation (drug, CRISPR, RNAi) or on the specific endpoint (viability, transcriptomics, phenotype). It depended almost entirely on the nature of existing data from that assay. Like this:

How well can virtual experiments replace wet-lab ones? Prediction performance is measured on a scale where 0 means no better than random (baseline/bias) and 1 equals the reproducibility achieved between independent wet labs running the same assay using one of our robust metrics. We consider a problem solved when performance ≥ 0.8, okay above 0.6, and marginal above 0.25. Each row shows how much task-specific training data was available. The more overlap between the training data and the target experiment (same cell type, same perturbation), the closer virtual results come to real lab reproducibility.

A few interesting observations from this.

Virtual cells can already help scale your experiments. You can run hundreds of millions of simulated experiments, choose the best ones, and the majority of them will work as expected1 (if you have previous training data for your drug in which case you are in one of the bottom rows).
Adding new perturbations without training data makes the problem much harder. There is real signal in the simulations, but you need specific expertise to untangle it from the noise. This is the region where a good model and clean data can make the biggest difference.
You can only be as good as the underlying assay is. Some wet-lab assays are bad at reproducing their own results. I’m looking at you, Bliss synergy2. I imagine as we grow in our understanding of what are the constituents of biological context, we can design experiments that replicate better.
Most foundation model tests target the easy problems. The currently running Virtual Cell Challenge provides training data that already includes the exact cell line and perturbations being tested. There is not much benefit to be gained here by heavy-duty foundation models compared to simple ML models.

The “AGI tests” of biology

So, we are, for the time being, in the narrow era. Every new advance is appreciated, but be careful - it’s easy to claim far too wide and general usability for the next narrow win.

I think there are some biological problems which can’t be solved without genuine context transfer. Mastering those would mean our models have begun to understand biology itself.

If you wanted to define an “AGI3 test” for biology, the ladder of progress could look like this:

Take a new drug with known structure and binding data and predict how cells will respond using only their CRISPR response.
Predict drug synergy between two compounds that have only been tested in monotherapy before.
Predict how two cell types, previously tested alone, respond when co-cultured.
Predict how a drug tested only in vitro will behave in ex vivo or in vivo models.

These sound simple, but none - according to my knowledge - have been solved to experimental reproducibility4, not even close. Some models and approaches are useful, but far from reproducibility.

Still, step by step, as we expand the context distance and the time horizon of what our models can predict, we’ll eventually get to the ultimate test of biological understanding:

Given only the a DNA of a zygote, replay embryonic development until the baby is born.

The work of Laszlo Mero, Gerold Csendes, Murat Cem Kose, Dana Zemel and Robert Sipos were instrumental to get the data package together for the article. Also thanks for the reviews, Louisa Roberts, Krishna Bulusu, Daniel Veres, Imre Gaspar and Valer Kaszas!

Subscribe now

I mean half of the signal will still be there. Interesting signals are rare. If you have 50% (coin-flip) accuracy, and 5% interesting signal, most likely all of your top 10 experiments will fail. You need more than 90% accuracy on the average data point to keep half your signal in the wet-lab. Here’s an illustrative example below (with 6% signal to keep numbers nice).

Combination synergy metrics, like Bliss, are famous for being noisy. It is because they are fundamentally difference metrics of the effectiveness of the drugs administered individually vs in combinations. A difference - especially if the two arguments are close - amplifies noise, sometimes extremely so. The figure below should give an intuition why that happens.

Artificial General Intelligence - simply put, the quest in general AI to get to a model which is really intelligent, not just sounds intelligent. I think it’s a good analogue, but otherwise it’s a huge can of worms I really don’t want to open here.

With a robust metric - I know I’m belaboring this point, but none of these are apparent if you’re using a simple metric. It will just give you a nice 70-90% accuracy, and you’ll be happy until the wet-lab experiments start failing. Since the outliers will be selected for lab validation, your model needs to work there, not just on the average data point.

The Patient Prediction Puzzle

Kris Szalay — Thu, 16 Oct 2025 15:02:35 GMT

OK, we need to talk about patient response predictions.

The baseline is surprisingly high

There’s a striking disconnect between how pharma insiders view AI’s relevance in clinical trials (close to zero) and how it’s often marketed (AI is already here! 70% accuracy!).

If someone tells you they can predict how patients will respond to a drug with 70% accuracy, how surprised should you be? Given that most phase 2 trials - where efficacy is first measured in humans - fail, that sounds like a big leap forward, right?

Let’s test that.

The gold standard for fully mapped patient data is the Cancer Genome Atlas (TCGA). The team found around 800 data points where we know all three of:

the patient’s molecular profile before the treatment
the treatment itself
the outcome

On first approach, how does a simple linear regression do?

Simple predictive models’ AUROC scores on the TCGA data. Train-test was split over the patient dimension. Different bars show the inclusion of different data points. The “clinical” bar is the one referred to in the text, including only three pieces of information: the stage and indication of the cancer and the drug identifier.

Wait, 75% already?

What’s the real question?

By predicting clinical patient response data, the question you’re really asking is:

Can you predict how an approved drug will work on a patient treated according to protocol?

But this data contains very little molecular information. The best-performing indications and biomarkers are already built into the treatment protocol, and doctors rarely go off-script.

No wonder the only meaningful variables that show up are the cancer’s indication, its stage, and the drug used - essentially, the survival statistics of the state-of-the-art.

That’s not the question we actually want to answer.

The real question for getting a new drug into the clinic is:

Can you predict how well a new, never-before-tested drug will work in different indications and where could it beat today’s standard of care1?

That question, almost by definition, can’t be answered from existing patient data. You only get 2-3 data points per patient, and from a very limited set of drugs. There’s simply no way to infer what would have worked.

More data?

Maybe the problem is just scale. Let’s add more data. Why just 800 data points, anyway? There are hundreds of thousands of patient data points out there.

On one hand, good data is surprisingly scarce.
We started with 12,000 samples from TCGA and 200,000 from GENIE, but it quickly started eroding.

Most samples had no response data at all.
Many represented early-stage cancers removed surgically - no way to tell whether the drug was even necessary.
Others were sequenced after treatment, so the tumor’s molecular profile represents the recurrent tumor, not the one you treated.

After cleaning, we were left with about 800 usable data points from TCGA and 3,000 from GENIE.

In theory, with unlimited data, the problem becomes solvable: for every patient who doesn’t respond to a given drug, there would eventually be others similar enough who responded to a different drug. At that point, an AI model could start connecting the genetic dots.

That’s exactly the hypothesis we’re testing now in collaboration with Memorial Sloan Kettering’s Cancer Center in this year’s iHub Challenge program - how much additional signal would emerge if we had an order of magnitude more data?

Huge thanks to Rick Peng, Jae Zhong and the MSK team for making this possible.

Building bridges

Still, my intuition is that patient data will remain just one piece of the puzzle. To reach the right level of molecular understanding, we’ll likely need to mix in data with much higher information density - where we can actually generate true counterfactuals (what-ifs).

That means simpler, but repeatable models like in vitro, organoid or PDX (patient-derived xenograft) models.

Explainer box: The acronym PDX stands for Patient-Derived Xenograft. By implanting human tumor cells into mice, we can test drugs not yet approved for clinical trials on fresh human cells, in a tumor-like living environment without compromising patient safety (unfortunately, the same can't be said for the mice)

Image from https://doi.org/10.3390/cells10030712

But how do we bridge these worlds?

An in vitro cell line doesn’t map directly to a patient. If you change both the treatment and the input system simultaneously, you can’t tell whether the outcome came from the drug or the model.

Turns out there might be a bridge. It’s faint and thin for now, but it could just become strong enough if we put the work in. See the following plot:

Predictive performance of the in vivo PDX outcomes on the corresponding patients (red line), and when shuffled within an indication (black line). The cyan line is the baseline from clinical data. n=23

PDX models show a predictive signal above that of the clinical baseline - but only when matched to their original patient. Just the same indication isn’t enough.

Now all our studies are very preliminary, hundreds of patients at best. The image above is made from just 23 patients.

That’s because there isn’t much data that meets these strict criteria, especially publicly. That’s why I’m glad to have our collaboration with Champions Oncology (shout-out to Matt Newman for helping us make this happen!). Together, we’re expanding that bridge to thousands of data points.

If it holds, this connection could let us integrate all the “unmatched” PDX datasets as well into one coherent predictive system.

I’ll be back with details when we get there.

Kudos to the many people on the Turbine team who made the studies and these collaborations possible: Richard Izrael for leading the Champions partnership, Emese Sallai-Simon for supporting us during the MSK process, Gema Sanz, Bence Czako, Richard Izrael and Bence Szalai for the PDX/patient studies!

Subscribe now

while not being unreasonably toxic, of course

This is All an Experiment

Kris Szalay — Mon, 14 Jul 2025 14:02:48 GMT

I have to come clean: this isn’t just a company blog. It’s also an attempt to test a new way of publishing - one that tries to fix some of the gaps in the scientific system.

You can’t force science to work

Academia is built on a business model of publishing papers. I still feel uneasy about some of the papers I published during my PhD. By my current standards, most would never have seen the light of day, like this first paper mentioning Turbine. But I try to empathize with other young researchers. Why do we expect every PhD student to make a significant, standalone discovery, in isolation?

That’s probably why I’ve wanted to help fix how science operates ever since I learned the ropes outside academia. I got burned so many times when a nicely tuned model fell apart on new data.

For a while, I chased the idea of a magical metric that could extract the truth from a pile of publications no matter how biased they individually were. Some techniques help. But I have never found any way to fully escape Goodhart’s law 1 or the garden of forking paths 2.

So what’s the next best thing?

Playing our own hand well.

How our research can be different

Getting the incentives right

Being a startup gives us the freedom to do research differently.

We are incentivized to deliver real value - value customers are willing to pay for. That means our models have to actually work. They must be predictive on new data, in the environments that matter.

That’s harder than it sounds. In statistics - and by extension, machine learning - it’s shockingly easy to fool yourself, especially with messy data. And under pressure, even easier to fool others.

That’s why this is the first rule in my research teams:

No one is incentivized to produce a specific outcome in any given experiment.

Their job is to give their best estimate of what the truth might be. (We only need to ensure that the whole set of experiments is designed so they lead us - eventually but inevitably - toward some solution.)

Staying focused

We aim to solve specific problems in biology. For example: building models that can accurately predict how cancer patients will respond to new drugs.

We’re not bound by the usual academic constraints: no need for everyone to make a separate, novel discovery to earn a PhD, and no need to follow grant requirements written five years ago. Entire teams can focus on the one big question, pivot quickly from dead ends and iterate faster. This lets us make much more rapid progress.

Doing replications

This focus gives us the bandwidth to not just learn from others, but to integrate their work by replicating key studies on our own datasets, using our internal benchmarks.

That often takes weeks - adapting the code, reshaping the data and running it through our evaluation pipelines. But it’s worth it. Replications help us pick better paths forward.

And this different process give us something valuable to contribute back:
publishing replications and the dead ends we’ve ruled out.

The current scientific publication system isn’t designed for that. But no matter. The Web is large.

A different way of publishing

So I’m experimenting with a new way to publish.

When I come across something useful, I’ll first publish it here. Before posting, I run it by people I trust - people willing to add their name below. It’s a kind of open review. If someone from the team is willing to expand it into a full article, they’re welcome to.
Personally, I think reviewed preprints are perfectly valid scientific outputs. But if they want to take it through the full journal process, I won’t stand in the way.

A springboard to write papers

So these posts are breadcrumbs. Useful on their own, but also possible seeds of future papers. That comes with a few advantages.

No need to retrofit a story

Many papers cram disconnected results into a single “story” just to meet the bar of publication in journals. That creates brittle arguments and logic that’s hard to follow. Here, each post tells one story. It’s easier to write and easier to read.

Easier to read

Scientific writing doesn’t have to be dry, it just often ends up that way. Partly due to authorship-by-committee and partly because most of us aren’t native English speakers.

But mainly because that’s the convention. I remember my PhD supervisor redlining my writing back into passive voice and past tense. “This is how you write scientific text”, he said. “It needs to sound serious”.

That’s just the norm. Journals themselves are trying to push for plainer language, but it hasn’t caught on.

Well, it can here.

Trade secrets

Some of what makes Turbine work has to stay confidential. That’s what funds the science we want to do tomorrow.

But there’s a lot I can share. Even just knowing what didn’t work can be valuable, even without full access to our code or data.

Since we can’t open-source everything, many of these insights would otherwise go unpublished.

Less friction for publishing

In the hectic startup life, putting the full timeline of getting something published on a roadmap is daunting. So publishing usually ends up staying in the backlog. Starting with a blog post, then evolving it into a preprint, and finally submitting for peer-review may help break it up into manageable chunks.

Strengthen existing science

Since most published research findings are false, publishing negative results and replications is essential. But they don’t fit into the current system.

This format lets me contribute in a way that helps strengthen the science we all build on - and maybe show that failures often teach more than success stories. They should be celebrated, not hidden in the drawer.

So we can all keep pushing biology forward. Together.

Subscribe now

“When a measure becomes a target, it stops being a good measure”.

Unintentionally increasing the degrees of freedom of a statistical test when preparing the data for analysis, thereby inflating the significance of findings, producing false positives. I have never found a method which can reliably correct for unobserved multiple comparisons.

IC50 is a deep rabbit hole

Kris Szalay — Mon, 30 Jun 2025 14:00:58 GMT

Data harmonization is often an afterthought in the brave new world of large ML models. Just feed the AI everything, and let the model figure it out. Right?

Real ground truth doesn’t exist in biology

This approach worked beautifully for language, but there’s a reason why biology is an entirely different beast.

In language, text is its own ground truth. Every sentence is a sequence of characters. All meaning there is emerges entirely from that sequence. An A is an A no matter the font. All letters of the same kind are completely interchangeable, so the alphabet is a complete catalog, and the text is a complete description.

But a cell is not just a cell.

Every cell has state. In theory, every single living cell traces its lineage back to LUCA - the last universal common ancestor - but each one has a slightly different history. Therefore no two cells are really identical. Some of those differences in the cell state probably don’t matter for predicting response, but others certainly do. The problem is we don’t yet know which is which. If we could measure everything that matters, we can have a complete description similar to text. That’s not the case for the time being.

We don’t yet have any way to completely describe a cell with numbers.

So any experimental readout - whether it’s viability, protein assays, sequencing, or omics data - is just a snapshot of a yet invisible generating process. That’s ultimately what we’d like to understand - a gargantuan hidden Markov model with bajillions of states.

A quick detour on hidden Markov models

…Yes? You don’t know what hidden Markov models are? Oh.

If you grew up after deep learning happened, you might have missed hidden Markov models (HMMs).

The idea is that there is an internal “generating state” which you cannot directly observe (X₁, X₂, X₃), but you know that each state has slightly different probabilities (b₁₁, b₁₂, etc.) of producing visible outputs (y₁, y₂, …).

Source: Wikipedia/Hidden Markov model

The goal is to learn these probabilities and infer the generating state from the input sequence. A classic example was predicting whether and unknown DNA segment is part of a gene. Coding regions tend to have higher GC content than non-coding regions, which the HMMs could recognize.

So, where were we?

You can be training the wrong idea

Since we don’t have a complete cell descriptor, the context in which your data was generated matters a lot. Metadata is key.

Otherwise you will confuse the AI.

Say you’re training a model using both gene knockout (KO) and drug response data. There’s a discrepancy: a pair of data points where a drug kills the cell but knocking out its target doesn’t.

Why? A drug may bind to other proteins we don’t know about. If your system just assumes that KO = drug (in high enough dose), you just taught the model that the drug kills the cell at low doses, and brings them back to life as the dose tends to infinity.

I’ve been surprised how often these curation issues get overlooked, especially by teams coming primarily from the ML world. But biology is messy. We’ve worked with big-name ML groups relatively new to biopharma and much of what follows came as a surprise to them. So let me show you just the tip of the iceberg.

Nothing fancy, no RNA-seq, Perturb-seq, no CRISPR, no complex combinations. Not even multiple modalities. Just plain old drug response data.

Curating an “IC50” data point

Let’s say you have this data point:

Dabrafenib, HeLa, 50% inhibition @ 10 nM.

What should you tell the AI? What is missing from here?
Let’s check the underlying data. You have that, right?

A simple, straightforward IC50 curve. Real data is rarely this nice.

This is how it might look like. We take the data points, fit a sigmoid curve on it, and take the halfway point on the y-axis, right? Simple.

But wait:

1. What do you mean by inhibition?

Remember, cells keep growing during an assay. So did you measure:

the point where the cells grow half as fast as without the drug?
the point where you have half as many cells as you started with?
the point where you have exactly half as many cells as a control growing in a different well?

All of these have, at some point, been labeled “IC50”. That’s incorrect.
The first one is called GI50 (for Growth Inhibition).
The second is LD50 (for Lethal Dose).
And only the third is truly IC50.

So unless you compare to a control, your value is not IC50, no matter how it’s called.

If you are not comparing to the control but to the t=0 cell count, you can have a GI50 or LD50 but not an IC50.

2. What do you mean by “50%”?

But what if your dose-response curve never hits 0?

Most drugs don’t kill all cells. So we must also distinguish between absolute 50% inhibition, and 50% of the maximum observed inhibition. Only the first one is IC50. The second is formally EC50, but of course these terms are used interchangeably all the time. It doesn’t help that they coincide on example curves where the dose-response curve does hit 0.

EC50 and IC50 turn out to be different measures when the maximum cell death is less than 100%.

3. How long was the treatment?

Some drugs act on different timescales. For instance DNA damage response (DDR) inhibitors generally do nothing in the standard 3-4 day assays. You need 7 days before enough DNA damage accumulates to visibly affect cell behavior.

But there’s also a mathematical problem.

Cells grow exponentially. At the same time, the 50% is a linear ratio of the untreated and treated cell counts. What you’re really modulating is the rate of growth (and cell death). So even if the drug effect is constant, a 3 and a 4 day measurement will yield different IC50 values.

Fortunately, you can correct for this mathematically, but only if you know how long the experiment lasted.

4. How did you fit the curve?

If your data is well behaved, most curve-fitting methods will give you similar IC50 values.

But what if your data looks like this?

The sigmoid assumption doesn’t apply to resistant cell lines. If you’re lucky, you might get a faint slope, but generally it’s just going to be noise. NLME fit is designed to handle hard cases leveraging expected behavior from similar cell lines, but as you may see (red curves), these can easily become wild extrapolations.

So what should you do? Just say there’s no prediction? In many use-cases you need to give a value no matter what. Depends on your use-case, but make sure you use the right data. For example, GDSC’s non-linear mixed effects(NLME) fitting does the latter - uses other cell lines’ data to give you a prediction no matter how ugly. We prefer just saying no if we can, so we had to refit all the curves with a different algorithm. But we needed to be aware to do that and needed to get the individual dose point data.

5. What did you actually measure?

How do you tell which cells are dead? Unless they completely dissolve (which they don’t in typical assay timescales) there’s no large “DEAD” sign written on a dead cell. You need some way to tell them apart.

And that’s the next issue. Some assays (like CellTiter-Glo) measure ATP levels to estimate live cell count. Others measure their metabolic rate by tracking NADH use (resazurin - Alamar Blue). You can also use DNA intercalators to highlight individual dead cells undergoing apoptosis, and count them under a microscope.

But these methods don’t agree perfectly. Not all cells die via apoptosis. Some cells may have metabolism but are already doomed. Some drugs just mess with the assay readout directly (see below).

Some compounds inhibit viability measurements, but don’t really induce cell death. Fig. 1b. from 10.1186/s12943-016-0517-3

Do you have the metadata to tell them apart?

The data’s in the details

I’m leaving out plenty of advanced topics like seeding density or the physical location of the well on the plate - you get the idea.

Until we have an “ultimate data type” that fully captures cell state, understanding and harmonizing the data is not optional - it’s paramount.

Thanks to Balazs Szabo and Miklos Laczik for their help with the NLME fitting and Richard Izrael for the ideas and reviewing!

Subscribe now

Why can we predict biology?

Kris Szalay — Thu, 22 May 2025 14:32:18 GMT

Yes, biology is complex

The first thing most people hear about biology is that it’s unbelievably complex.

Just this week, someone asked me how it’s possible to predict cell behavior from just 20,000 numbers (an RNASeq vector). It feels hard to imagine given the known complexity of cells.

Indeed. Every single cell contains billions of proteins in an incomprehensible number of possible combinations. Each protein is a whole machinery in and of itself, consisting of thousands of atoms in elaborate combinations. We are made of trillions of these cells.

Why do we even dare to think it is possible to predict any of this?

We can predict because cells - and we - are not random configurations of atoms. If we were, the only way to predict what will happen to us is to simulate from the atomic level with Schrödinger’s equation.

Just for fun, let’s calculate how infeasible that is. There are 10^24 atoms in an average human. Schrödinger’s equation scales quadratically, and we’d need femtosecond precision, We’re suddenly looking at ~10^63 operations per second of atomic simulation. A H100 GPU does say a petaFLOP (10^15) a second, so we’d still need a quindecillion (10^48) GPUs for real-time simulation. That’s almost the number of atoms on Earth. Yeah, that’s not gonna happen.

Learning from our ancestors

And yet, molecular biology is a thing. Alphafold is predictive. What are we really learning?

My hypothesis is this:
We are learning the optimization surface carved by evolution.

The only reason there is anything to learn is that our ancestors were subject to evolution for billions of years. We were being sculpted over the aeons to respond in certain ways to certain inputs. In ways that keeps our germ line alive, that is.

That means the reason why Alphafold can work1 is that the proteins were themselves subject to evolutionary pressure. If my hypothesis is true, it follows that Alphafold would struggle with random, synthetic proteins that lack any evolutionary context2.

I don’t think we’re going to find any deterministic equation at these higher levels that resemble the ones we learned to use in physics. The rulebook to survive 3 billion years is much better described as as a surface optimized by evolution. As a lucky coincidence3, this is exactly how machine learning algorithms work inside - they approximate mathematical surfaces.

Why does this matter?

It gives us some insight into how cells think.

We tend to analyze biology in terms of DNA, RNA, proteins and molecules as these are the tangible entities we can measure. From there, we have built pathway diagrams and regulatory networks which have taught us a lot about how cells work in general.

However, this has limited practical use. A drug that kills the majority of cells is just poison. To move forward, it’s exactly the outliers which you want to capture. The one outstanding indication where your drug is especially powerful, exactly because there’s a special constellation of factors causing the cell behavior to deviate from textbook form (“Can a biologist fix a radio?”).

In order to progress, we needed to let go of our old way of thinking about individual proteins responsible for cell behavior, and yield more control to the AI to organize the collective behavior of the components the way data tells us.

Maybe cells think in functions

I think the reason this worked is that cells are evolved to perform functions, making them “think” at a process level. Evolution was hacking whatever proteins were available, constantly tweaking and repurposing them to solve the most immediate problems. Hence all the different RAS proteins with overlapping functions, or moonlighting proteins with completely unrelated roles.

So I suspect that in a future where virtual cells are commonplace, peeking into what lies within the ultimate virtual cell’s neurons, we would find a mapping of functions - a web of ancestral functions evolved over time and their correspondence to the physical elements - genes, proteins, probably even parts of proteins (domains).

This would explain why we see our models learning much more from drug perturbations than from genetic ones - the job of a working drug is to disrupt a function, not a gene, after all.

(Thanks a lot for reviewing, Balint Pfliegel, Gerold Csendes and Bence Szalai!)

This is just why it works, not how it works.

There is an argument to be made here that the 22 amino acids we are made of are also pre-selected by evolution, hence there may be also be simplified ruleset that describes their behavior without resorting to full atomic simulation.

It is not exactly coincidence, but that’s not very relevant to the topic at hand.

Pretraining virtual cells is useless

Kris Szalay — Wed, 16 Apr 2025 16:53:36 GMT

UPDATE: this was extended into a poster at scVerse 2025, you can check it out here.

Clickbaiting aside, theoretically, pretraining is still a good idea:
Let’s try to make use of the troves of unlabeled, pre-treatment single-cell RNASeq data to teach the system biological fundamentals. Then the actual training on the targeted, expensive, and scarce data doesn’t have to start from ground zero, so maybe it goes farther.

Before we dive in and talk about the results, a quick primer on pretraining.

What are we pretraining?

What does the AI really learn during pretraining? There is no shortage of fancy names like “language of life” or “fundamental biology” (yes, I am guilty too). But deep inside, it’s just correlations of gene expression. Something like this: if gene A is highly expressed and gene B is not expressed, then most likely gene C is highly expressed as well.

Some articles mention learning gene regulatory networks (GRNs). It’s the same thing. If you take these complex correlations, simplify them to 1-to-1 rules and chain them together, what you get is a GRN.

The way most bioGPTs go about this pretraining is that we mask some of the expression values, and ask the system to predict them.

Here’s how it would look in an LLM:

It’s like asking the system to correct some broken English. There’s a word that’s clearly wrong there, what the author could have intended?

Proof of the pudding

So one of our research teams went in, and did the same thing - masked out 15% of the gene expressions in all samples. They then ran a pre-training on scRNASeq data from CELLxGENE, 30M cells (filtering out 10% of the original 33M samples not coming from the 10x Genomics platform).

It worked perfectly. The predictions had a 76% correlation with the ground truth, the same as experimental replicates. Great!

Then came fine-tuning. Ultimately we want to understand how drugs affect the cells. We want to be able to predict how their transcriptomic landscape looks after treatment. So we added some perturbation data.

Perturbation prediction performance (Pearson correlation of differential gene expression) on MixSeq data with and without pretraining.

Not much difference. If anything, fine-tuning made the training slightly worse. So to understand drug response1, pretraining was not really helpful. But maybe we were doing it wrong. How much information is really there in untreated RNA data?

Less information than you think

So the team went back to work. The interesting moment came with a set of experiments where they started increasing the ratio of masked genes.

As the ratio of masked genes change from left to right (red line, right axis), the model performance (measured as Pearson correlation of predicted vs actual gene expression level) stays the same (blue bars, left axis).

Wait, what? We are at the experimental replicate level of predictivity even with all genes masked out? What does the system even base its predictions on?

Turns out the remaining information with all genes’ expressions masked out is which genes were even expressed in the cell to start with. Single-cell RNASeq differs from bulk in that it captures relatively little RNA from each individual cell, so what you see is just the most abundant genes rather than the the full transcriptome.

The issue isn’t that more information from RNASeq doesn’t help - that can be on us. The problem is that this piece of information is enough to get to experimental-level performance - ie. that’s all the information there is about single cells!

It seems that in pre-treatment cells the genes detected by scRNASeq are heavily correlated. Probably there aren’t that many major cell types to start with, and single-cell technology is not yet good enough to reliably detect smaller changes in individual cells.

Which leaves us with three possible consequences:

either the gene regulatory information in the data is not good enough
or we’re not extracting data the right way (there is additional information in how cells associate which we’re not using)

or healthy gene regulation is just not the right kind of information we need to understand drug response.

Thanks to our great Skunkworks Infinity team — Gerold Csendes, Gema Sanz, Bence Czako and Bence Szalai for working this out!

Subscribe now

We could design a very specific task where the missing information is gene regulation, and pre-training does help - predicting unknown gene expression values post-perturbation given some of the post-treatment expression values from the same perturbation. The figure below shows the effect of pre-training. Whether this is useful in any practical application is anyone’s guess.

How to make data points worthwhile?

Kris Szalay — Sat, 11 Jan 2025 11:58:42 GMT

Over the years as we worked with more and more clients, we encountered training data in all shapes and sizes. It quickly became evident that just having a lot of raw data points doesn’t automatically result in a good model - it matters what you measure.

The team began developing intuitions about which types of training data work best. For example, drug perturbations seemed about 10 times more impactful than the same number of CRISPR data points. Post-treatment RNASeq data (measuring the whole transcriptome after treatment rather than just cell viability) gave better models, but definitely not 20.000 times better1. Similarly, single cell measurements are better than bulk measurements, but not 10.000 times better2.

I asked to have these intuitions quantified.
Why is CRISPR worse? Is the technology inherently less robust?

It matters what you perturb

Turns out the problem was that cells are pretty robust so most single gene KOs don’t really do anything. Drugs, on the other hand are specifically designed to elicit a response and usually bind to multiple proteins especially at higher concentrations.

Significant expression change events in the LINCS drug perturbation and the LINCS CRISPR dataset. X axis is the average distance (Euclidean distance in RNA space) of a sample replica group from the bias (global average expression change for any perturbation), Y axis shows the average replica-to-replica distance within a sample. Significant expression change evens are shown with filled dots below the diagonal, where the replicas are closer to each other than to the bias.

Therefore, in the CRISPR training we were showing a lot of differently looking RNASeq data to the model, but most of those amounted to no actual change in cell behavior, essentially teaching noise to the model. No wonder these models performed worse.

Compressing training data

In this case, is it true that only perturbations that cause meaningful changes carry most of the information in the training? Could we “compress” the training data by only using the meaningful data points?

In order to do that, the first question we had to answer was what “meaningful change” means.

Different teams came up with different definitions.

“Horizons” team devised the idea that a sample is significant if the RNASeq pattern is significantly different in any pathway. This gives a compression factor of ~6x, so 1500-2000 points from every 10000 data points in LINCS.

“Skunkworks” took the RNA pattern as a whole, hypothesizing that those RNASeq patterns are useful where the replicates are closer to each other than to the bias (average RNASeq value over many perturbations). This is the data you can see on the first figure above.
Surprisingly, samples where this is true are very rare, yielding a compression factor of 30-50x (2-300 significant samples from 10000)

So how well do these compressed datasets work in an actual training?
Here’s how.

Post-treatment pathway activity prediction performance for 3 selected pathways on different datasets (partial correlation above the bias). The baseline “full dataset” score is trained on the full LINCS drug dataset, the “outlier” is the Horizons-defined subsetting, '“mse” is the Skunkworks-defined subsetting (bias-z is a third different definition we’ve been experimenting with, don’t worry about that). The “filter” column shows models’ performance trained only on their respective subset, “inverted” shows model performance trained only on the negative of their respective subsets. (also, “padded” is a dataset padded to 10000 data points in all cases with random inclusion, but that’s immaterial for now).

Disclaimer: this was just a quick investigation, not a fully robust analysis.

What the results appear to suggest though, is that it is at least possible to find small subsets of data points carrying most of the useful information in a training
(this is shown by the performance of the “filter” column being equal or better than the baseline “full” dataset performance while at the same time datasets trained on the inverse subsets (“inverted”) missing these key data points performed significantly worse.)

Given how little perturbed RNASeq data is out there to train from, there is no need to speed up trainings with data compression yet. We still do some subsetting to remove the noise, but otherwise generally try to use as much info as we can.

A better way to generate wet-lab data

The real value of this compression becomes apparent when generating wet-lab data. Wouldn’t it be wonderful if there was a way to only generate the important data points? To get 6-30x the value for your experimental dollars?

The problem is that we only know whether a perturbation did anything after performing the experiment itself.

But what if there was a way to guess?
Maybe if you had something like, I don’t know, a Simulated Cell?😊

Let’s run some numbers, how much could such a predictive model help?

Based on the previous tests, the prevalence of meaningful events in a randomly generated data set is somewhere around 10% .

Now suppose you have a predictive model where you both have a 70% chance to successfully predict these meaningful events, but you also have a 30% chance to have false negatives (70% sensitivity). This results in the following contingency table.

So, generating 100 “smart” data points yields 70 valuable training samples compared to the 10 we would have originally gotten using random generation:a 7x compression rate.

The catch is that there are important points you will always discard due to the false negatives of the simulator, 3.3% of the significant data points in this case.
It’s a tradeoff I’m willing to take.

In real life, our actual performance varies depending on how hard the problem is (see the previous benchmarking post)

but it’s around 60-70% for the hardest specific subsets with a lot of outliers (resulting in 6-7x compression rate just as shown in the table above) and around 95% on the full dataset, yielding a 20x effective data compression rate.

AI in biology can be done today if you use data wisely

We believe that for many practical use-cases in drug discovery, AI for complex biology is already possible, today.

However, to make your platform viable in current industry practice, you need to use your data wisely. The reality is that you can’t go around asking for hundreds of thousands of training data points to fine-tune every new project you get, because very few pharma projects generate that kind of data. At the same time, few drug discovery teams out there would be able or willing to accept the data generation “fit-out phase” for their indications of interest to take more than a year and cost many millions of dollars. It’s just not sustainable.

But if you can provide useful results, quickly, from datasets of a size that pharma companies can actually generate in realistic time (and cost) and keep doing that, it will add up over time.

In the end, probably no single AI platform will be able to match the simultaneous data generation capability of all pharma companies rushing to get their next drug to market. So we think our best bet now is to tap into that stream.

That necessitates a platform that can provide partners value now. Which is why we’ve always had dedicated teams working on making our production models actionable and easy-to-use rather than only focusing on the next model generation.

Data kindly lifted from the work performed by Laszlo Mero, Milan Sztilkovics, Imre Gaspar and Valer Kaszas - thank you for your work! Additional thanks for the review & edits to Balint Paholcsek, Szabolcs Nagy, Imre Gaspar and Bence Szalai.

RNASeq generates one data point per gene in each sample.

Single-cell methods measure somewhere between 1000 to 10000 cells per sample.

In defense of RNASeq

Kris Szalay — Thu, 24 Oct 2024 14:01:28 GMT

We only use DNA and RNA for defining a cell (whether it’s a patient cell, PDX or cell line). I was asked many times about using proteomics data. In particular: why are we not using proteomics for modeling? Why do we think RNASeq is deep enough to get to a good model?

Information flow in a cell

Because - as the argument goes - RNASeq only tells you something about the rate of change in the proteome. It does not tell you a lot about protein concentration, let alone protein activity which eventually defines the phenotype - what the cell actually does.

Given that this is how information flows in a cell, the reasoning makes sense:

So - the logic follows - you could work your way backwards through the chain of causality and guess what the genome or the epigenome looks like based on how they impacted the transcriptome. Same way, you can guess the transcriptome based on the proteome.
But the other way is impossible. Since there is separate regulation at the proteome level, a single transcriptome can produce many different proteomes.

This is the gist of the argument. And I would agree - would these methods be at the same technology level. Sadly, (phospho)proteomics doesn’t really scale (for the time being). Proteins don’t readily amplify like nucleic acids do, and sequencers are much higher throughput than mass spectrometers.

So the question is: would you choose one proteomic sample over a thousand RNA samples (like a Perturb-Seq study)?

It depends. Is it even theoretically possible to figure out proteomics from RNA?
If not, then we need to go deeper, no matter the cost.

Defining cells with proteomics

First, it’s generally accepted that RNA levels correlate with protein levels in the steady state [1]. There is a good reason for that: proteins constantly break down and thus need to be constantly replenished via transcription, independent from anything else that may happen inside the cell. This results in a baseline correlation in around 0.6 (Pearson) between RNA and protein levels in the steady state.
So RNA could be good enough to initialize the model state.

Is it good enough?

The CCLE consortium has released a rather large-scale proteomic dataset[2] which we tested early 2023.

We trained three random forest models, one only using RNA data, one adding proteomics and one using both RNA and proteomics1.

Here they are:

Histograms of random forest (RF) gene essentiality prediction performance. A separate RF model was trained for each gene on 72% of 292 cell lines. The models then needed to predict the essentiality of the gene in question for the other 28% of the cell lines. A) results using both proteomics and transcriptomics features of the cell lines, B) transcriptomics only, C) proteomics only.

The improvement is slight at best. Now we’re not huge experts in proteomics, so we may not have used the data the best way, but based on this data it certainly does not merit the 1000 to 1 cost of generating proteomics - for us, for the time being, of course.

Reverse engineering protein logic

All this was steady-state, initialization data.

But could we use transcriptomics to figure out the unknown logic of protein regulation?
Establishing causality requires running perturbation experiments. We are specifically interested reading out how the cell changes in response to perturbation.
In this case the steady-state assumption no longer holds, RNA and proteomics will be nowhere close to each other.

Still, we happily use tens of thousands of perturbation RNA signatures (differential expression vectors).

Why does this work?

Because the information flow is actually a loop.

It works, because perturbations (ex. drugs) cause a cascade of protein activity changes, which - cascading down through the transcription factors - cause a host of transcriptomic changes.

It may be hard for us to deconvolute what pattern belongs to which change, but this is exactly what machine learning was invented for. Neural networks are great pattern matchers, even our own squishy ones. Can you tell me what animals these footprints belong to?

Easily, right? None of these footprints contain anything that actually makes an elephant or a human, but given the constraints that it must be an animal there is not much ambiguity.

These RNA footprints (a term coined by Schubert et al. [3]) gave us the opportunity to reverse engineer protein activity from RNA.

A case for multimodality

OK, how about having proteomics besides RNASeq? Surely it would increase the model performance, right?

Yes, it could. Take the following example:

Why is protein A inactive? Is it because it is inhibited (left scenario), or just because the gene is damaged (right scenario)?

These cases are theoretically impossible to distinguish unless given more data. Same goes mutatis mutandis for any potential pair of layers.

However, you could figure out the right gene regulation pattern still only using RNA if you add one more perturbed sample.

If knocking out gene B makes the protein turn on in the second study, we were observing the left scenario, otherwise we were observing the right one.

Of course, having another view at the system will also help noise correction, but, similarly, so could another sample. So this is my current opinion:

Multimodality helps, but only if it doesn’t cost much more than another round of samples.

This is why we use DNA as a second mode. DNA sequencing is also cheap, and while many DNA changes could be reverse engineered just by looking at the transcribed RNA, not all of them. As the example above shows, if a protein is not expressed at all, you would need additional experiments to decipher whether the underlying mechanism is specific downregulation of that protein or DNA damage.

Thank you Tamas Beke for the proteomics trainings!

References

[1] Liu et al. “On the Dependency of Cellular Protein Levels on mRNA Abundance”

[2] Nusinow et al. “Quantitative Proteomics of the Cancer Cell Line Encyclopedia”

[3] Broad “DepMap 2022 Q2 Public”

[4] Ghandi et al. “Next-generation characterization of the Cancer Cell Line Encyclopedia”

[5] Schubert et al. “Perturbation-response genes reveal signaling footprints in cancer gene expression”

RNA and gene sensitivity data comes from the DepMap 2022 Q2 release [3].

So how do you benchmark biology?

Kris Szalay — Mon, 07 Oct 2024 06:59:15 GMT

This is a human-readable summary of the EFFECT paper [1]. I found that the process of scientific writing (especially with multiple authors and review rounds) makes texts significantly harder to read than intended - hence this post.

Good benchmarking is a cornerstone of progress in all fields of AI.

The whole premise of AI is being able to figure out hidden connections within your data. However - exactly because these connections are hidden - what the AI actually latches onto may not be what you like.

Benchmarking tanks

There is an urban legend floating in the AI community about a team who tried to train an AI to recognize tanks. The system worked perfectly on the test benchmarks then immediately failed miserably in the field.

Here’s a set of training images. Guess what happened.

An illustration of how some training samples might have looked like. Generated with DALL-E. Thanks Bálint Farkas for the images!

The clouds cleared after the tank photos were made, so all negative images were made in sunshine.

So all the AI learned was that no sun = tank, sun = no tank.

But back to biology.

Benchmarking biology

All of this is made exponentially harder by the fact that in biology, we don’t even really have a ground truth to work with. There are tons of unknown, uncontrolled variables in any experiment, and data gets more noisy the deeper you go.

What’s the actual information content of an RNASeq experiment? Which RNA values are trustworthy? How trustworthy they are? Hard to say.

But we need to start somewhere. Let’s try establishing a beachhead and building from there. Let’s start from the most controlled setup possible: secondary, cultured cell lines. Let’s take the metric that’s closest to some form of ground truth: cell viability. Do the cells survive or die in response to your intervention (a drug or a gene knockout, for example)?

This is probably the least noisy information we have, and it even has a particular application in oncology, where you actually want to kill a certain population of cells, but only those cells if possible.

There are two gold standard datasets for this kind of setup: Depmap[2] (for gene knockouts) and GDSC2[3] (for drug effects).

OK, fine. So let’s take Depmap or GDSC2, remove say 20% of the data randomly, and if I can predict the missing data well enough, my method works, right?

Your method will work wonderfully. You will get ~90% accuracy. A lot of people tend to think that viability prediction is a solved challenge in computational biology, we can get methods to be as precise as experimental replicates.

Unfortunately, we were collectively doing the tank thing.

Three things to change

We have found three problems:

1. Most of the data is not surprising. You want to predict the outliers.

Check this sample plot. This is how the data really looks:

Histogram of the IC50 sensitivity values of the GDSC2 cell line panel to Bortezomib, a generally strong drug, and Cyclophosphamide, a relatively weak drug. Excerpted from Figure 2 of the EFFECT paper.

Some drugs are just generally stronger than others. Some cell lines are just generally more frail than others.

What’s really useful if you can find the outliers. Is there a mighty resistant cell line which is surprisingly sensitive to your drug? Is there a cell line in which a gene is suddenly essential?

To focus your metrics on the important question don’t use (uncorrected) global correlations1. Calculate a separate correlation value per gene (or per cell) and median-aggregate them. Pearson is better then Spearman as you want to emphasize the outliers. This is similar to what z-score does.

2. We were not splitting to answer the right question.

In how many real situations is both the cell line and the drug of interest known? I’m either giving you a new drug which has not been tested before or want to understand how my drug would fare in patient - who are actually a new environment each.

So either leave some cell lines entirely out of training (we call this a CEX split - cell-line exclusive) or leave entire drugs out (DEX - drug-exclusive).

Now when you leave drugs out, be careful that the drugs you left in are different enough. Some drugs are actually pretty similar to each other so you may inadvertedly leak data. Targeting the same pathway is fine. Depending on your use-case, having some shared protein targets could be OK as long as there’s not too much overlap. This needs some work, but we have some curated splits you can use below2.

There are many other ways to split the data if you have more complex data. Two others we tend to use frequently:

Can you predict a drug’s effect in vivo only from in vitro data? So your drug of interest can only have in vitro data in training, but you can have some other drugs with in vivo data so you understand the environment. We call this PDX-exclusive, as we usually do this when predicting to patient-derived xenograft models implanted in mice.
If given combination data to work with: could you have predicted this combination effect using mono data only for the two drugs in question? Single drugs are easy to test on many cell lines (with a PRISM screen for example), but the cost explodes when you want to test all possible combinations, so in silico is also useful here.

3. There can be more sources of bias present simultaneously.

Many years ago, with some of our first models, we were very happy with our performance. We used z-scores, we did drug-exclusive splits and have gotten surprisingly high scores. Sadly, it was a mirage.

The problem was the general sensitivity of cell lines to drugs [4], as both the split and the metric were only adjusting for drugs.

Under the hood, train/test splitting adjusts for one kind of bias, and a good metric can also adjust for one., However, in this setup the two biases we adjusted for happened to coincide, and cell line bias was left in. You can also easily have more than two sources of bias which you just cannot remove just by combining metrics and splitting.

We needed a more powerful tool.

It’s impossible to programmatically correct for unknown sources of bias, but it is possible to correct for any number of known sources.

We called this tool (creative, I know) the Bias Detector. Just these three steps:

Take a bias prediction from each bias source you want to filter(the mean of all the samples in the training set from the same source, eg. how strong your drug is on average on all of the cell lines in the train set)
Combine all these bias predictions into one bias predictor with a linear model
Your corrected metric is the partial correlation of your prediction and the ground truth with the bias prediction as the controlling variable.

This works with most continuous metrics. It is possible to build a version of the Bias Detector for a classifier based on the same considerations, but it will be slightly less intuitive3.

Solving for viability

The mean partial correlation among drug replicas in GDSC2 is 0.51, so that’s how good we can theoretically be.

How did standard ML algorithms fare after correction? Not well:

Partial correlations of drug predictive performance of three different ML methods (LR=linear regression, RF=random forest, MLP=multi-layer perceptron) on four different training regimen in CEX (left panel) and DEX (right panel) split. Excerpted from Figure 4 of the EFFECT paper.

Even viability is clearly still unsolved. Which means that it might make sense to ask all the fancy biology foundational models this simple question first: Can it tell us whether the cell would live or die better than a Random Forest?

Only after this would it probably make sense to focus on getting the actual RNA levels right.

RNA prediction does give additional biological insights, but it is also a whole new can of worms. That’s because you don’t just have to fit a single number anymore, but a whole twenty thousand of them. How to balance each? Maybe I could cover some of this in a next post.

We have made a reference benchmark set and our Bias Detector implementation freely available. Do try it if you have something, maybe you finally figured it all out.

Bias Detector code: https://github.com/turbine-ai/DrugKOPred-benchmark-bias

Reference benchmark: https://benchmark.turbine.ai

Thank you Bence Szalai, Imre Gáspár, Valér Kaszás, László Mérő, Milán Sztilkovics for working on this together.
Thanks to Andreas Bender, Aviad Tsherniak, and Daniel Veres for proof-reading the paper and suggesting how to make it more understandable.

References:

[1]: Szalai et al. “The EFFECT benchmark suite: measuring cancer sensitivity prediction performance - without the bias”

[2]: Tsherniak et al. “Defining a Cancer Dependency Map”

[3]: Iorio et al. “A Landscape of Pharmacogenomic Interactions in Cancer”

[4]: Geeleher et al. “Cancer biomarker discovery is improved by accounting for variability in general levels of drug sensitivity in pre-clinical models”

Global correlations corrected for bias are perfectly usable.

The plots above show the performance of standard ML methods measured by bias-corrected global Pearson correlation. As more cell line and drug features are introduced, the performance logically improves (only by adding cell features in the cell-exclusive (CEX) and only by adding drug features in the drug-exclusive (DEX) setup)

The drug splits you can find at https://benchmark.turbine.ai also have some drug targets overlapping between the train and test set, but we took care so that drugs in the test set have at least somewhat different target profiles from any drug in the train set.

What you can do is take a cellwise or drugwise metric, and compare with an appropriate statistical test whether your performance is significantly better than the bias. You can then tally the # of drugs or cells together where you are significantly over bias for a global performance score. What this does not tell you as a global score is how much better are you than the bias on the average drug / cell line.

Do cell foundation models work?

Kris Szalay — Mon, 23 Sep 2024 20:41:00 GMT

Biology is hard, and it’s very easy to accidentally produce misleading results. This post only aims to help us improve together and get closer to the dream of understanding biology instead of calling individual methods or authors out.

UPDATE: Others have also reached the same conclusion, Constantin et al. just published this.
It’s becoming more important than ever to keep your benchmarks healthy!

UPDATE #2: We have turned this post into a paper. You can find it at https://www.biorxiv.org/content/10.1101/2024.09.30.615843v1 with some additional analysis which have been omitted from here for clarity.

Foundation models in biology are a great concept.

The core idea is to show the AI how cells can look like – what are the “allowed” transcriptomic states (gene expression-level combinations) that describe actual cells versus “disallowed” states which don’t describe an actual working cell.

Why is any state disallowed in the first place?

A live cell is a very carefully regulated environment. Changing its protein levels is just like going in and reorganizing the gears in a steam engine – done randomly the engine will most likely blow up. The underlying assumption is that for an AI to learn the shared surface of the cells, it needs to build an internal representation of gene regulation – a gene regulatory network (GRN) hidden inside the Transformer’s clockwork: the attention weights.

This is an idea that might just work; but does it?

As with most large AI models, theoretical reasoning only gets you so far, the proof of the pudding is in the eating. Do we have a task which would be infeasible to do well without knowing the “true” human GRN? It turns out there is: perturbation prediction. If the model can figure out how cells respond to a gene (or a combination of genes) getting knocked out that, it must have built a useful internal understanding of biology.

Fortunately, the authors of scGPT[1] did pose this question to the model, here are the results:

ScLLM model performance comparison on the Adamson et al.[2] dataset. Score on the y-axis is the gene-wise Pearson correlation of differential expressions – true vs predicted

Are we done? This seems to imply that we can predict differential expression of genes with 0.6+ correlation for new perturbations. This seems respectable, and while not perfect, probably already useful.

Let’s add a simple bias predictor, predicting only the mean gene expression for each gene in the trainset.

ScLLM model performance comparison on the Adamson et al.[2] dataset, with train mean predictor performance. Score on the y-axis is the gene-wise Pearson correlation of differential expressions – true vs predicted

Has our previously respectable performance been just outmatched by a simple mean predictor? Let’s dive into the data to understand what’s going on. Calculating how well differential expressions correlate between all pairs of samples yields the following plot. (Note that the samples have been pseudo-bulked which is an interesting lesson on its own – despite having tens of thousands of cells, you may still only have a few dozen samples of information, as shown below.)

Pairwise DE correlations of sample pairs in the Adamson et al.[2] dataset

Reading the plot gives us a clue on why the mean predictor performs so well: most of the responses are very similar! #0, #1, #57 and #58 is a clear outlier group. Most of the other examples are extremely correlated. While you can observe a few separate response mechanisms in the bulk of the samples, the correlation is 0.5+ even across response types (and easily 0.8+ within a response group for different samples).

Probably the reason behind these responses being so highly correlated is that all perturbations have been targeting different proteins in the same growth pathway, hence the highly similar results.

Having data that’s so biased is not ideal in any dataset, but especially in biology you want to have outliers; they are the bread and butter of your benchmark. When your AI will get used in the real world, these would become the shiny sparks that lead scientists to the golden ore of finding selective patient populations. (You might even want to focus your scoring function on finding outliers, but that is a story for another time.)

Our findings above tell us that this is not a useful benchmark set, but not that the method itself is bad.

The Replogle et al.[3] dataset gives us a much better picture indeed:

Pairwise DE correlations of sample pairs in the Replogle et al.[3] dataset

While we still have correlated rows, there are many more unique responses in this dataset. Indeed, the performance of the mean predictor drops significantly: to 0.35 from the previous 0.7 in the Adamson set. Unfortunately, all models’ performance drop below the mean predictor too; scGPT, in particular, drops to 0.24.

ScLLM model performance comparison on the Replogle et al.[3] dataset, with train mean predictor performance. Score on the y-axis is the gene-wise Pearson correlation of differential expressions – true vs predicted

And even here, we’re still only working with a single cell line, K562! A good benchmark should be able to measure how well your model works on untrained cell lines, as frankly, each new patient will behave like a cell line you’ve never seen before.

If there is one lesson here, it’s this: check your benchmarks! Having good benchmarks is harder than it looks. Let me plug our EFFECT paper[4] at the end which can give you some considerations to start.

Most of the actual work underlying this post was done by Gerold Csendes and Bence Szalai – thanks for your work!

References

[1]: Haotian et al. “scGPT: toward building a foundation model for single-cell multi-omics using generative AI”

[2]: Britt Adamson et al. “A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response”

[3]: Replogle et al. “Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq”

[4]: Szalai et al. “The EFFECT benchmark suite: measuring cancer sensitivity prediction performance – without the bias”