Why can we predict biology?

and what are we really predicting?

May 22, 2025

Yes, biology is complex

The first thing most people hear about biology is that it’s unbelievably complex.

Just this week, someone asked me how it’s possible to predict cell behavior from just 20,000 numbers (an RNASeq vector). It feels hard to imagine given the known complexity of cells.

Indeed. Every single cell contains billions of proteins in an incomprehensible number of possible combinations. Each protein is a whole machinery in and of itself, consisting of thousands of atoms in elaborate combinations. We are made of trillions of these cells.

Why do we even dare to think it is possible to predict any of this?

We can predict because cells - and we - are not random configurations of atoms. If we were, the only way to predict what will happen to us is to simulate from the atomic level with Schrödinger’s equation.

Just for fun, let’s calculate how infeasible that is. There are 10^24 atoms in an average human. Schrödinger’s equation scales quadratically, and we’d need femtosecond precision, We’re suddenly looking at ~10^63 operations per second of atomic simulation. A H100 GPU does say a petaFLOP (10^15) a second, so we’d still need a quindecillion (10^48) GPUs for real-time simulation. That’s almost the number of atoms on Earth. Yeah, that’s not gonna happen.

Learning from our ancestors

And yet, molecular biology is a thing. Alphafold is predictive. What are we really learning?

My hypothesis is this:
We are learning the optimization surface carved by evolution.

The only reason there is anything to learn is that our ancestors were subject to evolution for billions of years. We were being sculpted over the aeons to respond in certain ways to certain inputs. In ways that keeps our germ line alive, that is.

That means the reason why Alphafold can work1 is that the proteins were themselves subject to evolutionary pressure. If my hypothesis is true, it follows that Alphafold would struggle with random, synthetic proteins that lack any evolutionary context2.

I don’t think we’re going to find any deterministic equation at these higher levels that resemble the ones we learned to use in physics. The rulebook to survive 3 billion years is much better described as as a surface optimized by evolution. As a lucky coincidence3, this is exactly how machine learning algorithms work inside - they approximate mathematical surfaces.

Why does this matter?

It gives us some insight into how cells think.

We tend to analyze biology in terms of DNA, RNA, proteins and molecules as these are the tangible entities we can measure. From there, we have built pathway diagrams and regulatory networks which have taught us a lot about how cells work in general.

However, this has limited practical use. A drug that kills the majority of cells is just poison. To move forward, it’s exactly the outliers which you want to capture. The one outstanding indication where your drug is especially powerful, exactly because there’s a special constellation of factors causing the cell behavior to deviate from textbook form (“Can a biologist fix a radio?”).

In order to progress, we needed to let go of our old way of thinking about individual proteins responsible for cell behavior, and yield more control to the AI to organize the collective behavior of the components the way data tells us.

Maybe cells think in functions

I think the reason this worked is that cells are evolved to perform functions, making them “think” at a process level. Evolution was hacking whatever proteins were available, constantly tweaking and repurposing them to solve the most immediate problems. Hence all the different RAS proteins with overlapping functions, or moonlighting proteins with completely unrelated roles.

So I suspect that in a future where virtual cells are commonplace, peeking into what lies within the ultimate virtual cell’s neurons, we would find a mapping of functions - a web of ancestral functions evolved over time and their correspondence to the physical elements - genes, proteins, probably even parts of proteins (domains).

This would explain why we see our models learning much more from drug perturbations than from genetic ones - the job of a working drug is to disrupt a function, not a gene, after all.

(Thanks a lot for reviewing, Balint Pfliegel, Gerold Csendes and Bence Szalai!)

This is just why it works, not how it works.

There is an argument to be made here that the 22 amino acids we are made of are also pre-selected by evolution, hence there may be also be simplified ruleset that describes their behavior without resorting to full atomic simulation.

It is not exactly coincidence, but that’s not very relevant to the topic at hand.

Turbine AI

Discussion about this post