6 Comments
User's avatar
Jérôme Salignon's avatar

Great post, thanks for the clear insights! Identifying the true dimensionality of a dataset and ajusting the modeling strategy accordingly sounds like a great advice for such AIxBio challenges. Also the notion that data relevance is way more important than number of data points is spot on.

I am a bit puzzled though that no significant clusters per cell lines were found in this dataset. i.e. cell line subclones and heterogeneity has been shown in several publications (https://pubmed.ncbi.nlm.nih.gov/30089904/, https://www.nature.com/articles/s41467-023-43991-9).

So maybe the real dimensionality should be a bit higher than 1 data point per condition? Or maybe this variability was not relevant for the metrics of the challenge?

Expand full comment
Kris Szalay's avatar

Thanks, Jérôme! My thoughts on this:

On one hand, 1000 cells per condition is not a lot to reliably identify subclones.

MCF7 is famously unstable, but yes, all cell lines drift apart when cultured separately. In the VCC, IIRC, it's a single PerturbSeq, single culture.

The Zhu et al. article you linked also shows on Fig. 1c. that it's rare for one cell line to split into multiple robust clusters - in which case it would indeed become two data points (or as many as the number of clusters present).

Expand full comment
Jérôme Salignon's avatar

Hi Kris, thanks you for your message.

In Zhu et al they highlight that even cell lines with cells forming continuous clusters can show important transcriptional diversity between clones (Fig. 2, 7), with different clones having different responses in stressful/perturbed environments (Fig. 8). But again maybe these different responses are minor compared to the responses of other cell lines, and as you mentioned a low number of cells per cell line may prevent reliable sub-clonal analyses.

Thanks again for the nice blog post!

Expand full comment
Kris Szalay's avatar

So, yes, your hypothesis is an equally valid prior, just our previous experience with immune modeling led us to think otherwise.

But this is exactly why the VCC results are interesting - the fact that pseudobulk-based methods can tie single-cell methods is evidence that - at least for this dataset measured on these scores - the individual cells don't really add much info.

We generally found the same with other metrics and setups as well.

Whether this is because the technology is just too noisy or it's really how biology works is anyone's guess.

Expand full comment
The AI Architect's avatar

Outstanding work reducing Perturb-Seq complexity down to it's effective info content. The 300 datapoint calculation makes so much intuitive sense when you consider homogenous colonies don't realy add new info. I'd be curious how this framework applies to datasets with more heterogenous cell populations though, where maybe there's actualy more than one pseudobulk per conditon worth teasing apart?

Expand full comment
Kris Szalay's avatar

Thanks! Yes, in heterogeneous populations you'll have more clusters. These usually correspond pretty well to cell types or subtypes, like different types and states of lymphocytes in PBMC samples. 10-20 pseudobulks per condition is normal, but heavily depends on the sample and the cell count per condition (is it really a cluster if there are only five cells in there?)

Expand full comment