Discussion about this post

User's avatar
Jérôme Salignon's avatar

Great post, thanks for the clear insights! Identifying the true dimensionality of a dataset and ajusting the modeling strategy accordingly sounds like a great advice for such AIxBio challenges. Also the notion that data relevance is way more important than number of data points is spot on.

I am a bit puzzled though that no significant clusters per cell lines were found in this dataset. i.e. cell line subclones and heterogeneity has been shown in several publications (https://pubmed.ncbi.nlm.nih.gov/30089904/, https://www.nature.com/articles/s41467-023-43991-9).

So maybe the real dimensionality should be a bit higher than 1 data point per condition? Or maybe this variability was not relevant for the metrics of the challenge?

The AI Architect's avatar

Outstanding work reducing Perturb-Seq complexity down to it's effective info content. The 300 datapoint calculation makes so much intuitive sense when you consider homogenous colonies don't realy add new info. I'd be curious how this framework applies to datasets with more heterogenous cell populations though, where maybe there's actualy more than one pseudobulk per conditon worth teasing apart?

5 more comments...

No posts

Ready for more?