The increasing availability of multi-omics and imaging data enables more comprehensive modeling of disease mechanisms and drug responses. However, it also presents challenges due to dataset scale and data structure heterogeneity. This thesis investigates, in two case studies, how machine learning can help integrate diverse data modalities to advance drug discovery and development.
The first case study introduces a Bayesian framework for predicting missing in vitro assay measurements used to characterize the impact of drugs on liver function, based on chemical structure, gene expression, and fluorescence microscopy data. Crucially, Bayesian models can propagate uncertainty to downstream hepatotoxicity predictions. Multiple Bayesian approaches are systematically benchmarked, showing both the strong predictive performance of chemical descriptors and the superiority of an early fusion data integration strategy. The impact of dataset size on performance is also assessed, revealing only modest improvements with substantially larger datasets. Overall, this framework supports decision-making in the preclinical phase, and can be used for in silico screening in the early stages of drug discovery.
The second contribution is uniDINO, a generalist fluorescence microscopy feature extractor. It is trained using self-supervised learning on approximately 900,000 images from diverse microscopy assays designed to characterize cellular responses to genetic and compound perturbations. By integrating heterogeneous assay formats, uniDINO is trained on datasets spanning a wide range of cell lines, fluorescent markers, and experimental conditions. After training, uniDINO extracts biologically relevant embeddings for different treatment conditions without the need for fine-tuning. When compared to baselines such as transfer learning from natural images or computationally intensive tools like CellProfiler, predictions based on uniDINO embeddings demonstrate robustness to technical artifacts and achieve strong performance in tasks such as identifying mechanisms of action. Moreover, uniDINO outperforms or matches these baselines in terms of generalization to non-human cell lines and non-imaging assays that measure cell health. By design, uniDINO can extract features from any fluorescence microscopy assay, eliminating the need to train separate models for individual datasets or assay formats, and enabling its application to datasets too small for training deep learning models.
Together, these case studies illustrate how data integration combined with machine learning can reduce costs and accelerate research efforts, ultimately improving predictive accuracy and decision-making across the drug discovery pipeline.