dc.contributor.author
Mejia Morelli, Flavio
dc.date.accessioned
2026-03-25T09:08:14Z
dc.date.available
2026-03-25T09:08:14Z
dc.identifier.uri
https://refubium.fu-berlin.de/handle/fub188/51654
dc.identifier.uri
http://dx.doi.org/10.17169/refubium-51382
dc.description.abstract
The increasing availability of multi-omics and imaging data enables more comprehensive modeling of disease mechanisms and drug responses.
However, it also presents challenges due to dataset scale and data structure heterogeneity.
This thesis investigates, in two case studies, how machine learning can help integrate diverse data modalities to advance drug discovery and development.
The first case study introduces a Bayesian framework for predicting missing in vitro assay measurements used to characterize the impact of drugs on liver function, based on chemical structure, gene expression, and fluorescence microscopy data.
Crucially, Bayesian models can propagate uncertainty to downstream hepatotoxicity predictions.
Multiple Bayesian approaches are systematically benchmarked, showing both the strong predictive performance of chemical descriptors and the superiority of an early fusion data integration strategy.
The impact of dataset size on performance is also assessed, revealing only modest improvements with substantially larger datasets.
Overall, this framework supports decision-making in the preclinical phase, and can be used for in silico screening in the early stages of drug discovery.
The second contribution is uniDINO, a generalist fluorescence microscopy feature extractor.
It is trained using self-supervised learning on approximately 900,000 images from diverse microscopy assays designed to characterize cellular responses to genetic and compound perturbations.
By integrating heterogeneous assay formats, uniDINO is trained on datasets spanning a wide range of cell lines, fluorescent markers, and experimental conditions.
After training, uniDINO extracts biologically relevant embeddings for different treatment conditions without the need for fine-tuning.
When compared to baselines such as transfer learning from natural images or computationally intensive tools like CellProfiler, predictions based on uniDINO embeddings demonstrate robustness to technical artifacts and achieve strong performance in tasks such as identifying mechanisms of action.
Moreover, uniDINO outperforms or matches these baselines in terms of generalization to non-human cell lines and non-imaging assays that measure cell health.
By design, uniDINO can extract features from any fluorescence microscopy assay, eliminating the need to train separate models for individual datasets or assay formats, and enabling its application to datasets too small for training deep learning models.
Together, these case studies illustrate how data integration combined with machine learning can reduce costs and accelerate research efforts, ultimately improving predictive accuracy and decision-making across the drug discovery pipeline.
dc.format.extent
xv, 117 Seiten
dc.rights.uri
http://www.fu-berlin.de/sites/refubium/rechtliches/Nutzungsbedingungen
dc.subject
Data Integration
en
dc.subject
Machine Learning
en
dc.subject
Computer Vision
en
dc.subject
Probabilistic Models
en
dc.subject
Fluorescence Microscopy
en
dc.subject
Drug Discovery
en
dc.subject.ddc
500 Natural sciences and mathematics::510 Mathematics::519 Probabilities and applied mathematics
dc.subject.ddc
500 Natural sciences and mathematics::570 Life sciences::570 Life sciences
dc.subject.ddc
000 Computer science, information, and general works::000 Computer Science, knowledge, systems::004 Data processing and Computer science
dc.title
Machine learning-driven data integration for drug discovery
dc.contributor.gender
male
dc.contributor.firstReferee
Baum, Katharina
dc.contributor.furtherReferee
Schulz, Marcel
dc.date.accepted
2026-02-13
dc.identifier.urn
urn:nbn:de:kobv:188-refubium-51654-9
refubium.affiliation
Mathematik und Informatik
dcterms.accessRights.dnb
free
dcterms.accessRights.openaire
open access