In recent years, advances in the field of sequencing technologies have enabled the field of population-scale sequencing studies. These studies aim to sequence and analyze a large set of individuals from one or multiple populations, with the aim of gaining insight into underlying genetic structure, similarities and differences. Collections of genetic variation and possible connections to various disease are some of the products of this area of research. The potential of population studies is widely considered to be huge and many more endeavors of this kind are expected in the near future. This opportunity comes with a big challenge because many computational tools that are used for the analysis of sequencing data were not designed for cohorts of this size and may suffer from limited scalability. It is therefore vital that the computational tools required for the analysis of population-scale data keep up with the quickly growing amounts of data.
This thesis contributes to the field of population-scale genetics in the development and application of a novel approach for structural variant detection. It has explicitly been designed with the large amounts of population-scale sequencing data in mind. The presented approach is capable of analyzing tens of thousands of whole-genome short-read sequencing samples jointly. This joint analysis is driven by a tailored joint likelihood ratio model that integrates information from many genomes. The efficient approach does not only save computational resources but also allows to combine the data across all samples to make sensitive and specific predictions about the presence and genotypes of structural variation present within the analyzed population. This thesis demonstrates that this approach and the computational tool PopDel that implements it compare favorably to current state-of-the-art structural variant callers that have been used in previous population-scale studies. Extensive benchmarks on simulated and real world sequencing data are provided to show the performance of the presented approach. Further, a first finding of medical relevance that directly stems from the application of PopDel on the genomes of almost 50,000 Icelanders is presented.
This thesis therefore provides a novel tool and new ideas to further push the boundaries of the analysis of massive amounts of next generation sequencing data and to deepen our understanding of structural variation and their implications for human health.