dc.contributor.author
Krannich, Thomas
dc.date.accessioned
2022-06-02T09:10:42Z
dc.date.available
2022-06-02T09:10:42Z
dc.identifier.uri
https://refubium.fu-berlin.de/handle/fub188/35041
dc.identifier.uri
http://dx.doi.org/10.17169/refubium-34757
dc.description.abstract
Non-reference sequence (NRS) variants are a less frequently investigated class of genomic structural variants (SV). Here, DNA sequences are found within an individual that are novel with respect to a given reference. NRS occur predominantly due to the fact that a linear reference genome lacks biological diversity and ancestral sequence if it was primarily derived from a single or few individuals. Therefore, newly sequenced individuals can yield genomic sequences which are absent from a reference genome.
With the increasing throughput of sequencing technologies, SV detection has become possible across tens of thousands of individuals. When using short-read data, the detection of NRS variants inevitably involves a de novo assembly which is a complex computational problem and requires high-quality sequence data at high coverage. Previous studies have demonstrated how sequence data of multiple genomes can be combined for the reliable detection of NRS variants. However, the algorithms proposed in these studies have a limited capability to process large sets of genomes.
This thesis introduces novel contributions for the discovery of NRS variants in many genomes, which scale to considerably larger numbers of genomes than previous methods. A practical software tool, PopIns2, that was developed to apply the presented methods is elucidated in greater detail. The highlight among the new contributions is a procedure to merge contig assemblies of unaligned reads from many individuals into a single set of NRS by heuristically generating a weighted minimum path cover for a colored de Bruijn graph. Tests on simulated data show that PopIns2 ranks among the best approaches in terms of quality and reliability and that its approach yields the best precision for a growing number of genomes processed. Results on the Polaris Diversity Cohort and a set of 1000 Icelandic human genomes demonstrate unmatched scalability for the application on population-scale datasets.
en
dc.format.extent
141 Seiten
dc.rights.uri
https://creativecommons.org/licenses/by-sa/4.0/
dc.subject
Bioinformatics
en
dc.subject
Structural variants
en
dc.subject
de Bruijn Graph
en
dc.subject.ddc
000 Computer science, information, and general works::000 Computer Science, knowledge, systems::004 Data processing and Computer science
dc.subject.ddc
500 Natural sciences and mathematics::570 Life sciences::570 Life sciences
dc.title
Contributions to the detection of non-reference sequences in population-scale NGS data
dc.contributor.gender
male
dc.contributor.firstReferee
Reinert, Knut
dc.contributor.furtherReferee
Kehr, Birte
dc.date.accepted
2022-04-29
dc.identifier.urn
urn:nbn:de:kobv:188-refubium-35041-7
refubium.affiliation
Mathematik und Informatik
dcterms.accessRights.dnb
free
dcterms.accessRights.openaire
open access