Aspects of Quality Control for Next Generation Sequencing Data in Medical
Genetics

Heinrich, Verena

Aspects of Quality Control for Next Generation Sequencing Data in Medical Genetics

Metadata

dc.contributor.author

Heinrich, Verena

dc.date.accessioned

2018-06-07T16:37:42Z

dc.date.available

2017-03-17T08:57:16.279Z

dc.date.issued

2017

dc.identifier.uri

https://refubium.fu-berlin.de/handle/fub188/2810

dc.identifier.uri

http://dx.doi.org/10.17169/refubium-7011

dc.description

1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 1.1 Biological Background . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 1.1.1 Next Generation Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 1.1.2 The Human Reference Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.1.3 Genetic Variability in the Human Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.1.4 The 1000 Genomes Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 1.1.5 A Broad Outlook of Next Generation Sequencing in Human Genetics . . . . . . . . . . . . . . . . . .17 1.2 Bioinformatics Processing of Next Generation Sequencing Data . . . . . . . . . . . . . . . .. . . . . . . . . 19 1.2.1 Methods for Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.2 Variant Calling and Allele Frequencies at Heterozygous Positions . . . . . . . . . . . . . . . . . . . . 22 1.3 Quality Control and Filtering Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.3.1 Quality Measurements in NGS Data . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.3.2 Strategies to Filter for Diseases in NGS Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3 Distribution of Short Read Fragments in Next Generation Sequencing Experiments . . . . . . . . . .41 3.1 Overview of this Chapter . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Introduction to (Bienayme-) Galton-Watson (BGW) Branching Processes . . . . . . . . . . . . . . . . . 43 3.2.1 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.2 (Bienayme-) Galton-Watson Branching Processes . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 46 3.3 Asymptotic Behaviour of the Relative Frequencies .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49 3.4 Sequence Fragments after Amplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.1 Fragment Amplification as a BGW Branching Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.2 Variance of ∆_k^i . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .66 3.5.1 Experimental Whole Exome Sequencing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.5.2 Independency of Positions and Individuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .66 3.5.3 Simulation of Allele Frequencies After Amplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67 3.5.4 Influence on Error Rates in heterozygous Variant Detection . . . . . . . . . . . . . . .. . . . . . . . . . . . 74 3.6 Summary of this Chapter . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4 Exome Genotyping Accuracy in Next GenerationSequencing Data . . . . . . . . . . . . . . . . . . . . . . . . .81 4.1 Overview of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82 4.2 Introduction to Distance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83 4.2.1 Distance Metrics and Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83 4.2.2 Embedding of the Distance Matrix D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85 4.3 Generation and Processing of Exome Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .89 4.4 An Error Sensitive Genotype-Weighted Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4.1 Computation of the Weighted Distance Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .90 4.4.2 Computation of a Standardized Dissimilarity Score (SDS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.5 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.5.1 Analysis of Exomes With Different Genotyping Accuracies . . . . . . . . . . . . . . . . . . . . . . . . . . . .96 4.5.2 The Influence of Coverage and Error Rates on d_{ij}^W . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.5.3 Comparison of WES Data from Different NGS Studies and Target Sizes . . . . . . . . . . . . . . . 101 4.6 Similarity Metrics in Rare Variant Association Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .105 4.7 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 106 5 Prediction of Family Structures in Next Generation Sequencing Data . . . . . . . . . . . . . . . . . . . . . 109 5.1 Overview of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .110 5.2 Introduction to Approaches to Analyse Family Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2.1 Genetic Identity Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2.2 Likelihood Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .113 5.3 Definition of Hypotheses and LOD Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.4 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.4.1 Separation Efficiency of LOD Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .120 5.4.2 Directionality of Pairwise Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.4.3 Precision of LOD Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.4.4 Influence of Inbred Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125 5.5 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 i Summary ii Zusammenfassung iii Acknowledgments iv Curriculum Vitae v Abbreviations vi Bibliography

dc.description.abstract

During the last decade methods based on next generation sequencing (NGS ) have revolutionized the field of medical genetics. By sequencing the protein coding region via whole exome sequencing (WES ) genetic variations that appear in Mendelian disorders can be identified. Further, additional approaches were introduced to reduce the search space for potentially pathogenic mutations, for example by including more family members into the analysis. With a growing rate of technical advances also new challenges have arisen and methods for quality control (QC ) are crucial to increase the sensitivity of variant detection. In this work different strategies for QC are presented which concentrate on three levels of the analysis of an WES experiment. The distribution of allele frequencies (AFs) at Heterozygous positions is associated with the amplification step during library preparation before sequencing. Strong deviations from the expected mean of 0.5 lead to an increased error rate in the detection of genetic variations. It is shown that the variance of this distribution can be modelled with a two-type (Bienayme-) Galton-Watson (BGW ) branching process. With this, conclusions can be drawn on how to reduce stochastic fluctuations caused by the amplification step. Additionally the derived variance can be used as an indicator for the error rate of a WES sample. Furthermore, variant detection is strongly influenced by the ethnic background of an individual as single nucleotide polymorphism (SNP ) frequencies have population specific characteristics. Here the exome wide accuracy is estimated by comparing all variants of an WES sample to a good quality Reference set with a matching background population, using a distance metric that emphasises weight on rare variants. The distance to the Reference set is highly associated with the genotyping quality of the sample and the overall genotyping accuracy can be estimated by comparing the result to simulated error groups. Most strategies to filter for potentially pathogenic variants are based on the simultaneous analysis of several family members, for example if filtering for De-Novo mutations. However, these techniques strongly rely on correct pedigree information and sample Mix-Ups considerably affect the analysis and can lead to false conclusions. In this work relatedness structures between samples are inferred by calculating logarithm of the odds (LOD) scores based on population genotype (GT ) frequencies. These approaches complement existing quality control recommendations and help to indicate the accuracy of a whole exome sequencing sample.

dc.description.abstract

Durch die Sequenzierung der Protein kodierenden genomischen Region können genetische Variationen identifiziert werden, die Mendelischen Krankheiten zugrunde liegen. Dabei sind Methoden zur Qualitätskontrolle ein essentieller Bestandteil, um die Sensitivitt der Detektion von genetischen Varianten abzuschätzen und zu steigern. In dieser Arbeit werden verschieden Strategien zur Qualitätskontrolle vorgestellt, welche sich auf drei verschiedene Phasen in der Analyse eines Exoms konzentrieren. Die Verteilung von Allele Frequenzen an heterozygoten Positionen ist mit einem Amplifikationsschritt assoziiert, welcher der Sequenzierung vorrausgeht. Es wurde gezeigt, dass die Varianz dieser Verteilung mit einem Verzweigungsprozess modelliert werden kann. Mithilfe dieser Simulation können Rückschlüsse über die stochastischen Fluktuationen während des Amplifikationsschrittes gezogen werden, womit sich die Fehlerrate eines Experimentes abschätzen lässt. Die Detektion von Varianten ist stark durch den ethnischen Hintergrund eines Individuums beeinflusst, da SNP Häufigkeiten populationsspezifische Charakteristika aufweisen. Durch den Vergleich aller Varianten eines Exoms mit einem qualitativ guten Referenzset, welches einen ähnlichen Populationshintergrund aufweist, kann die Genauigkeit eines Experimentes abgeschaetzt werden In diseser Arbeit wurde dafür eine Distanzmetrik verwendet die seltene Varianten stärker gewichtet als Häufige. Viele Strategien, die angewandt werden um nach möglichen pathogenen Mutationen zu filtern, basieren auf der Analyse mehrerer Familienangehöriger. Allerdings sind diese Ansätze auf korrekte Stammbäume angewiesen and mögliche Probenverwechslungen behindern die Analyse und führen zu falschen Ergebnissen. In dieser Arbeit wurden Verwandtschaftsbeziehungen mithilfe von Likelihood-Quotienten-Tests ermittelt, welche auf Genotypfrequenzen basieren. Die vorgestellten Ansätze ergänzen vorhandene Empfehlungen zur Quali-tätskontrolle und helfen, die Genauigkeit eines Exom Experimentes zu bestimmen.

dc.format.extent

vi, 137, XXVII Seiten

dc.language

eng

dc.rights.uri

http://www.fu-berlin.de/sites/refubium/rechtliches/Nutzungsbedingungen

dc.subject

Quality Control

dc.subject

Next Generation Sequencing

dc.subject.ddc

000 Informatik, Informationswissenschaft, allgemeine Werke

dc.subject.ddc

000 Informatik, Informationswissenschaft, allgemeine Werke::000 Informatik, Wissen, Systeme::004 Datenverarbeitung; Informatik

dc.subject.ddc

000 Informatik, Informationswissenschaft, allgemeine Werke::000 Informatik, Wissen, Systeme::005 Computerprogrammierung, Programme, Daten

dc.subject.ddc

500 Naturwissenschaften und Mathematik::570 Biowissenschaften; Biologie

dc.subject.ddc

500 Naturwissenschaften und Mathematik::570 Biowissenschaften; Biologie::576 Genetik und Evolution

dc.title

Aspects of Quality Control for Next Generation Sequencing Data in Medical Genetics

dc.type

Dissertation

dcterms.format

Text

dc.contributor.contact

heinrich@molgen.mpg.de

dc.contributor.gender

dc.contributor.firstReferee

Prof. Dr. Martin Vingron

dc.contributor.furtherReferee

Prof. Dr. Nick Robinson

dc.date.accepted

2016-11-22

dc.identifier.urn

urn:nbn:de:kobv:188-fudissthesis000000104369-1

dc.title.translated

Aspekte zur Qualitätskontrolle für Hochdurchsatzsequenzierungsdaten in der medizinischen Genetik

refubium.affiliation

Mathematik und Informatik

refubium.mycore.fudocsId

FUDISS_thesis_000000104369

refubium.mycore.derivateId

FUDISS_derivate_000000021201

dcterms.accessRights.dnb

free

dcterms.accessRights.openaire

open access

Show Simple Item Record

This Item appears in the following Collection(s)

Dissertationen FU

Files in This Item

HeinrichDiss.pdf

Description: PDF: Dissertation

Size: 10.53MB

Format: PDF

Checksum (MD5): c5187ab146dcfacf124c516a97c84238

View/Open

Aspects of Quality Control for Next Generation Sequencing Data in Medical Genetics

Refubium - Freie Universität Berlin Repository

Aspects of Quality Control for Next Generation Sequencing Data in Medical Genetics

Metadata

This Item appears in the following Collection(s)

Files in This Item

Export metadata