Approximate String Matching: Improving Data Structures and Algorithms

Pockrandt, Christopher Maximilian

Approximate String Matching

Metadaten

dc.contributor.author

Pockrandt, Christopher Maximilian

dc.date.accessioned

2019-04-15T08:26:19Z

dc.date.available

2019-04-15T08:26:19Z

dc.date.issued

2019

dc.identifier.uri

https://refubium.fu-berlin.de/handle/fub188/24413

dc.identifier.uri

http://dx.doi.org/10.17169/refubium-2185

dc.description.abstract

This thesis addresses important algorithms and data structures used in sequence analysis for applications such as read mapping. First, we give an overview on state-of-the-art FM indices and present the latest improvements. In particular, we will introduce a recently published FM index based on a new data structure: EPR dictionaries. This rank data structures allows search steps in constant time for unidirectional and bidirectional FM indices. To our knowledge this is the first and only constant-time implementation of a bidirectional FM index at the time of writing. We show that its running time is not only optimal in theory, but currently also outperforms all available FM index implementations in practice. Second, we cover approximate string matching in bidirectional indices. To improve the running time and make higher error rates suitable for index-based searches, we introduce an integer linear program for finding optimal search strategies. We show that it is significantly faster than other search strategies in indices and cover additional improvements such as hybrid approaches of index-based searches with in-text verification, i.e., at some point the partially matched string is located and verified directly in the text. Finally, we present a yet unpublished algorithm for fast computation of the mappability of genomic sequences. Mappability is a measure for the uniqueness of a genome by counting how often each $k$-mer of the sequence occurs with a certain error threshold in the genome itself. We suggest two applications of mappability with prototype implementations: First, a read mapper incorporating the mappability information to improve the running time when mapping reads that match highly repetitive regions, and second, we use the mappability information to identify phylogenetic markers in a set of similar strains of the same species by the example of E. coli. Unique regions allow identifying and distinguishing even highly similar strains using unassembled sequencing data. The findings in this thesis can speed up many applications in bioinformatics as we demonstrate for read mapping and computation of mappability, and give suggestions for further research in this field.

dc.format.extent

175 Seiten

dc.language

eng

dc.rights.uri

https://creativecommons.org/licenses/by/4.0/

dc.subject

approximate

dc.subject

fm index

dc.subject

string matching

dc.subject

epr dictionaries

dc.subject

BWT

dc.subject.ddc

000 Informatik, Informationswissenschaft, allgemeine Werke::000 Informatik, Wissen, Systeme::000 Informatik, Informationswissenschaft, allgemeine Werke

dc.title

Approximate String Matching

dc.type

Dissertation

dcterms.format

Text

dc.contributor.gender

male

dc.contributor.firstReferee

Reinert, Knut

dc.contributor.furtherReferee

Rahmann, Sven

dc.date.accepted

2019-04-11

dc.identifier.urn

urn:nbn:de:kobv:188-refubium-24413-4

dc.title.subtitle

Improving Data Structures and Algorithms

refubium.affiliation

Mathematik und Informatik

dcterms.accessRights.dnb

free

dcterms.accessRights.openaire

open access

Zur Kurzanzeige

Das Dokument erscheint in:

Dissertationen FU

Dateien zu dieser Ressource

Dissertation_Pockrandt.pdf

Größe: 2.086MB

Format: PDF

Prüfsumme (MD5): c124be1026ad006cd5c21365bb6be9c0

Öffnen

Approximate String Matching

Refubium - Repositorium der Freien Universität Berlin

Approximate String Matching

Metadaten

Das Dokument erscheint in:

Dateien zu dieser Ressource

Metadaten exportieren