Linear Programming and Integer Linear Programming in Bioinformatics

Andreotti, Sandro

Linear Programming and Integer Linear Programming in Bioinformatics

Metadaten

dc.contributor.author

Andreotti, Sandro

dc.date.accessioned

2018-06-08T01:09:14Z

dc.date.available

2015-02-06T10:16:21.938Z

dc.date.issued

2015

dc.identifier.uri

https://refubium.fu-berlin.de/handle/fub188/13001

dc.identifier.uri

http://dx.doi.org/10.17169/refubium-17199

dc.description.abstract

A wide range of important problems related to bioinformatics and computational biology are optimization problems asking for a solution that minimizes or maximizes a certain objective function. Often, these problems are combinatorial optimization problems that can be formulated as integer linear programs. While for some of these problems polynomial time algorithms are known, for many other problems it is unlikely that such algorithms exist. However, much work has been dedicated to develop algorithms that are capable of solving many interesting integer linear programming problems on real live instances with acceptable memory and running time requirements. These algorithms are implemented in a variety of free or commercial solver software packages. In situations where the performance of general purpose solvers is insufficient, often problem specific integer linear programming techniques can be applied that take advantage of knowledge about the particular structure of the integer linear programming formulation to solve the problem in a much more time- or space-efficient way. In this thesis we present our algorithmic approaches to three relevant bioinformatic problems, each involving certain linear programming and integer linear programming techniques. The first problem is the de novo peptide sequencing problem, which consists in identifying a peptide's sequence solely from its tandem mass spectrum without any additional information stored in genome databases or protein databases. This problem can be formulated as a graph theoretical problem asking for the computation of a longest antisymmetric path in a directed acyclic graph. The particular structure of the associated integer linear programming formulation facilitates the application of a technique called Lagrangian relaxation, which yields an algorithm that outperforms state-of-the-art commercial integer linear programming solvers by orders of magnitude. The second problem is the isoform inference and abundance estimation problem from RNA-Seq data. This problem consists in predicting a set of expressed RNA isoforms, i.e., full length RNA transcripts corresponding to alternative splice variants, together with an estimate of their individual expression levels. We apply a linear programming technique called delayed column generation, which allows us to increase the search space without explicitly enumerating the potentially huge set of candidate isoforms. As a consequence, our approach allows for the identification of isoforms that otherwise could not be recovered due to incomplete read coverage. A central component of our delayed column generation algorithm is an integer linear programming formulation. The third problem is the duplication-loss alignment problem, which asks for a labeled alignment of two genome sequences that implies the minimal number of loss and duplication events in the evolutionary history from an unknown nearest common ancestor. In a labeled alignment, every unaligned gene must be labeled either as a loss or as the product of a duplication event. Once an optimal labeled alignment has been computed, a common ancestor genome with minimal implied evolutionary operations can be derived in a straight forward way. In our approach we identified problem specific cutting planes and developed efficient separation algorithms to obtain a branch and cut algorithm that is several orders of magnitude faster than existing approaches based on integer linear programming.

dc.description.abstract

Viele bedeutende Probleme im Forschungsbereich der Bioinformatik sind Optimierungsprobleme, bei denen eine Lösung errechnet werden muss, welche eine gegebene Zielfunktion minimiert oder maximiert. Bei einigen dieser Probleme handelt es sich um kombinatorische Optimierungsprobleme, welche als ganzzahlige lineare Optimierungsprobleme formuliert werden können. Während für eine Teilmenge dieser Probleme effiziente Polynomialzeit-Algorithmen bekannt sind, ist es für viele Probleme unwahrscheinlich, dass solche Algorithmen überhaupt existieren. Infolge kontinuierlicher Forschung wurden jedoch Algorithmen und Methoden entwickelt, die es erlauben viele ganzzahlige lineare Optimierungsprobleme für reale Instanzen mit vertretbarem Rechenaufwand zu lösen. Diese Algorithmen sind in einer Vielzahl von freien oder kommerziellen Löser Software Paketen implementiert. In Situationen, in denen die Effizienz solcher Löser unzureichend ist, kann die Anwendung problemspezifischer Methoden der ganzzahligen linearen Optimierung, welche die spezielle Struktur der Formulierung berücksichtigen, den Zeit- oder Speicherbedarf signifikant reduzieren. In dieser Arbeit werden algorithmische Ansätze für drei relevante Probleme der Bioinformatik vorgestellt, die jeweils bestimmte Methoden der linearen Optimierung und ganzzahligen linearen Optimierung verwenden. Als erstes wird das Problem der de novo Peptidsequenzierung betrachtet, welches darin besteht, die Sequenz eines unbekannten Peptids aus seinem Tandem Massenspektrum zu rekonstruieren, ohne zusätzliche Informationen aus Genom- oder Proteindatenbanken zu verwenden. Dieses Problem lässt sich als das graphentheoretische Problem des längsten antisymmetrischen Pfads in einem gerichteten azyklischen Graphen formulieren. Die besondere Struktur der Formulierung als ganzzahliges lineares Optimierungsproblem ermöglicht eine effektive Anwendung von Lagrange Relaxierung. Der resultierende Algorithmus ist um mehrere Größenordnungen schneller als leistungsfähige kommerzielle Löser Software. Das zweite Problem behandelt die Vorhersage exprimierter RNA- Isoformen, welche alternativen Spleißvarianten desselben Gens entsprechen, basierend auf RNA-Seq Daten. Gleichzeitig sollen die individuellen Expressionsniveaus der einzelnen Isoformen bestimmt werden. Der vorgestellte Ansatz verwendet die Methode der verzögerten Spaltengenerierung, die es ermöglicht den Suchraum der Isoform Kandidaten zu erweitern, ohne eine vollständige Aufzählung dieser Kandidaten erforderlich zu machen. Durch die gezielte Erweiterung des Suchraums können auch Isoformen korrekt vorhergesagt werden die zuvor aufgrund unvollständiger Abdeckung durch sequenzierte reads nicht betrachtet werden konnten. Ein zentraler Bestandteil des vorgestellten Algorithmus zur Generierung neuer Spalten ist ein ganzzahliges lineares Optimierungsproblem.

dc.format.extent

XIV, 168 S.

dc.language

eng

dc.rights.uri

http://www.fu-berlin.de/sites/refubium/rechtliches/Nutzungsbedingungen

dc.subject

linear programming

dc.subject

integer linear programming

dc.subject

combinatorial optimization

dc.subject.ddc

500 Naturwissenschaften und Mathematik::570 Biowissenschaften; Biologie::570 Biowissenschaften; Biologie

dc.subject.ddc

000 Informatik, Informationswissenschaft, allgemeine Werke::000 Informatik, Wissen, Systeme::004 Datenverarbeitung; Informatik

dc.title

Linear Programming and Integer Linear Programming in Bioinformatics

dc.type

Dissertation

dcterms.format

Text

dc.contributor.gender

dc.contributor.firstReferee

Prof. Dr. Knut Reinert

dc.contributor.furtherReferee

Prof. Dr. Gunnar W. Klau

dc.date.accepted

2014-12-15

dc.identifier.urn

urn:nbn:de:kobv:188-fudissthesis000000098537-4

dc.title.translated

Lineare und ganzzahlige lineare Optimierung in der Bioinformatik

refubium.affiliation

Mathematik und Informatik

refubium.mycore.fudocsId

FUDISS_thesis_000000098537

refubium.mycore.derivateId

FUDISS_derivate_000000016535

dcterms.accessRights.dnb

free

dcterms.accessRights.openaire

open access

Zur Kurzanzeige

Das Dokument erscheint in:

Dissertationen FU

Dateien zu dieser Ressource

thesis.pdf

Größe: 1.904MB

Format: PDF

Prüfsumme (MD5): 929dca08ca624bcb00d93be2a1c8ef0c

Öffnen

Linear Programming and Integer Linear Programming in Bioinformatics

Refubium - Repositorium der Freien Universität Berlin

Linear Programming and Integer Linear Programming in Bioinformatics

Metadaten

Das Dokument erscheint in:

Dateien zu dieser Ressource

Metadaten exportieren