Linear Programming and Integer Linear Programming in Bioinformatics

Andreotti, Sandro

Linear Programming and Integer Linear Programming in Bioinformatics

Haupttitel:

Linear Programming and Integer Linear Programming in Bioinformatics

Titel übersetzt:

Lineare und ganzzahlige lineare Optimierung in der Bioinformatik

Autor*in:

Andreotti, Sandro

Erscheinungsjahr:

2015

Datum der Freigabe:

2015-02-06T10:16:21.938Z

Abstract:

A wide range of important problems related to bioinformatics and computational biology are optimization problems asking for a solution that minimizes or maximizes a certain objective function. Often, these problems are combinatorial optimization problems that can be formulated as integer linear programs. While for some of these problems polynomial time algorithms are known, for many other problems it is unlikely that such algorithms exist. However, much work has been dedicated to develop algorithms that are capable of solving many interesting integer linear programming problems on real live instances with acceptable memory and running time requirements. These algorithms are implemented in a variety of free or commercial solver software packages. In situations where the performance of general purpose solvers is insufficient, often problem specific integer linear programming techniques can be applied that take advantage of knowledge about the particular structure of the integer linear programming formulation to solve the problem in a much more time- or space-efficient way. In this thesis we present our algorithmic approaches to three relevant bioinformatic problems, each involving certain linear programming and integer linear programming techniques. The first problem is the de novo peptide sequencing problem, which consists in identifying a peptide's sequence solely from its tandem mass spectrum without any additional information stored in genome databases or protein databases. This problem can be formulated as a graph theoretical problem asking for the computation of a longest antisymmetric path in a directed acyclic graph. The particular structure of the associated integer linear programming formulation facilitates the application of a technique called Lagrangian relaxation, which yields an algorithm that outperforms state-of-the-art commercial integer linear programming solvers by orders of magnitude. The second problem is the isoform inference and abundance estimation problem from RNA-Seq data. This problem consists in predicting a set of expressed RNA isoforms, i.e., full length RNA transcripts corresponding to alternative splice variants, together with an estimate of their individual expression levels. We apply a linear programming technique called delayed column generation, which allows us to increase the search space without explicitly enumerating the potentially huge set of candidate isoforms. As a consequence, our approach allows for the identification of isoforms that otherwise could not be recovered due to incomplete read coverage. A central component of our delayed column generation algorithm is an integer linear programming formulation. The third problem is the duplication-loss alignment problem, which asks for a labeled alignment of two genome sequences that implies the minimal number of loss and duplication events in the evolutionary history from an unknown nearest common ancestor. In a labeled alignment, every unaligned gene must be labeled either as a loss or as the product of a duplication event. Once an optimal labeled alignment has been computed, a common ancestor genome with minimal implied evolutionary operations can be derived in a straight forward way. In our approach we identified problem specific cutting planes and developed efficient separation algorithms to obtain a branch and cut algorithm that is several orders of magnitude faster than existing approaches based on integer linear programming.

Viele bedeutende Probleme im Forschungsbereich der Bioinformatik sind Optimierungsprobleme, bei denen eine Lösung errechnet werden muss, welche eine gegebene Zielfunktion minimiert oder maximiert. Bei einigen dieser Probleme handelt es sich um kombinatorische Optimierungsprobleme, welche als ganzzahlige lineare Optimierungsprobleme formuliert werden können. Während für eine Teilmenge dieser Probleme effiziente Polynomialzeit-Algorithmen bekannt sind, ist es für viele Probleme unwahrscheinlich, dass solche Algorithmen überhaupt existieren. Infolge kontinuierlicher Forschung wurden jedoch Algorithmen und Methoden entwickelt, die es erlauben viele ganzzahlige lineare Optimierungsprobleme für reale Instanzen mit vertretbarem Rechenaufwand zu lösen. Diese Algorithmen sind in einer Vielzahl von freien oder kommerziellen Löser Software Paketen implementiert. In Situationen, in denen die Effizienz solcher Löser unzureichend ist, kann die Anwendung problemspezifischer Methoden der ganzzahligen linearen Optimierung, welche die spezielle Struktur der Formulierung berücksichtigen, den Zeit- oder Speicherbedarf signifikant reduzieren. In dieser Arbeit werden algorithmische Ansätze für drei relevante Probleme der Bioinformatik vorgestellt, die jeweils bestimmte Methoden der linearen Optimierung und ganzzahligen linearen Optimierung verwenden. Als erstes wird das Problem der de novo Peptidsequenzierung betrachtet, welches darin besteht, die Sequenz eines unbekannten Peptids aus seinem Tandem Massenspektrum zu rekonstruieren, ohne zusätzliche Informationen aus Genom- oder Proteindatenbanken zu verwenden. Dieses Problem lässt sich als das graphentheoretische Problem des längsten antisymmetrischen Pfads in einem gerichteten azyklischen Graphen formulieren. Die besondere Struktur der Formulierung als ganzzahliges lineares Optimierungsproblem ermöglicht eine effektive Anwendung von Lagrange Relaxierung. Der resultierende Algorithmus ist um mehrere Größenordnungen schneller als leistungsfähige kommerzielle Löser Software. Das zweite Problem behandelt die Vorhersage exprimierter RNA- Isoformen, welche alternativen Spleißvarianten desselben Gens entsprechen, basierend auf RNA-Seq Daten. Gleichzeitig sollen die individuellen Expressionsniveaus der einzelnen Isoformen bestimmt werden. Der vorgestellte Ansatz verwendet die Methode der verzögerten Spaltengenerierung, die es ermöglicht den Suchraum der Isoform Kandidaten zu erweitern, ohne eine vollständige Aufzählung dieser Kandidaten erforderlich zu machen. Durch die gezielte Erweiterung des Suchraums können auch Isoformen korrekt vorhergesagt werden die zuvor aufgrund unvollständiger Abdeckung durch sequenzierte reads nicht betrachtet werden konnten. Ein zentraler Bestandteil des vorgestellten Algorithmus zur Generierung neuer Spalten ist ein ganzzahliges lineares Optimierungsproblem.

Identifier:

https://refubium.fu-berlin.de/handle/fub188/13001
http://dx.doi.org/10.17169/refubium-17199
urn:nbn:de:kobv:188-fudissthesis000000098537-4

Sprache:

Englisch

Freie Schlagwörter:

linear programming
integer linear programming
combinatorial optimization

DDC-Klassifikation:

570 Biowissenschaften; Biologie
004 Datenverarbeitung; Informatik

Publikationstyp:

Dissertation

Fachbereich/Einrichtung:

Mathematik und Informatik