The function of non-coding RNA sequences is largely determined by their spatial conformation. This is the secondary structure of the molecule, which is formed by Watson–Crick interactions between nucleotides. Hence, modern RNA alignment algorithms routinely take structural information into account. Essential tasks for discovering yet unknown RNA families and inferring their possible functions are the structural alignment of RNAs and the subsequent search of the derived structural motifs. These tasks demand a lot of computational resources, especially for aligning many long sequences, and it therefore requires efficient algorithms that utilize modern hardware when available. A subset of the secondary structures contains pseudoknots, which are overlapping interactions that add additional complexity to the analysis and are often ignored in available software.
In this thesis, I present LaRA 2 and MaRs, two SeqAn-based software tools that implement algorithms for finding sequence-structure motifs in genomic sequences. In contrast to other programs, my tools can handle arbitrary pseudoknots. They use multithreading for parallel execution and are implemented in modern C++ code for maximal longevity and performance.
LaRA2 is significantly faster than comparable software for accurate pairwise and multiple alignments of structured RNA sequences. It uses a new heuristic for computing a lower boundary to the solution and employs vectorization techniques for speeding up the time-critical parts of the algorithm.
MaRs can be applied in a workflow right after LaRA2 and derives sequence-structure motifs from the structural alignments. The motifs are descriptors of the RNA sequences’ properties and drive the search for homologs in genomic sequences. MaRs employs a bidirectional index on the genomic sequences and an optimized multithreaded search strategy for finding the matches really fast. The use of a thread pool, effective pruning strategies, and a low memory footprint ensure that MaRs is capable of processing extremely large data sets.