Protein Secondary Structure Prediction Using a Vector Valued Classifier

Korff, Matti Gerrit

Protein Secondary Structure Prediction Using a Vector Valued Classifier

Metadata

dc.contributor.author

Korff, Matti Gerrit

dc.date.accessioned

2018-06-07T17:57:19Z

dc.date.available

2016-02-01T09:03:14.029Z

dc.date.issued

2016

dc.identifier.uri

https://refubium.fu-berlin.de/handle/fub188/4467

dc.identifier.uri

http://dx.doi.org/10.17169/refubium-8667

dc.description

1 Introduction 1 1.1 Protein structure 3 1.1.1 Amino acid propensities 4 1.1.2 Protein secondary structure 5 1.1.2.1 Helices 5 1.1.2.2 β sheet and β strand 6 1.1.2.3 Turns 6 1.1.2.4 Isolated β bridges 6 1.1.2.5 Coil 6 1.1.3 Protein secondary structure assignment 7 1.2 Protein structure prediction 8 1.2.1 Protein secondary structure prediction 9 1.2.1.1 PSIPRED 9 1.2.1.2 C3-Scorpion 10 1.2.1.3 Jpred 11 1.3 Statistical classification 12 1.3.1 Linear regression 12 1.3.1.1 Parameter estimation 13 1.3.1.2 Alternative representation 15 1.3.1.3 Solving the system of linear equations 17 1.3.1.4 Classification 17 1.3.2 Artificial neural networks 18 1.3.2.1 Perceptron 19 1.3.2.2 Sigmoid neuron 20 1.3.2.3 Gradient descent 21 1.3.2.4 Backpropagation 22 1.3.2.5 Conclusion 24 1.3.3 Regularization 25 1.3.3.1 Cross validation 25 1.3.3.2 Regularization terms 26 1.3.3.3 Tikhonov regularization 27 2 Methods 29 2.1 Dataset 29 2.2 Reduction schemes 30 2.3 Input data representation 32 2.3.1 Representation of the sequence window 33 2.3.2 PSI BLAST profiles 34 2.3.3 Terminal residues 36 2.4 Multi class classifier 37 2.4.1 Two class classification 37 2.4.2 Multi class classification 38 2.4.2.1 Class vectors 39 2.4.2.2 Generalized tetrahedron 39 2.4.3 Vector valued classification function 41 2.4.3.1 Definition of the scoring functions 41 2.4.4 Optimization of the classifier 42 2.4.4.1 Linearization of the scoring function 43 2.4.4.2 Derivation of the parameters of the scoring functions 44 2.4.4.3 Regularization of the objective function 49 2.5 Implementation of the classifier 50 2.5.1 Implementation of the first prediction step 51 2.5.2 Implementation of the second prediction step 52 2.6 Quality measures 54 2.6.1 Accuracy index QC 54 2.6.2 Generalized Matthews correlation coefficient 54 2.6.3 Confidence measure of the vector-valued scoring function f 56 2.6.3.1 Three classes. 56 2.6.3.2 Confidence of predicted class 57 3 Results 59 3.1 Dataset 60 3.2 Cross validation 62 3.2.1 Significant differences 63 3.3 Setup 63 3.3.1 Regularization parameter 64 3.3.2 BLAST databases 65 3.3.3 Secondary structure class reduction schemes 68 3.3.4 Window size 70 3.3.5 Classifier enhancement 72 3.3.5.1 Additional sequence feature 74 3.3.5.2 Second prediction step 74 3.3.5.3 Additional profile feature 78 3.3.5.4 Overall impact of prediction enhancements 78 3.4 Validation 79 3.4.1 Amino acid preferences 80 3.4.2 Confidence measurements 81 3.4.2.1 Correlation between confidence and Q3 accuracy 83 3.5 Benchmark 86 3.5.1 CASP datasets 87 3.5.2 ASTRAL40 2.03 subset 91 4 Discussion 93 4.1 Super class distinction 94 4.2 Hyperparameters 95 4.2.1 Sequence profiles 96 4.2.2 Class reduction schemes 97 4.2.3 Window size and classifier enhancements 97 4.3 *SPARROW+ compared to other methods 99 4.4 *SPARROW+ compared to its predecessor 100 5 Conclusions and Outlook 103 5.1 Weight optimization 104 5.2 Improvement of the sequence profiles 104 5.3 Predictions of more than three classes 105 5.4 Secondary structure profiles as input data 106 5.5 Terminal residue feature type 107 6 Summary 109 7 References 113

dc.description.abstract

Knowing a proteins structure is an essential prerequisite for understanding its function. The rate of protein sequencing greatly exceeds the rate by which protein structures can be experimentally solved. Methods to predict the protein structure based on the sequence are therefore in great demand. The prediction of the protein secondary structure is the first step in predicting the spatial structure. In addition, the knowledge of the secondary structure allows to characterize a protein into a structure category. In this work a new protein secondary structure prediction method, *SPARROW+, is presented. *SPARROW+ is a further development of its predecessor *SPARROW (Rasinski, 2011). Like its predecessor, a vector valued classifier is used for prediction. The vector valued classifier allows to project high dimensional input data into a low dimensional classification space. Through the relative orientation of the vector valued classifier to class vectors, input data are classified. *SPARROW+ consists of two consecutive prediction steps. Based on a target sequence a PSI BLAST (Altschul et al., 1997) sequence profile is generated, which together with the protein sequence is the input of the first prediction step. The results of secondary structure prediction from the first step and the PSI BLAST profile are the combined input for the second prediction step. *SPARROW+ achieves a Q3 accuracy of 84 % for the ASTRAL40 (Fox et al., 2014) dataset and an at least 1 % higher Q3 accuracy than the respective second best method for the CASP9, 10 and 11 datasets. Hence, *SPARROW+ is currently superior to all presently available methods, like PSIPRED (Buchan et al., 2013). In its current form, *SPARROW+ overestimates the coil class, the largest secondary structure class, at the expense of the strand class, which is the smallest class. This results in a high Q3 accuracy for coil and low one for strand. However, the Matthews Correlation Coefficient (MCC) (Gorodkin, 2004) is similar for both classes. During the Development of *SPARROW+ central parameters with a major influence on the prediction quality could be identified. To these parameters belong the choice of the BLAST (Altschul et al., 1990) database for the generation of the sequence profiles with PSI BLAST and the class reduction scheme to reduce the eight DSSP (Kabsch and Sander, 1983) classes to three. Using the UniRef90 (Suzek et al., 2014) BLAST database containing only homologs with 90 % sequence identity and filtering it with pfilt (Jones and Swindells, 2002) to remove transmembrane and unordered proteins, improved prediction quality. *SPARROW+ uses a class reduction scheme that accounts for the peculiarities of DSSP and new insights concerning the π helix secondary structure. During the implementation of the enhancements of *SPARROW+ it became obvious that the size of the sequence window is of critical importance for the prediction quality. The gains in prediction quality through enhancements of the vector valued classifier, such as a second prediction step or the combination different types of input data, depend on the window size. The smaller the considered sequence window, the greater the corresponding gains in prediction quality. Therefore, the enhancements reduce the prediction quality gained with an increase of the window sizes. Specifically for the vector valued classifier of *SPARROW+ a multiclass confidence measure was developed. The confidence can be correlated to prediction quality measures allowing to predict them. From a confidence of 0.8 a Q3 accuracy of 90 % can be expected. Furthermore, the vector valued classifiers show different confidence distributions for true and false positive classifications.

dc.description.abstract

Kenntnisse über die Struktur eines Proteins sind von größter Bedeutung um dessen Funktion zu verstehen. Die Geschwindigkeit mit der Proteinsequenzen bestimmt werden überschreitet bei weitem die Rate mit der Proteinstrukturen experimentell gelöst werden. Deshalb sind Methoden, um die Proteinstruktur an Hand seiner Sequenz vorherzusagen, sehr gefragt. Die Vorhersage der Sekundärstruktur von Proteinen ist der erste Schritt, um dessen dreidimensionale räumliche Struktur vorherzusagen. Weiterhin erlaubt die Kenntnis über die Sekundärstruktur eines Proteins dessen Zuordnung in eine Faltungsklasse. In dieser Arbeit wird ein neues Programm, *SPARROW+, zur Vorhersage der Sekundärstruktur von Proteinen vorgestellt. *SPARROW+ ist die Weiterentwicklung seines Vorgänger *SPARROW (Rasinski, 2011). Wie sein Vorgänger wird für die Vorhersage ein vektorwertiger Klassifikator verwendet. Dieser Klassifikator erlaubt hoch dimensionale Eingangsdaten in einen niedrig dimensionalen Raum zu projizieren. Die Klassifizierung der Eingangsdaten erfolgt durch die relative Orientierung des vektorwertigen Klassifikators zu Klassenvektoren. *SPARROW+ besteht in zwei aufeinander folgenden Vorhersageschritten. Aus der Eingangssequenz wird ein PSI BLAST (Altschul et al., 1997) Profil generiert, welches zusammen mit der Sequenz die Eingabe für die erste Stufe ist. Die Vorhersage der ersten Stufe und das PSI BLAST Profil sind die kombinierten Eingangsdaten für die zweite Stufe. *SPARROW+ erreicht eine Q3 Genauigkeit von 84 % auf dem ASTRAL40 (Fox et al., 2014) Datensatz und erzielt auf den CASP9, 10 und 11 Datensätzen eine 1 % höhere Q3 Genauigkeit als die jeweilige zweitbeste Methode. *SPARROW+ ist hiermit allen anderen aktuellen Methoden wie z.B. PSIPRED (Buchan et al., 2013) überlegen. In seiner derzeitigen Form überschätzt *SPARROW+ den Anteil der Coil-Strukturen, die größte Sekundärstrukturklasse, auf Kosten von Strand, der kleinsten Klasse. Dies führt zu einer hohen Q3 Genauigkeit für Coil und einer niedrigen für Strand, wobei der Matthews Correlation Coefficient (MCC) (Gorodkin, 2004) für beide Klassen ähnlich ist. Bei der Entwicklung von *SPARROW+ konnten zentrale Parameter ermittelt werden, welche einen großen Einfluss auf die Vorhersagequalität haben. Zu diesen Parametern gehören die Wahl der BLAST (Altschul et al., 1990) Datenbank für die Generierung der Sequenzprofile mit PSI BLAST und das Reduktionsschema um die acht DSSP (Kabsch and Sander, 1983) Klassen auf drei zu reduzieren. Bei der BLAST Datenbank zeigte sich das eine Reduzierung der Homologen von 100 auf 90 % durch Verwendung der UniRef90 (Suzek et al., 2014) Datenbank, sowie das Entfernen von Transmembran und ungeordneten Proteinen mittels pfilt (Jones and Swindells, 2002), die Vorhersagequalität erhöht. *SPARROW+ verwendet ein Reduktionsschema, welches die Eigenheiten von DSSP sowie Erkenntnisse bezüglich der π Helix berücksichtigt. Bei der Implementierung von Erweiterungen für *SPARROW+ zeigte sich, dass für die Vorhersagequalität die Größe des Sequenzfensters von entscheidender Bedeutung ist. Der Gewinn an Vorhersagequalität durch Erweiterungen des vektorwertigen Klassifikators durch eine zweite Stufe oder die Kombination von verschiedenen Typen von Eingangsdaten ist abhängig von der Fenstergröße. Je kleiner das Fenster desto größer ist der Gewinn an Genauigkeit der Vorhersage. Allerdings reduzieren diese Erweiterungen die Verbesserungen der Vorhersagequalität mit zunehmender Fenstergröße. Speziell für den vektorwertigen Klassifikator von *SPARROW+ wurde ein Multi-Klassen Konfidenzmaß entwickelt. Die Konfidenz lässt sich mit Vorhersagequalitätsmaßen korrelieren und ermöglicht so eine Vorhersage von selbigen. Ab einer Konfidenz von 0.8 ist eine Q3 Genauigkeit von 90 % zu erwarten. Weiterhin zeigt sich, dass der vektorwertige Klassifikator unterschiedliche Konfidenzverteilungen aufweist für richtig und falsch positive Klassifikationen.

dc.format.extent

V, 118 Seiten

dc.language

eng

dc.rights.uri

http://www.fu-berlin.de/sites/refubium/rechtliches/Nutzungsbedingungen

dc.subject

protein

dc.subject

secondary structure

dc.subject

prediction

dc.subject

linear regression

dc.subject

vector-valued function

dc.subject.ddc

500 Naturwissenschaften und Mathematik::570 Biowissenschaften; Biologie::572 Biochemie

dc.title

Protein Secondary Structure Prediction Using a Vector Valued Classifier

dc.type

Dissertation

dcterms.format

Text

dc.contributor.contact

gerrit.korff@gmail.com

dc.contributor.gender

dc.contributor.firstReferee

Prof. Dr. Ernst-Walter Knapp

dc.contributor.furtherReferee

Prof. Dr. Hermann-Georg Holzhütter

dc.date.accepted

2016-01-19

dc.identifier.urn

urn:nbn:de:kobv:188-fudissthesis000000101199-8

dc.title.translated

Protein Sekundärstrukturvorhersage mittels eines vektorwertigen Klassifikators

refubium.affiliation

Biologie, Chemie, Pharmazie

refubium.mycore.fudocsId

FUDISS_thesis_000000101199

refubium.mycore.derivateId

FUDISS_derivate_000000018566

dcterms.accessRights.dnb

free

dcterms.accessRights.openaire

open access

Show Simple Item Record

This Item appears in the following Collection(s)

Dissertationen FU

Files in This Item

Dissertation_Gerrit_Korff.pdf

Size: 3.215MB

Format: PDF

Checksum (MD5): 5a5194f1bdd7057dac9e8945139053eb

View/Open

Protein Secondary Structure Prediction Using a Vector Valued Classifier

Refubium - Freie Universität Berlin Repository

Protein Secondary Structure Prediction Using a Vector Valued Classifier

Metadata

This Item appears in the following Collection(s)

Files in This Item

Export metadata