An Incrementally Trainable Statistical Approach to Information Extraction
Based on Token Classification and Rich Context Models

Siefkes, Christian

An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models

Metadaten

dc.contributor.author

Siefkes, Christian

dc.date.accessioned

2018-06-07T21:47:53Z

dc.date.available

2007-02-20T00:00:00.649Z

dc.date.issued

2007

dc.identifier.uri

https://refubium.fu-berlin.de/handle/fub188/8437

dc.identifier.uri

http://dx.doi.org/10.17169/refubium-12636

dc.description

Title, Abstract & TOC1 1 Introduction9 I The Field of Information Extraction13 2 Information Extraction15 3 Architecture and Workflow19 4 Statistical Approaches29 5 Non-Statistical Approaches35 6 Comparison of Existing Approaches43 II Analysis47 7 Aims and Requirements49 8 Assumptions55 9 Target Schemas and Input/Output Models59 III Algorithms and Models67 10 Modeling Information Extraction as a Classification Task69 11 Classification Algorithm and Feature Combination Techniques73 12 Preprocessing and Context Representation81 13 Merging Conflicting and Incomplete XML Markup89 14 Weakly Hierarchical Extraction99 IV Evaluation103 15 Evaluation Goals and Metrics105 16 Text Classification Experiments109 17 Extraction of Attribute Values119 18 Ablation Study and Utility of Incremental Training131 19 Comparison of Tagging Strategies139 20 Weakly Hierarchical Extraction143 21 Mistake Analysis149 V Conclusions167 22 Conclusion and Outlook169 Bibliography175 Appendix183 A Schema for Augmented Text185 B Curriculum Vitae189 C Zusammenfassung in deutscher Sprache191

dc.description.abstract

Most of the information stored in digital form is hidden in natural language (NL) texts. While information retrieval (IR) helps to locate documents which might contain the facts needed, there is no way to answer queries. The purpose of information extraction (IE) is to find desired pieces of information in NL texts and store them in a form that is suitable for automatic querying and processing. The goal of this thesis has been the development and evaluation of a trainable statistical IE approach. This approach introduces new functionality not supported by current IE systems, such as support for incremental training to reduce the human training effort by allowing a more interactive workflow. The IE system introduced in this thesis is designed as a generic framework for statistical classification-based information extraction that allows modifying and exchanging all core components (such as classification algorithm, context representations, tagging strategies) independently of each other. The thesis includes a systematic analysis of switching one such component (the tagging strategies). Several new sources of information are explored for improving extraction quality. Especially we introduce rich tree-based context representations that combine document structure and generic XML markup with more conventional linguistic and semantic sources of information. Preparing these rich context representations makes it necessary to unify various and partially conflicting sources of information (such as structural markup and linguistic annotations) in XML-style trees. For this purpose, we develop a merging algorithm that can repair nesting errors and related problems in XML-like input. As the core of the classification-based IE approach, we introduce a generic classification algorithm (Winnow+OSB) that combines online learning with novel feature combination techniques. We show that this algorithm is not only suitable for information extraction, but also for other tasks such as text classification. Among other good results, the classifier was found to be one of the two best filters submitted for the 2005 Spam Filtering Task of the Text REtrieval Conference (TREC). The thesis includes a detailed evaluation of the resulting IE which shows that the results reached by our system are better than or competitive with those of other state-of-the-art IE systems. The evaluation includes an ablation study that measures the influence of various factors on the overall results and finds that all of them contribute to the good results of our system. It also includes an analysis of the utility of interactive incremental training that confirms that this newly introduced training regimen can be very helpful for reducing the human training effort. The quantitative evaluation is complemented with an analysis of the kinds of mistakes made during extraction and their likely causes that allows a better understanding of where and how we can expect further improvements in information extraction quality to be made and which limits might exist for information extraction systems in general.

dc.description.abstract

Ein Großteil der heute digital verfügbaren Informationen liegt in Form natürlichsprachlicher Texte vor. Das Ziel der Informationsextraktion (IE) ist es, bestimmte gewünschte Informationen aus solchen Texten zu extrahieren und in einer Form abzuspeichern, die strukturierte Abfragen ermöglicht (im Gegensatz zum Information Retrieval, wo die Suche nach Dokumenten und Dokumentfragmenten im Vordergrund steht). In dieser Dissertation wird ein trainierbares statistisches Informationsextraktionssystem entwickelt. Anders als bisherige Ansätze kann unser System inkrementell trainiert werden, was den menschlichen Trainingsaufwand verringert. Das System ist als generisches Framework konzipiert -- alle Bestandteile des klassifikationsbasierten Informationsextraktionsmodells können unabhängig voneinander modifiziert und ausgetauscht werden. Der systematische Austausch einer der Komponenten (der Tagging-Strategien) wird im Rahmen der Arbeit untersucht. Zur Verbesserung der Extraktionsqualitität werden verschiedene neue Informationsquellen untersucht. Die Verwendung reichhaltiger Kontextrepräsentationen auf Basis von Baumstrukturen ermöglicht es uns, neben semantischen und linguistischen Informationen auch die Dokumentstruktur als Informationsquelle zu erschließen. Um die verschiedenen und teilweise widersprüchlichen Strukturen in eine einheitliche Baumstruktur zu bringen, entwickeln wir einen Verschmelzungsalgorithmus für XML, der Verschachtelungskonflikte und andere Fehler beheben kann. Als Kern des klassifikationsbasierten Ansatzes führen wir einen generischen Klassifikationsalgorithmus (Winnow+OSB) ein, der Online Learning mit einer neuen Art erweiterter Bigramme verbindet. Wir zeigen, dass dieser Algorithmus außer für Informationsextraktion auch für andere Anwendungen wie Textklassifikation geeignet ist -- so erzielte er im Spamfilter-Wettbewerb der Text REtrieval Conference (TREC) 2005 eines der beiden besten Ergebnisse. Die Arbeit beinhaltet eine ausführliche Evaluation unseres Extraktionssystems, die zeigt, dass es mit anderen modernen Verfahren vergleichbare oder bessere Ergebnisse erzielt. Wir untersuchen dabei auch den Einfluss verschiedener Faktoren und Informationsquellen auf das Gesamtsystem, mit dem Ergebnisse, dass alle eine positive Rolle spielen. Weiterhin wird die Nützlichkeit des von uns vorgeschlagenen interaktiven inkrementellen Trainings gemessen; dabei bestätigt sich, dass der menschliche Trainingsaufwand auf diese Weise stark reduziert werden kann. Ergänzend zur quantitativen Evaluation analysieren wir die auftretenden Fehler und ihre mutmaßlichen Ursachen, was ein besseres Verständnis von Verbesserungsmöglichkeiten und vermutlich eher grundsätzlichen Beschränkungen der Informationsextraktion ermöglicht.

dc.language

eng

dc.rights.uri

http://www.fu-berlin.de/sites/refubium/rechtliches/Nutzungsbedingungen

dc.subject

information extraction

dc.subject

classification-based extraction

dc.subject

statistical methods

dc.subject

natural language processing

dc.subject

incremental training

dc.subject

H.3.1

dc.subject.ddc

000 Informatik, Informationswissenschaft, allgemeine Werke::000 Informatik, Wissen, Systeme::004 Datenverarbeitung; Informatik

dc.title

An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models

dc.type

Dissertation

dcterms.format

Text

dc.contributor.gender

dc.contributor.firstReferee

Prof. Dr. Heinz F. Schweppe

dc.contributor.furtherReferee

Prof. Dr. Bernhard Thalheim

dc.date.accepted

2007-02-16

dc.date.embargoEnd

2007-02-21

dc.identifier.urn

urn:nbn:de:kobv:188-fudissthesis000000002705-8

dc.title.translated

Ein inkrementell trainierbarer statistischer Ansatz zur Informationsextraktion basierend auf Tokenklassifikation und reichhaltigen Kontextmodellen

refubium.affiliation

Mathematik und Informatik

refubium.mycore.fudocsId

FUDISS_thesis_000000002705

refubium.mycore.transfer

http://www.diss.fu-berlin.de/2007/173/

refubium.mycore.derivateId

FUDISS_derivate_000000002705

dcterms.accessRights.dnb

free

dcterms.accessRights.openaire

open access

Zur Kurzanzeige

Das Dokument erscheint in:

Dissertationen FU

Dateien zu dieser Ressource

00_siefkes.pdf

Beschreibung: FUDISS_derivate_000000002705

Größe: 151.2KB

Format: PDF

Prüfsumme (MD5): 8f69784146de6211a643b2e4fa185c4b

Refubium - Repositorium der Freien Universität Berlin

An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models

Metadaten

Das Dokument erscheint in:

Dateien zu dieser Ressource

Metadaten exportieren