Fast Similarity Search in XML Data

Schlieder, Torsten

Fast Similarity Search in XML Data

Metadata

dc.contributor.author

Schlieder, Torsten

dc.date.accessioned

2018-06-07T23:15:44Z

dc.date.available

2003-05-07T00:00:00.649Z

dc.date.issued

2003

dc.identifier.uri

https://refubium.fu-berlin.de/handle/fub188/10259

dc.identifier.uri

http://dx.doi.org/10.17169/refubium-14457

dc.description

Title Page and Contents 1. Introduction 1 2. State of the Art 9 3. The approXQL Query Language 31 4. Modeling Documents and Queries 41 5. Querying by Approximate Tree Embedding 51 6. Direct Query Evaluation 67 7. Schema-Driven Query Evaluation 109 8. Efficient Algorithms for Plan Operators 127 9. The approXQL Query Engine 145 10. Experimental Efficiency Analysis 153 11. Conclusion 175 Bibliography 183 A The Grammar of approXQL 199 B Symbols used in the Thesis 201 C Anlagen gemäß Promotionsordnung 205

dc.description.abstract

The eXtensible Markup Language (XML) is a widely accepted standard for the representation of data. The more data is stored in XML documents, the more important become methods for effective and efficient searching. An important characteristics of XML documents is their self-describing structure. Queries that specify selection conditions for the structure promise to greatly improve the precision of the search. However, the use of the structure can also be problematic, because it is hard for users to learn all of the details of the often complex and heterogeneous structure required to phrase a query, and because structural selection conditions often lead to overspecified queries that miss relevant results. In this thesis, we propose an innovative method for searching in XML data, which uses the descriptive structure as a guide to locate the requested information. A user needs only partial knowledge of the structure to formulate queries that specify conditions on both the content and structure of documents. A query is interpreted in such a way that it retrieves not only exact matches, but also results considered to be similar to the query. To find the similar results, sequences of transformations are applied to the query so that its structure is adapted to the structure of each document in the collection. Each transformation within a sequence has a cost; the total cost of a sequence measures the similarity between the original query and a document matched by the transformed query. This total cost is assigned to the document and determines its position in the list of results, which is sorted by decreasing similarity. By adjusting the costs, the interpretation of queries can be tailored to the needs of different users, and also to the varied characteristics of XML documents. We present all necessary algorithms and data structures to implement a query processor that answers a query in polynomial - typically sublinear - time with respect to the size of the database. For a given query, the query processor creates a compact query-execution plan that represents all possible query transformations. It evaluates the plan by executing operators that successively calculate the transformation costs for each document in the collection. We present techniques to effectively optimize the evaluation of query-execution plans by exploiting equivalences between operators. To reduce the query-evaluation times even more, we propose a method to retrieve the best n results, without computing similarity scores for all documents in the collection. This method uses a structural summary of the data to estimate the best k transformed queries, which are successively evaluated until the best n results are found. The theoretical concepts are validated by a prototypical implementation. We describe the architecture of the prototype, and discuss the results of systematic tests carried out to analyze the evaluation times for a representative set of queries with respect to various collections of real and synthetic XML documents.

dc.description.abstract

Die Markup-Sprache XML ist ein weithin akzeptierter Standard zur Darstellung von Daten. Je mehr Daten in Form von XML-Dokumenten vorliegen, desto wichtiger werden effektive und effiziente Verfahren zur Informationssuche. Ein für die Suche sehr wichtiges Merkmal von XML-Dokumenten ist deren selbstbeschreibende Struktur. Durch Ausnutzung dieser Struktur können sehr präzise Suchanfragen formuliert werden. Doch leider kann die Dokumentstruktur sehr komplex und heterogen sein. Dem Nutzer sind oftmals nicht alle strukturellen Details bekannt, so dass die Formulierung von treffenden Anfragen schwierig ist und nicht selten relevante, aber nicht exakt zur Anfrage passende Dokumente verfehlt werden. In dieser Abeit stellen wir einen innovativen Ansatz zur Suche in XML-Daten vor, der die selbstbeschreibende Struktur als Leitfaden zum Auffinden der gewünschten Information verwendet. Ein Nutzer benötigt lediglich partielle Kenntnisse über die Struktur, um Anfragen zu formulieren, die Selektionsbedingungen bezüglich Inhalt und Struktur der Dokumente spezifizieren. Eine Anfrage wird derart interpretiert, dass nicht nur exakt passende Ergebnisse gefunden werden, sondern auch solche, die als ähnlich zur Anfrage eingeschätzt werden. Um diese ähnlichen Ergebnisse zu finden, wird die Anfrage mit Hilfe von Transformationssequenzen an die Struktur eines jeden Dokuments in der Kollektion angepasst. Dabei hat jede Einzeltransformation bestimmte Kosten. Die Gesamtkosten einer Sequenz werden zur Bewertung der Ähnlichkeit zwischen der originalen Anfrage und den von der transformierten Anfrage selektierten Dokumenten verwendet. Diese Kosten werden den Dokumenten zugewiesen und bestimmen deren Positionen in der Ergebnisliste, die nach fallender Ähnlichkeit sortiert ist. Wir stellen alle notwendigen Algorithmen und Datenstrukturen für die Realisierung eines Anfrage-Prozessors vor, der Suchanfragen in polynomieller - typischerweise sogar sublinearer - Zeit bezüglich der Kollektionsgröße beantwortet. Der Anfrage-Prozessor wertet Ausführungspläne bestehend aus Operatoren aus. Die Operatoren berechnen sukzessive die Transformationskosten. Wir beschreiben Techniken zur effektiven Optimierung von Ausführungsplänen, die Äquivalenzen zwischen den Operatoren ausnutzen. Um die Ausführungszeit für eine Anfrage weiter zu verringern, schlagen wir eine Methode zum effizienten Auffinden der besten n Ergebnisse vor, die eine strukturelle Zusammenfassung der Dokumente in der Kollektion verwendet, um die besten k transformierten Anfragen zu bestimmen. Diese Anfragen werden nacheinander ausgewertet, bis die besten n Ergebnisse gefunden sind. Eine prototypische Implementierung bestätigt die Effizienz der entwickelten Methoden zur Anfrage-Auswertung. Wir beschreiben die Architektur des Prototyps und diskutieren die Ergebnisse von systematischen Tests zur Analyse von Anfrage-Auswertungszeiten für verschiedene Kollektionen mit realen und synthetischen XML-Dokumenten.

dc.language

eng

dc.rights.uri

http://www.fu-berlin.de/sites/refubium/rechtliches/Nutzungsbedingungen

dc.subject

XML

dc.subject

similarity search

dc.subject

query language

dc.subject

semistructured data

dc.subject

H.3.3

dc.subject.ddc

000 Informatik, Informationswissenschaft, allgemeine Werke::000 Informatik, Wissen, Systeme::004 Datenverarbeitung; Informatik

dc.title

Fast Similarity Search in XML Data

dc.type

Dissertation

dcterms.format

Text

dc.contributor.gender

dc.contributor.firstReferee

Prof. Dr. Heinz F. Schweppe

dc.contributor.furtherReferee

Prof. Dr. Myra Spiliopoulou

dc.contributor.furtherReferee

Prof. Dr. Klaus U. Schulz

dc.date.accepted

2003-04-14

dc.date.embargoEnd

2003-05-12

dc.identifier.urn

urn:nbn:de:kobv:188-2003001085

dc.title.translated

Schnelle Ähnlichkeitssuche in XML-Daten

refubium.affiliation

Mathematik und Informatik

refubium.mycore.fudocsId

FUDISS_thesis_000000000957

refubium.mycore.transfer

http://www.diss.fu-berlin.de/2003/108/

refubium.mycore.derivateId

FUDISS_derivate_000000000957

dcterms.accessRights.dnb

free

dcterms.accessRights.openaire

open access

Show Simple Item Record

This Item appears in the following Collection(s)

Dissertationen FU

Files in This Item

00_title.pdf

Description: FUDISS_derivate_000000000957

Size: 44.76KB

Format: PDF

Checksum (MD5): b1239ff78e69e50fde5a566151d3b51a

Refubium - Freie Universität Berlin Repository

Fast Similarity Search in XML Data

Metadata

This Item appears in the following Collection(s)

Files in This Item

Export metadata