Usage-dependent maintenance of structured Web data sets

Luczak-Rösch, Markus

Usage-dependent maintenance of structured Web data sets

Metadaten

dc.contributor.author

Luczak-Rösch, Markus

dc.date.accessioned

2018-06-07T22:59:22Z

dc.date.available

2014-02-26T09:19:32.726Z

dc.date.issued

2014

dc.identifier.uri

https://refubium.fu-berlin.de/handle/fub188/9899

dc.identifier.uri

http://dx.doi.org/10.17169/refubium-14097

dc.description.abstract

The Web of Data is the current shape of the Semantic Web that gained momentum outside of the research community and becomes publicly visible. It is a matter of fact that the Web of Data does not fully exploit the primarily intended technology stack. Instead, the so called Linked Data design issues, which are the basis for the Web of Data, rely on the much more lightweight technologies. Openly available structured Web data sets are at the beginning of being used in real-world applications. The Linked Data research community investigates the overall goal to approach the Web-scale data integration problem in a way that distributes efforts between three contributing stakeholders on the Web of Data - the data publishers, the data consumers, and third parties. This includes methods and tools to publish and consume Linked Data on the Web, to deal with data quality issues in such structured Web data sets, and to improve the linkage between data sets as well as the mappings between underlying vocabularies. Web ontologies are the Web-compatible and machine-processable schema for structured Web data. With reference to the Semantic Web standards one can claim that a structured Web data set altogether is nothing else than a specific form of an ontology that applies subsets of the conceptual knowledge of numerous ontologies to populate instance data. Research on ontology engineering advances the processes, guidelines, and tools for the creation and management of ontologies and has evolved from describing their scratch development towards an integrative life cycle support. Contextualized at the interface between research on ontology engineering and research on structured Web data this thesis deals with the analysis of the usage of structured Web data sets with the goal to support data publishers in managing maintenance activities as part of the data set life cycle. We focus data sets which follow the Linked Data design issues and offer a SPARQL endpoint to query the data. In this frame we design, develop, instantiate, and study approaches to answer the following three research questions: (1) What are the blind spots between ontology engineering and Linked Data? (2) How can classical Web usage mining methods be applied in the context of structured Web data sets and how does that affect managing data set maintenance? (3) To which extend can usage- dependent metrics help to assess the quality of a Web data set? Since the scope of this thesis spans across the borders of more than one discrete research area its contributions do so as well. Furthermore, it is characteristic for research that deals with methodological innovation that not only process guidelines are designed in theory but also necessary tools when an approach cannot be fulfilled by state-of-the-art technology which also results in multiple contributions. This thesis contributes an analytical survey among a representative set of Linked Open Data providers to understand the role of ontology engineering methodologies in structured Web data publication and management. Along the requirements derived from this study we propose a methodology that describes the processes, activities, methods, and tools of a usage-dependent data set life cycle. The evaluation of the data set quality is a core component of our proposed methodology. We present an instantiation of this by performing Web usage mining on SPARQL query logs. For the preprocessing of classical server log files which contain SPARQL queries we introduce a preprocessing and enrichment algorithm which allows for an in- depth analysis of successful and failing queries as well as their atomic parts. A statistical framework on top of our enriched usage database features a plain and a hierarchical data quality score function. The evaluation is done multi-perspectively capturing three pillars. First, the data set life cycle is set into context with the most recent and most established methodologies for ontology and data set development or management by application of the state- of-the-art framework for the qualitative evaluation of ontology engineering methodologies and the ONTOCOM cost model. Second, the representativeness of our data quality framework is critically discussed by comparing it to the state-of-the-art in data quality research which is based on empirical evidence about the importance of particular quality dimensions for the data consumer. Third, in a number of experiments we analyzed real-world log files of different Linked Open Data data sets to prove the applicability of the entire approach in practice and to gain an understanding how our usage-dependent data quality score functions perform in comparison to the state-of-the-art in data set maintenance as a baseline procedure.

dc.description.abstract

Mit den Linked-Data-Prinzipien hat sich seit einigen Jahren ein Paradigma etabliert, das beschreibt, wie man, in vollständiger Konformität zur Web- Architektur, strukturierte Daten in sogenannten Datensets veröffentlicht und Web-Ressourcen über die Grenzen einzelner Datensets hinweg in Beziehungen setzt. Als strukturiert kann man diese Web-Daten bezeichnen, weil alle atomaren Teile dieser als Tripel ausgedrückten Daten mit URIs als global eindeutigen Bezeichnern versehen sind. Durch Auflösung der URIs auf entsprechende Basisvokabulare stellt dies eine vollständige Typisierung von Instanz- und Schemaebene dar, die über die Web-Architektur abrufbar ist. Berücksichtigt man die Semantic-Web-Standards, so kann man zu dem Schluss kommen, dass ein Datenset nichts anderes ist als eine spezielle Form einer Web-Ontologie. Die Forschungsdisziplin des Ontology Engineering beschäftigt sich seit mehreren Jahrzehnten mit der Entwicklung und Evaluierung von standardisierten Prozess- und Lebenszyklusmodellen für die Entwicklung von Ontologien. Ontologien sind in der Domäne der Informationssysteme ein Mittel, um Wissen über Interessensbereiche in einer maschinenverarbeitbaren Form strukturiert zu repräsentieren. Im Rahmen dieser Arbeit wird an der Schnittstelle zwischen Ontology Engineering und Linked Data geforscht und ich gehe der übergeordneten Hypothese nach, dass sich Prinzipien des Ontology Engineering auf Linked Data übertragen lassen. Um den Problembereich als relevant nachzuweisen, wurde zunächst eine Studie zur Anwendung etablierter Ontology Engineering Methoden im Kontext von Linked Data in Form einer Onlinebefragung von Linked-Data-Anbietern durchgeführt. Aus den Ergebnissen der Studie wurden Anforderungen an Datenset-Lebenszyklen abgeleitet, anhand derer ein abstrakter Lebenszyklus entworfen, detailliert beschrieben und instanziiert wurde, um dessen Anwendung experimentell zu untersuchen. Hierbei spielt insbesondere die Evaluation von Datensets in Relation zur Nutzung eine Rolle. Feedback über die Nutzung eines Datensets wird aus Server-Log-Files gewonnen, welche gegen das Datenset ausgeführte SPARQL-Anfragen enthalten. Es wird der Bereich des Web Usage Mining im Kontext strukturierter Web Daten berührt und es stellen sich Fragen zur Ermittlung der Datenqualität auf Basis der Nutzung eines Datensets. Hier schlägt die Arbeit ein statistisches Rahmenwerk zur Bestimmung der Datenqualität auf Basis von Nutzungsdaten vor, das ausgewählte Daten-Qualitäts-Dimensionen des Stands der Wissenschaft abdeckt. Die Evaluation des entworfenen Lebenszyklusses erfolgt durch Anwendung rigoroser Methoden aus dem Ontology Engineering Forschungsbereich, welche es erlauben Methoden qualitativ zu vergleichen, sowie Kostenfunktionen für die einzelnen Phasen des Ontologie-Lebenszyklusses zu bestimmen. Überdies wird eine experimentelle Untersuchung präsentiert, in der echte Nutzungsdaten unterschiedlicher Datensets aus der Linked-Open-Data-Cloud analysiert werden. Als Messbasis zum Vergleich des eigenen Ansatzes dient die derzeitige Praxis zur Wartung von Datensets, wie sie zum Beispiel im Rahmen des DBpedia-Projekts durchgeführt und dokumentiert wird.

dc.format.extent

XV, 235 S.

dc.language

eng

dc.rights.uri

http://www.fu-berlin.de/sites/refubium/rechtliches/Nutzungsbedingungen

dc.subject

data management

dc.subject

web usage mining

dc.subject

data quality

dc.subject

linked data

dc.subject

life cycle

dc.subject

data engineering: data set maintenance

dc.subject

dataspaces

dc.subject.ddc

000 Informatik, Informationswissenschaft, allgemeine Werke::000 Informatik, Wissen, Systeme::004 Datenverarbeitung; Informatik

dc.subject.ddc

000 Informatik, Informationswissenschaft, allgemeine Werke::000 Informatik, Wissen, Systeme::005 Computerprogrammierung, Programme, Daten

dc.title

Usage-dependent maintenance of structured Web data sets

dc.type

Dissertation

dcterms.format

Text

dc.contributor.contact

mail@markus-luczak.de

dc.contributor.gender

dc.contributor.firstReferee

Prof. Dr.-Ing. Robert Tolksdorf

dc.contributor.furtherReferee

Natalya F. Noy, PhD

dc.contributor.furtherReferee

Dr. rer. nat. Elena Simperl

dc.date.accepted

2014-01-13

dc.identifier.urn

urn:nbn:de:kobv:188-fudissthesis000000096138-5

dc.title.translated

Nutzungsbasierte Wartung strukturierter Web-Datensätze

refubium.affiliation

Mathematik und Informatik

refubium.mycore.fudocsId

FUDISS_thesis_000000096138

refubium.note.author

Die Forschungsdaten, auf denen diese Arbeit basiert, wurden aus Datenschutzgründen entfernt. Sie sind als anonymisierte Rohdaten Teil des USEWOD Forschungsdatensatzes und erhältlich unter http://usewod.org/data- sets.html

refubium.mycore.derivateId

FUDISS_derivate_000000014794

refubium.mycore.derivateId

FUDISS_derivate_000000014797

dcterms.accessRights.dnb

free

dcterms.accessRights.openaire

open access

Zur Kurzanzeige