Trend Mining

Streibel, Olga Katarzyna

Trend Mining

Title:

Trend Mining

Translated Title(s):

Trend Mining

Author(s):

Streibel, Olga Katarzyna

Year of publication:

2014

Available Date:

2014-02-03T07:14:42.678Z

Abstract:

In terms of Information Retrieval (IR), a trend is defined as a topic area that is growing in interest and utility over time. An example of a trend would thus be the general topic financial crisis that started to appear on the market in late 2007 and early 2008, or the Arab Spring that started to appear on the news in 2011. Several approaches based on methods from text mining and machine learning can be successfully applied to the problem of mining trends in text collections. Among others, the most popular are probabilistic topic models and diverse clustering methods. The weakness of the existing research in automatic trend detection in texts lies in: 1\. inconsistency in the definition of a trend 2\. lack of a general scientific approach for trend mining 3\. lack of the integration of explicit knowledge and therefore the difficulty in the interpretation of algorithm's results. The scientific contribution of this research is contained in the suggestion to deal with the trend detection from the perspective of trend mining that is being defined here. As a solution for the problem of difficulty in the interpretation of the results from the common trend detection techniques, this research proposes the trend template that is a knowledge-based trend mining approach. Based on this trend template, two directions of implementation are introduced: trend ontology and trend-indication (the trend weighting method). The trend ontology works as an a-priori model and enables the discovery of a trend structure in the web documents corpus. Tests with this method on a test corpus show that mining trends with an a-priori model while integrating explicit knowledge leads to a better quality of results considering their interpretability. The trend-indication approach is based on time-incorporating weighting methods for selection of trend features from web documents. It enables the reduction of features that are considered in the process of trend mining, and therefore reduces the data so that only time-relevant information is considered for further analysis. This method's results on our web document corpus show that time-based weighting functions alone can help in discovering trend-relevant features. Both the trend ontology and the trend-indication approaches are implemented in the tremit tool (TREnd MIning Tool), a test tool developed for this thesis, and are tested on a test corpus. The test corpus consists of 35,635 business news and 4,696 DAX (Deutscher Aktienindex - German stock market) reports from German web sites in a late 2007 and early 2008. The results are compared with the standard method results of a LDA-based topic model and the k-means clustering algorithm on the same test corpus. Discussion of the results is contained in the experimental part of the thesis.

Ein Trend im Kontext des Information Retrievals (IR) ist ein Themengebiet, das über einen Zeitraum an Nutzwert und Interesse gewinnt, wie z. B. das allgemeine Thema Finanzkrise im Zeitraum 2008-2012 oder Arabischer Frühling im Zeitraum 2010-2011. Es gibt Verfahren, verankert in Bereichen des Data Minings, Text Minings und des Maschinellen Lernens, die zur Lösung des Problems der Trenderkennung in Texten herangezogen werden. Zu den oft verwendeten gehören die probabilistischen Topic Models sowie verschiedene Clusteringverfahren. Die Schwachstellen der existierenden Forschung über automatische Trenderkennung in Texten liegen in: 1\. inkonsistenten Definitionen des Trends 2\. fehlendem wissenschaftlichen Ansatz des Trend Mining 3\. fehlendem Bezug zum expliziten Wissen und damit schlechter Interpretierbarkeit der Ergebnisse Der wissenschaftliche Beitrag dieser Arbeit besteht in dem Vorschlag, die Forschung zur automatischen Trenderkennung aus der Sicht des Trend Mining zu betrachten, dessen Definition in dieser Arbeit vorgeschlagen wird. Als Lösung für das Problem der schlechten Interpretierbarkeit der Ergebnisse von gängigen Trenderkennungsalgorithmen wird trend template vorgeschlagen, das ein wissensbasierter Ansatz für trend mining ist. Ausgehend von diesem trend template werden zwei Implementierungsrichtungen gezeigt: die Trendontologie und das Trend- Indication-Verfahren. Die Trendontologie funktioniert nach dem Prinzip eines A -priori-Modells und ermöglicht die Entdeckung einer Trendstruktur in dem Webdokumentenkorpus. Tests mit diesem Verfahren auf dem Testkorpus zeigen, dass Trenderkennung mit einem A-priori-Modell unter Einbezug von explizitem Wissen, zu qualitativ besseren Ergebnissen, vor allem in Hinsicht auf die Interpretierbarkeit, führt. Das Trend-Indication-Verfahren baut auf den zeitbasierten Gewichtungsfunktionen auf und konzentriert sich auf die Selektion der Trend Features aus den Webdokumenten. Mithilfe dieses Verfahrens wird die Dimension der zu untersuchenden Daten im Hinblick auf die Trenderkennung sinnvoll reduziert und somit nur die zeitrelevante Information aus den Texten für weitere Analysen bereitgestellt. Die Tests mit diesem Verfahren zeigen, dass zeitrelevante Trendbegriffe alleine durch geeignete Gewichtungsfunktionen gut aufgedeckt werden. Beide Methoden werden in dem tremit (TREnd MIning Tool), das für diese Arbeit entwickelte Testtool, implementiert und auf dem Testkorpus getestet. Der Testkorpus besteht aus 35.635 Wirtschaftsnachrichten und 4.696 DAX-Berichten des deutschsprachigen Webs aus dem Zeitraum September 2007 bis April 2008. Die Ergebnisse werden mit den Ergebnissen der gängigen Verfahren - LDA-basiertem Topic Model und k-means Clustering - auf dem gleichen gleichen Korpus verglichen und im Experimentierteil der Arbeit diskutiert und evaluiert.

Identifier:

https://refubium.fu-berlin.de/handle/fub188/7212
http://dx.doi.org/10.17169/refubium-11411
urn:nbn:de:kobv:188-fudissthesis000000096106-4

Language:

English

Keywords:

trend mining
trend template
trend ontology
trend indication
knowledge-based model
tremit tool
temporal
analysis

DDC-Classification:

000 Informatik, Wissen, Systeme
004 Datenverarbeitung; Informatik

Publication Type:

Dissertation

Department/institution:

Mathematik und Informatik

Show Full Item Record

This Item appears in the following Collection(s)

Dissertationen FU

Files in This Item

streibel-diss-online-1.pdf

Size: 47.39MB

Format: PDF

Checksum (MD5): 754f36a7fd9b6a969398e6c025fc2d38

View/Open

License

http://www.fu-berlin.de/sites/refubium/rechtliches/Nutzungsbedingungen

Trend Mining

Refubium - Freie Universität Berlin Repository

Trend Mining

This Item appears in the following Collection(s)

Files in This Item

License

Export metadata

Refubium - Freie Universität Berlin Repository

Trend Mining

This Item appears in the following Collection(s)

Files in This Item

License

Export metadata

Related items