dc.contributor.author
Fillies, Jan
dc.contributor.author
Teich, Maximilian
dc.contributor.author
Karam, Naouel
dc.contributor.author
Paschke, Adrian
dc.contributor.author
Rehbein, Malte
dc.date.accessioned
2026-01-26T10:37:13Z
dc.date.available
2026-01-26T10:37:13Z
dc.identifier.uri
https://refubium.fu-berlin.de/handle/fub188/50570
dc.identifier.uri
http://dx.doi.org/10.17169/refubium-50297
dc.description.abstract
As the availability of historical biodiversity data continues to grow, ensuring its usability through adherence to FAIR principles (Findable, Accessible, Interoperable, and Reusable) has become increasingly essential. This study focuses on solving key challenges in interpreting biodiversity data from historical texts, particularly in identifying and aligning common species names with their modern scientific counterparts. We address five main challenges: spelling variations, the invention of new terms, semantic shifts between broad and narrow naming conventions, and the renaming or reclassification of historical terms. To tackle these issues, we tested a range of large language models (LLMs) (GPT‑4, LLaMA3-405B, Mistral-8B, and Qwen3-30B-A3B) for their ability to resolve these challenges and support terminology alignment. The initial entity detection was performed using GPT-4o, which achieved a 92% success rate in detecting historical common names and correctly identified 98% of scientific terms on a test dataset. Comparative evaluation of the ability to match historical common names with modern equivalents revealed that GPT-4o consistently delivered the most accurate and nuanced outputs across four of the five challenges, demonstrating strong contextual understanding. The results highlight the potential of advanced LLMs to not only identify entities but also to interpret historical naming conventions, thereby enhancing the reusability and interoperability of biodiversity data in line with FAIR principles.
en
dc.format.extent
8 Seiten
dc.rights.uri
https://creativecommons.org/licenses/by/4.0/
dc.subject
Large Language Models
en
dc.subject
FAIR Principals
en
dc.subject
Language Standardization
en
dc.subject
Data Interoperability
en
dc.subject
Historic Data
en
dc.subject
Semantic Annotation
en
dc.subject.ddc
000 Informatik, Informationswissenschaft, allgemeine Werke::000 Informatik, Wissen, Systeme::004 Datenverarbeitung; Informatik
dc.title
Historic to FAIR: Leveraging LLMs for Historic Term Identification and Standardization
dc.type
Wissenschaftlicher Artikel
dc.title.translated
Historisch zu FAIR: Einsatz von LLMs zur Identifikation und Standardisierung historischer Begriffe
de
dcterms.bibliographicCitation.doi
10.1007/s13222-025-00519-3
dcterms.bibliographicCitation.journaltitle
Datenbank-Spektrum
dcterms.bibliographicCitation.number
3
dcterms.bibliographicCitation.pagestart
179
dcterms.bibliographicCitation.pageend
186
dcterms.bibliographicCitation.volume
25
dcterms.bibliographicCitation.url
https://doi.org/10.1007/s13222-025-00519-3
refubium.affiliation
Mathematik und Informatik
refubium.affiliation.other
Institut für Informatik

refubium.funding
Springer Nature DEAL
refubium.note.author
Gefördert aus Open-Access-Mitteln der Freien Universität Berlin.
refubium.resourceType.isindependentpub
no
dcterms.accessRights.openaire
open access
dcterms.isPartOf.eissn
1610-1995