Leveraging large language models for data analysis automation

Jansen, Jacqueline A.; Manukyan, Artur; Al Khoury, Nour; Akalin, Altuna

doi:10.1371/journal.pone.0317084

Leveraging large language models for data analysis automation

Metadaten

dc.contributor.author

Jansen, Jacqueline A.

dc.contributor.author

Manukyan, Artur

dc.contributor.author

Al Khoury, Nour

dc.contributor.author

Akalin, Altuna

dc.date.accessioned

2025-04-10T08:56:40Z

dc.date.available

2025-04-10T08:56:40Z

dc.date.issued

2025

dc.identifier.uri

https://refubium.fu-berlin.de/handle/fub188/47278

dc.identifier.uri

http://dx.doi.org/10.17169/refubium-46996

dc.description.abstract

Data analysis is constrained by a shortage of skilled experts, particularly in biology, where detailed data analysis and subsequent interpretation is vital for understanding complex biological processes and developing new treatments and diagnostics. One possible solution to this shortage in experts would be making use of Large Language Models (LLMs) for generating data analysis pipelines. However, although LLMs have shown great potential when used for code generation tasks, questions regarding the accuracy of LLMs when prompted with domain expert questions such as omics related data analysis questions, remain unanswered. To address this, we developed mergen, an R package that leverages LLMs for data analysis code generation and execution. We evaluated the performance of this data analysis system using various data analysis tasks for genomics. Our primary goal is to enable researchers to conduct data analysis by simply describing their objectives and the desired analyses for specific datasets through clear text. Our approach improves code generation via specialized prompt engineering and error feedback mechanisms. In addition, our system can execute the data analysis workflows prescribed by the LLM providing the results of the data analysis workflow for human review. Our evaluation of this system reveals that while LLMs effectively generate code for some data analysis tasks, challenges remain in executable code generation, especially for complex data analysis tasks. The best performance was seen with the self-correction mechanism, in which self-correct was able to increase the percentage of executable code when compared to the simple strategy by 22.5% for tasks of complexity 2. For tasks for complexity 3, 4 and 5, this increase was 52.5%, 27.5% and 15% respectively. Using a chi-squared test, it was shown that significant differences could be found using the different prompting strategies. Our study contributes to a better understanding of LLM capabilities and limitations, providing software infrastructure and practical insights for their effective integration into data analysis workflows.

dc.format.extent

17 Seiten

dc.language

eng

dc.rights.uri

https://creativecommons.org/licenses/by/4.0/

dc.subject

Bioinformatics

dc.subject

Data visualization

dc.subject

Genome complexity

dc.subject.ddc

000 Informatik, Informationswissenschaft, allgemeine Werke::000 Informatik, Wissen, Systeme::004 Datenverarbeitung; Informatik

dc.title

Leveraging large language models for data analysis automation

dc.type

Wissenschaftlicher Artikel

dcterms.bibliographicCitation.articlenumber

e0317084

dcterms.bibliographicCitation.doi

10.1371/journal.pone.0317084

dcterms.bibliographicCitation.journaltitle

PLoS ONE

dcterms.bibliographicCitation.number

dcterms.bibliographicCitation.volume

dcterms.bibliographicCitation.url

https://doi.org/10.1371/journal.pone.0317084

refubium.affiliation

Mathematik und Informatik

refubium.affiliation.other

Institut für Bioinformatik

refubium.resourceType.isindependentpub

dcterms.accessRights.openaire

open access

dcterms.isPartOf.eissn

1932-6203

refubium.resourceType.provider

WoS-Alert

Zur Kurzanzeige

Das Dokument erscheint in:

Dokumente FU

Dateien zu dieser Ressource

journal.pone.0317084.pdf

Größe: 1.909MB

Format: PDF

Prüfsumme (MD5): d0a5a7d170c876c6c1f20faac73afc5d

Öffnen

Leveraging large language models for data analysis automation

Refubium - Repositorium der Freien Universität Berlin

Leveraging large language models for data analysis automation

Metadaten

Das Dokument erscheint in:

Dateien zu dieser Ressource

Metadaten exportieren