Evaluating GPT- and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment

Tschisgale, Paul; Maus, Holger; Kieser, Fabian; Kroehs, Ben; Petersen, Stefan; Wulff, Peter

doi:10.1103/6fmx-bsnl

Evaluating GPT- and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment

Metadata

dc.contributor.author

Tschisgale, Paul

dc.contributor.author

Maus, Holger

dc.contributor.author

Kieser, Fabian

dc.contributor.author

Kroehs, Ben

dc.contributor.author

Petersen, Stefan

dc.contributor.author

Wulff, Peter

dc.date.accessioned

2025-09-22T11:45:17Z

dc.date.available

2025-09-22T11:45:17Z

dc.date.issued

2025

dc.identifier.uri

https://refubium.fu-berlin.de/handle/fub188/49486

dc.identifier.uri

http://dx.doi.org/10.17169/refubium-49208

dc.description.abstract

Large language models (LLMs) are now widely accessible, reaching learners across all educational levels. This development has raised concerns that their use may circumvent essential learning processes and compromise the integrity of established assessment formats. In physics education, where problem solving plays a central role in both instruction and assessment, it is therefore essential to understand the physicsspecific problem-solving capabilities of LLMs. Such understanding is key to informing responsible and pedagogically sound approaches to integrating LLMs into instruction and assessment. This study therefore compares the problem-solving performance of a general-purpose LLM (GPT-4o, using varying prompting techniques) and a reasoning-optimized model (o1-preview) with that of participants in the German Physics Olympiad, based on a set of well-defined Olympiad problems. In addition to evaluating the correctness of the generated solutions, the study analyzes the characteristic strengths and limitations of LLM-generated solutions. The results of this study indicate that both tested LLMs (GPT-4o and o1-preview) demonstrate advanced problem-solving capabilities on Olympiad-type physics problems, on average outperforming the human participants. Prompting techniques had little effect on GPT-4o's performance, and o1-preview almost consistently outperformed both GPT-4o and the human benchmark. The main implications of these findings are twofold: LLMs pose a challenge for summative assessment in unsupervised settings, as they can solve advanced physics problems at a level that exceeds top-performing students, making it difficult to ensure the authenticity of student work. At the same time, their problem-solving capabilities offer potential for formative assessment, where LLMs can support students in evaluating their own solutions to problems.

dc.format.extent

21 Seiten

dc.language

eng

dc.rights.uri

https://creativecommons.org/licenses/by/4.0/

dc.subject

Assessment

dc.subject

Scientific reasoning & problem solving

dc.subject

Technology

dc.subject.ddc

000 Informatik, Informationswissenschaft, allgemeine Werke::000 Informatik, Wissen, Systeme::004 Datenverarbeitung; Informatik

dc.title

Evaluating GPT- and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment

dc.type

Wissenschaftlicher Artikel

dcterms.bibliographicCitation.articlenumber

020115

dcterms.bibliographicCitation.doi

10.1103/6fmx-bsnl

dcterms.bibliographicCitation.journaltitle

Physical Review Physics Education Research

dcterms.bibliographicCitation.number

dcterms.bibliographicCitation.volume

dcterms.bibliographicCitation.url

https://doi.org/10.1103/6fmx-bsnl

refubium.affiliation

Physik

refubium.resourceType.isindependentpub

dcterms.accessRights.openaire

open access

dcterms.isPartOf.eissn

2469-9896

refubium.resourceType.provider

WoS-Alert

Show Simple Item Record

This Item appears in the following Collection(s)

Dokumente FU

Files in This Item

6fmx-bsnl.pdf

Size: 1.440MB

Format: PDF

Checksum (MD5): b96567c11c3db5d3523a0ab9f842b24f

View/Open

Evaluating GPT- and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment

Refubium - Freie Universität Berlin Repository

Evaluating GPT- and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment

Metadata

This Item appears in the following Collection(s)

Files in This Item

Export metadata