dc.contributor.author
Tschisgale, Paul
dc.contributor.author
Maus, Holger
dc.contributor.author
Kieser, Fabian
dc.contributor.author
Kroehs, Ben
dc.contributor.author
Petersen, Stefan
dc.contributor.author
Wulff, Peter
dc.date.accessioned
2025-09-22T11:45:17Z
dc.date.available
2025-09-22T11:45:17Z
dc.identifier.uri
https://refubium.fu-berlin.de/handle/fub188/49486
dc.identifier.uri
http://dx.doi.org/10.17169/refubium-49208
dc.description.abstract
Large language models (LLMs) are now widely accessible, reaching learners across all educational levels. This development has raised concerns that their use may circumvent essential learning processes and compromise the integrity of established assessment formats. In physics education, where problem solving plays a central role in both instruction and assessment, it is therefore essential to understand the physicsspecific problem-solving capabilities of LLMs. Such understanding is key to informing responsible and pedagogically sound approaches to integrating LLMs into instruction and assessment. This study therefore compares the problem-solving performance of a general-purpose LLM (GPT-4o, using varying prompting techniques) and a reasoning-optimized model (o1-preview) with that of participants in the German Physics Olympiad, based on a set of well-defined Olympiad problems. In addition to evaluating the correctness of the generated solutions, the study analyzes the characteristic strengths and limitations of LLM-generated solutions. The results of this study indicate that both tested LLMs (GPT-4o and o1-preview) demonstrate advanced problem-solving capabilities on Olympiad-type physics problems, on average outperforming the human participants. Prompting techniques had little effect on GPT-4o's performance, and o1-preview almost consistently outperformed both GPT-4o and the human benchmark. The main implications of these findings are twofold: LLMs pose a challenge for summative assessment in unsupervised settings, as they can solve advanced physics problems at a level that exceeds top-performing students, making it difficult to ensure the authenticity of student work. At the same time, their problem-solving capabilities offer potential for formative assessment, where LLMs can support students in evaluating their own solutions to problems.
en
dc.format.extent
21 Seiten
dc.rights.uri
https://creativecommons.org/licenses/by/4.0/
dc.subject
Scientific reasoning & problem solving
en
dc.subject.ddc
000 Informatik, Informationswissenschaft, allgemeine Werke::000 Informatik, Wissen, Systeme::004 Datenverarbeitung; Informatik
dc.title
Evaluating GPT- and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment
dc.type
Wissenschaftlicher Artikel
dcterms.bibliographicCitation.articlenumber
020115
dcterms.bibliographicCitation.doi
10.1103/6fmx-bsnl
dcterms.bibliographicCitation.journaltitle
Physical Review Physics Education Research
dcterms.bibliographicCitation.number
2
dcterms.bibliographicCitation.volume
21
dcterms.bibliographicCitation.url
https://doi.org/10.1103/6fmx-bsnl
refubium.affiliation
Physik
refubium.resourceType.isindependentpub
no
dcterms.accessRights.openaire
open access
dcterms.isPartOf.eissn
2469-9896
refubium.resourceType.provider
WoS-Alert