Despite the complexity of real-world environments, natural vision is seamlessly efficient. To explain this efficiency, researchers often use predictive processing frameworks, in which perceptual efficiency is determined by the match between the visual input and internal models of what the world should look like. In scene vision, predictions derived from our internal models of a scene should play a particularly important role, given the highly reliable statistical structure of our environment. Despite their importance for scene perception, we still do not fully understand what is contained in our internal models of the environment. Here, we highlight that the current literature disproportionately focuses on an experimental approach that tries to infer the contents of internal models from arbitrary, experimenter-driven manipulations in stimulus characteristics. To make progress, additional participant-driven approaches are needed, focusing on participants’ descriptions of what constitutes a typical scene. We discuss how recent studies on memory and perception used methods like line drawings to characterize internal representations in unconstrained ways and on the level of individual participants. These emerging methods show that it is now time to also study natural scene perception from a different angle—starting with a characterization of an individual’s expectations about the world.