We live in a structured world, where objects rarely exist in isolation but are often surrounded by similar environments. When objects consistently co-occur with certain objects and scene contexts, our neural systems can implicitly extract and learn such regularities in real-world environments. Predictive processing theories propose that our brains can use learned statistical regularities to predict the structure of incoming sensory input across space and time during visual processing. The predictions may allow us to efficiently recognize objects and understand scenes, thus forming coherent visual experiences in natural vision. In this dissertation, we conducted three studies to explore how our brains use real-world structures to create coherent visual experiences using neuroimaging techniques (EEG & fMRI) and multivariate pattern analyses (MVPA). Study 1 investigated how scene context affects object processing across time by recording EEG signals while participants viewed semantically consistent or inconsistent objects within scenes. The results reveal that semantically consistent scenes facilitate object representations, but this facilitation is task-dependent rather than automatic. In Study 2, we investigated how cortical feedback mediates the integration of visual information across space by manipulating the spatiotemporal coherence of naturalistic video stimuli shown in both visual hemifields. By analytically combining EEG and fMRI data, we demonstrated that spatial integration of naturalistic visual inputs is mediated by cortical feedback in alpha dynamics that fully traverse the visual hierarchy. In Study 3, we further investigated what level of spatiotemporal coherence is needed to trigger such integration-related alpha dynamics. The findings suggest that integration-related alpha dynamics have some flexibility so that they can accommodate information from videos belonging to the same basic-level category. Together, the dissertation provides multimodal evidence demonstrating that contextual information facilitates object perception and scene integration, highlighting the critical role of predictions related to real-world regularities in constructing coherent visual experiences.