Humans are remarkably fast at processing scenes and making decisions based on the information they contain. Within a few hundred milliseconds of viewing a scene, our brain can extract the most important information through a hierarchical cascade starting with perceptual attributes (color, edges, etc.) and ending with abstract properties (category, relationship between objects, etc.), eventually supporting decision-making. Despite the central role of scene processing, many aspects of how it unfolds in the brain remain poorly understood. In particular, the intermediate stages linking perceptual and abstract scene understanding, i.e., mid-level feature processing, are largely unresolved. Moreover, the link between neural activity and behavior, i.e., when, where and what kind of scene information arising in the brain influences decision-making, remains unclear. This thesis addresses these gaps through three studies implementing empirical and computational methods. In Study 1, we used a novel stimulus set to reveal that various mid-level features of scenes are processed in humans between ∼100 ms and ∼250 ms after stimulus onset, bridging low- and high-level feature representations, and with a temporal hierarchy that is mirrored by convolutional neural networks (CNNs). In Study 2, we showed that neural representations of scenes are suitably formatted for behavioral readout of scene naturalness between ∼100 ms and ∼200 ms, i.e., in the intermediate processing stages, and that intermediate CNN layers best correlated with the neural representations in this time-window, suggesting that mid-level features underlie behaviorally-relevant representations. In Study 3, we showed that neural representations of scenes are suitably formatted for behavioral readout of scene naturalness in the early visual cortex and in the object-selective high-level cortex, and that intermediate CNN layers best explain this brain-behavior relationship, indicating that behaviorally-relevant representations in these areas are driven by mid-level features. Taken together, the studies included in this thesis revealed the timing, spatial localization, and behavioral relevance of mid-level feature representations in scene processing, contributing to a better understanding of how the human brain extracts information from the surrounding world.