This paper presents a convolutional neural network (CNN)-based enhancement to inter prediction in Versatile Video Coding (VVC). Our approach aims at improving the prediction signal of inter blocks with a residual CNN that incorporates spatial and temporal reference samples. It is motivated by the theoretical consideration that neural network-based methods have a higher degree of signal adaptivity than conventional signal processing methods and that spatially neighboring reference samples have the potential to improve the prediction signal by adapting it to the reconstructed signal in its immediate vicinity. We show that adding a polyphase decomposition stage to the CNN results in a significantly better trade-off between computational complexity and coding performance. Incorporating spatial reference samples in the inter prediction process is challenging: The fact that the input of the CNN for one block may depend on the output of the CNN for preceding blocks prohibits parallel processing. We solve this by introducing a novel signal plane that contains specifically constrained reference samples, enabling parallel decoding while maintaining a high compression efficiency. Overall, experimental results show average bit rate savings of 4.07% and 3.47% for the random access (RA) and low-delay B (LB) configurations of the JVET common test conditions, respectively.