In many songbird species, young birds learn their song from adult conspecifics. Like much animal communication, birdsong is multimodal: singing is accompanied by beak and body movements. We hypothesized that these visual cues could enhance vocal learning thus partly explaining the reduced learning from unimodal audio playbacks compared to multimodal live social tutoring observed in many birdsong studies. To test this, juvenile zebra finches, Taeniopygia guttata, were tutored in a yoked design where replicate tutoring groups of three male–female dyads were exposed to the same live tutor simultaneously in three different ways. (1) Tutees were housed with the tutor in a central compartment; hence they could hear, see and interact with their tutor (‘live’). (2) Tutees placed in one of two adjacent compartments could hear but not see the same tutor from behind a black loudspeaker cloth (‘audio-only’). (3) Tutees could likewise hear the tutor through loudspeaker cloth but could also see the tutor through a one-way mirror (‘audiovisual’). Comparisons of subadult and adult song showed more changes in the audio-only than in the audiovisual or live tutored tutees, suggesting the audio-only group's song development was delayed. According to (blinded) human observer similarity scoring, the audio-only tutees' singing was least similar and the live tutees' singing most similar to their tutor's singing, while the audiovisual tutees showed an intermediate level of similarity, but the between-treatment differences in similarity were not significant. Conversely, the audio-only group showed the highest similarity values with their father's song, which they only heard before the experimental tutoring. Given that the quantity and quality of the tutor song input were the same across treatments within tutoring groups, the results support the hypothesis that visual in addition to auditory exposure to a tutor can affect the timing and possibly also the amount of vocal learning.