Configural processing is a specialised perceptual mechanism that allows adult humans to quickly process facial information. It emerges before the first birthday and can be disrupted by upside-down presentation of the face (inversion). To date, little is known about the relationship of configural face processing to the emerging knowledge of audiovisual (AV) speech in infancy. Using eye-tracking we measured attention to speaking mouth in upright and inverted faces that were either congruent or incongruent with the speech sound. Face inversion affected looking at AV speech only in older infants (9- to 11- and 12- to 14-month-olds). The youngest group of infants (5- to 7-month-olds) did not show any differences in looking durations between upright and inverted faces, while in both older groups face inversion led to reduced looking at the articulating mouth. We also observed a stronger interest in the eyes in the youngest infants, followed by an increase in looking time to the mouth in both older groups. Our findings suggest that configural face processing is involved in AV speech processing already in infancy, indicating early integration of face and speech processing mechanisms in cognitive development.