Speech rate in women and men: who’s really taking up the most space?

Many of us would hold our hands up to harboring the intuition that women tend to talk faster than men, but while myths of differences in word count over the course of the day have been debunked without much complication (c.f. Gender Jabber: Do Women Talk More than Men?), the case of speaking rate is much more curious an illusion.

Not only do women not in fact speak faster than men, but they are reliably found to speak slower in both read and spontaneous speech (Simpson, 1998).  So where is this impression coming from?  Is our calibration just out of whack from watching ‘Bringing Up Baby’ one too many times or are our gender stereotypes so ingrained that we simply can’t see women as anything other than over-excitable motor-mouths?

As Weirich and Simpson (2014) found, the answer is a lot more interesting and may actually lie in actual properties of the speech signal.

Speech tempo perception does not seem to be as simple as tracking pacing according to an abstract, independent metric of time.  Stimuli with more complex acoustic properties are generally perceived as having taken longer to perform than simple ones, despite taking an equal length of time to play out.  Leaving gender aside for the moment, a number of illusory effects  potentially driven by this bias within speech have already been identified:

  • Vowels spoken with varying and dynamic pitch contours are perceived as longer than the same vowel spoken in a monotone or with simpler pitch variation (Lehiste, 1976; Cumming, 2011)
  • Utterances are heard as faster when pause, syllable or segment content increases in number or duration within the same time frame (Trouvain, 2004)

So, how might this relate to sex differences?

Sex and Vowel Space

Acoustic vowel space is a co-ordinate representation of the locations of an individual’s vowels, according to two key frequency properties of the speech signal called formants (F1 & F2).  Formant values are related to the properties of an individual’s vocal tract, which is composed of two cavities (pharyngeal & oral) which can differ in overall length as well as in length ratio.  On average, men have longer vocal tracts and particularly a longer pharyngeal cavity (Fant, 1966; Chiba & Kajiyama 1941).  The cavities and the moving articulators of the vocal tract act as spectral filters during speech, which is what gives rise to the different characteristic formant structures we see in vowel space.

We can map out this space by taking measurements from people reading out words containing key vowels of their native language.  These spaces vary from person to person, due to sociocultural and biological factors, but cross-lingusitically it has been demonstrated that on average women’s vowel spaces tend to be larger, as you can see from the example below.

3 vowels from male (black) & female (grey) German speakers

3 vowels from male (black) & female (grey) German speakers. From Weirich & Simpson (2014).

While these cavity lengths are a biological constraint, behaviourally our articulations are also malleable to cultural influence; for example, women might speak more clearly, while men tend to mumble more(Labov, 1990; Heffeman, 2010).  However, I’ll leave those issues for now.

Since women will generally have to traverse a greater acoustic space over the course of an utterance, listeners might be subject to the aforementioned bias of tying perceived duration to acoustic complexity.  If this is the case, within-sex temporal difference effects according to vowel space size should be apparent.  This is what Weirich and Simpson (2014) set out to investigate.

Weirich & Simpson’s (2014) Design

The authors hypothesised that those with larger vowel spaces will have been perceived to cover a larger span of frequencies in vowel space, which means the actions would have had to have been performed quicker within a given time frame.  Thus, they predicted that listeners would rate the utterances of those with larger acoustic vowel spaces as having been spoken faster.  Risking yawn-making digs about multi-tasking ability, could it be the case that women are simply fitting more in?

They constructed controlled utterances from real-recordings of both male and female German-speakers with variable vowel space sizes.  The segments of all utterances were manipulated (according to sex) to be the length of the average segment and the pitch contours and amplitudes for all utterances were normalised.  This provided a set of stimuli in which temporal and pitch factors were kept constant, leaving the acoustic vowel space structure in tact.

Prior to the main perceptual test, ratings were also taken on the perceived naturalness, quality, pitch, speed and age of speaker for the utterances.  This was necessary to ensure the stimuli were not too odd to listen to, but more importantly to check for any subjective impressions of the stimuli that might account for any effects found in the main experiment.

Participants were required to listen to two identical utterances spoken by different speakers2  and rate whether the second speaker spoke slower or faster than the first on a seven-point scale from -3 to +3 (including 0 for perceiving an identical rate).  Their response reaction times were also recorded.  All that varied between the speakers was the size of their vowel space and the critical parameter of each pair was not absolute vowel space size, but difference in vowel space size between speakers.

The Verdict!

Vowel space size was indeed a predictor for perceived tempo in both men and women, with difference in vowel space size proving highly predictive of reaction time data in females (i.e., the larger the difference in vowel space size, the quicker the subjects responded).  The lack of this effect in men was likely due to the much lower variability in vowel space size between the male speakers, which is also probably responsible for the slightly less significant effect of vowel space size within men compared to within women.

Further Thoughts

Cool, right?!

Well, I definitely think so, especially in terms of discovering something concrete potentially underlying this stereotype, as well as going further to demonstrate the applicability of the underlying hypothesis at work within sexes.  However, there are a few things which I find perplexing about the hypothesis, which slightly alters my interpretation of these results.

As Weirich and Simpson quite rightly point out, they have identified an effect of the size of acoustic vowel space on tempo perception (i.e., crossing larger formant frequency territory).  However, according to the  approach I espouse, the Direct Realist Perspective (Fowler, 1986), what is perceived is the event of the speech gesture, not the acoustic properties that make it up.  While this isn’t the place to fully elaborate on the theory, it will suffice to say that frequencies alone are not important to an organism, but the events that they support to specify are.  However, Simpson (2002) examined male and female productions from a database of articulatory recordings and found that while the acoustic consequences were more minor for men’s tongue movements, their trajectories were both faster and covered greater distance than women’s.  If the tempo ratings are based on compensation for perceived action within a timeframe, we should expect the opposite effect to the one found here.

On the face of things, this seems to be bad news for my perspective.  However, looking back at the rationale for the predictions made about acoustic vowel space size on tempo ratings, I feel there has been an oversight.

Cumming’s (2011) overview of the dynamic pitch experiments demonstrates illusory lengthening of the more complex speech sounds, whereas Weirich and Simpson (2014) appear to be setting out to demonstrate compensation for illusory lengthening, but (to all intents and purposes) within the same experimental paradigm. This is a problematic shift in perspective, as it results in a hypothesis in the opposite direction from those of the experiments that motivate the study.  In most cases the studies covered by Cumming (2011a) presented two utterance pairs (typically vowels) and asked which was longer.  While Weirich and Simpson asked participants to judge rate of speech, it is technically equivocal, since the utterances were the same (i.e., faster rate = shorter duration).  It is unclear why compensation is expected in the latter and not the former.  What a compensation effect in fact relies on is that participants were reliably detecting the equivalent duration of the utterances (only then taking account of the larger frequency span), which we already know is subject to bias.

I am not claiming that predictions on speech rate according to size of articulatory vowel space will be the exact inverse of those predicted by acoustic vowel space, so we cannot simply claim that the significant results in fact support Direct Realist predictions.  However, the fact that Simpson’s (2002) findings uncover a reversed relationship between the two spaces at least opens up the possibility that illusory lengthening might be occurring on the basis of real energy expenditure within articulatory space.  In other words, the participants were not compensating for illusion in acoustic space, but potentially experiencing the established illusion according to articulatory space.  For this to be the case, there would have to be information in the acoustic signal about this increased energy expenditure, so any future research on this effect would have to start looking around here.

It certainly doesn’t surprise me that we get these lengthening effects in non-linguistic stimuli, (e.g. moving compared to static pitch in buzz stimuli – Cumming, 2011), since this is often the only information at hand about any potential ‘event’.  However, it doesn’t seem intuitive that we would latch onto this abstract non-informative variable for judging speech rate unless it reliably tells us something about the event.  Indeed, there is plenty of evidence that many of these speech impressions are not simply a broad acoustic effect.  For example, the lengthening effect of dynamic pitch contour of vowels is not even cross-lingustically stable3 and seems to be dependent on factors within a language, such as the existence of vowel length contrasts (Lehnert-LeHouillier, 2007) or the fact that pitch is an acoustic correlate of stress in English (Lehiste, 1976).

Since, judging speech rate isn’t a typical task that we engage in, it’s possible that we are exploiting informational variables (correctly or incorrectly) that are available to us as part of the normal course of speech perception and related to informative events that compose our typical language productions.

Anyway, those are my current impressions, but feel free to comment if you have any thoughts.

As it happens, there might be a physiological underpinning to this, with evidence that guys simply might not be physically able to open their jaws as wide (Weirich, Simpson, Fuchs, Winkler & Perrier, 2015).  And no, this is not an excuse to relegate “gobby female” to the status of neutral scientific parlance

These were always same-sex speakers to avoid stereotype influence.  More specifcially, a between subjects design allowed mixed-sex groups to be exposed only to speakers of one sex throughout the experiment

3 No lengthening effect is found in German, Swedish, Thai or Spanish speakers (c.f. Cumming, 2011a)


Chiba, T., & Kajiyama, M. (1941). The vowel: Its nature and structure. Tokyo-Kaiseikan.

Cumming, R. (2011). The effect of dynamic fundamental frequency on the perception of duration. Journal of Phonetics, 39(3), 375-387.

Fant, G. (1966). A note on vocal tract size factors and non-uniform F-pattern scalings. Speech Transmission Laboratory Quarterly Progress and Status Report, 1, 22-30.

Fowler, C. A. (1986). An event approach to the study of speech perception from a direct-realist perspective. Journal of Phonetics, 14(1), 3-28.

Lehiste, I. (1976). Suprasegmental features of speech. Contemporary issues in experimental phonetics, 225, 239.

Lehnert-LeHouillier, H. (2007). The influence of dynamic F0 on the perception of vowel duration: Cross-linguistic evidence. In Proceedings of the 16th international congress of phonetic sciences, Saarbrucken, Germany.

Simpson, A. P. (1998). Phonetische Datenbanken des Deutschen in der empirischen Sprachforschung und der phonologischen Theoriebildung (No. 33). Institut für Phonetik und Digitale Sprachverarbeitung Universität.

Simpson, A. P. (2002). Gender-specific articulatory-acoustic relations in vowel sequences. Journal of Phonetics, 30, 417–435.

Trouvain, J. (2004). Tempo variation in speech production: Implications for speech synthesis. Univ. Institut für Phonetik.

Weirich, M. & Simpson, A. P. (2014). Differences in acoustic vowel space and the perception of speech tempo. Journal of Phonetics, 43, 1-10.

Weirich, M., Simpson, A., Fuchs, S., Winkler, R., & Perrier, P. (2014, May). Mumbling is morphology?. In 10th International Seminar on Speech Production (ISSP 2014) (pp. 457-460). Köln Universität.