Speech Gestures: The new normal(isation)

Speech, we are told, is a messy beast.  Hockett’s Easter Egg analogy (1955) is perhaps one of the most drastically pessimistic and most reflective metaphor of the received wisdom about the phonologists’ struggle.

Imagine a row of differently coloured easter eggs which are carried along a conveyor belt, only to be smashed to bits by mechanical hammer, leaving debris in its wake.  With a mess of albumen, yolk and smashed shell intermingled from neighbouring eggs, it’s hard to understand how we’d begin to put them together again, let alone work out what they were supposed to look like in the first place.

Easter egg tradition

The idealised mental units of intended speech tokens, in this view, must contend with the interfering, material medium of the vocal tract.  In the quest for meaning and regularity in the speech signal, all we seem to find is the wreckage of these mental tokens’ passage into the world.  We find acoustic smearing by coarticulatory effects, making it seemingly impossible to know where one sound starts and another begins, and we find acoustic properties to be infuriatingly contingent on the properties of every speaker’s vocal tracts, rather than simply their linguistic identity.

It seems that all we’re left with is the ingredients for an omelette we never really wanted to make.

In the first of two posts, I’m going to look at the second of these two issues, speaker-dependence, followed up by a treatment of coarticulation in the second.

Speaker Normalisation

Vowels and their acoustic identities are a paradigm case to illustrate the problem at hand.  The speech signal shows signature properties for vowels called formants.  These are the effect of the resonant frequencies of the vocal tract on the periodic glottal pulse from the vocal folds.  The first formant varies relatively systematically with tongue height and the second with tongue backness.  The majority of vowels can be described in terms of these two parameters (Fant, 1960)1 .

While these properties covary somewhat reliably within an individual, every vocal tract is different. Individual variation in shape and size affects both the resonant frequencies of the tract and the fundamental frequency of an individual’s speech, thus the harmonic resonances of voicing.  These indicators can be relied upon somewhat robustly within an individual, but this view would predict problems for a listener, for whom the acoustic properties specifying a vowel varies depending on speaker context.  The ambiguity is demonstrated in the following graph – where there is substantial overlap of vowel clusters in this space.  Despite this, all the following tokens are robustly recognised as the intended vowels.


Vowel clusters – Peterson & Barney (1952)

Quest for mechanisms

Since we are in fact very good at recognising phonemes across speaker contexts – from young ages, at that (e.g Grieser & Kuhl, 1989) – the solution has been assumed to lie in the search for a normalisation mechanism.  This would operate on formants to factor out individual variation and the resulting representation would yield access to vowel identity.

Most theories propose an online mechanism to transform F1-F2 space into speaker-independent units (see Johnson, 2005).  These transformations have ranged from formant ratio normalisation – drawing analogy to chord recognition independent of pitch (Potter & Steinberg, 1950) – to F0 and F3 modulation of values.  However, these rarely perform better than statistical standardisation techniques used as a practical measure to reduce between-speaker scatter in sociophonetic research (Adank et al., 2004), rather than as a putative cognitive mechanism.

Perception of gestures

According to the direct realist theory of speech perception (e.g. Fowler, 1986), listeners do not perceive the acoustic properties of speech sounds, but the gestures that structure the acoustic medium.  The doctrine of dynamical articulation to acoustic kinematic mapping dictates that it is the physical gestures that constrain the available acoustics in a lawful manner over time.  While it means that the acoustic medium likely specifies the gesture, this could be in the form of a higher level acoustic property, rather the supposed acoustic primitives of F1 and F2.  While F1 and F2 indeed covary with tongue positioning in the tract, they clearly do not uniquely specify it i.e. they are not an invariant property.

An analogous case exists in visual perception, in which a triangle is still perceived as a triangle, independent of the absolute lengths of the sides.  Critically though, the lengths of the sides still covary with the perception, but only in relative terms.  For example, a right angled triangle is still defined by the relative lengths of the sides, such that the length of the hypotenuse squared is equal to the summed squares of the other two sides.

With this in mind, it is clear that the formant ratio approaches were trying to capture a similar property, in which the relative formant values are preserved as a possible invariant.  However, the direct realist (/ecological) approach to perception crucially reinforces the notion that the structure in the sensory array specifies its source i.e. the event in the world that gave rise to it, which in this case is gestures of the vocal tract.  Rather than applying abstract methods of differentiation on the sensory array to look for invariant identities, we should start with the source identity – speech gestures – and see whether and how it structures the array in a way that could support identification. The latter is much more constrained than the former.

Something akin to this has been followed in attempts at vocal tract normalisation, since it is known that the length of the vocal tract has a dramatic effect on the formant values of the speaker.  Monahan & Idsardi (2010) approach explores the possibility that listeners exploit the third formant, since it is a relatively reliable indicator of vocal tract length.  Furthermore, there is EEG evidence that we are sensitive to the ratios between F3 and the first two formants.  This is successful to some extent; however, it seems to fit comfortably in the vein of mechanistic transformations on basic primitives.  The existence of such a mechanism is not needed if we accept that our perceptual systems evolved to perceive events, rather than abstract physical parameters of the world.

The reason for caution regarding this type of mechanism is that it requires that F3 specify vocal tract length uniquely.  While it does indeed covary somewhat with length, it is not a simple identity relationship and is also not the only property impacting vowel identity.  For example, F3 alters with lip rounding and vowel category boundaries are sensitive, though weakly, to changes in formants from F3-F5.  Animals have also demonstrated categorical perception of speech sounds, speaking against the primacy of a specialised human speech mechanism (e.g. Kuhl & Miller, 1978).

This falls into the same trap of providing a relatively correlative relationship and presuming causation. It is indeed likely that vocal tract length and its impact on F3, among other things, forms a necessary component of the information supporting gesture identity; however, it is another thing to propose F3 as a modulating cue participating in a transformation of vowel space.  It is also arbitrary to propose the normalisation mechanism would involve a simple ratio relationship between resonant frequencies and vocal tract length, unless modelling of airflow over this  model supports such a transformation of these parameters.

Where do we go from here?

The solution begged by the direct realist account is unfortunately trickier than simple transformation hacks on F1-F2 vowel space.  It dictates that we look for acoustic invariants that uniquely specify the actions of the vocal tract.  Luckily, due to its constraints according to articulation, it can start from the gestural invariant and work outwards, using techniques such as articulatory synthesis from detailed models of the vocal tract and its simulated interaction with airflow in structuring the sound (e.g. Browman & Goldstein, 1986) .  This approach can account for animal sensitivity to speech sounds and the effects of higher formant manipulation on vowel perception.  A simple lawful causal relationship between gestural identity and movement at any frequency can still be picked up and exploited by a generalised perceptual mechanism.  By a cue-based account, sensitivty to higher formants appears odd, since these primitives by themselves are rarely uniquely or robustly supportive of vowel identity, making them poor candidates for a mechanistic transformative process.

More will be explored about defining speech sounds as gestural invariants in the next post on coarticulation.

1Rhoticity – lip rounding – also affects the second and third formants, but this will not be explored here.


Adank, P., Smits, R., & Van Hout, R. (2004). A comparison of vowel normalization procedures for language variation research. The Journal of the Acoustical Society of America, 116(5), 3099-3107.

Browman, C. P., & Goldstein, L. M. (1986). Towards an articulatory phonology. Phonology, 3(01), 219-252.

Fant, G. (1960). Acoustic theory of speech prodicution. Mouton.

Fowler, C. A. (1986). An event approach to the study of speech perception from a direct-realist perspective. Journal of Phonetics, 14(1), 3-28.

Grieser, D., & Kuhl, P. K. (1989). Categorization of speech by infants: Support for speech-sound prototypes. Developmental Psychology, 25(4), 577.

Hockett, C. F. (1955). A manual of phonology (No. 11). Waverly Press.

Johnson, K. A. (2005). Speaker normalization in speech perception. In D. B. Pisoni & R. E. Remez (Eds.), Handbook of speech perception (pp. 363–389). Oxford: Blackwell.

Kuhl, P. K., & Miller, J. D. (1978). Speech perception by the chinchilla: Identification functions for synthetic VOT stimuli. The Journal of the Acoustical Society of America, 63(3), 905-917.

Monahan, P. J., & Idsardi, W. J. (2010). Auditory sensitivity to formant ratios: Toward an account of vowel normalisation. Language and cognitive processes, 25(6), 808-839.

Peterson, G. E., & Barney, H. L. (1952). Control methods used in a study of the vowels. The Journal of the acoustical society of America, 24(2), 175-184.

Speech rate in women and men: who’s really taking up the most space?

Many of us would hold our hands up to harboring the intuition that women tend to talk faster than men, but while myths of differences in word count over the course of the day have been debunked without much complication (c.f. Gender Jabber: Do Women Talk More than Men?), the case of speaking rate is much more curious an illusion.

Not only do women not in fact speak faster than men, but they are reliably found to speak slower in both read and spontaneous speech (Simpson, 1998).  So where is this impression coming from?  Is our calibration just out of whack from watching ‘Bringing Up Baby’ one too many times or are our gender stereotypes so ingrained that we simply can’t see women as anything other than over-excitable motor-mouths?

As Weirich and Simpson (2014) found, the answer is a lot more interesting and may actually lie in actual properties of the speech signal.

Speech tempo perception does not seem to be as simple as tracking pacing according to an abstract, independent metric of time.  Stimuli with more complex acoustic properties are generally perceived as having taken longer to perform than simple ones, despite taking an equal length of time to play out.  Leaving gender aside for the moment, a number of illusory effects  potentially driven by this bias within speech have already been identified:

  • Vowels spoken with varying and dynamic pitch contours are perceived as longer than the same vowel spoken in a monotone or with simpler pitch variation (Lehiste, 1976; Cumming, 2011)
  • Utterances are heard as faster when pause, syllable or segment content increases in number or duration within the same time frame (Trouvain, 2004)

So, how might this relate to sex differences?

Sex and Vowel Space

Acoustic vowel space is a co-ordinate representation of the locations of an individual’s vowels, according to two key frequency properties of the speech signal called formants (F1 & F2).  Formant values are related to the properties of an individual’s vocal tract, which is composed of two cavities (pharyngeal & oral) which can differ in overall length as well as in length ratio.  On average, men have longer vocal tracts and particularly a longer pharyngeal cavity (Fant, 1966; Chiba & Kajiyama 1941).  The cavities and the moving articulators of the vocal tract act as spectral filters during speech, which is what gives rise to the different characteristic formant structures we see in vowel space.

We can map out this space by taking measurements from people reading out words containing key vowels of their native language.  These spaces vary from person to person, due to sociocultural and biological factors, but cross-lingusitically it has been demonstrated that on average women’s vowel spaces tend to be larger, as you can see from the example below.

3 vowels from male (black) & female (grey) German speakers

3 vowels from male (black) & female (grey) German speakers. From Weirich & Simpson (2014).

While these cavity lengths are a biological constraint, behaviourally our articulations are also malleable to cultural influence; for example, women might speak more clearly, while men tend to mumble more(Labov, 1990; Heffeman, 2010).  However, I’ll leave those issues for now.

Since women will generally have to traverse a greater acoustic space over the course of an utterance, listeners might be subject to the aforementioned bias of tying perceived duration to acoustic complexity.  If this is the case, within-sex temporal difference effects according to vowel space size should be apparent.  This is what Weirich and Simpson (2014) set out to investigate.

Weirich & Simpson’s (2014) Design

The authors hypothesised that those with larger vowel spaces will have been perceived to cover a larger span of frequencies in vowel space, which means the actions would have had to have been performed quicker within a given time frame.  Thus, they predicted that listeners would rate the utterances of those with larger acoustic vowel spaces as having been spoken faster.  Risking yawn-making digs about multi-tasking ability, could it be the case that women are simply fitting more in?

They constructed controlled utterances from real-recordings of both male and female German-speakers with variable vowel space sizes.  The segments of all utterances were manipulated (according to sex) to be the length of the average segment and the pitch contours and amplitudes for all utterances were normalised.  This provided a set of stimuli in which temporal and pitch factors were kept constant, leaving the acoustic vowel space structure in tact.

Prior to the main perceptual test, ratings were also taken on the perceived naturalness, quality, pitch, speed and age of speaker for the utterances.  This was necessary to ensure the stimuli were not too odd to listen to, but more importantly to check for any subjective impressions of the stimuli that might account for any effects found in the main experiment.

Participants were required to listen to two identical utterances spoken by different speakers2  and rate whether the second speaker spoke slower or faster than the first on a seven-point scale from -3 to +3 (including 0 for perceiving an identical rate).  Their response reaction times were also recorded.  All that varied between the speakers was the size of their vowel space and the critical parameter of each pair was not absolute vowel space size, but difference in vowel space size between speakers.

The Verdict!

Vowel space size was indeed a predictor for perceived tempo in both men and women, with difference in vowel space size proving highly predictive of reaction time data in females (i.e., the larger the difference in vowel space size, the quicker the subjects responded).  The lack of this effect in men was likely due to the much lower variability in vowel space size between the male speakers, which is also probably responsible for the slightly less significant effect of vowel space size within men compared to within women.

Further Thoughts

Cool, right?!

Well, I definitely think so, especially in terms of discovering something concrete potentially underlying this stereotype, as well as going further to demonstrate the applicability of the underlying hypothesis at work within sexes.  However, there are a few things which I find perplexing about the hypothesis, which slightly alters my interpretation of these results.

As Weirich and Simpson quite rightly point out, they have identified an effect of the size of acoustic vowel space on tempo perception (i.e., crossing larger formant frequency territory).  However, according to the  approach I espouse, the Direct Realist Perspective (Fowler, 1986), what is perceived is the event of the speech gesture, not the acoustic properties that make it up.  While this isn’t the place to fully elaborate on the theory, it will suffice to say that frequencies alone are not important to an organism, but the events that they support to specify are.  However, Simpson (2002) examined male and female productions from a database of articulatory recordings and found that while the acoustic consequences were more minor for men’s tongue movements, their trajectories were both faster and covered greater distance than women’s.  If the tempo ratings are based on compensation for perceived action within a timeframe, we should expect the opposite effect to the one found here.

On the face of things, this seems to be bad news for my perspective.  However, looking back at the rationale for the predictions made about acoustic vowel space size on tempo ratings, I feel there has been an oversight.

Cumming’s (2011) overview of the dynamic pitch experiments demonstrates illusory lengthening of the more complex speech sounds, whereas Weirich and Simpson (2014) appear to be setting out to demonstrate compensation for illusory lengthening, but (to all intents and purposes) within the same experimental paradigm. This is a problematic shift in perspective, as it results in a hypothesis in the opposite direction from those of the experiments that motivate the study.  In most cases the studies covered by Cumming (2011a) presented two utterance pairs (typically vowels) and asked which was longer.  While Weirich and Simpson asked participants to judge rate of speech, it is technically equivocal, since the utterances were the same (i.e., faster rate = shorter duration).  It is unclear why compensation is expected in the latter and not the former.  What a compensation effect in fact relies on is that participants were reliably detecting the equivalent duration of the utterances (only then taking account of the larger frequency span), which we already know is subject to bias.

I am not claiming that predictions on speech rate according to size of articulatory vowel space will be the exact inverse of those predicted by acoustic vowel space, so we cannot simply claim that the significant results in fact support Direct Realist predictions.  However, the fact that Simpson’s (2002) findings uncover a reversed relationship between the two spaces at least opens up the possibility that illusory lengthening might be occurring on the basis of real energy expenditure within articulatory space.  In other words, the participants were not compensating for illusion in acoustic space, but potentially experiencing the established illusion according to articulatory space.  For this to be the case, there would have to be information in the acoustic signal about this increased energy expenditure, so any future research on this effect would have to start looking around here.

It certainly doesn’t surprise me that we get these lengthening effects in non-linguistic stimuli, (e.g. moving compared to static pitch in buzz stimuli – Cumming, 2011), since this is often the only information at hand about any potential ‘event’.  However, it doesn’t seem intuitive that we would latch onto this abstract non-informative variable for judging speech rate unless it reliably tells us something about the event.  Indeed, there is plenty of evidence that many of these speech impressions are not simply a broad acoustic effect.  For example, the lengthening effect of dynamic pitch contour of vowels is not even cross-lingustically stable3 and seems to be dependent on factors within a language, such as the existence of vowel length contrasts (Lehnert-LeHouillier, 2007) or the fact that pitch is an acoustic correlate of stress in English (Lehiste, 1976).

Since, judging speech rate isn’t a typical task that we engage in, it’s possible that we are exploiting informational variables (correctly or incorrectly) that are available to us as part of the normal course of speech perception and related to informative events that compose our typical language productions.

Anyway, those are my current impressions, but feel free to comment if you have any thoughts.

As it happens, there might be a physiological underpinning to this, with evidence that guys simply might not be physically able to open their jaws as wide (Weirich, Simpson, Fuchs, Winkler & Perrier, 2015).  And no, this is not an excuse to relegate “gobby female” to the status of neutral scientific parlance

These were always same-sex speakers to avoid stereotype influence.  More specifcially, a between subjects design allowed mixed-sex groups to be exposed only to speakers of one sex throughout the experiment

3 No lengthening effect is found in German, Swedish, Thai or Spanish speakers (c.f. Cumming, 2011a)


Chiba, T., & Kajiyama, M. (1941). The vowel: Its nature and structure. Tokyo-Kaiseikan.

Cumming, R. (2011). The effect of dynamic fundamental frequency on the perception of duration. Journal of Phonetics, 39(3), 375-387.

Fant, G. (1966). A note on vocal tract size factors and non-uniform F-pattern scalings. Speech Transmission Laboratory Quarterly Progress and Status Report, 1, 22-30.

Fowler, C. A. (1986). An event approach to the study of speech perception from a direct-realist perspective. Journal of Phonetics, 14(1), 3-28.

Lehiste, I. (1976). Suprasegmental features of speech. Contemporary issues in experimental phonetics, 225, 239.

Lehnert-LeHouillier, H. (2007). The influence of dynamic F0 on the perception of vowel duration: Cross-linguistic evidence. In Proceedings of the 16th international congress of phonetic sciences, Saarbrucken, Germany.

Simpson, A. P. (1998). Phonetische Datenbanken des Deutschen in der empirischen Sprachforschung und der phonologischen Theoriebildung (No. 33). Institut für Phonetik und Digitale Sprachverarbeitung Universität.

Simpson, A. P. (2002). Gender-specific articulatory-acoustic relations in vowel sequences. Journal of Phonetics, 30, 417–435.

Trouvain, J. (2004). Tempo variation in speech production: Implications for speech synthesis. Univ. Institut für Phonetik.

Weirich, M. & Simpson, A. P. (2014). Differences in acoustic vowel space and the perception of speech tempo. Journal of Phonetics, 43, 1-10.

Weirich, M., Simpson, A., Fuchs, S., Winkler, R., & Perrier, P. (2014, May). Mumbling is morphology?. In 10th International Seminar on Speech Production (ISSP 2014) (pp. 457-460). Köln Universität.

Presenting the World

This post was a brief introduction to the history of mainstream cognitive science and has some background information on the notions of ‘mental representation’ and the computational metaphor, which are going to be looked at critically in this post.

The Computational Metaphor

Anthony Chemero, author of Radical Embodied Cognitive Science (2009) expresses in his opening chapter that a description or metaphor in science is acceptable, only as long as it furthers our understanding of a problem.

Sticking points arise, however, when a metaphor become so entrenched in the intellectual community that it becomes the object of study itself. Within cognitive science, the computational metaphor has successfully embedded and reinforced itself by rewriting the central aim of Psychology.  Cognitive science has become preoccupied with elucidating the nature of representations (supposedly functionally invoked entities), rather than examining critically whether the metaphor is fit for purpose.  Metaphors not only constrain our understanding of the behaviour under scrutiny, but also constrain the questions that are being asked.

This is a trivial issue if, like Fodor (1987), we believe that a representational language of thought is theoretically non-negotiable when explaining human perceptual and cognitive capabilities. While Fodor (2003) himself admits that this approach is currently without a psychosemantic theory of content (which I’ll get to later), the lack of urgency to find such an account is implicitly vindicated by the lazy mantra of ‘what else could it be?’. Traditional cognitivism will eventually have to bear this burden of explanation in the face of any viable alternative to studying human behaviour.

One such alternative is Radical Embodied Cognitive Science which studies behaviour from a Systems Theory perspective.  This post will look at what such a view entails and having considered this alternative account of complex behaviour, the post will finally consider the metaphysical consequences of assuming that we have contentful mental representations which are computed over and also the feasibility of “information transmission”. Hopefully this will go some way towards questioning whether representationalism warrants its status as the default theoretical stance in cognitive science.

Dynamical Systems

The “radical” of “Radical Embodied Cognitive Science” refers to its anti-representationalist stance and it seeks instead to explain behaviours in terms of the system of brain, body and environment interaction.

Van Gelder’s (1995) seminal paper “What Might Cognition Be If Not Computation?” makes a useful example of the Steam Governor, which was a device designed by the mechanical engineer James Watt to regulate steam engines. The aim was to maintain the speed of the driving flywheel smoothly in the face of large fluctuations in steam pressure and workload. This could be controlled via the turn of the throttle, which was the gateway for the steam.

From the perspective of a computational designer, we might regulate the speed of the flywheel by measuring the speed, comparing it to the desired speed, calculating how to adjust the throttle that restricts/facilitates steam flow and then implementing the change. This appears to be a task which perfectly illustrates the need to posit some sort of information processing mechanism, in which information on current and desired speed (content) is transmitted, interpreted and computed over.

However, the actual Governor does not perform any of these complex calculations at all, but still regulates the engine – and it does so incredibly well.

In the governor, a spindle is geared to the flywheel, whose rotation speed directly depends on the speed of the flywheel. Attached by hinges to the spindle are two arms, each with a metal ball attached at the end. These in turn are connected to the throttle itself. The movement of the spindle creates a centrifugal force, resulting in the balls being driven up and out. This means that when speed increases, the rising balls immediately begin restricting the flow of steam (and vice versa). The solution to the problem is immediate, continuous and smooth; the system is robust to large changes in pressure and load and maintains the desired speed effectively.


The Nature of Representation

The Watts governor is a dynamical system in which the arm angle of the system covaries with the speed of the flywheel. This could conceivably be called the system’s “representation” of speed; however, it is misleading to call this a “representation”. The system is not taking a measure of speed and performing a computation on it. In fact, there are no identifiable sub-components which perform discrete operations, whereas the computational solution clearly has such identifiable modules. The Governor does not have a schedule of rules to follow and its task is not one of translation in any meaningful sense of the word. Its activity is continuous and there is no point in time at which any part of the system is not influencing the behaviour of all other parts.

Van Gelder raises the point that the correlative relationship between arm angle and flywheel speed in fact breaks down in the system when outside of an equilibrium state, meaning that the supposed representational relationship (specified as correlation) is not even an enduring one1. However, when described within the framework of dynamical systems, mathematical description of the coupling of the parts adequately characterises the relationship between the system’s components over time in its entirety. A representational narrative not only adds nothing, but encourages us to ask misleading questions, such as how the elements ‘communicate’, how information is ‘processed’ or how a proposed ‘algorithm’ might be implemented.

While this has laid out that there do in fact exist alternative approaches to studying behaviour, it is also worth looking at traditional computationalism’s conceptual commitments and viability.

The Hard Problem of Content

Hutto and Myin (2013) define the classic representationalism stance as CIC (Cognition necessarily Involves Content) and characterise its biggest credibility hurdle as The Hard Problem of Content. This challenge holds for any theory which aims to characterise cognition within the bounds of explanatory naturalism while maintaining a CIC stance that cognition is about manipulating contentful representations.

Problems arise when attempting to explain how information maintains its integrity through transmission in different physical mediums. Invariably, attempts to ground representations in the physical world lead to fuzzy distinctions between representational vehicles (“information carriers” which are potentially amenable to physical description) and their contents. If our cognitive architecture is specialised to deal with the physical vehicles that hold the content, then what or who is the attached content meaningful for?

While covariance relations could be sufficient to constitute information, Hutto and Myin (2013) claim that this is insufficiently constrained to account for meaningful content, mirroring van Gelder’s concerns that representation-as-correlation opens the term “representation” up to trivialisation. Covariance does not constitute a state carrying information about something else without an external interpretive process imposed on it (e.g. using the number of tree rings to derive the age of the tree). In order for a state to be contentful, it must have conditions of satisfaction. Covariance is not a semantic relationship, as it is not propositional about the truth of states in the world.  While organisms might respond to natural signs, this does not necessarily entail that they respond to them as stand-ins for something else.

RECS does not deny that organisms are informationally sensitive (Hutton & Myin, 2013) in that they “exploit correspondences in their environments to adaptively guide their environments” (p. 82). However, this is fundamentally distinct from claiming that information is transmitted as semantic content and presupposing some form of internal language to support the required interpretive process.

For those interested in reading about these problems in detail, their book ‘Radicalising Enactivism’ goes on to discuss potential CIC rebuttals and consequent revisions of the notion of ‘representation’ and ‘content’. This lays out the inevitable dilution of concepts that occurs during this redefining process. These terms appear to be rendered empirically implausible at worst and, at best, explanatorily irrelevant.

While this post has served to introduce a systems approach as a potential alternative to computationalism, I will later discuss in more detail a particular theoretical approach which characterises organisms as dynamical systems coupled with their environment through information.  This approach does not rely on the concepts of representation or information transmission and therefore, unlike cognitivist theories, avoids the troublesome and persistent demand to provide a coherent theory of content.

1There is a small but significant differential between arm angle and speed when the flywheel slows quickly.  Whilst the flywheel can slow almost instantly, the rate at which the arms can fall is dictated by gravity and it is during this fall that their angle cannot be correlated to the speed of the flywheel.


Chemero, A. (2009). Radical embodied cognitive science. MIT Press.

Fodor, J. (1987). Psychosemantics. MA: MIT Press.

Fodor, J. (2003). Hume Variations. Oxford: Oxford University Press.

Hutto, D. D. & Myin, E. (2013) Radicalising enactivism: Basic minds without content. Cambridge, MA, MIT Press.

Van Gelder, T. (1995). What might cognition be, if not computation?. The Journal of Philosophy, 92(7), 345-381.

Spatiotemporal coupling between speech and manual motor actions (Parrell, Goldstein, Lee & Byrd, 2014) – HIBAR

Another summary and HIBAR post on our most recent lab group reading.

Speech and motor activity co-ordination has long been observed and has been empirically demonstrated across the lifespan, from co-occurrence of babbling and rhythmic limb movement (e.g. Iverson, Hall, Nickel & Wozniak, 2007) to adult co-production of gestures with speech to a level of high temporal resolution. For example, gestures such as pointing tend to align exactly at the point of stressed syllable in a word (Rouchet-Capellan, Laboissiere, Galvan, & Schwarz, 2008).

The importance of motor coupling. Donald showing us how it's done.

The utility of speech and gesture coupling. Donald showing us how it’s done.

Parrell et al. (2014) aimed to probe a bit further the role of prosodic structure in the spatiotemporal coordination of speech and gesture.


The experimenters had four participants tap their right finger on their left shoulder, while synchronously repeating a monosyllable. Midway through a 30s trial, they were required to impose a stress either on the finger tapping movement or in the production of the word, but maintain an unaffected performance in the other domain.  This was achieved by asking participants to watch a clock dial and impose the stress as it reached one of the quarter markers.

The kinematics being monitored were of lip aperture (LA) and fingertip movement (FT) and these were tracked by transducers attached to the articulators.  In order to address spatial effects, the magnitude of the emphasised repetition and mean magnitude of unemphasised trial repetitions were compared. Magnitude in LA and FT were characterised as the wideness/height (amplitude) of the aperture or tap.
Inter-response interval (IRI) was used to measure temporal effects of the emphasis and this was calculated as the time between the onsets of concurrent articulator repetitions.

Spatial Effects

While participants had been instructed to maintain unstressed movement in the uninstructed domain, three subjects showed increased magnitude in articulator repetitions concurrent with an emphasis in the other articulator.  There was also found to be a significant correlation between general movement magnitudes of both articulators (i.e. during unstressed repetitions).

Temporal Effects

A lengthening of IRIs were found near the stress boundary for both LA and TP regardless of stress domain, but a tendency was found for the lengthening in the unstressed domain to be delayed by one repetition.  Similarly to the spatial effects, a general correlation was found between temporal effects in both articulators.  Consistent with previous literature, they also found more robust effects of speech emphasis on finger tapping than vice versa, suggesting a closer coupling in this direction.

Theoretical Implications

The authors venture some possible interpretations of the data, including the π-gesture model of speech prosody. This proposes that all motor activity is controlled by a single internal clock recruited in accordance with prosodic structure. While comparable IRI-lengthening within and across modalities might indicate a common clock, a single driver does not account for the observed delay in the unstressed modality. In addition, the asymmetrical effects of stress across domain indicate that simple coupling the speech and finger articulators is an insufficient explanation.

The alternative hypothesis is that emphasis acted as a perturbation to the coordinative dynamics of the articulators. The delayed IRI in the unstressed domain would reflect a restoration of relative phase between articulators. Prosody is suggested to act as a means of grouping certain information and making it salient, therefore it is possible that this process recruits wider bodily resources than speech apparatus to achieve this. This hypothesis would also account for the asymmetry of stress domain effect, as the prosodic structure of language is stress in the speech domain. This then calls on a wider set of motor resources which form part of a larger prosodic architecture, where stress in the finger tapping domain will not do the reverse (1).

There are some points we would have raised, had we been reviewers of this paper:

1. We were unsure about why the sample size (four) for this study was so small, especially as there was a large amount of individual variation in performance and also that the observed effects were relatively small. This makes it harder to interpret “majority” (3/4) effects such as the augmented cross-modal effect for speech stress (see paper for a number of these instances)

2. The authors note that for each condition (stressed finger tap/syllable) they presented two blocks with the syllables /ma/ and /mop/. The reason cited was to investigate effects of syllable coda (the optional final part of a syllable) on the amplitude and timing of articulator movement. This was not elaborated on and it is unclear to us what justified including this factor, as no previous literature was cited (2).

3. While the authors aimed to reproduce a more natural prosodic context than previous studies (3), it could be argued that timing a stress according to an external stimulus (point on a clock dial) may simply be unnaturalistic in another manner. While the emphasis is ostensibly quasi-linguistic (i.e. a specific point in a speech string), it is possible that explicitly constraining the placement of the stress could have dynamical consequences on the preceding and following movements that would not be present in a natural language activity. The authors cite some previous literature for using this type of stress, but I currently don’t have access to these references. It is unclear whether participants in these cases were volitionally choosing which gesture to stress or whether they also had to time this to a clock

(1) General coupling principles do explain the domain-general effects in which a smaller amplitude change is seen in the unstressed domain and also the correlation between domains of spatial and temporal effects during unstressed repetitions
(2) N.B.The authors found no effect of presence of coda on spatial or temporal effects
(3) These typically employed alternating stressed-unstressed patterns which imposed a rhythm, something which the current authors were keen to avoid


Iverson, J. M., Hall, A. J., Nickel, L., & Wozniak, R. H. (2007). The relationship between reduplicated babble onset and laterality biases in infant rhythmic arm movements. Brain and language, 101(3), 198-207.

Parrell, B., Goldstein, L., Lee, S., & Byrd, D. (2014). Spatiotemporal coupling between speech and manual motor actions. Journal of phonetics, 42, 1-11.

Rochet-Capellan, A., Laboissière, R., Galván, A., & Schwartz, J. L. (2008). The speech focus position effect on jaw–finger coordination in a pointing task. Journal of Speech, Language, and Hearing Research, 51(6), 1507-1521.


Olmstead, Viswanathan, Aicher & Fowler (2009)

Fellow Leeds Met PhD student Liam Cross and I have collaborated this week to review a paper investigating a simulation theory approach to language comprehension.  This theory comes under the umbrella term of embodiment, but is distinct from RECS and will be revisited in future posts.

Liam conducts research on the role of synchrony in social behaviour and his blog can be found at: http://tosyncornottosync.wordpress.com/

HIBAR – Sentence comprehension affects the dynamics of bimanual co-ordination: Implications for embodied cognition

Simulation theory (Barsalou, 1999) is a subset of embodied approaches that attempts to reconcile embodiment within a representational framework.  This account seeks to ground high-level cognition in sensorimotor representations, in an attempt to overcome issues of how purely symbolic or abstract representations might be instantiated in the brain.  Within the domain of language, it predicts that mental simulation of the appropriate sensorimotor representation is integral to the process of comprehension.  This would predict interaction of activities which draw on the same representations, in either an interfering or facilitatory manner (e.g. Glenberg & Kaschak, 2002).  For example, motions congruent with an action specified in a statement should facilitate comprehension, while incongruent motions should interfere.


Olmstead et al. (2009) used a pre-established bilateral rhythmic co-ordination task (Kugler & Turvey, 1987) involving well-understood behavioural measures to investigate how language comprehension of performable sentences interferes with the behavioural organisation.

ᶲ = relative phase, V = potential

ᶲ = relative phase, V = potential

The task involved bimanual movements, where participants were required to swing two pendulums from both wrists.  This was tested when participants were swinging the pendulums in-phase (00) and anti-phase (1800).  Without interference, people show stable relative phase during these particular movements, but 00 is more stable.  1800 tends to see more variability in performance around its attractor location; this variability is the standard deviation of relative phase (SDRP).

While performing this task, participants were asked to read sentences on a screen and indicate verbally whether they were plausible or implausible sentences.  While plausibility was varied, the variable under investigation was that of performability vs. inanimacy of the sentences.  Performability sentences implied movements concerning the hands/fingers/arms.  Differences in the dynamic of the movement were then compared to a baseline, where participants engaged in a swinging only condition.

The researchers also had a detuned condition, in which the preferred frequencies of the limbs were manipulated by giving participants pendulums of different natural frequencies.  Detuning has the effect of shifting the attractors away from 00 and 1800, captured by “relative phase shift” (RPS), and also has the effect of increasing SDRP.

Under simulation theory, it would be predicted that movements would differ in the performable sentences condition compared to the inanimate sentences condition and the baseline (swinging only).  Since performable sentence comprehension and the movement task should be using overlapping neural resources with respect to the limb involved, simulation theory would expect a change in SDRP, but not a shift in RPS.  Sentence comprehension is presumed to cause intermittent interruption of the continuous neural oscillators active in the control of the movement task (Grossberg, Pribe, & Cohen, 1997).  In other words, performable sentence comprehension would act as a perturbation to the system and thus increase variability (SDRP).


Performability of sentences did not interact with detuning or required phase for either RPS or SDRP leading the researchers to aggregate the detuned and non-detuned conditions, comparing the single and dual-task conditions.

Further analysis revealed a main effect of comprehension task on RPS, but none on SDRP.  Judging performable sentences was found to affect relative phase (RP) of the co-ordination, where inanimate sentences saw no shift from baseline RP.  There was no effect of task on SDRP; in other words, the attractor shape did not change, it merely shifted significantly from baseline when judging performable sentences, but not when judging inanimate sentences.

The authors conclude that the results are inconsistent with predictions made by simulation theory as it currently stands, since an unexpected shift in attractor occurred and no increase in SDRP was observed.

While this is indeed an interesting effect, there are some questions and issues we would have raised, had we been reviewers of this paper:

Nature of Landscape Shift:

The authors plot task against difference scores between single/dual task conditions on grounds of clarity.  We argue that the figure is in fact misleading, especially coupled with its use of graph lines between conditions; these give the impression of RPS directionality i.e. that the significant shift in RP for performable sentences followed the n.s. rightward deviation in the inanimate condition.  We feel it odd that the authors did not explicitly mention in the results that the significant shift was actually a leftward shift from the baseline.

graphWithout access to the raw data for the detuned and non-detuned conditions at baseline and during sentence comprehension, it is also unclear whether this can be characterised as a leftward shift or one of facilitation towards the desired phase, as the raw RPS scores reported are aggregates of the two.  Because of this, the aggregated baseline RPS lies at -8.660, not 00, inanimate sentence comprehension lies at -9.07 and performable sentence comprehension at -4.78.  The leftward shift is briefly mentioned in the discussion, but could have benefitted from further explication in the results.  The authors report that right-handed participants appeared to be the culprit for the left-leading behaviour in the performable condition, but this is not substantiated by any formal analysis.

Temporal Resolution of Movement Analysis

The authors indicate that they chose to measure the global effects of the sentence types (averaging across all performable sentences vs. inanimate sentences), rather than adopt an event-based design, but give no explanation as to why this was opted for, whether for practicality’s sake or otherwise.  The current design does not allow analysis of potentially interesting data about the course of the movement change over time during comprehension.

An event-based design would allow observation of when movement changes occur and how these directly link on a moment-to-moment basis to presentation of the stimulus. Here, sentences are not considered as discrete events, rather some sort of continuous activity averaged over three sentence instances.  Temporal resolution could be further facilitated through auditory stimuli or eye-tracking for visual stimuli, as this means movement data can be accurately time stamped with relation to the information present in the environment at any one time.  It is curious that the lack temporal resolution was not mentioned, given the time spent in the discussion considering what sort of influence transient perturbations (comprehension of performable sentences) would have on the continuous control task under a simulation account.

Control of Facilitatory/Competition Effects

It would be helpful to investigate sentence comprehension that involves the same effector when comprehending both congruent or incongruent actions; there are key predictions in simulation theory about how this should affect neural resource-sharing.  Despite it being unclear why the shift took the form it did in the performable sentences condition, it would have been interesting to see whether actions predicted to facilitate or compete for resources under simulation theory would show qualitatively distinct shifts in behaviour.

The authors present this as a paradigm to be utilised in the study of embodied cognition.  While we applaud the use of a well-defined movement task with clear behavioural measures, we feel that the design could be smoothed out for future use.


Barsalou, L. W. (1999). Perceptual symbol systems. Behavioral and brain sciences, 22(04), 577-660.

Grossberg, S., Pribe, C., & Cohen, M. (1997). Neural control of interlimb oscillations: 1. Human bimanual coordination. Biological cybernetics, 77, 131–140.

Kugler, P. N., & Turvey, M. T. (1987). Information, natural law, and the self-assembly of rhythmic movement. Lawrence Erlbaum Associates, Inc.
Olmstead, A. J., Viswanathan, N., Aicher, K. A., & Fowler, C. A. (2009). Sentence comprehension affects the dynamics of bimanual coordination: Implications for embodied cognition. The Quarterly Journal of Experimental Psychology, 62(12), 2409-2417.

Introduction to mainstream Cognitive Science or: How we learned to stop worrying and love the representation

My first post is just going to set out a very brief of history of cognitive science.  While this will be woefully incomplete as a comprehensive historical account, the aim is to focus on some key movements that will hopefully shed some light on ideological shifts responsible for its current incarnation.

Psychology is generally accepted to have begun experimentally in the 1880s in the laboratory of Wilhelm Wundt, a man who focused his efforts on discovering the basic units of consciousness such as sensations through phenomenological report or “introspection”.

The early 1900s saw a shift in practice in the face of increasing criticism concerning the unreliability of introspective methods in a world where science aimed to champion the “objective”.  This led to the eventual development of behaviourism, an approach which switched the focus from fuzzy, undefined conceptions of mind and constructs (e.g. beliefs and feelings) to overt behaviour, which could be empirically tested and more easily experimentally controlled.  Those such as Watson and Skinner pioneered this science of behaviour, in which attempts were made to characterse human capacities in terms of stimulus-response associations and reinforcement schedules.

However, in the 1950s the American linguist Noam Chomsky famously reviewed Skinner’s account of language acquisition, critiquing the scope of operant conditioning to explain the complex and productive nature of language competence.  He argued that a fundamental part of the picture was missing, as mere exposure to natural language would be insufficient to account for the robust knowledge of language humans display.  Concerns were highlighted about the lack of information available in speech signals and also the lack of negative evidence (feedback on errors in production) to account for language competence.  This is known as the poverty of the stimulus argument and conclusions were drawn that internal mental structures must exist to account for enhancement of impoverished input.

This came during a period when the increasing prominence of information processing led to parallels being drawn between the way computers deal with information and how a mind might process sensory input.  The computer metaphor captivated the cognitive science community and had a profound impact on the direction Psychology would take up to the present day, in what is now termed the “cognitive revolution”.

The focus had turned full circle in under a century and it was squarely back on the empirically inaccessible mind and its internal causal contents, whose nature could only be inferred.  It was not solely language that was receiving this treatment.  The study of perception became a domain in which the basic building blocks of perception started at the level of sense data.  Since stimulation of a sense cell contains no information of the causal stimulus itself, the job of the brain was conceived to be to derive and infer the content of the world on the basis of this sparse input.  The role of the environment in any explanatory sense was theoretically dispensable; the conception was not that it was the world, but the world-as-represented, ­which influenced behaviour.

This approach has enjoyed a lengthy period of stability in the history of the science and countless models of human behaviour take the form of input/output models in which representations are computed over by something akin to a computer’s central processor to explain human behaviour.  Representations have done a lot of the heavy lifting for a wide range of cognitive tasks and highly complex calculations ascribed to them in order to account for simple and complex human behaviour.  It is rarely detailed how these calculations are performed, but these explanations are frequently believed to be on the horizon and thought unlikely to be a problem for the frighteningly complex human brain.

Mental representations nominally play the necessary explanatory role in most accounts of human behaviour and have to be presupposed, since they cannot be empircally accessed.  As a consequence, research is frequently dedicated, not to proving their existence, but to supposedly indirectly probing their character and modelling how they might be put to work.

Is this a problem or a scientific necessity?  For example, are mental representations providing a similar function to the hypothesised role of dark matter in astrophysics?  Can models simply not function without invoking mental representations?

One obvious benefit of invoking mental representations is their decouplability.  They appear to cater as a simple explanation of how we achieve competent behaviour when a stimulus is currently absent in the real world.  Things do not come in and out of existence for us just because they are not in contact with our senses at any one time, so the brain must surely contain some form of stand-in for the world.  This also seems to serve to explain highly competent behaviour that appears to require knowledge of future states of the world, such as knowing when and where to be to catch a fly ball.  How else could this be achieved, but through some hidden computation that allows prediction?

Language is perhaps the most obvious domain begging for a representational narrative.  With no lawful relationship between arbitrary sounds and their referents, a mental stand-in seems called for to explain how language and concepts can be used to control action.  How could we create sense-making but novel utterances, if there are not abstract structures in place to scaffold this?  Why else would children overgeneralise regular grammatical rules to irregular forms they had already mastered?

In essence, the resounding chorus in the cognitive science community was and continues to be: What else could it be?

A future post will look at an altenative approach to the current paradigm and why assumptions of mental representations are not as benign as they initially appear.