Evidence for a Spoken Word Lexicon in the Auditory Ventral Stream

Srikanth R. Damera, Lillian Chang, Plamen P. Nikolov, James A. Mattei, Suneel Banerjee , Laurie S. Glezer, Patrick H. Cox, Xiong Jiang, Josef P. Rauschecker, Maximilian Riesenhuber


The existence of a neural representation for whole words (i.e., a lexicon) is a common feature of many models of speech processing. Prior studies have provided evidence for a visual lexicon containing representations of whole written words in an area of the ventral visual stream known as the visual word form area. Similar experimental support for an auditory lexicon containing representations of spoken words has yet to be shown. Using functional magnetic resonance imaging rapid adaptation techniques, we provide evidence for an auditory lexicon in the auditory word form area in the human left anterior superior temporal gyrus that contains representations highly selective for individual spoken words. Furthermore, we show that familiarization with novel auditory words sharpens the selectivity of their representations in the auditory word form area. These findings reveal strong parallels in how the brain represents written and spoken words, showing convergent processing strategies across modalities in the visual and auditory ventral streams.


Speech perception is perhaps the most remarkable achievement of the human auditory system and one that likely is critically dependent on its overall cortical architecture. It is generally accepted that the functional architecture of auditory cortex in human and nonhuman primates comprises two processing streams (Hickok & Poeppel, 2007Rauschecker & Scott, 2009Rauschecker & Tian, 2000). There is an auditory dorsal stream that is involved in the processing of auditory space and motion (van der Heijden et al., 2019) as well as in sensorimotor transformations such as those required for speech production (Archakov et al., 2020Hickok et al., 2011Rauschecker, 20112018). There is also an auditory ventral stream specialized for recognizing auditory objects such as spoken words. This stream is organized along a simple-to-complex feature hierarchy (Rauschecker & Scott, 2009), akin to the organization of the visual ventral stream (Kravitz et al., 2013).

Visual object recognition studies support a simple-to-complex model of cortical visual processing in which neuronal populations in the visual ventral stream are selective for increasingly complex features and ultimately visual objects along a posterior-to-anterior gradient extending from lower-to-higher-order visual areas (Felleman & Van Essen, 1991Hubel & Wiesel, 1977). For the special case of recognizing written words, this simple-to-complex model predicts that progressively more anterior neuronal populations are selective for increasingly complex orthographic patterns (Dehaene et al., 2005Vinckier et al., 2007). Thus, analogous to general visual processing, orthographic word representations are predicted to culminate in representations of whole visual words—an orthographic lexicon. Evidence suggests that these lexical representations are subsequently linked to concept representations in downstream areas like the anterior temporal lobe (Damera et al., 2020Lambon Ralph et al., 2017Liuzzi et al., 2019Malone et al., 2016). The existence of this orthographic lexicon in the brain is predicted by neuropsychological studies of reading (Coltheart, 2004). Indeed, functional magnetic resonance imaging (fMRI; Glezer et al., 20092015) and, more recently, electrocorticographic data (Hirshorn et al., 2016Woolnough et al., 2021) have confirmed the existence of such a lexicon in a region of the posterior fusiform cortex known as the visual word form area (VWFA; Dehaene & Cohen, 2011Dehaene et al., 2005).

It has been proposed that an analogous simple-to-complex hierarchy exists in the auditory ventral stream as well (Kell et al., 2018Rauschecker, 1998) extending anteriorly from Heschl’s gyrus along the superior temporal cortex (STC; DeWitt & Rauschecker, 2012Rauschecker & Scott, 2009). Yet, the existence and location of a presumed auditory lexicon, that is, a neural representation for the recognition (and storage) of real words, has been quite controversial (Bogen & Bogen, 1976): The traditional posterior view is that the auditory lexicon should be found in posterior STC (pSTC; Geschwind, 1970). In contrast, a notable meta-analysis (DeWitt & Rauschecker, 2012) provided strong evidence for the existence of word-selective auditory representations in anterior STC (aSTC), consistent with imaging studies of speech intelligibility (Binder et al., 2000Scott et al., 2000) and proposals for an auditory word form area (AWFA) in the human left anterior temporal cortex (Cohen et al., 2004DeWitt & Rauschecker, 2012). Such a role of the aSTC is compatible with nonhuman primate studies that show selectivity for complex communication calls in aSTC (Ortiz-Rios et al., 2015Rauschecker et al., 1995Tian et al., 2001) and demonstrate, in humans and nonhuman primates, that progressively anterior neuron populations in the STC pool over longer timescales (Hamilton et al., 2018Hullett et al., 2016Jasmin et al., 2019Kajikawa et al., 2015). In this anterior account of lexical processing, the pSTC and speech-responsive regions in the IPL are posited to be involved in “inner speech” and phonological reading (covert articulation), but not auditory comprehension (DeWitt & Rauschecker, 2013Rauschecker, 2011). Yet, despite this compelling alternative to traditional theories, there is still little direct evidence for an auditory lexicon in the aSTC.

Investigating the existence and location of auditory lexica is critical for understanding the neural bases of speech processing and, consequently, the neural underpinnings of speech processing disorders. However, finely probing the selectivity of neural representations in the human brain with fMRI is challenging, in part because it is difficult to assess the selectivity of these populations. Many studies have identified speech processing areas by contrasting speech stimuli with various nonspeech controls (Evans et al., 2014Okada et al., 2010Scott et al., 2000). However, these coarse contrasts cannot reveal what neurons in a particular auditory word-responsive region of interest (ROI) are selective for, for example, phonemes, syllables, or whole words. More sensitive techniques such as fMRI rapid adaptation (fMRI-RA; Grill-Spector & Malach, 2001Krekelberg et al., 2006) are needed to probe the selectivity of speech representations in the brain and resolve the question of the existence of auditory lexica. In the current study, we used fMRI-RA to test the existence of lexical representations in the auditory ventral stream. Paralleling previous work in the visual system that used fMRI-RA to provide evidence for the existence of an orthographic lexicon in the VWFA (Glezer et al., 200920152016), we first performed an independent auditory localizer scan that we used to identify the AWFA in individual subjects, and then conducted three fMRI-RA scans that probed the representation in the AWFA and its plasticity. The first two scans consisted of real words (RWs) and pseudowords (PWs; i.e., pronounceable nonwords), respectively. These scans revealed an adaptation profile consistent with lexical selectivity in the putative AWFA for RWs, but not novel PWs, directly replicating results for written words in the VWFA. We then tested the lexicon hypothesis by predicting that training subjects to recognize novel PWs would add them to their auditory lexica, leading them to exhibit lexical-like selectivity in the AWFA following training, as previously shown for written words in the VWFA (Glezer et al., 2015). To do so, we conducted a third fMRI-RA scan after PW training. Results from this scan showed RW-like lexical selectivity to the now-familiar PWs following training, supporting the role of the AWFA as an auditory lexicon shaped by experience with auditory words.