Cognitive / Developmental Brown Bag | Jim Magnuson (UConn)

Wednesday, November 28, 2018 12:00pm to 1:30pm

Location: 

Tobin 521B

EARSHOT: A minimal neural network model of human speech recognition that learns to map real speech to semantic patterns

James S. Magnuson, Heejo You, Hosung Nam, Paul Allopenna, Kevin Brown, Monty Escabi, Rachel Theodore, Sahil Luthra, Monica Li, & Jay Rueckl

One of the great unsolved challenges in the cognitive and neural sciences is understanding how human listeners achieve phonetic constancy (seemingly effortless perception of a speaker's intended consonants and vowels under typical conditions) despite a lack of invariant cues to speech sounds. Models (mathematical, neural network, or Bayesian) of human speech recognition have been essential tools in the development of theories over the last forty years. However, they have been little help in understanding phonetic constancy because most do not operate on real speech (they instead focus on mapping from a sequence of consonants and vowels to words in memory), and most do not learn. The few models that work on real speech borrow elements from automatic speech recognition (ASR), but do not achieve high accuracy and are arguably too complex to provide much theoretical insight. Over the last two decades, however, advances in deep learning have revolutionized ASR, with neural network approaches that emerged from the same framework as those used in cognitive models. These models do not offer much guidance for human speech recognition because of their complexity. Our team asked whether we could borrow minimal elements from ASR to construct a simple cognitive model that would work on real speech. The result is EARSHOT (Emulation of Auditory Recognition of Speech by Humans Over Time), a neural network trained on 1000 words produced by 10 talkers. It learns to map spectral slice inputs to sparse "pseudo-semantic" vectors via recurrent hidden units. The element we have borrowed from ASR is to use "long short-term memory" (LSTM) nodes. LSTM nodes have a memory cell and internal "gates" that allow nodes to become differentially sensitive to variable time scales. EARSHOT achieves high accuracy and moderate generalization, and exhibits human-like over-time phonological competition. Analyses of hidden units – based on approaches used in human electrocorticography – reveal that the model learns a distributed phonological code to map speech to semantics that resembles responses to speech observed in human superior temporal gyrus. I will discuss the implications for cognitive and neural theories of human speech learning and processing.

Research Area: 

Cognition and Cognitive Neuroscience