How NaturallySpeaking Learns to Recognize Speech
Computers are very smart when it comes to brain-straining things like playing chess and filling out tax returns, so you may think they’d be whizzes at “simple” activities like recognizing faces or understanding speech.
But after about 50 years of trying to make computers do these simple things, programmers have come to the conclusion that a skill isn’t simple just because humans master it easily. In fact, our brains and eyes and ears are chock-full of sophisticated sensing and processing equipment that still runs rings around anything we can design in silicon and metal.
We humans think it’s simple to understand speech because all the really hard work is done before we become conscious of it. To us, it seems as if English words just pop into our heads as soon as people open their mouths. The unconscious (or preconscious) nature of the process makes it doubly hard for computer programmers to mimic.
To get an idea of why computers have such trouble with speech, think about something they’re very good at recognizing and understanding: touch-tone phone numbers. Those blips and bloops on the phone lines are much more meaningful to computers than they are to people. Several important features make the phone tones an easy language for computers, listed below. English, on the other hand, is completely different.
The touch-tone “vocabulary” has only 12 “words” in it. After you know the tones for the ten digits plus * and #, you’re in. English, on the other hand, has hundreds of thousands of words.
None of the words sound the same. On the touch-tone phone, the “1” tone is distinctly different from the “7” tone. But English has homonyms, such as new and gnu, and near homonyms, like merrier and marry her. Sometimes entire sentences sound alike: “The sons raise meat” and “The sun’s rays meet,” for example.
All “speakers” of the language say the words the same way. Push the 5 button on any phone, and you get exactly the same tone. But an elderly man and a 10-year-old girl use very different tones when they speak; and people from Great Britain, Canada, and the United States pronounce the same English words in very different ways.
Context is meaningless. To the phone, a 1 is a 1 is a 1. How you interpret the tone doesn’t depend on the preceding number or the next number. But in written English, context is everything. It makes sense to “go to New York.” But it makes much less sense to “go two New York” or “go too New York.”