10 Cutting-Edge Phonetics Trends of the Future
Phonetics has come a long way since the good ol’ days of Daniel Jones and his colleagues in London at the turn of the century. Technology and mass communication have revolutionized the field of phonetics, allowing breakthroughs the founders would never have imagined. The following previews some of these amazing new directions.
Training computers to recognize human emotions in speech
Clearly, many situations exist where recognizing emotion in speech can be important. Think of how your voice may become increasingly tense as you wait on the phone for a computer operator to (finally) hand you over to a real person. Or more seriously, consider people working in emergency situations such as a 911 operator. Major, potentially life-threatening problems can occur if a 911 operator can’t understand what you’re saying.
Working with emotion in speech is a cutting-edge research topic in many laboratories worldwide. For instance, Dr. Carlos Busso at the University of Texas at Dallas has experimented pairing computerized voices and visual heads expressing the emotions of anger, joy, and sadness. This work has compared the speech of actors and ordinary individuals in more naturalistic situations. From the audio recordings, Busso uses pitch features to classify emotions. He then uses motion tracking technology to record speakers’ facial movements during speech. The findings show that certain regions of the face are more critical for expressing certain emotions than others.
Linguistics and scientists can now use the results from these studies to create more believable avatars (computerized human-like characters), and to better understand disorders, such as Parkinson’s disease (in which disintegration of the nervous system causes a loss of facial expression), and autism (in which attendance to facial cues appears to be a problem).
Animating silicon vocal tracts
Different ways can help you understand the human vocal tract. One way is to study the human body through anatomy and physiology. Another way is to construct models of the system and study the biomechanical properties of these creations. Silicon vocal tracts are a new type of model that can be used for speech synthesis, the manmade creation of speech by machine.
The beginning of speech synthesis actually goes back to the 1700s with a bagpipe-like talking machine consisting of leather bellows (to serve as the lungs) and a reed (to serve as the vocal folds). Although this system squeaked its way through speech, it wasn’t possible to decipher much about the speech source or filter by studying its components.
Today people remain fascinated by talking machines, including robots and humanoid creations. Such robots help with animation and other artistic purposes, as well as helping researchers better understand anatomical systems.
Producing a human-like articulatory system isn’t simple. The human body has very specific density, damping, elasticity, and inertial properties that aren’t easy to replicate. The changing physical shapes of the vocal tract are also difficult to mechanically reproduce. For instance, the tongue is a muscular hydrostat that preserves its volume when changing shape. The tongue elongates when protruded and humps when retracted.
Dr. Atsuo Takanishi at Waseda University in Japan has spent decades perfecting a silicon head that can produce vowels, consonants, and fricatives in Japanese. You can watch movies of his various contraptions, including silicon vocal folds, motorized tongues, and gear-driven lips and face.
Getting tubular and synthetic
A method of synthesizing speech more cerebral than building robots involves making electronic or mathematical models of the speech production system. After researchers understand these complex systems, they can create them and then manipulate these systems in a computer to simulate the human system (albeit electronically). Gunnar Fant, who developed models of the relation between the human speech anatomy and formant frequencies, spearheaded this type of work in the 1950s. This enterprise also draws on the physical models of Hermann von Helmholtz who described how single resonators and coupled resonators shape input sound.
More recent versions of tube models are making breakthroughs with difficult problems, such as replicating the voices of women and children, as well giving computers the illusion that they’re successfully singing. Brad Story, a professor at the University of Arizona, is working on a prototype called tube talker. This system is based on modeled physiology of the vocal folds and the upper airway system. Its design incorporates video images of the vocal folds and MRI images of the vocal tract taken during speech. By using both articulatory and acoustic constraints, Story and his team can model and move virtual articulators to create smooth, speech-like movements. The result is a sound wave that can be listened to, analyzed, and compared to real speech.
Tube talker has been modified in some strange and interesting ways. For example, traditional models of speech suggest that the voice and filter components should be considered separate. However, for some types of sung voice (and perhaps for a children’s voice), this may not be the case. Recent versions of the tube talker have tested nonlinear interactions between source and filter as new possible combinations to better model such types of voice and song.
Another model using tube-like designs has won a recent European speech synthesis song contest for not only making plausible spoken speech, but also for singing (you can witness the eerie spectacle of transparent 3D computerized vocal tracts, developed by Dr. Peter Birkholz, singing a duet).
Training with Baldi and other avatars
Instructional agents, such as avatars that are designed to be expert speakers of various languages, are another interesting trend in phonetics. Such systems can help instructors by giving additional practice with lesson plans, assisting in training with second language learning, working with the hard of hearing, or individuals having particular difficulty interacting with live speech partners (such as persons with autism).
Under the direction of Professor Dominic Massaro at the University of California at Santa Cruz, researchers have come up with a 3D talking head named Baldi, capable of doing many tasks. For instance, Baldi has helped Japanese students develop their English accent and has assisted in deaf education. In more recent versions, Baldi’s head has become transparent in order to better show his vocal tract so that learners of languages in which special tongue and pharynx positions are important (such as Arabic) can see what’s going on. Baldi has even sprouted legs, arms, and a body because an avatar’s gestures can in some situations add to a more effective language-learning situation. This type of research suggests that work with avatars can hold a bold and promising future for phonetics.
Helping the mute talk with silent speech interfaces
Silent speech interface (SSI) can be especially useful in military applications, such as for personnel in loud cockpits or vehicles that prevent them from hearing themselves speak or from being recorded by a microphone.
Furthermore, SSI can help others who can’t produce audible sound from their vocal folds, but their articulators (tongue, lips, and jaw) still work. Having an artificial vocal source would alleviate this problem. If the position of the person’s tongue can be tracked in real time, and this information were fed to a computer, the two could be coupled with a voicing source and, presto, speech.
Several exciting working prototypes for SSIs are currently under development. The following focus on articulatory acoustic principles and flesh-point articulator tracking technologies:
Researchers in South Africa are working on a system using electropalatography (EPG).
Scientists at the University of Georgia are exploring the use of a permanent magnet tracking system.
Other researchers are working on lip and tongue tracking systems.
One day the ultimate goal is to have individuals who can’t speak due to the loss of the larynx to simply pull out their phone (or a device roughly that size), push a button, and then have a high quality synthesized voice speak for them as they articulate.
Visualizing tongue movement for stroke patients
Many individuals with left cortical brain damage have apraxia of speech (AOS), a problem controlling the production of speech sounds. Although these patients generally understand language fairly well, if they want to pronounce a certain sound, say “s” in the word “see,” the sound may come out wrong, such as as “she.” AOS is very frustrating to patients because they typically know they’ve produced a sound in error. They commonly feel like they know what to say, but they just can’t get it out.
One proven principle known to help these patients is practice (practice makes perfect), particularly as such individuals tend to stop speaking due to frustration, depression, and having other family members take over and speak for them. Another important therapeutic principle is articulatory training. The University of Dallas at Texas laboratory (in conjunction with colleagues at the University of Pittsburgh) is giving individuals with AOS visual feedback concerning the position of their tongue during speech. This intervention is based on the premise that individuals with AOS have a breakdown with sound sequencing and sound implementing, but their eye-to-tongue feedback monitoring systems are intact.
A number of studies have found that this method can help individuals with AOS increase the accuracy of their sound production after stroke. The work to date has relied on information from a single articulatory data point (such as the tongue tip). Future work will give patients a 3D avatar that shows them the online movement of their tongue while they speak. Doing so will permit treatment of a broader range of speech sounds and will allow clinicians to treat manner of articulation, as well as place.
Sorting more masculine voice from less masculine voice
A number of properties in the voice can actually indicate masculinity. Phoneticians have terms for this:
More masculine speech (MMS)
Less masculine speech (LMS)
MMS is lower in fundamental frequency (the pitch a person hears). The two also seem to have differences in the spectral quality (how high pitched the hissiness is) of the fricatives. Also, MMS individuals have less pronounced vowel space than individuals judged to be LMS (meaning LMS talkers use greater tongue excursions while talking).
Companies or governments may be able to use this information to design a male versus female voice detector and perhaps an even more detailed detector (straight versus gay) for simple kinds of judgments. However, conveying gender through speech is more complicated than a general approximation of the biological properties of the opposite sex. That is, despite what popular culture often implies, the speech of gay men doesn’t seem to be merely a feminized version of the speech of straight men (or the speech of lesbians a masculinized version of the speech of straight women).
Ron Smyth, a professor at the University of Toronto, has studied the differences between more and less gay-sounding male speech. His work reveals that the following complex mix of acoustic properties characterizes “gay-sounding speech”:
Vowels produced closer to the edges of the vowel space
Stop consonants with longer voice onset times (VOTs)
Longer /s/ and /ʃ/ fricatives with higher peak frequencies
More light “l” allophones
Smyth’s work also shows that many of these judgments also depend on assumptions made by the listeners, the types of speech samples provided, and on the gender and sexual orientation of the listeners themselves. Sexual orientation and speech is an ongoing topic of research to determine whether popular-cultural stereotypes are based on anything tangible, and whether people’s perception of sexual orientation (gay people’s self-proclaimed gaydar) is what it claims to be (His work has shown that people’s gaydar based on speech usually isn’t reliable.)
These issues relate to the field of sociolinguistics, the study of the relationship between language and society. Studies have shown, for instance, that young (heterosexual) men will lower their fundamental frequency when a young female questioner, rather than a male, walks into the room. These men are presumably making themselves attractive through a lower voice. If the previous studies findings are accurate, a research could assume that under the same experimental conditions, women would increase the breathiness of their voice, a characteristic known to increase the percept of more attractive female speech.
Figuring out the foreign accent syndrome (FAS)
Foreign Accent Syndrome (FAS) is a speech motor disorder where adults present with foreign-sounding speech as the result of mistiming and prosodic abnormalities resulting from brain disorder. It continues to fascinate the public and scientists alike. Study of individuals having this disorder can potentially give a better picture of which brain systems are involved in producing and understanding accent.
So far, most of the FAS cases have been native English-speaking individuals, although increasingly other European languages are also being recorded. Now several non-Indo-European (Hebrew, Japanese, and Arabic) cases have been recorded. Researchers are interested in which varieties of languages are affected, and researchers question the extent to which stress- and or syllable-based prosodic factors (commonly quantified as Pairwise Variability Index, (PVI)) plays a role in whether such patients are perceived as foreign, and whether there are high-PVI and low-PVI FAS subtypes.
Another puzzle in the FAS picture is how cases that result from frank focal lesions (such as from stroke or tumor) can be related to those of less specific or unknown etiologies (such as migraine, allergy, or possibly psychogenic causes). An individual with a lesion in a well-established brain region known to correspond to speech function (like the perisylvian language zone) may be assumed to have a plausible cause for FAS. The situation for individuals with no known physiological cause is less clear.
Many patients referred to the clinic at the University of Texas at Dallas for suspected FAS have been diagnosed with conversion disorder. This is a condition in which patients experience neurological symptoms that medical evaluation can’t explain. Conversion disorder isn’t malingering (faking illness) and it can affect speech, yet this is not the same thing as the FAS. To best evaluate FAS, professionals should work closely in a team that ideally includes a psychologist and psychiatrist. Including phonetic tests to rule out intentional, inadvertent, or mimicked accent modification is also important.
Discovering the genetics of speech
Phoneticians have become more interested in the fast-moving and exciting field of genetics to find the basis of speech and language. A tumult started in the 1980s with the discovery of a family in West London that had a series of family-related speech and language problems. Between the various members of the family (named KE) were nine siblings. Four of these siblings had pronounced problems with comprehension, understanding sentences such as “The boy is being chased by the tiger” to mean “The boy is chasing the tiger.” They also dropped sounds at the beginning of words, such as saying “art” when intending to say “tart.” From such behavior, it became clear there was something family-related particularly affecting their speech and language.
In the mid-1990s, a group of Oxford University geneticists began to search for the damaged gene in this family. They found this disorder resulted when only one gene was passed from a generation to the next (autosomal dominant) and wasn’t sex-linked. Further investigation pinned the gene to an area on chromosome 7, which was called Speech and Language Disorder 1 (SPCH1). The geneticists proceeded to pinpoint the precise location of the chromosome 7 breakage in the case of another child with a genetic speech and language disorder. It turned out to relate to the KE cases in an amazing way: Both encoded something called Forkhead Box Protein (FOXP2), a transcriptional protein that codes other factors needed for neurological, gut, and lung systems.
FOXP2 is associated with vocal learning in young songbirds, echolocation in bats, and possibly in other vocal-learning species, such as whales and elephants. Mice with human-FOXP2 genes spliced into their DNA emitted low funky squeaks and grew different neural patterns in their brains in regions involved with learning.
Like all exciting scientific stories, the FOXP2 story isn’t without controversy. Many popular reports of these discoveries make simplified claims, overlooking the multifactorial genetic basis for speech and language. For example, the descent of the human larynx was undoubtedly important in making speech physically possible, in comparison to the vocal tract of chimpanzees. Yet this genetic process doesn’t likely seem tied to FOXP2, suggesting that other gene loci are arguably involved. Indeed, other genes are already emerging. FOXP2 switches off a gene called Contactin-associated protein-like 2 (CNTNAP2). This gene has been associated in both specific language impairment (SLI) and autism. Nerve cells in the developing brain, particularly in circuits associated with language, deploy CNTNAP2, which encodes the protein.
Matching dialects for fun and profit
Many people change their spoken accent through the course of a day to match the accent of people to which they’re talking. You can call this being an accent sponge, although it’s more technically referred to as dialect matching or register matching.
Dialect matching is quite natural for people. In fact, it has become one of the hot areas in computer speech recognition for the potential of matching a call-in telephone request with an online response matched in dialect. Because people seem to appreciate group membership, the idea is to have the computer quickly recognize your dialect and match you up with a phone buddy or computerized voice that matches you.
Researchers are designing computer systems with phone unit recognition and phone unit adaptation modules. Telephone systems using such technologies can determine the accent of the person calling, extract the features of that accent, and modify the synthesized voicing responding to the caller by best matching that person’s accent. If done correctly, it can lead to greater intelligibility and perhaps a better subjective feeling in the conversation. On the other hand, if it’s not done well, people may feel mimicked or mocked. You can just imagine how this sort of thing can be used in computerized dating systems.
Dialect matching is even natural for Orca whales, bottlenose dolphins, and spear-nosed bats, too. Orcas and dolphins use coordinated squeaks and whistles to decide what they will hunt and travel with. Study of spear-nosed bats has shown that the females match their calls to recruit other members of their roost when they find a rich food source and collectively defend their food from other bats. According to biologists, these animal sounds are all cases of signaling for group membership.