upper waypoint

Synthesized Voices Just Got More Realistic

Save ArticleSave Article
Failed to save article

Please try again

 (Ole_CNX via Getty Images)

When ChatGPT launched advanced voice mode last month – inserting accents and ‘umms’, even taking breaths – some users called it surreal. Humans have been trying to make synthesized voices sound more natural for nearly a century. We talk about why and look at the history and evolution of synthesized voices, from robots of film like HAL and C3PO to digital assistants like Alexa.

Guests:

Kyle Orland, senior gaming editor, Ars Technica

Sarah A Bell, associate professor of digital media, Michigan Technological University; author, “Vox ex Machina: A Cultural History of Talking Machines"

Sponsored

The Rapid Evolution of Artificial Voices

Kyle Orland, senior gaming editor for Ars Technica, tried ChatGPT4’s new advanced voice mode and says it’s disconcertingly realistic. “It was the first time I’d really heard something where the intonation, the pauses, the little laughs that it put in or the ability to mimic accents there, it was just a new plateau…in mimicking human voices.” Orland believes this level of vocal realism can foster an artificial “emotional relationship” with AI, making us forget “it has no more thoughts behind it than it did a year ago when it was just text.”

Similarly, Google’s NotebookLM, which generates “podcasts from uploaded text, surprised Orland with its naturalistic back-and-forth dialogue: “It’s engaging enough that I could see using this as like a cliffsnotes way to summarize complex documents or books.”

Early Talking Machines and Innovations

One of the earliest artificial voices, according to Sarah Bell, author of “Vox ex Machina: A Cultural History of Talking Machines,” was created by Wolfgang von Kempelen in 1791. Kempelen’s contraption relied on bellows to push air across bagpipe reeds and through a rubber funnel that could be manipulated like a mouth. It later inspired Alexander Graham Bell’s work on speech synthesis and the telephone. The 1939 World’s Fair introduced “The Voder” (Voice Demonstrator), an early electronic voice made for Bell Labs to showcase telephone research. The Voder required a human to operate a keyboard to get the machine to talk. Despite sounding flat, it garnered fame through radio broadcasts. However, some warned of its potential for manipulation, fears echoed today about deepfakes of political voices.

Design Choices and Consequences

Early consumer products like the 1970s Speak & Spell fostered acceptance through gamification: “We learned that these talking computers were really sort of companionable,” Bell noted. Integrating voices into everyday life normalized the technology despite some parents’ initial misgivings about the cost and the longevity of their children’s interest. She also emphasized that synthesized voices reflect intentional design choices by their creators, as when Speak & Spell designers opted against having the machine “blow raspberries” at wrong answers to avoid reinforcing mistakes.

Voice assistants like Siri were deliberately given female-sounding voices and personalities to enhance engagement, though this has reinforced gender stereotypes about subservient roles. Bell said we need to keep that in mind and ask, “Is that benefiting us or not?”

Ethical Considerations

While synthetic voices offer conveniences, their increased sophistication sparks worries over privacy violations, social isolation, and whether it erodes human authenticity. One caller described sharing secrets with an AI “girlfriend” as “data for the AI to use and for the company to use for profit. I find it all very disturbing.”

Regulation may be needed to “watermark” AI-generated content, but Bell cautioned about forecasting all ramifications: “We don’t know, and can’t exactly anticipate all of the…consequences… we just need to have these conversations about where are our limits.” Still, she expressed optimism that we’ll see legislation to protect consumer privacy.

This content was edited by the Forum production team but was generated with the help of AI.

lower waypoint
next waypoint