A mere 5 seconds, that is what it takes for an AI to clone your voice. The clone wars have begun and soon we may not know who’s singing your favorite tune – real or an AI. Check this out:
According to John P Shea there are examples of entire vocal synthesis characters taken from just a 5 second human speech sample and applied to written text in native and other languages.
This is work done at Cornell University (https://arxiv.org/abs/1806.04558) on Computation and Language by a team of smart folks:
“We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples”
The core technology is Google’s Tacotron end to end speech synthesis. The intonation and nuance is remarkable and uses something called a Neural Vocoder. There were several thousand spoken samples used to train the technology, but it is unclear as to how long it takes to generate the synthesized voices – eg it is close to real-time or require significant computational resources and time. But the results are impressive, and when you check the voices being used to speak in non-native languages called Cross Language Voice Cloning with varying degrees of accent control you will be amazed.
It’s worth checking out the links below to explore the nuance and audio examples of this stuff.