Most people have probably heard that images or videos can be manipulated. The spectrum ranges from fat pads or wrinkles that have been airbrushed away, to people or motifs that have been remounted, to the replacement of faces and facial expressions in people. But what is not yet so well known is that "real" voices can now also be generated artificially. Just like in May, when in a commercial advertisement an Angela Merkel lookalike said sentences in an almost perfect Chancellor's voice that the head of government would hardly say in public.
The basis for this is an enormous advance in text-to-speech synthesis. On the one hand, this makes it possible to develop new products or improve existing ones, such as voice assistants, navigation systems or access systems for the visually impaired. On the other hand, the voice of a person can also be generated artificially in this way, provided that enough speech material of this person is available to train a neural network. This can lead to criminals using synthetic voices to defraud or become politically active. In the latter case, these so-called deepfakes - media content that has been manipulated in a targeted and fully automated way through the use of artificial intelligence - could influence election outcomes or even trigger wars.
"The data it takes to train AI appropriately on the voice can be extracted anywhere people communicate digitally," explains Prof. Dr. Andreas Schaad from Offenburg University. In the master's program Enterprise and IT Security, he and students Vanessa Barnekow, Dominik Binder, Niclas Kromrey, Pascal Munaretto and Felix Schmieder have therefore tested in a project how much, or rather how little, a computer- or informatics-savvy person needs to generate an audio clone with a reasonable amount of effort, limited computing resources and no prior knowledge in the field of speech synthesis. The prepint for this project is available at the following link arxiv.org/abs/2108.01469, with the professor himself volunteering as the test subject. "Even less than three hours of high-quality audio material from my online lectures was enough to train the AI," said Andreas Schaad himself, amazed at how sophisticated the technology has become. In a subsequent study with 102 test subjects, only just under 40 percent were able to distinguish his real voice from the fake one.
The project team first obtained audio clips with a minimum length of half a second and a maximum length of 30 to 40 seconds. They converted these into written texts or used the transcriptions already attached to the audio clips. From these, the participants removed unwanted characters, converted all others to lowercase, wrote out all numbers, replaced all abbreviations with the full word, and inserted phonemic orthography where necessary, in which a written symbol corresponds to the actual spoken sound. In addition, they partially inserted sentences that were never said in this way for example "send all exam papers to ..." or "Please enter an A for Mr. Müller". Subsequently, the neural network was trained with the audio clips on the voice characteristics as well as with the transcriptions including the insertions on the text to be said and both were combined to new audio clips. Afterwards, both the real and the fake audio clips were played to the subjects - with the result already mentioned.
And so the task now is to find suitable means for detecting such deepfakes. A task in which Prof. Dr. Janis Keuper at the Institute for Machine Learning and Analytics (IMLA) at Offenburg University of Applied Sciences has already done a lot in terms of image and video material. Prof. Dr. Andreas Schaad would now like to do the same for audio material and has therefore submitted a corresponding project application together with the Deutsche Presse Agentur (dpa) and New Work SE, the operator of the social network Xing, among others.