The president was seething.
His problem was with the press, yes, but also with the technology they used. Electronic media had changed everything. People were glued to their screens. “I have never heard or seen such outrageous, vicious, distorted reporting,” he said in a news conference.
The age of television news, Richard Nixon told reporters gathered that day in October 1973, was shaking the confidence of the American people. He didn’t yet know his presidency would reach a calamitous end. When Nixon announced he would resign, in August 1974, he spoke directly into a television camera. The recording remains stunning half-a-century later—mostly because of the historic nature of the moment, but also because of the power of broadcast.
Even in an informational era transformed by the web, video is a gripping format. In the chaos of real-time news, especially, there’s an advantage to being able to see something with your own eyes.
Or, there used to be.
Computer scientists can now make realistic lip-synched videos—ostensibly putting anyone’s words into another person’s mouth.
The animated gif that you see above? That’s not actually Barack Obama speaking. It’s a synthesized video of Obama, made to appear as though he’s speaking words that were actually inputted from an audio file.
Obama was a natural subject for this kind of experiment because there are so many readily available, high-quality video clips of him speaking. In order to make a photo-realistic mouth texture, researchers had to input many, many examples of Obama speaking—layering that data atop a more basic mouth shape. The researchers used what’s called a recurrent neural network to synthesize the mouth shape from the audio. (This kind of system, modeled on the human brain, can take in huge piles of data and find patterns. Recurrent neural networks are also used for facial recognition and speech recognition.) They trained their system using millions of existing video frames. Finally, they smoothed out the footage using compositing techniques applied to real footage of Obama’s head and torso.
The researchers wrote a paper about their technique, and they plan to present their findings at a computer graphics and interactive techniques conference next month.
“The idea is to use the technology for better communication between people,” says Ira Kemelmacher-Shlizerman, a co-author of the paper and an assistant professor in the department of computer science and engineering at the University of Washington. She thinks this technology could be useful for video conferencing—one could generate a realistic video from audio, even when a system’s bandwidth is too low to support video transmission, for example. Eventually, the technique could be used as a form of teleportation in virtual reality and augmented reality, making a convincing avatar of a person appear to be in the same room as a real person, across any distance in space and time.
“We’re not learning just how to give a talking face to Siri, or to use Obama as your GPS navigation, but we’re learning how to capture human personas,” says Supasorn Suwajanakorn, a co-author of the paper. Not surprisingly, several major technology companies have taken notice: Samsung, Google, Facebook, and Intel all chipped in funding for this research. Their interest likely spans the realms of artificial intelligence, augmented reality, and robotics. “I hope we can study and transfer these human qualities to robots and make them more like a person,” Suwajanakorn told me.
Quite clearly, though, the technique could be used to deceive. People are already fooled by doctored photos, impostor accounts on social media, and other sorts of digital mimicry all the time.
Imagine the confusion that might surround a convincing video of the president being made to say something he never actually said. “I do worry,” Kemelmacher-Shlizerman acknowledged. But the good outweighs the bad, she insists. “I believe it’s a breakthrough.”
There are ways for experts to determine whether a video has been faked using this technique. Since researchers still rely on legitimate footage to produce portions of a lip-synched video, like the speaker’s head, it’s possible to identify the original video that was used to create the made-up one.
“So, by creating a database of internet videos, we can detect fake videos by searching through the database and see whether there exists a video with the same head and background,” Suwajanakorn told me. “Another artifact that can be an indication is the blurry mouth [and] teeth region. This may be not noticeable by human eyes, but a program that compares the blurriness of the mouth region to the rest of the video can easily be developed and will work quite reliably.”
It also helps if you have two or more recordings of a person from different views, Suwajanakorn said. That’s much harder to fake. These are useful safeguards, but the technology will still pose challenges as people realize its potential. Not everyone will know how to seek out the databases and programs that allow for careful vetting—or even think to question a realistic-looking video in the first place. And those who share misinformation unintentionally will likely exacerbate the increasing distrust in experts who can help make sense of things
“My thought is that people will not believe videos, just like how we do not believe photos once we’re aware that tools like Photoshop exist,” Suwajanakorn told me. “This could be both good and bad, and we have to move on to a more reliable source of evidence.”
But what does reliability mean when you cannot believe your own eyes? With enough convincing distortions to reality, it becomes very difficult to know what’s real.