Google is facing controversy among AI experts for a deceptive Gemini promotional video released Wednesday that appears to show its new AI model recognizing visual cues and interacting vocally with a person in real time. As reported by Parmy Olson for Bloomberg, Google has admitted that was not the case. Instead, the researchers fed still images to the model and edited together successful responses, partially misrepresenting the model’s capabilities.
“We created the demo by capturing footage in order to test Gemini’s capabilities on a wide range of challenges,” a spokesperson said. “Then we prompted Gemini using still image frames from the footage, & prompting via text,” a Google spokesperson told Olson. As Olson points out, Google filmed a pair of human hands doing activities, then showed still images to Gemini Ultra, one by one. Google researchers interacted with the model through text, not voice, then picked the best interactions and edited them together with voice synthesis to make the video.
Right now, running still images and text through massive large language models is computationally intensive, which makes real-time video interpretation largely impractical. That was one of the clues that first led AI experts to believe the video was misleading.