Recreating Audio With Video Clips to Produce a Video Collage
First let me explain the concept. Say "hello" out loud. Go ahead, people around you won't think you're weird. Maybe. An audio clip of your "hello" can be recorded digitally. It can be matched as a loose similarity to an audio clip of me saying "hello". You can experience audio recognition in technology like an automated telephone response system. Call UPS if you've never interacted with one. This "loose" relativity is useful, but my ideas are rarely useful. Only neat. What about "tight" relativity? What if your "hello" clip was digitally altered to sound like mine? What if I took 1,000 audio clips of people saying "hello" and cut tiny bits and pieces out of each one to make a best fit to approximate a recreation of your clip? This is the basis of my concept.
Here is a clip from Live Free or Die Hard that's an example of a video collage cut together to recreate a written speech.
Cool, eh? The writers made a script for the message to be broadcast, then cut bits and pieces of audio from presidential speeches and edited them together to create a video collage to tell the message. This is a perfect example of where I'm coming from. Only I want to take it much further.
In the movie, the process started with text and some editing crew manually searched for video clips that used the words from the speech. If this were to be automated, a computer would either:
- have to know how to sound out the original text in order to search audio clips for matches
- or, have some text document of all the words represented in the library of audio clips with which to match the search parameters.
Forget all that, let's do something easier. Let's just take in audio as a source and match it to audio clips. Take text out of the equation. As input, say our sample audio is you saying the words "don't fly through the air on a banana". Then, we give a computer access to 1,000 feature-length movies. The algorithm could search for each complete word in the audio of its library and put together clips that match at some forgiving percentage. An unoptimized algorithm would probably take a few days of computing time, but that ain't too bad. At the end of it, you would have a totally cool video of compiled clips where the audio is the words (or close to them) that you originally recorded. Much like the clip from Live Free or Die Hard.
I'm not done yet.
What about exact (well, >90%) replication of your audio input? What if we even wanted the output to SOUND like you? Instead of returning entire words that are loosely similar to what you recorded, what if the algorithm took tiny snippets of length 1/30 of a second and did a best fit? It would take far, far longer to complete, but the result would be a visual assault of video collage. Scenes of movies would be absolutely whizzing past your head, but the audio you hear would be scarily similar to your own voice! This would be very exciting to experience.
Another neat example would be to pass in audio of an explosion. The resulting clip would likely find millions of tiny clips of explosions from movies, all strung together to make a visual representation of the input.
I don't even want to think about the processing time for the precise reproducing algorithm, but what do you guys think about the loose fit algorithm? An application that could accept audio as an input and produce a collage of video clips matching the words?