13 May 2008

Cool Idea #9764 - Recreating Audio With Video Clips to Produce a Video Collage

Recreating Audio With Video Clips to Produce a Video Collage

First let me explain the concept. Say "hello" out loud. Go ahead, people around you won't think you're weird. Maybe. An audio clip of your "hello" can be recorded digitally. It can be matched as a loose similarity to an audio clip of me saying "hello". You can experience audio recognition in technology like an automated telephone response system. Call UPS if you've never interacted with one. This "loose" relativity is useful, but my ideas are rarely useful. Only neat. What about "tight" relativity? What if your "hello" clip was digitally altered to sound like mine? What if I took 1,000 audio clips of people saying "hello" and cut tiny bits and pieces out of each one to make a best fit to approximate a recreation of your clip? This is the basis of my concept.

Here is a clip from Live Free or Die Hard that's an example of a video collage cut together to recreate a written speech.

Cool, eh? The writers made a script for the message to be broadcast, then cut bits and pieces of audio from presidential speeches and edited them together to create a video collage to tell the message. This is a perfect example of where I'm coming from. Only I want to take it much further.

In the movie, the process started with text and some editing crew manually searched for video clips that used the words from the speech. If this were to be automated, a computer would either:
- have to know how to sound out the original text in order to search audio clips for matches
- or, have some text document of all the words represented in the library of audio clips with which to match the search parameters.

Forget all that, let's do something easier. Let's just take in audio as a source and match it to audio clips. Take text out of the equation. As input, say our sample audio is you saying the words "don't fly through the air on a banana". Then, we give a computer access to 1,000 feature-length movies. The algorithm could search for each complete word in the audio of its library and put together clips that match at some forgiving percentage. An unoptimized algorithm would probably take a few days of computing time, but that ain't too bad. At the end of it, you would have a totally cool video of compiled clips where the audio is the words (or close to them) that you originally recorded. Much like the clip from Live Free or Die Hard.

I'm not done yet.

What about exact (well, >90%) replication of your audio input? What if we even wanted the output to SOUND like you? Instead of returning entire words that are loosely similar to what you recorded, what if the algorithm took tiny snippets of length 1/30 of a second and did a best fit? It would take far, far longer to complete, but the result would be a visual assault of video collage. Scenes of movies would be absolutely whizzing past your head, but the audio you hear would be scarily similar to your own voice! This would be very exciting to experience.

Another neat example would be to pass in audio of an explosion. The resulting clip would likely find millions of tiny clips of explosions from movies, all strung together to make a visual representation of the input.

I don't even want to think about the processing time for the precise reproducing algorithm, but what do you guys think about the loose fit algorithm? An application that could accept audio as an input and produce a collage of video clips matching the words?


  1. i should clarify, i didn't really read this post yet, i have a final in about 20 minutes and i'm totally "locked-inside-a-room-with-no-windows-for-too-many-days" punchdrunk but the clip made me think of this:
    i promise an intelligent response will eventually come...

  2. Would you have to take into account the quality of the sound? A guy saying "hello" in a movie made in 1960 will sound significantly different from a guy in a very recent movie, due to the quality of the recording equipment. Your computer might then try to match the snippets of static or other extraneous noises of the old movie to other pieces of audio and find nothing.


  3. @Kyle - that's why even the precise algorithm can only go up to ~90%, but even still a clip of 1960s audio is still feasibly comparable.

  4. This type of thing should be pretty trivial pretty soon. I've heard that every minute there are 10 hours of content uploaded to YouTube!

    Still, it would take forever to search across ALL VIDEO IN THE KNOWN INTERTRUCKS, but there's no reason to do that. Much easier (and probably more visually pleasing) would be to specify some subset of video (presidential speeches, or Jerry Seinfeld standup routines, or Humphrey Bogart movies for instance), and force the program to work with only those 100 hours (or whatever) of content.

    Think about when all content is deeply tagged (probably by a machine process instead of a human, at that point.) I've often wanted to be able to just watch music montages from random sources. So now we're talking about only using portions of videos as the source to cut from...

    So what about a sentence i say that i want said by a specific person? (Bogart, baby!) Now we're on to a different approach, which i think actually already exists.

    I've heard that there is software that can listen to a person talk for 3 minutes and then reliably be able to generate speech that sounds like that person saying any arbitrary thing you want. Talk about scary! I wonder what is in store for "on-the-fly" video?

    Btw, do you ever sit down and do programming exercises when you have looney ideas, or do you just bother your friends to comment on them? :)

  5. Triangletom @ GameTap shared this link with me: http://www.sr.se/P1/src/sing/#