Audio/Video Processing

It’s time to talk about the last topic from my list of possible solutions. This is by far the most difficult and is not yet a feasible option. However, it is definitely worth talking about.

Audio/Video processing. In particular, I’m talking about looking at video or listening to audio and characterizing it as… well, anything. Is the video a concert? a video game? animation? home video? Is the audio pop music? rock? rap? country? folk? Is it just talking? If you know who might be talking, is it possible to determine the speaker? For a human watching a video and listening to its audio, this is a simple task. You can tell just from looking at it whether it’s animated or real. You can easily tell a concert from a video game. You can differentiate different styles of music and tell apart different speakers. But for computers, this is a near impossible task.

Images are extremely difficult to process. The computer is shown a matrix of colors which represent the pixels of the image. It then has to use that information to figure out what each individual section represents, even though many things have similar shapes and colors, sizes vary from picture to picture, and images might not be entirely clear. A red bouncy ball may be indistinguishable from an apple. Even if the computer is looking for something, it has to realize that the target object could be in any size, position, or resolution and may not be shown in its entirety. So, if a single image is extremely difficult to process, consider how difficult a video (many, many images coming one after another) is to process…

Similarly, audio is presented to the computer as a wavelength. If it has different sources, it may get multiple wavelengths that play at the same time, but it is still very difficult to categorize anything unless the computer is told what patterns of wavelengths are characteristic of different categories (e.g. pop, rock, rap, etc.). Simply put, we don’t have those characteristics, and whatever estimations we have are just that: estimations.

Why does this matter for automated copyright infringement detection systems like Content ID? Let me show you some examples of what we might want to do for the system.

  • Recognize a concert venue in a video. Just this alone can help categorize a video as music and can allow the system to be more strict with its flagging.
  • Recognize a video game based on screenshots of gameplay. This can help identify the game being played and can allow Let’s Players to play their games without worrying about the in-game music causing a flag of the system.
  • Separate talking from music. Even if people play music in the background, it would be great to know what part of the audio is just speech, so that can be left unmuted.
  • Identify voices from games based on samples. The more audio matches in-game dialogue, the more obvious it is that a game is being played. Identifying the game can make it clear whether or not there is copyright infringement.

These are just a few examples of what audio/video processing could do. Most importantly, it can differentiate music from video games, which is where the big problem of copyright comes into play. There is a conflict of interests, as the music industry is much more strict with copyright than the video game industry. Being able to categorize videos as music videos/concerts/lyric videos, walkthroughs/reviews/Let’s Plays, and “other” would be a great step toward being able to enforce copyright without being overzealous and flagging game channels. That said, it may be too much effort for too little improvement over other alternatives.

So now I’d like to ask: Which of the options that I’ve presented seems the most feasible? What seems like it would work well? Would not work well? Is there anything I’ve missed?

Leave a comment