The internet is filled with instructional videos that cover a wide range of topics, from cooking to life-saving techniques like the Heimlich maneuver.
However, finding a specific action in a long video can be time-consuming. Scientists are working on training computers to do this task efficiently. The goal is for users to describe the action they want to find, and an AI model will locate it in the video.
Traditionally, teaching machine-learning models to do this required expensive, hand-labeled video data. A new, more efficient method developed by researchers at MIT and the MIT-IBM Watson AI Lab trains a model to perform this task, known as spatio-temporal grounding, using only videos and their automatically generated transcripts.
The researchers teach the model to understand an unlabeled video in two ways: by analyzing small details for object location (spatial information) and by looking at the overall action timing (temporal information).
Their approach, compared to other AI methods, more accurately identifies actions in longer videos with multiple activities. They found that training on both spatial and temporal information simultaneously improves the model’s performance.
Aside from enhancing online learning and virtual training processes, this technique could also be valuable in healthcare settings for quickly identifying key moments in diagnostic procedure videos.
Lead author Brian Chen, a recent graduate of Columbia University, conducted the research at the MIT-IBM Watson AI Lab with colleagues from MIT, Goethe University, and Quality Match GmbH. The research will be presented at the Conference on Computer Vision and Pattern Recognition.
The researchers’ approach eliminates the need for expensive annotated video data by using unlabeled instructional videos and their text transcripts for training. They divide the training process into global representation, which focuses on overall actions in the video, and local representation, which zooms in on specific action regions.
By creating a new benchmark for testing models on longer uncut videos, the researchers found that their method outperformed other AI techniques in pinpointing actions. They aim to further develop their approach to automatically detect misalignments between text and narration and extend it to audio data in the future.
This research, funded in part by the MIT-IBM Watson AI Lab, represents a significant advancement in AI models’ ability to understand video content.