A new approach developed by researchers from MIT and elsewhere improves an AI model's ability to learn in this same fashion. This could be useful in applications such as journalism and film production, where the model could help with curating multimodal content through automatic video and audio retrieval. Improving upon prior work from their group, the researchers created a method that helps machine-learning models align corresponding audio and visual data from video clips without the need for human labels. They adjusted how their original model is trained so it learns a finer-grained correspondence between a particular video frame and the audio that occurs in that moment. The researchers also made some architectural tweaks that help the system balance two distinct learning objectives, which improves performance. Taken together, these relatively simple improvements boost the accuracy of their approach in video retrieval tasks and in classifying the action in audiovisual scenes. For...
learn more