Exploring videos with multimodal AI

Team RTL @ ICT with Industry workshop

Maurits van der Goes
RTL Tech

--

What happens when academia and industry meet for a hackathon? Next week, a new edition of the ICT with Industry workshop will take place. We thought this would be a good moment to look back at last year’s workshop. During the 2019 workshop team RTL developed a prototype that improves the search-ability of news videos. A talented team of computer vision researchers extracted multimodal semantics with deep learning, created a search engine, and enabled interactive exploration. It was a week filled with die-hard coding and fun side-activities.

ICT with Industry is a NWO-funded initiative to increase collaboration between science and industry. Since 2014, it brings together PhD students, postdocs and business members together in multidisciplinary teams to work on interesting cases. Think of it as a week-long hackathon to develop creative solutions for business problems and accelerate innovation.

Our case

RTL Nieuws is the largest Dutch news service on commercial television. As the speed and amount of news articles are increasing, the size of the video archive is expanding as well, and the number of editors that rely on this archive is growing. Therefore Marc Schreuder (deputy editor-in-chief) asked us to do research on automated annotation of news videos:

“Create a little robot to support the documentation department.”

Our chief documentation Karen Kamperman shared that the editors are searching in various ways. A list of query examples showed simple terms and even short sentences: “yellow vests” or “diaper with poo”. In a nutshell, our assignment for the week was to make a subset of 1.000 short videos searchable with freeform text.

The team

Company visit to the studio of RTL Nieuws

Team RTL working on this case consisted of diverse members with strong backgrounds in computer vision or information retrieval. The members were Devanshu Arya, Shuo Chen, Yunlu Chen, Sarah Ibrahimi, Thomas Mensink, Pascal Mettes, William Thong, Jiaojiao Zhao (University of Amsterdam), Arthur Barbosa Câmara (Delft University of Technology), Emiel van Miltenburg (Tilburg University), Daan Odijk, Tanja Crijns and Maurits van der Goes (RTL). With Pascal as academic lead and Daan as case owner.

We had two main goals: 1) A multimodal semantic representation of videos 2) A matching approach for search. The latter was originally not within scope, as Marc gave us a semantic challenge. However, our discussion with Karen triggered us to bridge this gap in vocabulary between the generated metadata and the editor’s queries.

Overview of the framework (Ibrahimi et al., 2019)

Research

Everyone quickly agreed that training own deep learning models was not possible due to a lack of time. Instead, we relied on various pre-trained models. Four different models ran on the videos, for objects, scenes, OCR and actions. Detecting objects turned out to be relatively simple. However, object classification in the wild is much harder. It was also quite straightforward to detect one person or multiple people in a single frame. For the sake of time, we didn’t pursue face detection. For scenes, we determined whether shots were in- or outdoor, and detected different scene types and scene attributions. The videos contained 3 primary text objects: title boxes, subtitles and person titles. Regardless of our effort, we weren’t happy about the quality of the OCR. “Should we use OCR for elements the editors are adding themselves?” was a fair question by the audience. This metadata is preferably added by the editors or exported for the system they are working with. The hardest part of the feature detection was action detection, but we managed to get very good action classification results.

Brainstorm and architecture sketching

It was a challenge to go from these rich feature sets to freeform search queries. The search function is a combination of textual (OCR) and semantic (objects, scenes, and actions) matching. All information was stored in ElasticSearch. With Python microframework Flask and the borrowed CSS-markup of RTL Nieuws we made the results accessible. The visual features were ranked to the query, enabling exploration of the videos. It required some last-minute work, but it worked well. During the final presentation, Emiel showed that with the query “vuur” (fire) our prototype retrieved relevant videos, adjusting the results to his preferences with sliders. Icing on the cake of a great week.

Prototype demo

The technical details of this research are described in the corresponding paper Interactive Exploration of Journalistic Video Footage through Multimodal Semantic Matching’ (Ibrahimi et al., 2019). This great summary was accepted at ACM Multimedia 2019 for a demo session. The code of the interface is available on GitHub.

Closing

It was an intense week: working all day on a case to reach your goals and participating in social activities, all with people you just met. The multidisciplinary aspect worked really well. It was great to work with participants with different backgrounds: academia and industry, young and old, international and Dutch, PhD-students, post-docs and professors, and different companies. The goal of the workshop is to transfer knowledge between industry and academia. We’ve succeeded in that, without a doubt.

Video has always been the core business of RTL. Besides news videos, RTL has a wide variety of video content: ranging from big studio shows to influencer clips. With video-on-demand services, this variety is increasing. Accordingly, understanding our content on a deeper level using new methods is vital. Open source solutions, knowledge, and experimentation can help you achieve great results. RTL continues to invest in this domain with VideoPipe. More on this project in a future RTL Tech blog.

With these results, we’re delighted that we return to ICT with Industry. This year, a new team RTL is working on a multimodal emotion recognition case. What is the best way to detect emotions in video, audio, and text? A dataset with Goede Tijden, Slechte Tijden content is ready to be studied. Follow #ICTwithIndustry from January 20 to 24 via Twitter to see the progress on our case. Plus the cases of Beeld & Geluid, ATOS Homie, Koninklijke Bibliotheek, and TNO. Another exciting research week is coming up!

--

--