Multi-level, Multi-aspect, Multi-modal

We live in a multi-modal world, we learn, we think and we express through multiple modalities. Therefore for AI systems they should have the ability to understand the multi-modal world.
Our research efforts for building AI systems focus on understanding from multi-level, multi-aspect and multi-modal.

AIM3 Research

Vision & Language
Building deep learning models that can reason about images, videos, and text.
3D & Embodied
Understanding and interacting with 3D environments and physical entities.
Affective Computing
Recognizing, interpreting, and responding to human emotions.
Music Intelligence
Generating lyrics, melodies, and synthesizing music and singing.


  • 2017-2023 TRECVID (Video to Text Description) Grand Challenge (Rank 1st)
  • 3-5th CVPR/ECCV Affective Behavior Analysis in-the-wild (Rank 1st)
  • CVPR 2021 ActivityNet Entities Object Localization (Rank 1st)
  • 2018-2020 CVPR “ActivityNet Dense Captioning Events in Videos” (Rank 1st)
  • CVPR 2020 The End-of-End-to-End A Video Understanding Pentathlon (Rank 2nd)
  • 2017-2019 Audio-Visual Emotion Challenge (Rank 1st)
  • ICCV 2019 Outstanding Method Award in VATEX Video Captioning Challenge
  • 2019 之江杯全球人工智能大赛视频内容描述生成 (第一名,30万元奖金)
  • CVPR 2019 ActivityNet Large Scale Activity Recognition Challenge (ANET) Temporal Captioning Task (Winner)
  • ACM Multimedia 2019 Audio-Visual Emotion Challenge (Winner)
  • CVPR 2018 ActivityNet Large Scale Activity Recognition Challenge (ANET) Temporal Captioning Task (Winner)
  • ACM Multimedia 2017 Best Grand Challenge Paper Award
  • 2017 ACM Multimedia (Video to Language) Grand Challenge (Rank 1st)
  • 2016 ACM Multimedia (Video to Language) Grand Challenge (Rank 1st)
  • 2016 Audio-Visual Emotion Challenge (AVEC) (Rank 2nd)
  • 2016 MediaEval Movie Emotion Impact Challenge (Rank 1st)
  • 2016 Chinese Multimodal Emotion Challenge (MEC) (Rank 2nd)
  • 2016 NLPCC Chinese Weibo Stance Detection (Rank 1st)
  • "Spoken English Assistant" system in IBM Bluemix computing contest (2nd Place Price)
  • 2015 ImageCLEF (Image Sentence Generation) Evaluation (Rank 1st)