Journal and Conference

Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation
Zihao Yue, Anwen Hu, Liang Zhang, Qin Jin
[pdf]   [code]
NeurIPS, 2023.
Explore and Tell: Embodied Visual Captioning in 3D Environments
Anwen Hu, Shizhe Chen, Liang Zhang, Qin Jin
[pdf]
ICCV, 2023.
Prompt-Oriented View-agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World
Boshen Xu, Sipeng Zheng, Qin Jin

[pdf]
ACM Multimedia, 2023.
Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences
Dingyi Yang, Hongyu Chen, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Qin Jin

[pdf]
ACM Multimedia, 2023.
Emotionally Situated Text-to-Speech Synthesis in User-Agent Conversation
Yuchen Liu, Haoyu Zhang, Shichao Liu, Xiang Yin, Zejun Ma, Qin Jin

[pdf]
ACM Multimedia, 2023.
InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation
Anwen Hu, Shizhe Chen, Liang Zhang, Qin Jin
[pdf]
ACL, 2023.
UniLG: A Unified Structure-aware Framework for Lyrics Generation
Tao Qian, Zhong Tian, Jiatong Shi, Yuning Wu, Shuan Guo, Xiang Yin, Qin Jin
[pdf]
ACL, 2023.
Attractive Storyteller: Stylized Visual Storytelling with Unpaired Text
Dingyi Yang, Qin Jin
[pdf]
ACL, 2023.
Movie101: A New Movie Understanding Benchmark
Zihao Yue, Qi Zhang, Anwen Hu, Liang Zhang, Ziheng Wang, Qin Jin
[pdf]   [code]
ACL, 2023.
Rethinking Benchmarks for Cross-modal Image-Text Retrieval
Weijing Chen, Linli Yao, Qin Jin
[pdf]   [code]
SIGIR, 2023.
Knowledge Enhanced Model for Live Video Comment Generation
Jieting Chen, Junkai Ding, Wenping Chen, Qin Jin
[pdf]   [code]
ICME, 2023.
Open-Category Human-Object Interaction Pre-training via Language Modeling Framework
Sipeng Zheng, Boshen Xu, Qin Jin
[pdf]
CVPR, 2023.
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Nicholas Jing Yuan, Qin Jin, Jianlong Fu, Baining Guo
[pdf]   [code]
CVPR, 2023.
CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge
Linli Yao, Weijing Chen, Qin Jin
[pdf]   [code]
WWW, 2023.
PHONEIX: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor
Yuning Wu, Jiatong Shi, Tao Qian, Dongji Gao, Qin Jin
[pdf]   [code]
ICASSP, 2023.
Token Mixing: Parameter-Efficient Transfer Learning from Image-Language to Video-Language
Yuqi Liu, Luhui Xu, Pengfei Xiong, Qin Jin
[pdf]   [code]
AAAI, 2023.
Accommodating Audio Modality in CLIP for Multimodal Processing
Ludan Ruan, Anwen Hu, Yuqing Song, Liang Zhang, Sipeng Zheng, Qin Jin
[pdf]   [code]
AAAI, 2023.
MPMQA: Multimodal Question Answering on Product Manuals
Liang Zhang, Anwen Hu, Jing Zhang, Shuo Hu, Qin Jin
[pdf]   [code]
AAAI, 2023.
Multi-Modal Knowledge Hypergraph for Diverse Image Retrieval
Yawen Zeng, Qin Jin, Tengfei Bao, Wenfeng Li
[pdf]
AAAI, 2023.
Multi-Lingual Acquisition on Multimodal Pre-training for Cross-modal Retrieval
Liang Zhang, Anwen Hu, Qin Jin
[pdf]
NeurIPS, 2022.
DialogueEIN: Emotion Interaction Network for Dialogue Affective Analysis
Yuchen Liu, Jinming Zhao, Jingwen Hu, Ruichen Li, Qin Jin
[pdf]
COLING, 2022.
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, Qin Jin
[pdf]
ECCV, 2022.
Few-shot Action Recognition with Hierarchical Matching and Contrastive Learning
Sipeng Zheng, Shizhe Chen, Qin Jin
[pdf]
ECCV, 2022.
Unifying Event Detection and Captioning as Sequence Generation via Pre-Training
Qi Zhang, Yuqing Song, Qin Jin
[pdf]
ECCV, 2022.
Progressive Learning for Image Retrieval with Hybrid-Modality Queries
Yida Zhao, Yuqing Song, Qin Jin
[pdf]
SIGIR, 2022.
VRDFormer: End-to-End Video Visual Relation Detection with Transformers
Sipeng Zheng, Shizhe Chen, Qin Jin
[pdf]
CVPR, 2022.
M3ED: Multi-modal Multi-scene Multi-label Emotional Dialogue Database
Jinming Zhao, Tenggan Zhang, Jingwen Hu, Yuchen Liu, Qin Jin, Xinchao Wang, Haizhou Li
[code]
ACL, 2022.
Image Difference Captioning with Pre-Training and Contrastive Learning
Linli Yao, Weiying Wang, Qin Jin
[pdf]   [code]
AAAI, 2022.
Training strategies for automatic song writing: a unified framework perspective
Tao Qian, Jiatong Shi, Shuai Guo, Peter Wu, Qin Jin
[pdf]
ICASSP, 2022.
MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition
Jinming Zhao, Ruichen Li, Qin Jin, Xinchao Wang, Haizhou Li
[pdf]
ICASSP, 2022.
Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training
Yuqing Song, Shizhe Chen, Qin Jin, Wei Luo, Jun Xie, and Fei Huang

[pdf]   [code]
ACM Multimedia, 2021.
Question-controlled Text-aware Image Captioning
Anwen Hu, Shizhe Chen, Qin Jin
[pdf]   [code]
ACM Multimedia, 2021.
Enhancing Neural Machine Translation with Dual-Side Multimodal Awareness
Yuqing Song, Shizhe Chen, Qin Jin, Wei Luo, Jun Xie, and Fei Huang
[pdf]
IEEE Transactions on Multimedia, 2021.
Speech Emotion Recognition via Multi-Level Cross-Modal Distillation
Ruichen Li, Jinming Zhao, Qin Jin
[pdf]
Interspeech, 2021.
Missing Modality Imagination Network for Emotion Recognition with Uncertain Missing Modalities
Jinming Zhao, Ruichen Li, Qin Jin
[pdf]
ACL, 2021.
MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation
Jingwen Hu, Yuchen Liu, Jinming Zhao, Qin Jin
[pdf]
ACL, 2021.
Towards Diverse Paragraph Captioning for Untrimmed Videos
Yuqing Song, Shizhe Chen, Qin Jin
[pdf]   [code]
CVPR, 2021.
Sequence-To-Sequence Singing Voice Synthesis With Perceptual Entropy Loss
Jiatong Shi, Shuai Guo, Nan Huo, Yuekai Zhang, Qin Jin
[pdf]
ICASSP, 2021.
Context-aware Goodness of Pronunciation for Computer-Assisted Pronunciation Trainings
Jiatong Shi, Nan Huo, Qin Jin
[pdf]
Interspeech, 2020.
VideoIC: A Video Interactive Comments Dataset and Multimodal Multitask Learning for Comments Generations
Weiying Wang, Jieting Chen, Qin Jin
[pdf]   [code]
ACM Multimedia, 2020.
ICECAP: Information Concentrated Entity-aware Image Captioning
Anwen Hu, Shizhe Chen, Qin Jin
[pdf]   [code]
ACM Multimedia, 2020.
Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching
Jingjun Liang, Ruichen Li, Qin Jin
[pdf]
ACM Multimedia, 2020.
Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs
Shizhe Chen, Qin Jin, Peng Wang, Qi Wu
[pdf]
CVPR, 2020.
Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning
Shizhe Chen, Yida Zhao, Qin Jin, Qi Wu
[pdf]
CVPR, 2020.
Better Captioning With Sequence-Level Exploration
Jia Chen, Qin Jin
[pdf]
CVPR, 2020.
Skeleton-based Interactive Graph Network for Human Object Interaction Detection
Sipeng Zheng, Shizhe Chen, Qin Jin
[pdf]
ICME, 2020.
Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data
Shizhe Chen, Qin Jin, Alexandar Hauptmann
[pdf]
AAAI, 2019.
Cross-culture Multimodal Emotion Recognition with Adversarial Learning
Jingjun Liang, Shizhe Chen, Jinming Zhao, Qin Jin, Haibo Liu, Li Lu
[pdf]
ICASSP, 2019.
Activitynet 2019 Task 3:Exploring Contexts for Dense Captioning Events in Video
Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin,Zhaoyang Zeng, Bei Liu, Jianlong Fu, Alexander Hauptmann
[pdf]
CVPR 2019, ActivityNet Large Scale Activity Recognition Challenge.
From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots
Shizhe Chen, Qin Jin, Jianlong Fu
[pdf]
IJCAI, 2019.
Generating Video Descriptions With Latent Topic Guidance
Shizhe Chen, Qin Jin, Jia Chen, Alexander G. Hauptmann
[pdf]
IEEE TRANSACTIONS ON MULTIMEDIA, 2019.
Speech Emotion Recognition in Dyadic Dialogues
Jinming Zhao, Shizhe Chen, Jingjun Liang, Qin Jin
[pdf]
INTERSPEECH, 2019.
Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards
Yuqing Song, Shizhe Chen, Qin Jin
[pdf]
ACM Multimedia, 2019.
Visual Relation Detection with Multi-Level Attention
Sipeng Zheng, Shizhe Chen, Qin Jin
[pdf]
ACM Multimedia, 2019.
Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences
Shizhe Chen, Bei Liu, Jianlong Fu, Ruihua Song, Qin Jin, Pingping Lin, Xiaoyu Qi, Chunting Wang, Jin Zhou
[pdf]
ACM Multimedia, 2019.
Relation Understanding in Videos
Sipeng Zheng, Xiangyu Chen, Shizhe Chen, Qin Jin
[pdf]
ACM Multimedia, Grand Challenge: Relation Understanding in Videos, 2019.
Adversarial Domain Adaption for Multi-Cultural DimensionalEmotion Recognition in Dyadic Interactions
Jinming Zhao, Ruichen Li, Jingjun Liang, Qin Jin
[pdf]
AVEC, 2019.
Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019
Shizhe Chen, Yida Zhao, Yuqing Song, Qin Jin, Qi Wu
[pdf]
ICCV, VATEX Video Captioning Challenge 2019.
YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension
Weiying Wang, Yongcheng Wang, Shizhe Chen, Qin Jin
[pdf]
EMNLP, 2019.
RUC_AIM3 at TRECVID 2019: Video to Text
Yuqing Song, Yida Zhao, Shizhe Chen, Qin Jinn
[pdf]
NIST TRECVID, 2019.
Semi-supervised Multimodal Emotion Recognition With Improved Wasserstein GANs
Jingjun Liang, Shizhe Chen, Qin Jin
[pdf]
APSIPA ASC, 2019.
RUC+CMU: System Report for Dense Captioning Events in Videos
Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin, Alexandar Hauptmann
[pdf]
CVPR ActivityNet Large Scale Activity Recognition Challenge, 2018.
Class-aware Self-Attention for Audio Event Recognition
Shizhe Chen, Jia Chen, Qin Jin, Alexandar Hauptmann
[pdf]
ICMR, 2018. (Best Paper Runner-up)
Multimodal Dimensional and Continuous Emotion Recognition in Dyadic Video Interactions
Jinming Zhao, Shizhe Chen, Qin Jin
[pdf]
Pacific-Rim Conference on Multimedia (PCM), 2018.
iMakeup: Makeup Instructional Video Dataset for Fine-grained Dense Video Captioning
Xiaozhu Lin, Qin Jin, Shizhe Chen, Yuqing Song, Yida Zhao
[pdf]
Pacific-Rim Conference on Multimedia (PCM), 2018.
Multi-modal Multi-cultural Dimensional Continues Emotion Recognition in Dyadic Interactions
Jinming Zhao, Ruichen Li, Shizhe Chen, Qin Jin
[pdf]
ACM Multimedia Audio-Visual Emotion Challenge (AVEC) Workshop, 2018.
Video Captioning with Guidance of Multimodal Latent Topics
Shizhe Chen, Jia Chen, Qin Jin, Alexandar Hauptmann
[pdf]
ACM Multimedia, 2017.
Knowing Yourself: Improving Video Caption via In-depth Recap
Qin Jin, Shizhe Chen, Jia Chen, Alexandar Hauptmann
[pdf]
ACM Multimedia, 2017. (Best Grand Challenge Paper)
Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition
Shizhe Chen, Qin Jin, Jinming Zhao and Shuai Wang
[pdf]
ACM Multimedia Audio-Visual Emotion Challenge (AVEC) Workshop, 2017.
Generating Video Descriptions with Topic Guidance
Shizhe Chen, Jia Chen, Qin Jin
[pdf]
ICMR, 2017.
Emotion Recognition with Multimodal Features and Temporal Models
Shuai Wang, Wenxuan Wang, Jinming Zhao, Shizhe Chen, Qin Jin, Shilei Zhang, Yong Qin
[pdf]
ICMI, 2017.
Facial Action Units Detection with Multi-Features and-AUs Fusion
Xinrui Li, Shizhe Chen, and Qin Jin
[pdf]
Automatic Face & Gesture Recognition (FGR), 2017.
Boosting Recommendation in Unexplored Categories by User Price Preference
Jia Chen, Qin Jin, Shiwan Zhao, Shenghua Bao, Li Zhang, Zhong Su, Yong Yu
[pdf]
ACM Transactions on Information Systems (TOIS), 2016.
Video Emotion Recognition in the Wild Based on Fusion of Multimodal Features
Shizhe Chen, Xinrui Li, Qin Jin, Shilei Zhang, Yong Qin
[pdf]
ICMI 2016.
Describing Videos using Multi-modal Fusion
Qin Jin, Jia Chen, Shizhe Chen, Yifan Xiong
[pdf]
ACM Multimedia, 2016.
Semantic Image Profiling for Historic Events: Linking Images to Phrases
Jia Chen, Qin Jin, Yifan Xiong
[pdf]
ACM Multimedia 2016.
Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction
Shizhe Chen, Qin Jin
[pdf]
ACM Multimedia 2016.
History Rhyme: Searching Historic Events by Multimedia Knowledge
Yifan Xiong, Jia Chen, Qin Jin, Chao Zhang
[pdf]
ACM Multimedia 2016.
Detecting Violence in Video using Subclasses
Xirong Li, Yujia Huo, Qin Jin, Jieping Xu
[pdf]
ACM Multimedia 2016.
Generating Natural Video Descriptions via Multimodal Processing
Qin Jin, Junwei Liang, Xiaozhu Lin
[pdf]
Interspeech 2016.
Improving Image Captioning by Concept-based Sentence Reranking
Xirong Li, Qin Jin
[pdf]
Pacific-Rim Conference on Multimedia (PCM), 2016. (Best Paper Runner-up)
Video Description Generation using Audio and Visual Cues
Qin Jin, Junwei Liang
[pdf]
ICMR 2016.
Exploitation and Exploration Balanced Hierarchical Summary for Landmark Images
Jia Chen, Qin Jin, Shenghua Bao, Junfeng Ye, Zhong Su, Shimin Chen, Yong Yu
[pdf]
IEEE Transactions on Multimedia (TMM), 2015
Lead Curve Detection in Drawings with Complex Cross-Points
Jia Chen, Min Li, Qin Jin, Yongzhe Zhang, Shenghua Bao, Zhong Su, Yong Yu
[pdf]
Neurocomputing, 2015, 168: 35-46.
Image Profiling for History Events on the Fly
Jia Chen, Qin Jin, Yong Yu, Alexander G. Hauptmann
[pdf]
ACM Multimedia 2015.
Persistent B+-Trees in Non-Volatile Main Memory
Shimin Chen and Qin Jin
[pdf]
VLDB, Hawaii, USA, 2015 (VLDB’15).
Semantic Concept Annotation for User Generated Videos Using Soundtracks
Qin Jin, Junwei Liang, Xixi He, Gang Yang, Jieping Xu, Xirong Li
[pdf]
ICMR 2015.
Speech Emotion Recognition With Acoustic And Lexical Features
Qin Jin, Chengxin Li, Shizhe Chen, Huimin Wu
[pdf]
ICASSP, 2015.
Detecting Semantic Concepts In Consumer Videos Using Audio
Junwei Liang, Qin Jin, Xixi He, Gang Yang, Jieping Xu, Xirong Li
[pdf]
ICASSP, 2015.
Does Product Recommendation Meet its Waterloo in Unexplored Categories? No, Price Comes to Help
Jia Chen, Qin Jin, Shiwan Zhao, Shenghua Bao, Li Zhang, Zhong Su, Yong Yu
[pdf]
SIGIR 2014 (SIGIR’14).
Semantic Concept Annotation of Consumer Videos at Frame-level Using Audio
Junwei Liang, Qin Jin, Xixi He, Xirong Li, Gang Yang, Jieping Xu
[pdf]
Pacific-rim Conference on Multimedia 2014 (PCM’14).
Speech Emotion Classification using Acoustic Features
Shizhe Chen, Qin Jin, Xirong Li, Gang Yang, Jieping Xu
[pdf]
ISCSLP, 2014.