Home
Research
Publication
People
Contact
Research Overview
Vision & Language
Building deep learning models that can reason about images, videos, and text.
3D & Embodied
Understanding and interacting with 3D environments and physical entities.
Affective Computing
Recognizing, interpreting, and responding to human emotions.
Music Intelligence
Generating lyrics, melodies, and synthesizing music and singing.
Vision & Language
TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective
Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline
Movie101v2: Improved Movie Narration Benchmark
POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World
Respond in my Language: Mitigating Language Inconsistency in Response Generation based on Large Language Models
Affective Computing
ESCoT: Towards Interpretable Emotional Support Dialogue Systems
ECR-Chain: Advancing Generative Language Models to Better Emotion-Cause Reasoners through Reasoning Chains
DialogueEIN: Emotion Interaction Network for Dialogue Affective Analysis
Memobert: Pre-training model with prompt-based learning for multimodal emotion recognition
3D & Embodied
SPAFormer: Sequential 3D Part Assembly with Transformers
Think-Program-reCtify: 3D Situated Reasoning with Large Language Models
Music Intelligence
Be with you (与你同在)
AI Song Contest 2023
Muskits: an End-to-End Music Processing Toolkit for Singing Voice Synthesis
UniLG: A Unified Structure-aware Framework for Lyrics Generation
PHONEIX: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor