Indexed by:
Abstract:
Human-Object Interaction (HOI) detection aims to localize human-object pairs and comprehend their interactions. Recently, pre-trained Vision-Language Models (VLM) have shown their great recognition ability in HOI detection task. However, these VLM based methods are struggle to transfer knowledge to achieve desired performance. To this end, we propose a Dual-Branch Knowledge Enhancement Network with VLM (DBKEN-VLM) within the two-stage paradigm to enhance the effectiveness of VLM. Specifically, we propose a semantic mining decoder to supplement contextual and action-related semantic information into our model. It forms a dual-branch knowledge enhancement network with spatial guided decoder. Furthermore, we propose a two-level fusion strategy for the dualbranch network to facilitate better knowledge transfer of VLM. One is feature-level fusion, producing more instructive interaction features; another is decision-level fusion, further enhancing the capability of VLM for HOI detection. The proposed method achieves competitive performance compared to recent methods on two benchmark datasets, HICO-DET and V-COCO. © 2024 IEEE.
Keyword:
Reprint Author's Address:
Email:
Source :
Year: 2024
Language: English
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 4
Affiliated Colleges: