Indexed by:
Abstract:
Scene graph generation provides high-order semantic information by understanding the objects and their relations in images. In order to improve the performance of scene graph generation, context fusion has been widely used in scene graph generation tasks, LSTM and Vision-Transformer are commonly used fusion modules. Both LSTM and Vision-Transformer realize context fusion by stacking multiple basic units, which needs to learn a large number of parameters of the units. However, the model computational efficiency of scene graph generation as a mid-level semantic understanding task to support downstream tasks is crucial. To simplify the context fusion computation, this paper proposes ASCF -Net (Augmented Spatial Context Fusion Network) which computes the spatial context of designated object by searching the nearest neighbor objects with high relevance and strengthens the context with random noise. Without learning parameters, the above computational process essentially simulates the attention mechanism. Experiments on VG dataset show that ASCF -Net uses 15.26% of the parameters of Bi-LSTM and 13.34% of the parameters of Vision-Transformer for context fusion based on the same baseline and achieves higher performance than using the two fusion modules. At the same time, ASCF -Net uses simple fusion module to obtain competitive results on VG dataset compared with the mainstream scene generation models. © 2023 IEEE.
Keyword:
Reprint Author's Address:
Email:
Source :
Year: 2023
Volume: 2023-June
Language: English
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 6
Affiliated Colleges: