Multi-modal Dialogue State Tracking for Playing GuessWhich Game - Details

Author：

Pang, Wei (Pang, Wei.) | Duan, Ruixue (Duan, Ruixue.) | Yang, Jinfu (Yang, Jinfu.) | Li, Ning (Li, Ning.)

Indexed by：

CPCI-S EI Scopus

Abstract：

GuessWhich　is　an　engaging　visual　dialogue　game　that　involves　interaction　between　a　Questioner　Bot　(QBot)　and　an　Answer　Bot　(ABot)　in　the　context　of　image-guessing.　In　this　game,　QBot’s　objective　is　to　locate　a　concealed　image　solely　through　a　series　of　visually　related　questions　posed　to　ABot.　However,　effectively　modeling　visually　related　reasoning　in　QBot’s　decision-making　process　poses　a　significant　challenge.　Current　approaches　either　lack　visual　information　or　rely　on　a　single　real　image　sampled　at　each　round　as　decoding　context,　both　of　which　are　inadequate　for　visual　reasoning.　To　address　this　limitation,　we　propose　a　novel　approach　that　focuses　on　visually　related　reasoning　through　the　use　of　a　mental　model　of　the　undisclosed　image.　Within　this　framework,　QBot　learns　to　represent　mental　imagery,　enabling　robust　visual　reasoning　by　tracking　the　dialogue　state.　The　dialogue　state　comprises　a　collection　of　representations　of　mental　imagery,　as　well　as　representations　of　the　entities　involved　in　the　conversation.　At　each　round,　QBot　engages　in　visually　related　reasoning　using　the　dialogue　state　to　construct　an　internal　representation,　generate　relevant　questions,　and　update　both　the　dialogue　state　and　internal　representation　upon　receiving　an　answer.　Our　experimental　results　on　the　VisDial　datasets　(v0.5,　0.9,　and　1.0)　demonstrate　the　effectiveness　of　our　proposed　model,　as　it　achieves　new　state-of-the-art　performance　across　all　metrics　and　datasets,　surpassing　previous　state-of-the-art　models.　©　The　Author(s),　under　exclusive　license　to　Springer　Nature　Singapore　Pte　Ltd.　2024.

Keyword：

Decision making Speech processing

Author Community：

[ 1 ] [Pang, Wei]Beijing Information Science and Technology University, Beijing, China
[ 2 ] [Duan, Ruixue]Beijing Information Science and Technology University, Beijing, China
[ 3 ] [Yang, Jinfu]Beijing University of Technology, Beijing, China
[ 4 ] [Li, Ning]Beijing Information Science and Technology University, Beijing, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

Improving the performance of MGM-based voice conversion by preparing training data method
2004，2004 International Symposium on Chinese Spoken Language Processing
Coarse-to-Fine Target Speaker Extraction Based on Contextual Information Exploitation
2024，ACM Transactions on Audio Speech and Language Processing
Forecasting the dynamics of correlations in complex systems
2024，Chaos, Solitons and Fractals
RPLNet: Object-Object Affordance Recognition via Relational Phrase Learning
2023，5th International Conference on Industrial Artificial Intelligence, IAI 2023

Source ：

ISSN： 0302-9743

Year： 2024

Volume： 14473 LNAI

Page： 554-565

Language： English

Cited Count：

WoS CC Cited Count： 0

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 6

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Engineering Village

Type
Departments

All Years Choose Year From to