Visual question answering (VQA) requires joint comprehension of images and natural language questions, where many questions can't be directly or clearly answered from visual content but require reasoning from structured human knowledge with confirmation from visual content. This paper proposes visual knowledge memory network (VKMN) to address this issue, which seamlessly incorporates structured human knowledge and deep visual features into memory networks in an end-to-end learning framework. Comparing to existing methods for leveraging external knowledge for supporting VQA, this paper stresses more on two missing mechanisms. First is the mechanism for integrating visual contents with knowledge facts. VKMN handles this issue by embedding knowledge triples (subject, relation, target) and deep visual features jointly into the visual knowledge features. Second is the mechanism for handling multiple knowledge facts expanding from question and answer pairs. VKMN stores joint embedding using key-value pair structure in the memory networks so that it is easy to handle multiple facts. Experiments show that the proposed method achieves promising results on both VQA v1.0 and v2.0 benchmarks, while outperforms state-of-the-art methods on the knowledge-reasoning related questions.
Authors
Yurong Chen
Senior Research Director & Principle Research Scientist, Cognitive Computing Lab, Intel Labs China
Zhou Su
Chen Zhu
Yinpeng Dong
Related Content
Explicit Loss-Error-Aware Quantization for Deep Neural Networks
Benefiting from tens of millions of hierarchically stacked learnable parameters, Deep Neural Networks (DNNs) have demonstrated overwhelming accuracy on a...
HeNet: A Deep Learning Approach on IntelR Processor...
This paper presents HeNet, a hierarchical ensemble neural network, applied to classify hardware-generated control flow traces for malware detection. Deep...
Motion Segmentation by Exploiting Complementary Geometric Models
Many real-world sequences cannot be conveniently categorized as general or degenerate; in such cases, imposing a false dichotomy in using...
Efficient, Sparse Representation of Manifold Distance Matrices for...
Geodesic distance matrices can reveal shape properties that are largely invariant to non-rigid deformations, and thus are often used to...