With the recent advancements in AI, Intelligent Virtual Assistants (IVA) have become a ubiquitous part of every home. Going forward, we are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. As a part of the 7th Dialog System Technology Challenges (DSTC7), for Audio Visual Scene-Aware Dialog (AVSD) track, We explore `topics' of the dialog as an important contextual feature into the architecture along with explorations around multimodal Attention. We also incorporate an end-to-end audio classification ConvNet, AclNet, into our models. We present detailed analysis of the experiments and show that some of our model variations outperform the baseline system presented for this task.
Authors
Saurav Sahay
Juan Jose Alvarado Leanos
Related Content
Framework for Scalable Intra-Node Collective Operations using Shared..
Collective operations are used in MPI programs to express common communication patterns, collective computations, or synchronization. In many collectives, such...
BAR: Bayesian Activity Recognition using Variational Inference
Uncertainty estimation in deep neural networks is essential for designing reliable and robust AI systems. Applications such as video surveillance...
Multi-Institutional Deep Learning Modeling Without Sharing Patient Data:...
Deep learning models for semantic segmentation of images require large amounts of data. In the medical imaging domain, acquiring sufficient...
Bayesian Structure Learning by Recursive Bootstrap
We address the problem of Bayesian structure learning for domains with hundreds of variables by employing non-parametric bootstrap, recursively. We...