Scene graphs have become an important form of structured knowledge for tasks such as for image generation, visual relation detection, visual question answering, and image retrieval. While visualizing and interpreting word embeddings is well understood, scene graph embeddings have not been fully explored. In this work, we train scene graph embeddings in a layout generation task with different forms of supervision, specifically introducing triplet super-vision and data augmentation. We see a significant performance increase in both metrics that measure the goodness of layout prediction, mean intersection-over-union (mIoU)(52.3% vs. 49.2%) and relation score (61.7% vs. 54.1%),after the addition of triplet supervision and data augmentation. To understand how these different methods affect the scene graph representation, we apply several new visualization and evaluation methods to explore the evolution of the scene graph embedding. We find that triplet supervision significantly improves the embedding separability, which is highly correlated with the performance of the layout prediction model.
Authors
Subarna Tripathi
Deep Learning Data Scientist; Artificial Intelligence Product Group
Brigit Schroeder
Related Content
Training Compact Models for Low Resource Entity Tagging…
Training models on low-resource named entity recognition tasks has been shown to be a challenge, especially in industrial applications where…
Real-time Approximate Inference for Scene Understanding with Generative…
This paper shows how to solve scene understanding problems using inference in generative models by using new techniques for real-time…
Q8BERT: Quantized 8Bit BERT
Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing…
Probabilistic Modeling of Deep Features for Out-of-Distribution and…
We present a principled approach for detecting out-of-distribution (OOD) and adversarial samples in deep neural networks. Our approach consists in…