A Deep Multi-modal Explanation Model for Zero-shot Learning

IEEE Trans Image Process. 2020 Feb 28. doi: 10.1109/TIP.2020.2975980. Online ahead of print.

Abstract

Zero-shot learning (ZSL) has attracted significant attention due to its capabilities of classifying new images from unseen classes. To perform the classification task for ZSL, learning visual and semantic embeddings has been the main research approach in existing literature. At the same time, generating complementary explanations to justify the classification decision has remained largely unexplored. In this paper, we propose to address a new and challenging task, namely explainable zero-shot learning (XZSL), which aims to generate visual and textual explanations to support the classification decision. To accomplish this task, we build a novel Deep Multi-modal Explanation (DME) model that incorporates a joint visual-attribute embedding module and a multi-channel explanation module in an end-to-end fashion. In contrast to existing ZSL approaches, our visual-attribute embedding is associated not only with the decision, but also with new visual and textual explanations. For visual explanations, we first capture several attribute activation maps (AAM) and then merge them into a class activation map (CAM) that visually infers which region of an image is relevant to the class. Textual explanations are generated from the multi-channel explanation module, jointly integrating three long short-term memory models (LSTMs) each of which is conditioned on a different feature representation. Additionally, we suggest that the DME model can retain explanatory consistency for similar instances and explanatory diversity for diverse instances. We conduct qualitative and quantitative experiments to assess the model for ZSL classification and explanation. Specifically, the ablation studies verify the effectiveness of the components in our model. Our results on three well-known datasets are competitive with prior approaches. More importantly, the joint training of our embedding and explanation modules demonstrates mutual performance improvements between ZSL classification and explanation. We shed more light on DME to analyze and diagnose its advantages and limitations.