Relation-Aware Heterogeneous Graph Network for Learning Intermodal Semantics in Textbook Question Answering

IEEE Trans Neural Netw Learn Syst. 2024 Apr 23:PP. doi: 10.1109/TNNLS.2024.3385436. Online ahead of print.

Abstract

Textbook question answering (TQA) task aims to infer answers for given questions from a multimodal context, including text and diagrams. The existing studies have aggregated intramodal semantics extracted from a single modality but have yet to capture the intermodal semantics between different modalities. A major challenge in learning intermodal semantics is maintaining lossless intramodal semantics while bridging the gap of semantics caused by heterogeneity. In this article, we propose an intermodal relation-aware heterogeneous graph network (IMR-HGN) to extract the intermodal semantics for TQA, which aggregates different modalities while learning features rather than representing them independently. First, we design a multidomain consistent representation (MDCR) to eliminate semantic gaps by capturing intermodal features while maintaining lossless intramodal semantics in multidomains. Furthermore, we present neighbor-based relation inpainting (NRI) to reduce semantic ambiguity via repairing fuzzy relations with correlations of relations. Finally, we propose hierarchical multisemantics aggregation (HMSA) to guarantee the completeness of semantics by aggregating features of nodes and relations with a reconstruction network (RN). Experimental results show that IMR-HGN could extract the intermodal semantics of answers, achieving a 2.16% improvement on the validation set of the TQA dataset and a 3.04% increase on the test set of the AI2D dataset.