Neural Module Networks for Compositional Visual Reasoning
Published in PhD Thesis, Conservatoire National des Arts et Métiers CNAM Paris, HESAM Université, 2023
The context of this PhD thesis is compositional visual reasoning. When presented with an image and a question pair, our objective is to have neural networks models answer the question by following a reasoning chain defined by a program. We assess the model’s reasoning ability through a Visual Question Answering (VQA) setup.Compositional VQA breaks down complex questions into modular easier sub-problems.These sub-problems include reasoning skills such as object and attribute detection, relation detection, logical operations, counting, and comparisons. Each sub-problem is assigned to a different module. This approach discourages shortcuts, demanding an explicit understanding of the problem. It also promotes transparency and explainability.Neural module networks (NMN) are used to enable compositional reasoning. The framework is based on a generator-executor framework, the generator learns the translation of the question to its function program. The executor instantiates a neural module network where each function is assigned to a specific module. We also design a neural modules catalog and define the function and the structure of each module. The training and evaluations are conducted using the pre-processed GQA dataset [3], which includes natural language questions, functional programs representing the reasoning chain, images, and corresponding answers.The research contributions revolve around the establishment of an NMN framework for the VQA task.One primary contribution involves the integration of vision and language pre-trained (VLP) representations into modular VQA. This integration serves as a ``warm-start” mechanism for initializing the reasoning process.The experiments demonstrate that cross-modal vision and language representations outperform uni-modal ones. This utilization enables the capture of intricate relationships within each individual modality while also facilitating alignment between different modalities, consequently enhancing overall accuracy of our NMN.Moreover, we explore various training techniques to enhance the learning process and improve cost-efficiency. In addition to optimizing the modules within the reasoning chain to collaboratively produce accurate answers, we introduce a teacher-guidance approach to optimize the intermediate modules in the reasoning chain. This ensures that these modules perform their specific reasoning sub-tasks without taking shortcuts or compromising the reasoning process’s integrity. We propose and implement several teacher-guidance techniques, one of which draws inspiration from the teacher-forcing method commonly used in sequential models. Comparative analyses demonstrate the advantages of our teacher-guidance approach for NMNs, as detailed in our paper [1].We also introduce a novel Curriculum Learning (CL) strategy tailored for NMNs to reorganize the training examples and define a start-small training strategy. We begin by learning simpler programs and progressively increase the complexity of the training programs. We use several difficulty criteria to define the CL approach. Our findings demonstrate that by selecting the appropriate CL method, we can significantly reduce the training cost and required training data, with only a limited impact on the final VQA accuracy. This significant contribution forms the core of our paper [2].[1] W. Aissa, M. Ferecatu, and M. Crucianu. Curriculum learning for compositional visual reasoning. In Proceedings of VISIGRAPP 2023, Volume 5: VISAPP, 2023.[2] W. Aissa, M. Ferecatu, and M. Crucianu. Multimodal representations for teacher-guidedcompositional visual reasoning. In Advanced Concepts for Intelligent Vision Systems, 21st International Conference (ACIVS 2023). Springer International Publishing, 2023.[3] D. A. Hudson and C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. 2019.
