Event Details

Reducing Training Time in Text Visual Question Answering

Presenter: Ghazale Behboud
Supervisor:

Date: Tue, May 31, 2022
Time: 15:00:00 - 00:00:00
Place: ZOOM - Please see below.

ABSTRACT

Zoom link:  https://uvic.zoom.us/j/5855448370?pwd=WVR6ZVdPOFpHUXFQRWZNUjFkVk9DQT09

Meeting ID: 585 544 8370

Password: 570742

Abstract:

Artificial Intelligence (AI) and Computer Vision (CV) have brought the promise of many applications along with many challenges to solve. The majority of current AI research has been dedicated to single-modal data processing meaning they use only one modality such as visual recognition or text recognition. However real-world challenges are often a combination of  different modalities of data such as text, audio and images. This thesis focuses on solving the Visual Question Answering (VQA) problem which is a significant multi-modal challenge. VQA is defined as a computer vision system that when given a question about an image will answer based on an understanding of both the question and image. The main focus of this thesis is improving the training time of VQA models. In this thesis, Look, Read, Reason and Answer (LoRRA) which is a state-of-the-art architecture is used as the base model.Then, Reduce Uni-modal Biases (RUBi) is applied to this model to reduce the importance of uni-modal biases in training. Finally, an early-stopping strategy is employed to stop the training process once the model accuracy has converged to prevent the model from over fitting. Numerical results are presented which show that when training LoRRA with RUBi and early stopping accuracy can reach convergence in less than 5 hours. The impact of batch size, learning rate and warm up hyper parameters is also investigated and experimental results are presented.