From Visual Question Answering to multimodal learning: an interview with Aishwarya Agrawal

2026-02-11 · Source: ΑΙhub · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Robotics & Autonomous Systems · Depth: Advanced, medium

Summary

Aishwarya Agrawal, an Assistant Professor at the University of Montreal and Canada CIFAR AI Chair, discusses her research journey from Visual Question Answering (VQA) to broader multimodal learning. Her PhD dissertation focused on proposing open-ended VQA as a new benchmark for computer vision models, curating a large-scale dataset for training and testing. This approach aimed to move beyond limited category classification to free-form natural language interaction, improving model understanding and addressing language biases. Over the past decade, VQA model performance has improved by approximately 25%, shifting research focus to challenges like cross-cultural alignment, data efficiency for large models, and learning compatible visual and language representations. Agrawal is also exploring how knowledge from large language models (LLMs) and vision-language models (VLMs) can inform low-level control tasks for embodied AI.

Key takeaway

For AI Scientists developing multimodal models, consider shifting focus from basic perception to advanced challenges like cross-cultural alignment and data efficiency. Your research impact is significantly tied to rigorous execution and insightful contributions, not just "hot" topics. Additionally, invest in improving your communication and presentation skills to effectively convey complex ideas to broader audiences and secure funding.

Key insights

VQA research has evolved from basic perception to complex multimodal challenges like cultural alignment and embodied AI.

Principles

Rigorous execution enhances research impact.
Communication skills are crucial for career advancement.

Method

VQA benchmarks computer vision models by asking open-ended, free-form questions about images, using large datasets to train and test understanding beyond simple classification.

In practice

Characterize model behavior to identify weaknesses.
Develop unified models for both understanding and generation.

Topics

Visual Question Answering
Multimodal Learning
Embodied AI
Generative Models
Vision-Language Models

Best for: AI Scientist, AI Researcher, AI Student, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ΑΙhub.