From Visual Question Answering to multimodal learning: an interview with Aishwarya Agrawal
Summary
Aishwarya Agrawal, an Assistant Professor at the University of Montreal and Canada CIFAR AI Chair, discusses her research journey from Visual Question Answering (VQA) to broader multimodal learning. Her PhD dissertation focused on proposing open-ended VQA as a new benchmark for computer vision models, curating a large-scale dataset for training and testing. This approach aimed to move beyond limited category classification to free-form natural language interaction, improving model understanding and addressing language biases. Over the past decade, VQA model performance has improved by approximately 25%, shifting research focus to challenges like cross-cultural alignment, data efficiency for large models, and learning compatible visual and language representations. Agrawal is also exploring how knowledge from large language models (LLMs) and vision-language models (VLMs) can inform low-level control tasks for embodied AI.
Key takeaway
For AI Scientists developing multimodal models, consider shifting focus from basic perception to advanced challenges like cross-cultural alignment and data efficiency. Your research impact is significantly tied to rigorous execution and insightful contributions, not just "hot" topics. Additionally, invest in improving your communication and presentation skills to effectively convey complex ideas to broader audiences and secure funding.
Key insights
VQA research has evolved from basic perception to complex multimodal challenges like cultural alignment and embodied AI.
Principles
- Rigorous execution enhances research impact.
- Communication skills are crucial for career advancement.
Method
VQA benchmarks computer vision models by asking open-ended, free-form questions about images, using large datasets to train and test understanding beyond simple classification.
In practice
- Characterize model behavior to identify weaknesses.
- Develop unified models for both understanding and generation.
Topics
- Visual Question Answering
- Multimodal Learning
- Embodied AI
- Generative Models
- Vision-Language Models
Best for: AI Scientist, AI Researcher, AI Student, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by ΑΙhub.