Shantanu Jaiswal

I'm a MSR student at CMU Robotics Institute, advised by Deepak Pathak. Previously, I was a researcher at the Center for Frontier AI Research at A*STAR Singapore, where I was advised by Cheston Tan and Basura Fernando.

Email  /  CV  /  Scholar  /  Github

profile photo

Research

I'm broadly interested in deep learning, computer vision and cognitive science. My recent research focuses on architectural refinements, learning methods and inference-time strategies towards more capable multimodal generative models.

Selected Publications

Iterative refinement figure Iterative Refinement Improves Compositional Image Generation
Shantanu Jaiswal, Mihir Prabhudesai, Nikash Bhardwaj, Zheyang Qin, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak
Under review, 2026
preprint / code (soon)

A training-free iterative refinement mechanism to improve compositional image generation capabilities of text-to-image models (incl. NanoBanana, GPT-Image-1 and Qwen-Image).

IPRM figure Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios
Shantanu Jaiswal, Debaditya Roy, Basura Fernando, Cheston Tan
Conference on Neural Information Processing Systems (NeurIPS), 2024
arXiv / code

A hybrid recurrent-transformer module to improve compositional visual reasoning capabilities of vision-language backbones. Achieves state-of-art on multiple image and video question answering benchmarks (incl. STAR, AGQA, CLEVR-/CLEVRER-Humans).

QUAG figure Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion
Ishaan Rawal, Alexander Matyasko, Shantanu Jaiswal, Basura Fernando, Cheston Tan
International Conference on Machine Learning (ICML), 2024
arXiv / code

An analysis of multimodal biases in VideoQA transformer models through a non-parametric probe and a stress-test dataset suggesting strong modality-specific biases in existing models.

TDAM figure TDAM: Top-Down Attention Module for Contextually-Guided Feature Selection in CNNs
Shantanu Jaiswal, Basura Fernando, Cheston Tan
European Conference on Computer Vision (ECCV), 2022
arXiv / code

A lightweight top-down attention module that iteratively uses higher-level features to attend to lower-level features across model hierarchy. Improves fine-grained object recognition and localization performances of CNN vision backbones.


Website template adapted from Jon Barron.
Last updated: November 2025