Shantanu Jaiswal

About me

Hello! I am a researcher at A*STAR Singapore’s Center for Frontier AI Research (CFAR).

Broadly, my research interests are at the intersection of computer vision, machine learning and cognitive science, with the goal of developing more intelligent and reliable AI that can continually learn from real-world/embodied experiences and that exhibits human-like reasoning capabilities.

At A*STAR, I have specifically worked on designing more effective scene understanding and reasoning methods by combining insights from relevant cognitive phenomena with machine learning techniques.

My CV is available here.

Selected Papers

Please see my Google Scholar profile for a complete list.

TDAM: Top-Down Attention Module for Contextually-Guided Feature Selection in CNNs

Shantanu Jaiswal, Basura Fernando, Cheston Tan

In European Conference on Computer Vision (ECCV), 2022.

[Paper] [Suppl.] [Code]

Description (click to show)

Introduces a top-down attention formulation to enable models to 'look again' and attend to salient objects or features based on higher-level activations that contain relatively semantically-richer task or contextual information.
Improves performance on multiple tasks (in object recognition and localization) for various architecture types with few additional parameters and memory consumption
Also, makes models more robust to changes in input resolution and reduces entropy of output activations (suggesting enhanced feature selectivity; entropy and top-bottom feature co-activation analysis in supplemental).

Revealing the Illusion of Joint Multimodal Understanding in VideoQA Models

Ishaan Rawal, Shantanu Jaiswal, Basura Fernando, Cheston Tan

arXiv preprint, 2023. Short version in NeurIPS 2023 XAI in Action Workshop.

[Preprint] [Short workshop paper]

Description (click to show)

Analyzes multimodal biases in videoQA models through a quadrant-averaging (QUAG) attentional probing method and a diagnostic benchmark (CLAVI) with temporal counterfactuals.
Model performances drastically drop for counterfactual temporal scenarios (i.e. event sequences in video are swapped or questions are inversed) suggesting lack of joint multimodal (video-language) reasoning and temporal understanding in current videoQA models.
Model performances (on existing benchmarks) are maintained even after impaired (averaged) attention for text and visual modalities, suggesting current models may exploit 'shortcuts' to achieve competitive performances on existing benchmarks.

What do CNNs gain by imitating the visual development of primate infants?

Shantanu Jaiswal, Dongkyu Choi, Basura Fernando

In British Machine Vision Conference (BMVC), 2020.

[Paper] [Suppl.] [Code] [Abstract (Cogsci 2020)]

Description (click to show)

Analyses a growth-based training strategy wherein models are grown and inputs are gradually refined over course of training.
Growth-based training results in higher performance (on object recognition) with faster training speed compared to traditional static training approaches.
Growth strategy potentially allows 'coarse global' patterns to be identified first and then 'finer' ones later in a more hierarchical manner (visualization of induced filters during training in supplemental).