Print Close

Perception similarity enhances visual encoding of natural scenes in the human brain

Poster No:

1483

Submission Type:

Abstract Submission

Authors:

Yuduo Zhang¹^,2, zhang jing¹, Siyang Li¹, Weiyang Shi¹, Yongfu Hao¹, Zhang Yu¹, Tianzi Jiang³

Institutions:

¹Zhejiang Lab, Hangzhou, China, ²Shanxi University, Taiyuan, China, ³Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China

First Author:

Yuduo Zhang
Zhejiang Lab|Shanxi University
Hangzhou, China|Taiyuan, China

Co-Author(s):

zhang jing
Zhejiang Lab
Hangzhou, China

Siyang Li
Zhejiang Lab
Hangzhou, China

Weiyang Shi
Zhejiang Lab
Hangzhou, China

Yongfu Hao
Zhejiang Lab
Hangzhou, China

Zhang Yu
Zhejiang Lab
Hangzhou, China

Tianzi Jiang
Brainnetome Center, Institute of Automation, Chinese Academy of Sciences
Beijing, China

Introduction:

The integration of brain responses with artificial neural network (ANN) models has opened new avenues for understanding the neural basis of cognitive functions. This study introduces an innovative encoding approach using the dreamsim_Openclip model, leveraging the joint mechanism of image-text alignment and perceptual similarity training. By employing advanced spatial and layer attention mechanisms, the proposed model not only achieve state-of-the-art encoding performance in the cortex, but also provides novel insights into the receptive field mapping and hierarchical structure of visual coding. This work marks a substantial advancement in bridging the fields of artificial intelligence and neuroscience, particularly in the realms of visual processing and scene comprehension.

Methods:

We used 7T-fMRI dataset of the Natural Scenes Dataset (Allen 2021), when participants viewed 10,000 different natural scene images collected from Coco (Lin 2014). Image-wise brain response were estimated by using fmriprep and GLMsingle (Prince 2022) and then projected onto the fsaverage cortical surface.
We utilized the pretrained dreamsim-openclip model (Fu 2023) to extract latent features of each scene image. This model is adept at encoding multimodal information, encompassing both objects and captions through text-image contrastive learning, and capturing the semantic relationships between objects through human perceptual similarity tasks of image triplets. To predict the neural response of each vertex in the cerebral cortex, we integrated latent features from multiple layers of openclip using a 1x1 convolution, followed by the application of spatial (196-dimensional) and layer (12-dimensional) attention mechanisms. The spatial attention specifically targets capturing the retinotopic mapping of the visual cortex, while the layer attention mechanism reflects the hierarchy of visual coding. Based on the spatial attention map, the angle and eccentricity of each cortical vertex can be estimated and subsequently compared with the actual population receptive field (pRF) parameters provided by the dataset.

Results:

State-of-art encoding performance in the visual cortex was achieved by using the dreamsim-openclip model (Fig 1B), with a maximum prediction accuracy as 0.86. Medium encoding accuracy was observed in the prefrontal and parietal regions. Additionally, the receptive field mapping derived from the spatial attention map of each vertex highly resembled the actual pRF map (Fig 1C). The layer attention map aligns with the hierarchical structure of visual coding (Fig 1E), demonstrating low-layer representations predominantly for early visual areas (e.g., V1), middle-layer representations for higher visual areas including V4, and high-layer representations for higher-order areas in the dorsal and ventral streams.
We observed that the proposed encoding model exhibited superior performance compared to other models, such as dreamsim-dino, particularly in high-order cognitive areas beyond the visual cortex (Fig 2A). Notably, this enhanced performance was evident in regions including the lateral and medial prefrontal cortex, as well as the superior parietal and temporal cortex. However, no significant improvement was detected in the early visual cortex, encompassing areas V1 to V3.

Conclusions:

Our study employed the pre-trained dreamsim_Openclip model to encode neural responses to natural scenes across the entire cortex. Leveraging a joint mechanism of image-text alignment with perceptual similarity training, the proposed model achieved state-of-the-art encoding performance in the visual cortex, and demonstrated enhanced performance in high-order cognitive areas, particularly within prefrontal regions. Furthermore, the model effectively captured the receptive field mapping and the hierarchical structure of visual coding, utilizing both spatial and layer attention mechanisms.

Modeling and Analysis Methods:

Activation (eg. BOLD task-fMRI)

Classification and Predictive Modeling ¹

Neuroanatomy, Physiology, Metabolism and Neurotransmission:

Cortical Anatomy and Brain Mapping

Perception, Attention and Motor Behavior:

Perception: Visual ²

Keywords:

Machine Learning

Other - Brain encoding, fMRI

^1|2Indicates the priority used for review

Provide references using author date format

Allen, E. J., St-Yves, G., Wu, Y., Breedlove, J. L., Prince, J. S., Dowdle, L. T., ... & Kay, K. (2022), A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature neuroscience, 25(1), 116-126.
Prince, J. S., Charest, I., Kurzawski, J. W., Pyles, J. A., Tarr, M. J., & Kay, K. N. (2022). Improving the accuracy of single-trial fMRI response estimates using GLMsingle. Elife, 11, e77599.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014), Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 (pp. 740-755). Springer International Publishing.
Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., & Isola, P. (2023), DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data. arXiv preprint arXiv:2306.09344.