Poster No:
1483
Submission Type:
Abstract Submission
Authors:
Yuduo Zhang1,2, zhang jing1, Siyang Li1, Weiyang Shi1, Yongfu Hao1, Zhang Yu1, Tianzi Jiang3
Institutions:
1Zhejiang Lab, Hangzhou, China, 2Shanxi University, Taiyuan, China, 3Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
First Author:
Yuduo Zhang
Zhejiang Lab|Shanxi University
Hangzhou, China|Taiyuan, China
Co-Author(s):
Tianzi Jiang
Brainnetome Center, Institute of Automation, Chinese Academy of Sciences
Beijing, China
Introduction:
The integration of brain responses with artificial neural network (ANN) models has opened new avenues for understanding the neural basis of cognitive functions. This study introduces an innovative encoding approach using the dreamsim_Openclip model, leveraging the joint mechanism of image-text alignment and perceptual similarity training. By employing advanced spatial and layer attention mechanisms, the proposed model not only achieve state-of-the-art encoding performance in the cortex, but also provides novel insights into the receptive field mapping and hierarchical structure of visual coding. This work marks a substantial advancement in bridging the fields of artificial intelligence and neuroscience, particularly in the realms of visual processing and scene comprehension.
Methods:
We used 7T-fMRI dataset of the Natural Scenes Dataset (Allen 2021), when participants viewed 10,000 different natural scene images collected from Coco (Lin 2014). Image-wise brain response were estimated by using fmriprep and GLMsingle (Prince 2022) and then projected onto the fsaverage cortical surface.
We utilized the pretrained dreamsim-openclip model (Fu 2023) to extract latent features of each scene image. This model is adept at encoding multimodal information, encompassing both objects and captions through text-image contrastive learning, and capturing the semantic relationships between objects through human perceptual similarity tasks of image triplets. To predict the neural response of each vertex in the cerebral cortex, we integrated latent features from multiple layers of openclip using a 1x1 convolution, followed by the application of spatial (196-dimensional) and layer (12-dimensional) attention mechanisms. The spatial attention specifically targets capturing the retinotopic mapping of the visual cortex, while the layer attention mechanism reflects the hierarchy of visual coding. Based on the spatial attention map, the angle and eccentricity of each cortical vertex can be estimated and subsequently compared with the actual population receptive field (pRF) parameters provided by the dataset.
Results:
State-of-art encoding performance in the visual cortex was achieved by using the dreamsim-openclip model (Fig 1B), with a maximum prediction accuracy as 0.86. Medium encoding accuracy was observed in the prefrontal and parietal regions. Additionally, the receptive field mapping derived from the spatial attention map of each vertex highly resembled the actual pRF map (Fig 1C). The layer attention map aligns with the hierarchical structure of visual coding (Fig 1E), demonstrating low-layer representations predominantly for early visual areas (e.g., V1), middle-layer representations for higher visual areas including V4, and high-layer representations for higher-order areas in the dorsal and ventral streams.
We observed that the proposed encoding model exhibited superior performance compared to other models, such as dreamsim-dino, particularly in high-order cognitive areas beyond the visual cortex (Fig 2A). Notably, this enhanced performance was evident in regions including the lateral and medial prefrontal cortex, as well as the superior parietal and temporal cortex. However, no significant improvement was detected in the early visual cortex, encompassing areas V1 to V3.


Conclusions:
Our study employed the pre-trained dreamsim_Openclip model to encode neural responses to natural scenes across the entire cortex. Leveraging a joint mechanism of image-text alignment with perceptual similarity training, the proposed model achieved state-of-the-art encoding performance in the visual cortex, and demonstrated enhanced performance in high-order cognitive areas, particularly within prefrontal regions. Furthermore, the model effectively captured the receptive field mapping and the hierarchical structure of visual coding, utilizing both spatial and layer attention mechanisms.
Modeling and Analysis Methods:
Activation (eg. BOLD task-fMRI)
Classification and Predictive Modeling 1
Neuroanatomy, Physiology, Metabolism and Neurotransmission:
Cortical Anatomy and Brain Mapping
Perception, Attention and Motor Behavior:
Perception: Visual 2
Keywords:
Machine Learning
Other - Brain encoding, fMRI
1|2Indicates the priority used for review
Provide references using author date format
Allen, E. J., St-Yves, G., Wu, Y., Breedlove, J. L., Prince, J. S., Dowdle, L. T., ... & Kay, K. (2022), A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature neuroscience, 25(1), 116-126.
Prince, J. S., Charest, I., Kurzawski, J. W., Pyles, J. A., Tarr, M. J., & Kay, K. N. (2022). Improving the accuracy of single-trial fMRI response estimates using GLMsingle. Elife, 11, e77599.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014), Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 (pp. 740-755). Springer International Publishing.
Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., & Isola, P. (2023), DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data. arXiv preprint arXiv:2306.09344.