ARENA

Project Publications

Guo, M., Choksi, B., Sadiya, S., Gifford, A. T., Vilas, M. G., Cichy, R. M., & Roig, G. (2025). Limited but Consistent Gains in Adversarial Robustness by Co-training Object Recognition Models with Human EEG. In A. Del Bue, C. Canton, J. Pont-Tuset, & T. Tommasi (Eds.), Computer Vision – ECCV 2024 Workshops (Vol. 15636, pp. 245–255). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-91578-9_18
```
@inproceedings{guo2025,
  author = {Guo, Manshan and Choksi, Bhavin and Sadiya, Sari and Gifford, Alessandro T. and Vilas, Martina G. and Cichy, Radoslaw Martin and Roig, Gemma},
  title = {Limited but Consistent Gains in Adversarial Robustness by Co-training Object Recognition Models with Human EEG},
  pages = {245--255},
  volume = {15636},
  publisher = {{Springer Nature Switzerland}},
  isbn = {978-3-031-91577-2},
  series = {Lecture Notes in Computer Science},
  editor = {{Del Bue}, Alessio and Canton, Cristian and Pont-Tuset, Jordi and Tommasi, Tatiana},
  booktitle = {Computer Vision -- ECCV 2024 Workshops},
  year = {2025},
  address = {Cham},
  doi = {10.1007/978-3-031-91578-9_18}
}
```
Abstract
In contrast to human vision, artificial neural networks (ANNs) remain relatively susceptible to adversarial attacks. To address this vulnerability, efforts have been made to transfer inductive bias from human brains to ANNs, often by training the ANN representations to match their biological counterparts. Previous works relied on brain data acquired in rodents or primates using invasive techniques, from specific regions of the brain, under non-natural conditions (anesthetized animals), and with stimulus datasets lacking diversity and naturalness. In this work, we explored whether aligning model representations to human EEG responses to a rich set of real-world images increases robustness to ANNs. Specifically, we trained ResNet50-backbone models on a dual task of classification and EEG prediction; and evaluated their EEG prediction accuracy and robustness to adversarial attacks. We observed significant correlation between the networks’ EEG prediction accuracy, often highest around 100 ms post stimulus onset, and their gains in adversarial robustness. Although effect size was limited, effects were consistent across different random initializations and robust for architectural variants. We further teased apart the data from individual EEG channels and observed strongest contribution from electrodes in the parieto-occipital regions. The demonstrated utility of human EEG for such tasks opens up avenues for future efforts that scale to larger datasets under diverse stimuli conditions with the promise of stronger effects.
Choksi, B., Zalaffi, G. P., Dimitri, G. M., & Roig, G. (2025). Analysing the impact of brain-inspired predictive coding dynamics through gradient based explainability methods. ESANN 2025 Proceesdings, 711–716. https://doi.org/10.14428/esann/2025.ES2025-88
```
@inproceedings{choksi_analysing_2025,
  address = {Bruges (Belgium) and online},
  title = {Analysing the impact of brain-inspired predictive coding dynamics through gradient based explainability methods},
  isbn = {9782875870933},
  url = {https://www.esann.org/sites/default/files/proceedings/2025/ES2025-88.pdf},
  doi = {10.14428/esann/2025.ES2025-88},
  language = {en},
  urldate = {2025-05-15},
  booktitle = {{ESANN} 2025 proceesdings},
  publisher = {Ciaco - i6doc.com},
  author = {Choksi, Bhavin and Zalaffi, Gionata Paolo and Dimitri, Giovanna Maria and Roig, Gemma},
  year = {2025},
  pages = {711--716}
}
```
Abstract
Multiple theories exist for the role of feedback connections in the brain and in the artificial neural networks, but remain untested using modern tools. In this work, we undertake this task by exploring the utility of explainability methods like GradCAMs[1] in investigating bio-inspired recurrent networks–provided with the predify[2] package–that perform hierarchical updates inspired by the predictive coding theory in neuroscience. We report an extensive search with different levels of feedforward and feedback information. Our preliminary results show that the dynamics are able to recover the GradCAMs on noisy images, providing promising avenues for future work aiming to understand the role of recurrence.
Guo, M., Samjatin, M., Choksi, B., Sadiya, S., Cichy, R., & Roig, G. (2025). Predictive Coding Dynamics Enhance Model-Brain Similarity. ESANN 2025 Proceesdings, 687–692. https://doi.org/10.14428/esann/2025.ES2025-143
```
@inproceedings{guo_predictive_2025,
  address = {Bruges (Belgium) and online},
  title = {Predictive {Coding} {Dynamics} {Enhance} {Model}-{Brain} {Similarity}},
  isbn = {9782875870933},
  url = {https://www.esann.org/sites/default/files/proceedings/2025/ES2025-143.pdf},
  doi = {10.14428/esann/2025.ES2025-143},
  language = {en},
  urldate = {2025-05-15},
  booktitle = {{ESANN} 2025 proceesdings},
  publisher = {Ciaco - i6doc.com},
  author = {Guo, Manshan and Samjatin, Michael and Choksi, Bhavin and Sadiya, Sari and Cichy, Radoslaw and Roig, Gemma},
  year = {2025},
  pages = {687--692}
}
```
Abstract
Predictive coding–a popular theory in neuroscience–has gar- nered significant attention in the machine learning community aiming to incorporate brain-inspired components in neural networks. While various proposals have demonstrated the ability of predictive dynamics to render robustness and entail human-like perception of illusions, it remains un- clear if they improve the alignment between brain and artificial represen- tations. Here, we systematically investigate the conditions under which brain-inspired modifications in predictive processing improve alignment between model and neural representations in the brain. Our results re- veal that the feedback component significantly increases similarity between model representations and those found in higher-level visual brain areas, especially when processing complex visual scenes.
Nicholls, V. I., Krugliak, A., Alsbury-Nealy, B., Gramann, K., & Clarke, A. (2025). Contextual expectations in the real-world modulate low-frequency neural oscillations. Imaging Neuroscience, 3, imag_a_00568. https://doi.org/10.1162/imag_a_00568
```
@article{nicholls_contextual_2025,
  title = {Contextual expectations in the real-world modulate low-frequency neural oscillations},
  volume = {3},
  issn = {2837-6056},
  url = {https://direct.mit.edu/imag/article/doi/10.1162/imag_a_00568/128870/Contextual-expectations-in-the-real-world-modulate},
  doi = {10.1162/imag_a_00568},
  language = {en},
  urldate = {2025-05-15},
  journal = {Imaging Neuroscience},
  author = {Nicholls, Victoria I. and Krugliak, Alexandra and Alsbury-Nealy, Benjamin and Gramann, Klaus and Clarke, Alex},
  month = may,
  year = {2025},
  pages = {imag\_a\_00568}
}
```
Abstract
Abstract Objects in expected locations are recognised faster and more accurately than objects in incongruent environments. This congruency effect has a neural component, with increased activity for objects in incongruent environments. Studies have increasingly shown differences between neural processes in realistic environments and tasks, and neural processes in the laboratory. Here, we aimed to push the boundaries of traditional cognitive neuroscience by tracking the congruency effect for objects in real-world environments, outside of the laboratory. We investigated how neural activity is modulated when objects are placed in real environments using augmented reality while recording mobile EEG. Participants approached, viewed, and rated how congruent they found the objects with the environment. We found significant differences in ERPs and higher theta-band power for objects in incongruent contexts than objects in congruent contexts. This demonstrates that real-world contexts impact how objects are processed, and that mobile brain imaging and augmented reality are effective tools to study cognition in the wild.
Kuhn, L., sari sadiya, & Roig, G. (2025). Cognitive Neural Architecture Search Reveals Hierarchical Entailment. Second Workshop on Representational Alignment at ICLR 2025. https://openreview.net/forum?id=IJhPRtA7EM
```
@inproceedings{kuhn2025cognitive,
  title = {Cognitive Neural Architecture Search Reveals Hierarchical Entailment},
  author = {Kuhn, Lukas and sari sadiya and Roig, Gemma},
  booktitle = {Second Workshop on Representational Alignment at ICLR 2025},
  year = {2025},
  url = {https://openreview.net/forum?id=IJhPRtA7EM}
}
```
Abstract
Recent neuroscience research has challenged the traditionally assumed hierarchical structure of the ventral visual pathway, suggesting that the brain is more shallow than previously thought. Here, we demonstrate that optimizing convolutional network architectures for brain-alignment via evolutionary neural architecture search results in models with clear representational hierarchies. Despite having random weights, these networks achieve brain-alignment scores surpassing even pretrained classification models, as measured by both regression and representational similarity analysis. Furthermore, through traditional supervised training, architectures optimized for alignment with late ventral regions become competitive classification models. These findings suggest that hierarchical structure is a fundamental mechanism of primate visual processing. Finally, this work demonstrates the potential of neural architecture search as a framework for computational cognitive neuroscience research that could reduce the field’s reliance on manually designed convolutional networks.
Cerdas, D. G., Sartzetaki, C., Petersen, M., Roig, G., Mettes, P., & Groen, I. (2025). BrainACTIV: Identifying visuo-semantic properties driving cortical selectivity using diffusion-based image manipulation. The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=CGON8Btleu
```
@inproceedings{cerdas2025brainactiv,
  title = {Brain{ACTIV}: Identifying visuo-semantic properties driving cortical selectivity using diffusion-based image manipulation},
  author = {Cerdas, Diego Garcia and Sartzetaki, Christina and Petersen, Magnus and Roig, Gemma and Mettes, Pascal and Groen, Iris},
  booktitle = {The Thirteenth International Conference on Learning Representations},
  year = {2025},
  url = {https://openreview.net/forum?id=CGON8Btleu}
}
```
Abstract
The human brain efficiently represents visual inputs through specialized neural populations that selectively respond to specific categories. Advancements in generative modeling have enabled data-driven discovery of neural selectivity using brain-optimized image synthesis. However, current methods independently generate one sample at a time, without enforcing structural constraints on the generations; thus, these individual images have no explicit point of comparison, making it hard to discern which image features drive neural response selectivity. To address this issue, we introduce Brain Activation Control Through Image Variation (BrainACTIV), a method for manipulating a reference image to enhance or decrease activity in a target cortical region using pretrained diffusion models. Starting from a reference image allows for fine-grained and reliable offline identification of optimal visuo-semantic properties, as well as producing controlled stimuli for novel neuroimaging studies. We show that our manipulations effectively modulate predicted fMRI responses and agree with hypothesized preferred categories in established regions of interest, while remaining structurally close to the reference image. Moreover, we demonstrate how our method accentuates differences between brain regions that are selective to the same category, and how it could be used to explore neural representation of brain regions with unknown selectivities. Hence, BrainACTIV holds the potential to formulate robust hypotheses about brain representation and to facilitate the production of naturalistic stimuli for neuroscientific experiments.
Sartzetaki, C., Roig, G., Snoek, C. G. M., & Groen, I. (2025). One Hundred Neural Networks and Brains Watching Videos: Lessons from Alignment. The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=LM4PYXBId5
```
@inproceedings{sartzetaki2025one,
  title = {One Hundred Neural Networks and Brains Watching Videos: Lessons from Alignment},
  author = {Sartzetaki, Christina and Roig, Gemma and Snoek, Cees G. M. and Groen, Iris},
  booktitle = {The Thirteenth International Conference on Learning Representations},
  year = {2025},
  url = {https://openreview.net/forum?id=LM4PYXBId5}
}
```
Abstract
What can we learn from comparing video models to human brains, arguably the most efficient and effective video processing systems in existence? Our work takes a step towards answering this question by performing the first large-scale benchmarking of deep video models on representational alignment to the human brain, using publicly available models and a recently released video brain imaging (fMRI) dataset. We disentangle four factors of variation in the models (temporal modeling, classification task, architecture, and training dataset) that affect alignment to the brain, which we measure by conducting Representational Similarity Analysis across multiple brain regions and model layers. We show that temporal modeling is key for alignment to brain regions involved in early visual processing, while a relevant classification task is key for alignment to higher-level regions. Moreover, we identify clear differences between the brain scoring patterns across layers of CNNs and Transformers, and reveal how training dataset biases transfer to alignment with functionally selective brain areas. Additionally, we uncover a negative correlation of computational complexity to brain alignment. Measuring a total of 99 neural networks and 10 human brains watching videos, we aim to forge a path that widens our understanding of temporal and semantic video representations in brains and machines, ideally leading towards more efficient video models and more mechanistic explanations of processing in the human brain.
Bersch, D., Vilas, M. G., Saba-Sadiya, S., Schaumlöffel, T., Dwivedi, K., Sartzetaki, C., Cichy, R. M., & Roig, G. (2025). Net2Brain: a toolbox to compare artificial vision models with human brain responses. Frontiers in Neuroinformatics. https://doi.org/10.3389/fninf.2025.1515873
```
@article{bersch_net2brain_2025,
  title = {{Net2Brain}: a toolbox to compare artificial vision models with human brain responses},
  shorttitle = {{Net2Brain}},
  url = {https://www.frontiersin.org/articles/10.3389/fninf.2025.1515873/full},
  doi = {10.3389/fninf.2025.1515873},
  urldate = {2025-05-15},
  journal = {Frontiers in Neuroinformatics},
  author = {Bersch, Domenic and Vilas, Martina G. and Saba-Sadiya, Sari and Schaumlöffel, Timothy and Dwivedi, Kshitij and Sartzetaki, Christina and Cichy, Radoslaw M. and Roig, Gemma},
  year = {2025}
}
```
Abstract
In cognitive neuroscience, the integration of deep neural networks (DNNs) with traditional neuroscientific analyses has significantly advanced our understanding of both biological neural processes and the functioning of DNNs. However, challenges remain in effectively comparing the representational spaces of artificial models and brain data, particularly due to the growing variety of models and the specific demands of neuroimaging research. To address these challenges, we present Net2Brain, a Python-based toolbox that provides an end-to-end pipeline for incorporating DNNs into neuroscience research, encompassing dataset download, a large selection of models, feature extraction, evaluation, and visualization. Net2Brain provides functionalities in four key areas. First, it offers access to over 600 DNNs trained on diverse tasks across multiple modalities, including vision, language, audio, and multimodal data, organized through a carefully structured taxonomy. Second, it provides a streamlined API for downloading and handling popular neuroscience datasets, such as the NSD and THINGS dataset, allowing researchers to easily access corresponding brain data. Third, Net2Brain facilitates a wide range of analysis options, including feature extraction, representational similarity analysis (RSA), and linear encoding, while also supporting advanced techniques like variance partitioning and searchlight analysis. Finally, the toolbox integrates seamlessly with other established open source libraries, enhancing interoperability and promoting collaborative research. By simplifying model selection, data processing, and evaluation, Net2Brain empowers researchers to conduct more robust, flexible, and reproducible investigations of the relationships between artificial and biological neural representations.
Moussa, O., Klakow, D., & Toneva, M. (2025). Improving Semantic Understanding in Speech Language Models via Brain-tuning. https://doi.org/https://doi.org/10.48550/arXiv.2410.09230
```
@misc{moussa2025improvingsemanticunderstandingspeech,
  title = {Improving Semantic Understanding in Speech Language Models via Brain-tuning},
  author = {Moussa, Omer and Klakow, Dietrich and Toneva, Mariya},
  year = {2025},
  eprint = {2410.09230},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  url = {https://arxiv.org/abs/2410.09230},
  file = {https://arxiv.org/pdf/2410.09230},
  doi = {https://doi.org/10.48550/arXiv.2410.09230}
}
```
Abstract
Speech language models align with human brain responses to natural language to an impressive degree. However, current models rely heavily on low-level speech features, indicating they lack brain-relevant semantics which limits their utility as model organisms of semantic processing in the brain. In this work, we address this limitation by inducing brain-relevant bias directly into the models via fine-tuning with fMRI recordings of people listening to natural stories, a process we name brain-tuning. After testing it on 3 different pretrained model families, we show that brain-tuning not only improves overall alignment with new brain recordings in semantic language regions, but also reduces the reliance on low-level speech features for this alignment. Excitingly, we further show that brain-tuning leads to 1) consistent improvements in performance on a range of downstream tasks and 2) a representational space with increased semantic preference. Our results provide converging evidence, for the first time, that incorporating brain signals into the training of language models improves the models’ semantic understanding.
Aubret, A., Schaumlöffel, T., Roig, G., & Triesch, J. (2024). Learning Object Semantic Similarity with Self-Supervision. 2024 IEEE International Conference on Development and Learning (ICDL), 1–6. https://doi.org/10.1109/ICDL61372.2024.10644930
```
@inproceedings{aubret_learning_2024_10644930,
  author = {Aubret, Arthur and Schaumlöffel, Timothy and Roig, Gemma and Triesch, Jochen},
  booktitle = {2024 IEEE International Conference on Development and Learning (ICDL)},
  title = {Learning Object Semantic Similarity with Self-Supervision},
  year = {2024},
  volume = {},
  number = {},
  pages = {1-6},
  keywords = {Visualization;Costs;Shape;Biological system modeling;Semantics;Neural networks;Streaming media;visual representation learning;self-supervised learning;slowness principle},
  doi = {10.1109/ICDL61372.2024.10644930}
}
```
Abstract
Humans judge the similarity of two objects not just based on their visual appearance but also based on their semantic relatedness. However, it remains unclear how humans learn about semantic relationships between objects and categories. One important source of semantic knowledge is that semantically related objects frequently co-occur in the same context. For instance, forks and plates are perceived as similar, at least in part, because they are often experienced together in a “kitchen” or “eating” context. Here, we investigate whether a bio-inspired learning principle exploiting such co-occurrence statistics suffices to learn a semantically structured object representation de novo from raw visual or combined visual and linguistic input. To this end, we simulate temporal sequences of visual experience by binding together short video clips of real-world scenes showing objects in different contexts. A bio-inspired neural network model aligns close-in-time visual representations while also aligning visual and category label representations to simulate visuo-language alignment. Our results show that our model clusters object representations based on their context, e.g. kitchen or bedroom, in particular in high-level layers of the network, akin to humans. In contrast, lower-level layers tend to better reflect object identity or category. To achieve this, the model exploits two distinct strategies: the visuo-language alignment ensures that different objects of the same category are represented similarly, whereas the temporal alignment leverages that objects from the same context are frequently seen in succession to make their representations more similar. Overall, our work suggests temporal and visuo-language alignment as plausible computational principles for explaining the origins of certain forms of semantic knowledge in humans.
Ernst, M. R., López, F. M., Aubret, A., Fleming, R. W., & Triesch, J. (2024, April). Self-Supervised Learning of Color Constancy. Proceedings of the 2024 IEEE International Conference on Development and Learning (ICDL). http://arxiv.org/abs/2404.08127
```
@inproceedings{ernst_self-supervised_2024,
  title = {Self-{Supervised} {Learning} of {Color} {Constancy}},
  url = {http://arxiv.org/abs/2404.08127},
  language = {en},
  urldate = {2024-07-01},
  booktitle = {Proceedings of the 2024 {IEEE} {International} {Conference} on {Development} and {Learning} ({ICDL})},
  publisher = {arXiv},
  author = {Ernst, Markus R. and López, Francisco M. and Aubret, Arthur and Fleming, Roland W. and Triesch, Jochen},
  month = apr,
  year = {2024},
  note = {arXiv:2404.08127 [cs]},
  keywords = {Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence},
  file = {2404.pdf:C\:\\Users\\felix\\Zotero\\storage\\TSYSWQ9R\\2404.pdf:application/pdf}
}
```
Abstract
Color constancy (CC) describes the ability of the visual system to perceive an object as having a relatively constant color despite changes in lighting conditions. While CC and its limitations have been carefully characterized in humans, it is still unclear how the visual system acquires this ability during development. Here, we present a first study showing that CC develops in a neural network trained in a self-supervised manner through an invariance learning objective. During learning, objects are presented under changing illuminations, while the network aims to map subsequent views of the same object onto close-by latent representations. This gives rise to representations that are largely invariant to the illumination conditions, offering a plausible example of how CC could emerge during human cognitive development via a form of self-supervised learning.
Vilas, M. G., Adolfi, F., Poeppel, D., & Roig, G. (2024). Position: an inner interpretability framework for AI inspired by lessons from cognitive neuroscience. Proceedings of the 41st International Conference on Machine Learning, 235, 49506–49522. https://dl.acm.org/doi/10.5555/3692070.3694093
```
@inproceedings{vilas_position_2024_new,
  author = {Vilas, Martina G. and Adolfi, Federico and Poeppel, David and Roig, Gemma},
  title = {Position: an inner interpretability framework for AI inspired by lessons from cognitive neuroscience},
  year = {2024},
  publisher = {JMLR.org},
  booktitle = {Proceedings of the 41st International Conference on Machine Learning},
  articleno = {2023},
  numpages = {17},
  location = {Vienna, Austria},
  series = {ICML'24},
  volume = {235},
  pages = {49506--49522},
  url = {https://dl.acm.org/doi/10.5555/3692070.3694093}
}
```
Abstract
Inner Interpretability is a promising emerging field tasked with uncovering the inner mechanisms of AI systems, though how to develop these mechanistic theories is still much debated. Moreover, recent critiques raise issues that question its usefulness to advance the broader goals of AI. However, it has been overlooked that these issues resemble those that have been grappled with in another field: Cognitive Neuroscience. Here we draw the relevant connections and highlight lessons that can be transferred productively between fields. Based on these, we propose a general conceptual framework and give concrete methodological strategies for building mechanistic explanations in AI inner interpretability research. With this conceptual framework, Inner Interpretability can fend off critiques and position itself on a productive path to explain AI systems.
Oota, S. R., Çelik, E., Deniz, F., & Toneva, M. (2024, June). Speech language models lack important brain-relevant semantics. https://doi.org/10.48550/arXiv.2311.04664
```
@inproceedings{oota_speech_2024,
  title = {Speech language models lack important brain-relevant semantics},
  url = {http://arxiv.org/abs/2311.04664},
  doi = {10.48550/arXiv.2311.04664},
  urldate = {2024-07-01},
  publisher = {arXiv},
  author = {Oota, Subba Reddy and Çelik, Emin and Deniz, Fatma and Toneva, Mariya},
  month = jun,
  year = {2024},
  note = {arXiv:2311.04664 [cs, eess, q-bio]},
  keywords = {Computer Science - Machine Learning, Computer Science - Computation and Language, Quantitative Biology - Neurons and Cognition, Electrical Engineering and Systems Science - Audio and Speech Processing},
  file = {arXiv Fulltext PDF:C\:\\Users\\felix\\Zotero\\storage\\2W5MRDIE\\Oota et al. - 2024 - Speech language models lack important brain-releva.pdf:application/pdf;arXiv.org Snapshot:C\:\\Users\\felix\\Zotero\\storage\\SCRZEMZV\\2311.html:text/html}
}
```
Abstract
Despite known differences between reading and listening in the brain, recent work has shown that text-based language models predict both text-evoked and speech-evoked brain activity to an impressive degree. This poses the question of what types of information language models truly predict in the brain. We investigate this question via a direct approach, in which we systematically remove specific low-level stimulus features (textual, speech, and visual) from language model representations to assess their impact on alignment with fMRI brain recordings during reading and listening. Comparing these findings with speech-based language models reveals starkly different effects of low-level features on brain alignment. While text-based models show reduced alignment in early sensory regions post-removal, they retain significant predictive power in late language regions. In contrast, speech-based models maintain strong alignment in early auditory regions even after feature removal but lose all predictive power in late language regions. These results suggest that speech-based models provide insights into additional information processed by early auditory regions, but caution is needed when using them to model processing in late language regions. We make our code publicly available. [https://github.com/subbareddy248/speech-llm-brain]
Lahner, B., Dwivedi, K., Iamshchinina, P., Graumann, M., Lascelles, A., Roig, G., Gifford, A. T., Pan, B., Jin, S. Y., Ratan Murty, N. A., Kay, K., Oliva, A., & Cichy, R. (2024). Modeling short visual events through the BOLD moments video fMRI dataset and metadata. Nature Communications, 15(1), 6241. https://doi.org/10.1038/s41467-024-50310-3
```
@article{lahner_modeling_2024,
  title = {Modeling short visual events through the {BOLD} moments video {fMRI} dataset and metadata},
  volume = {15},
  copyright = {2024 The Author(s)},
  issn = {2041-1723},
  url = {https://www.nature.com/articles/s41467-024-50310-3},
  doi = {10.1038/s41467-024-50310-3},
  language = {en},
  number = {1},
  urldate = {2024-07-30},
  journal = {Nature Communications},
  author = {Lahner, Benjamin and Dwivedi, Kshitij and Iamshchinina, Polina and Graumann, Monika and Lascelles, Alex and Roig, Gemma and Gifford, Alessandro Thomas and Pan, Bowen and Jin, SouYoung and Ratan Murty, N. Apurva and Kay, Kendrick and Oliva, Aude and Cichy, Radoslaw},
  month = jul,
  year = {2024},
  note = {Publisher: Nature Publishing Group},
  keywords = {Perception, Visual system, Neural encoding},
  pages = {6241},
  file = {Full Text PDF:C\:\\Users\\felix\\Zotero\\storage\\MLQBFTHX\\Lahner et al. - 2024 - Modeling short visual events through the BOLD mome.pdf:application/pdf}
}
```
Abstract
Studying the neural basis of human dynamic visual perception requires extensive experimental data to evaluate the large swathes of functionally diverse brain neural networks driven by perceiving visual events. Here, we introduce the BOLD Moments Dataset (BMD), a repository of whole-brain fMRI responses to over 1000 short (3 s) naturalistic video clips of visual events across ten human subjects. We use the videos’ extensive metadata to show how the brain represents word- and sentence-level descriptions of visual events and identify correlates of video memorability scores extending into the parietal cortex. Furthermore, we reveal a match in hierarchical processing between cortical regions of interest and video-computable deep neural networks, and we showcase that BMD successfully captures temporal dynamics of visual events at second resolution. With its rich metadata, BMD offers new perspectives and accelerates research on the human brain basis of visual event perception.
Yu, Z., Aubret, A., Raabe, M. C., Yang, J., Yu, C., & Triesch, J. (2024). Active Gaze Behavior Boosts Self-Supervised Object Learning. arXiv. https://doi.org/10.48550/arXiv.2411.01969
```
@misc{yu_active_2024,
  title = {Active {Gaze} {Behavior} {Boosts} {Self}-{Supervised} {Object} {Learning}},
  url = {http://arxiv.org/abs/2411.01969},
  doi = {10.48550/arXiv.2411.01969},
  urldate = {2024-12-27},
  publisher = {arXiv},
  author = {Yu, Zhengyang and Aubret, Arthur and Raabe, Marcel C. and Yang, Jane and Yu, Chen and Triesch, Jochen},
  month = nov,
  year = {2024},
  note = {arXiv:2411.01969 [cs]},
  keywords = {Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition},
  file = {Preprint PDF:C\:\\Users\\felix\\Zotero\\storage\\G8XWI7GL\\Yu et al. - 2024 - Active Gaze Behavior Boosts Self-Supervised Object Learning.pdf:application/pdf;Snapshot:C\:\\Users\\felix\\Zotero\\storage\\6634GLNC\\2411.html:text/html}
}
```
Abstract
Due to significant variations in the projection of the same object from different viewpoints, machine learning algorithms struggle to recognize the same object across various perspectives. In contrast, toddlers quickly learn to recognize objects from different viewpoints with almost no supervision. Recent works argue that toddlers develop this ability by mapping close-in-time visual inputs to similar representations while interacting with objects. High acuity vision is only available in the central visual field, which may explain why toddlers (much like adults) constantly move their gaze around during such interactions. It is unclear whether/how much toddlers curate their visual experience through these eye movements to support learning object representations. In this work, we explore whether a bio inspired visual learning model can harness toddlers’ gaze behavior during a play session to develop view-invariant object recognition. Exploiting head-mounted eye tracking during dyadic play, we simulate toddlers’ central visual field experience by cropping image regions centered on the gaze location. This visual stream feeds a time-based self-supervised learning algorithm. Our experiments demonstrate that toddlers’ gaze strategy supports the learning of invariant object representations. Our analysis also reveals that the limited size of the central visual field where acuity is high is crucial for this. We further find that toddlers’ visual experience elicits more robust representations compared to adults’ mostly because toddlers look at objects they hold themselves for longer bouts. Overall, our work reveals how toddlers’ gaze behavior supports self-supervised learning of view-invariant object recognition.
Neamaalkassis, H., Boubenec, Y., Muralikrishnan, R., Fiebach, C., & Tavano, A. (2024). The fundamental frequencies of our own voice. OSF. https://doi.org/10.31234/osf.io/fm9ed
```
@misc{neamaalkassis_fundamental_2024,
  title = {The fundamental frequencies of our own voice},
  url = {https://osf.io/fm9ed},
  doi = {10.31234/osf.io/fm9ed},
  language = {en-us},
  urldate = {2024-12-27},
  publisher = {OSF},
  author = {Neamaalkassis, Hakam and Boubenec, Yves and Muralikrishnan, R. and Fiebach, Christian and Tavano, Alessandro},
  month = feb,
  year = {2024},
  file = {OSF Preprint:C\:\\Users\\felix\\Zotero\\storage\\T4XNRFQW\\Neamaalkassis et al. - 2024 - The fundamental frequencies of our own voice.pdf:application/pdf}
}
```
Abstract
Own actions send a corollary discharge (CD) signal, that is a copy of the planned motor program, to sensory-specific brain areas to suppress the anticipated sensory response, providing a neural basis for the sense of self. When we speak, the sensory consequences of the fundamental frequency (f0) of our own voice, generated by vocal fold vibrations, are suppressed. However, due to bone/air conduction filtering effects, the f0 we self-generate is measurably different from the f0 we subjectively perceive as defining our own voice. Using an auditory change deafness paradigm, we parametrically tested the sensitivity to auditory change in the frequency neighbourhoods of individual objective and subjective voice f0, and found that participants experience change deafness for both to a similar extent, relative to a control pitch condition. We conclude that when we listen attentively, we are likely to filter out voice pitches in the vicinity of our own objective and subjective voice f0, possibly as a long-term consequence of speaking-induced suppression mechanisms integrated with individual, perceptual bodily priors.
Taylor, J. E., Sinn, R., Iaia, C., & Fiebach, C. J. (2024). Beyond Letters: Optimal Transport as a Model for Sub-Letter Orthographic Processing. bioRxiv. https://doi.org/10.1101/2024.11.11.622929
```
@misc{taylor_beyond_2024,
  title = {Beyond {Letters}: {Optimal} {Transport} as a {Model} for {Sub}-{Letter} {Orthographic} {Processing}},
  copyright = {© 2024, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution-NoDerivs 4.0 International), CC BY-ND 4.0, as described at http://creativecommons.org/licenses/by-nd/4.0/},
  shorttitle = {Beyond {Letters}},
  url = {https://www.biorxiv.org/content/10.1101/2024.11.11.622929v1},
  doi = {10.1101/2024.11.11.622929},
  language = {en},
  urldate = {2024-12-27},
  publisher = {bioRxiv},
  author = {Taylor, Jack E. and Sinn, Rasmus and Iaia, Cosimo and Fiebach, Christian J.},
  month = nov,
  year = {2024},
  note = {Pages: 2024.11.11.622929
  Section: New Results},
  file = {Full Text PDF:C\:\\Users\\felix\\Zotero\\storage\\2FE6DWFQ\\Taylor et al. - 2024 - Beyond Letters Optimal Transport as a Model for Sub-Letter Orthographic Processing.pdf:application/pdf}
}
```
Abstract
Letter processing plays a key role in visual word recognition. However, word recognition models typically overlook or greatly simplify early perceptual processes of letter recognition. We suggest that optimal transport theory may provide a computational framework for describing letter shape processing. We use representational similarity analysis to show that optimal transport cost (Wasserstein distance) between pairs of letters aligns with neural activity elicited by visually presented letters \textless225 ms after stimulus onset, outperforming an existing approach based on shape overlap. We additionally show that optimal transport can capture the emergence of geometric invariances (e.g., to position or size) observed in letter perception. Finally, we demonstrate that Wasserstein distance predicts neural activity similarly well to features from artificial networks trained to classify images and letters. However, whereas representations in artificial neural networks emerge in a computationally unconstrained manner, our proposal provides a computationally explicit route to modeling the earliest orthographic processes.
Gagl, B., Weyers, I., Eisenhauer, S., Fiebach, C. J., Colombo, M., Scarf, D., Ziegler, J. C., Grainger, J., Güntürkün, O., & Mueller, J. L. (2024). Non-Human Recognition of Orthography: How is it implemented and how does it differ from Human orthographic processing. bioRxiv. https://doi.org/10.1101/2024.06.25.600635
```
@misc{gagl_non-human_2024,
  title = {Non-{Human} {Recognition} of {Orthography}: {How} is it implemented and how does it differ from {Human} orthographic processing},
  copyright = {© 2024, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0/},
  shorttitle = {Non-{Human} {Recognition} of {Orthography}},
  url = {https://www.biorxiv.org/content/10.1101/2024.06.25.600635v2},
  doi = {10.1101/2024.06.25.600635},
  language = {en},
  urldate = {2024-12-27},
  publisher = {bioRxiv},
  author = {Gagl, Benjamin and Weyers, Ivonne and Eisenhauer, Susanne and Fiebach, Christian J. and Colombo, Michael and Scarf, Damian and Ziegler, Johannes C. and Grainger, Jonathan and Güntürkün, Onur and Mueller, Jutta L.},
  month = aug,
  year = {2024},
  note = {Pages: 2024.06.25.600635
  Section: New Results},
  file = {Full Text PDF:C\:\\Users\\felix\\Zotero\\storage\\QYS3SNJC\\Gagl et al. - 2024 - Non-Human Recognition of Orthography How is it implemented and how does it differ from Human orthog.pdf:application/pdf}
}
```
Abstract
The ability to robustly recognize strings of letters, a cornerstone of reading, was observed in Baboons and Pigeons despite their lack of phonological and semantic knowledge. Here, we apply a comparative modeling approach to investigate the neuro-cognitive basis of Human, Baboon, and Pigeon orthographic decision behavior, addressing whether phylogenetic relatedness entails similar underlying neuro-cognitive phenotypes. We use the highly transparent Speechless Reader Model (SLR), which assumes letter string recognition based on widely accepted computational principles of predictive coding so that orthographic decisions rely on a prediction error signal emerging from multiple, hierarchically ordered representational levels, i.e., low-level visual, letter, or letter sequence representations. We investigate which representations species use during successful orthographic decision-making. We introduce multiple SLR variants, each including one or multiple prediction error representations, and compare the simulations of each SLR variant to the orthographic decisions from individuals of three species after learning letter strings without meaning. Humans predominantly relied on letter-sequence-level representations, resulting in the highest task performance in behavior and model simulations. Baboons also relied on sequence-based representations but in combination with pixel- and letter-level representations. In contrast, all Pigeons relied on pixel-level representations, partly in combination with letter- and letter-sequence-level representations. These findings suggest that orthographic representations utilized in orthographic decisions reflect the phylogenetic distance between species: Humans and Baboons use more similar representations compared to Pigeons. Overall, the description of orthographic decisions based on a small set of representations and computations was highly successful in describing behavior, even for Humans who mastered reading in its entirety. Significance Statement Imagine being able to read without ever learning the alphabet. Research has shown that baboons and pigeons can exhibit reading-like behavior, suggesting shared processes across the species involved. To increase our understanding of the similarities and differences between humans and animals in reading-like behavior, we use a computational model to uncover the underlying processes that enable humans, baboons, and pigeons to perform these tasks. We found that humans and baboons rely on similar processes, focusing on information related to letters and letter sequences. In contrast, pigeons rely more heavily on visual cues. This discovery sheds light on the evolution of processes underlying reading and reading-like behavior, indicating that the lower the evolutionary distance between species, the more similar processes are involved.
Aubret, A., Teulière, C., & Triesch, J. (2024). Self-supervised visual learning from interactions with objects. arXiv. https://doi.org/10.48550/arXiv.2407.06704
```
@misc{aubret_self-supervised_2024,
  title = {Self-supervised visual learning from interactions with objects},
  url = {http://arxiv.org/abs/2407.06704},
  doi = {10.48550/arXiv.2407.06704},
  urldate = {2024-12-27},
  publisher = {arXiv},
  author = {Aubret, Arthur and Teulière, Céline and Triesch, Jochen},
  month = aug,
  year = {2024},
  note = {arXiv:2407.06704 [cs]},
  keywords = {Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning},
  file = {Preprint PDF:C\:\\Users\\felix\\Zotero\\storage\\IGFD4Q5M\\Aubret et al. - 2024 - Self-supervised visual learning from interactions with objects.pdf:application/pdf;Snapshot:C\:\\Users\\felix\\Zotero\\storage\\GNVXGGY3\\2407.html:text/html}
}
```
Abstract
Self-supervised learning (SSL) has revolutionized visual representation learning, but has not achieved the robustness of human vision. A reason for this could be that SSL does not leverage all the data available to humans during learning. When learning about an object, humans often purposefully turn or move around objects and research suggests that these interactions can substantially enhance their learning. Here we explore whether such object-related actions can boost SSL. For this, we extract the actions performed to change from one ego-centric view of an object to another in four video datasets. We then introduce a new loss function to learn visual and action embeddings by aligning the performed action with the representations of two images extracted from the same clip. This permits the performed actions to structure the latent visual representation. Our experiments show that our method consistently outperforms previous methods on downstream category recognition. In our analysis, we find that the observed improvement is associated with a better viewpoint-wise alignment of different objects from the same category. Overall, our work demonstrates that embodied interactions with objects can improve SSL of object categories.
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv. https://doi.org/10.48550/arXiv.2203.11171
```
@misc{wang_self-consistency_2023,
  title = {Self-{Consistency} {Improves} {Chain} of {Thought} {Reasoning} in {Language} {Models}},
  url = {http://arxiv.org/abs/2203.11171},
  doi = {10.48550/arXiv.2203.11171},
  urldate = {2024-07-30},
  publisher = {arXiv},
  author = {Wang, Xuezhi and Wei, Jason and Schuurmans, Dale and Le, Quoc and Chi, Ed and Narang, Sharan and Chowdhery, Aakanksha and Zhou, Denny},
  month = mar,
  year = {2023},
  note = {arXiv:2203.11171 [cs]},
  keywords = {Computer Science - Computation and Language, Computer Science - Artificial Intelligence},
  file = {arXiv Fulltext PDF:C\:\\Users\\felix\\Zotero\\storage\\RXABRRVP\\Wang et al. - 2023 - Self-Consistency Improves Chain of Thought Reasoni.pdf:application/pdf;arXiv.org Snapshot:C\:\\Users\\felix\\Zotero\\storage\\8NGA4CCN\\2203.html:text/html}
}
```
Abstract
Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).
Vilas, M. G., Schaumlöffel, T., & Roig, G. (2023). Analyzing Vision Transformers for Image Classification in Class Embedding Space. Advances in Neural Information Processing Systems, 36, 40030–40041. https://proceedings.neurips.cc/paper_files/paper/2023/hash/7dd309df03d37643b96f5048b44da798-Abstract-Conference.html
```
@article{vilas_analyzing_2023,
  title = {Analyzing {Vision} {Transformers} for {Image} {Classification} in {Class} {Embedding} {Space}},
  volume = {36},
  url = {https://proceedings.neurips.cc/paper_files/paper/2023/hash/7dd309df03d37643b96f5048b44da798-Abstract-Conference.html},
  language = {en},
  urldate = {2024-07-30},
  journal = {Advances in Neural Information Processing Systems},
  author = {Vilas, Martina G. and Schaumlöffel, Timothy and Roig, Gemma},
  month = dec,
  year = {2023},
  pages = {40030--40041},
  file = {Full Text PDF:C\:\\Users\\felix\\Zotero\\storage\\KML6KDAY\\Vilas et al. - 2023 - Analyzing Vision Transformers for Image Classifica.pdf:application/pdf}
}
```
Abstract
Oota, S., Gupta, M., & Toneva, M. (2023). Joint processing of linguistic properties in brains and language models. Advances in Neural Information Processing Systems, 36, 18001–18014. https://proceedings.neurips.cc/paper_files/paper/2023/hash/3a0e2de215bd17c39ad08ba1d16c1b12-Abstract-Conference.html
```
@article{oota_joint_2023,
  title = {Joint processing of linguistic properties in brains and language models},
  volume = {36},
  url = {https://proceedings.neurips.cc/paper_files/paper/2023/hash/3a0e2de215bd17c39ad08ba1d16c1b12-Abstract-Conference.html},
  language = {en},
  urldate = {2024-07-30},
  journal = {Advances in Neural Information Processing Systems},
  author = {Oota, Subbareddy and Gupta, Manish and Toneva, Mariya},
  month = dec,
  year = {2023},
  pages = {18001--18014},
  file = {Full Text PDF:C\:\\Users\\felix\\Zotero\\storage\\QGZCWPRS\\Oota et al. - 2023 - Joint processing of linguistic properties in brain.pdf:application/pdf}
}
```
Abstract
Schaumlöffel, T., Vilas, M. G., & Roig, G. (2023). PEACS: Prefix Encoding for Auditory Caption Synthesis. Proceedings of the Detection and Classification of Acoustic Scenes and Events Challenge (DCASE), 1–3. https://dcase.community/documents/challenge2023/technical_reports/DCASE2023_Schaumloeffel_107_t6a.pdf
```
@inproceedings{schaumloffel_peacs_2023,
  author = {Schaumlöffel, Timothy and Vilas, Martina G. and Roig, Gemma},
  title = {PEACS: Prefix Encoding for Auditory Caption Synthesis},
  booktitle = {Proceedings of the Detection and Classification of Acoustic Scenes and Events Challenge (DCASE)},
  year = {2023},
  pages = {1--3},
  note = {Technical Report},
  url = {https://dcase.community/documents/challenge2023/technical_reports/DCASE2023_Schaumloeffel_107_t6a.pdf}
}
```
Abstract
This technical report describes an Automated Audio Captioning system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge, Task 6a (automated audio captioning). Our approach employs an encoder-decoder architecture, with the encoder utilizing a large contrastive pre-trained HTS-AT capable of handling variable-length audio segments. The decoder is based on the GPT2 model. To incorporate audio into the decoding process, we employ a light mapping network that translates audio representations into a prefix, effectively guiding the decoder’s generation process. Given the limited data availability, we pre-train our model on various audio captioning datasets and fine-tune it on Clotho. We reach a SPIDERr-FL score of 29.3 on the evaluation split of the Clotho-v2 dataset.
Schaumlöffel, T., Aubret, A., Roig, G., & Triesch, J. (2023). Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play. 2023 IEEE International Conference on Development and Learning (ICDL), 67–72. https://doi.org/10.1109/ICDL55364.2023.10364409
```
@inproceedings{schaumloffel_caregiver_2023,
  title = {Caregiver {Talk} {Shapes} {Toddler} {Vision}: {A} {Computational} {Study} of {Dyadic} {Play}},
  shorttitle = {Caregiver {Talk} {Shapes} {Toddler} {Vision}},
  url = {http://arxiv.org/abs/2312.04118},
  doi = {10.1109/ICDL55364.2023.10364409},
  urldate = {2024-07-01},
  booktitle = {2023 {IEEE} {International} {Conference} on {Development} and {Learning} ({ICDL})},
  author = {Schaumlöffel, Timothy and Aubret, Arthur and Roig, Gemma and Triesch, Jochen},
  month = nov,
  year = {2023},
  note = {arXiv:2312.04118 [cs]},
  keywords = {Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence},
  pages = {67--72},
  file = {arXiv Fulltext PDF:C\:\\Users\\felix\\Zotero\\storage\\AV6UR8J8\\Schaumlöffel et al. - 2023 - Caregiver Talk Shapes Toddler Vision A Computatio.pdf:application/pdf;arXiv.org Snapshot:C\:\\Users\\felix\\Zotero\\storage\\WDLDI6PZ\\2312.html:text/html}
}
```
Abstract
Infants’ ability to recognize and categorize objects develops gradually. The second year of life is marked by both the emergence of more semantic visual representations and a better understanding of word meaning. This suggests that language input may play an important role in shaping visual representations. However, even in suitable contexts for word learning like dyadic play sessions, caregivers utterances are sparse and ambiguous, often referring to objects that are different from the one to which the child attends. Here, we systematically investigate to what extent caregivers’ utterances can nevertheless enhance visual representations. For this we propose a computational model of visual representation learning during dyadic play. We introduce a synthetic dataset of ego-centric images perceived by a toddler-agent that moves and rotates toy objects in different parts of its home environment while hearing caregivers’ utterances, modeled as captions. We propose to model toddlers’ learning as simultaneously aligning representations for 1) close-in-time images and 2) co-occurring images and utterances. We show that utterances with statistics matching those of real caregivers give rise to representations supporting improved category recognition. Our analysis reveals that a small decrease/increase in object-relevant naming frequencies can drastically impact the learned representations. This affects the attention on object names within an utterance, which is required for efficient visuo-linguistic alignment. Overall, our results support the hypothesis that caregivers’ naming utterances can improve toddlers’ visual representations.
Xu, X., & Triesch, J. (2023). CIPER: Combining Invariant and Equivariant Representations Using Contrastive and Predictive Learning. In L. Iliadis, A. Papaleonidas, P. Angelov, & C. Jayne (Eds.), Artificial Neural Networks and Machine Learning – ICANN 2023 (pp. 320–331). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-44213-1_27
```
@inproceedings{xu_ciper_2023,
  address = {Cham},
  title = {{CIPER}: {Combining} {Invariant} and {Equivariant} {Representations} {Using} {Contrastive} and {Predictive} {Learning}},
  isbn = {978-3-031-44213-1},
  shorttitle = {{CIPER}},
  doi = {10.1007/978-3-031-44213-1_27},
  language = {en},
  booktitle = {Artificial {Neural} {Networks} and {Machine} {Learning} – {ICANN} 2023},
  publisher = {Springer Nature Switzerland},
  author = {Xu, Xia and Triesch, Jochen},
  editor = {Iliadis, Lazaros and Papaleonidas, Antonios and Angelov, Plamen and Jayne, Chrisina},
  year = {2023},
  pages = {320--331},
  file = {Eingereichte Version:C\:\\Users\\felix\\Zotero\\storage\\C8HQU98R\\Xu und Triesch - 2023 - CIPER Combining Invariant and Equivariant Representations Using Contrastive and Predictive Learning.pdf:application/pdf}
}
```
Abstract
Self-supervised representation learning (SSRL) methods have shown great success in computer vision. In recent studies, augmentation-based contrastive learning methods have been proposed for learning representations that are invariant or equivariant to pre-defined data augmentation operations. However, invariant or equivariant features favor only specific downstream tasks depending on the augmentations chosen. They may result in poor performance when the learned representation does not match task requirements. Here, we consider an active observer that can manipulate views of an object and has knowledge of the action(s) that generated each view. We introduce Contrastive Invariant and Predictive Equivariant Representation learning (CIPER). CIPER comprises both invariant and equivariant learning objectives using one shared encoder and two different output heads on top of the encoder. One output head is a projection head with a state-of-the-art contrastive objective to encourage invariance to augmentations. The other is a prediction head estimating the augmentation parameters, capturing equivariant features. Both heads are discarded after training and only the encoder is used for downstream tasks. We evaluate our method on static image tasks and time-augmented image datasets. Our results show that CIPER outperforms a baseline contrastive method on various tasks. Interestingly, CIPER encourages the formation of hierarchically structured representations where different views of an object become systematically organized in the latent representation space.
Aubret, A., Ernst, M., Teulière, C., & Triesch, J. (2022). Time to augment self-supervised visual representation learning. arXiv. https://doi.org/10.48550/arXiv.2207.13492
```
@misc{aubret_time_2022,
  title = {Time to augment self-supervised visual representation learning},
  url = {http://arxiv.org/abs/2207.13492},
  doi = {10.48550/arXiv.2207.13492},
  language = {en},
  urldate = {2024-12-27},
  publisher = {arXiv},
  author = {Aubret, Arthur and Ernst, Markus and Teulière, Céline and Triesch, Jochen},
  month = dec,
  year = {2022},
  note = {arXiv:2207.13492 [cs]},
  keywords = {Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning},
  file = {PDF:C\:\\Users\\felix\\Zotero\\storage\\KH2TTQDM\\Aubret et al. - 2022 - Time to augment self-supervised visual representation learning.pdf:application/pdf}
}
```
Abstract
Biological vision systems are unparalleled in their ability to learn visual representations without supervision. In machine learning, self-supervised learning (SSL) has led to major advances in forming object representations in an unsupervised fashion. Such systems learn representations invariant to augmentation operations over images, like cropping or ﬂipping. In contrast, biological vision systems exploit the temporal structure of the visual experience during natural interactions with objects. This gives access to “augmentations” not commonly used in SSL, like watching the same object from multiple viewpoints or against different backgrounds. Here, we systematically investigate and compare the potential beneﬁts of such time-based augmentations during natural interactions for learning object categories. Our results show that time-based augmentations achieve large performance gains over state-of-the-art image augmentations. Speciﬁcally, our analyses reveal that: 1) 3-D object manipulations drastically improve the learning of object categories; 2) viewing objects against changing backgrounds is important for learning to discard background-related information from the latent representation. Overall, we conclude that time-based augmentations during natural interactions with objects can substantially improve self-supervised learning, narrowing the gap between artiﬁcial and biological vision systems.

Background Publications

Bersch, D., Dwivedi, K., Vilas, M., Cichy, R. M., & Roig, G. (2022). Net2Brain: A Toolbox to Compare Artificial Vision Models with Human Brain Responses. https://doi.org/10.48550/arXiv.2208.09677
```
@online{berschNet2BrainToolboxCompare2022,
  title = {{{Net2Brain}}: {{A Toolbox}} to Compare Artificial Vision Models with Human Brain Responses},
  shorttitle = {{{Net2Brain}}},
  author = {Bersch, Domenic and Dwivedi, Kshitij and Vilas, Martina and Cichy, Radoslaw M. and Roig, Gemma},
  date = {2022-08-25},
  eprint = {2208.09677},
  eprinttype = {arXiv},
  eprintclass = {cs, q-bio},
  doi = {10.48550/arXiv.2208.09677},
  url = {http://arxiv.org/abs/2208.09677},
  urldate = {2024-07-01},
  pubstate = {prepublished},
  keywords = {Computer Science - Artificial Intelligence,Computer Science - Computer Vision and Pattern Recognition,Quantitative Biology - Neurons and Cognition},
  file = {C\:\\Users\\felix\\Zotero\\storage\\XJ6BRPEW\\Bersch et al. - 2022 - Net2Brain A Toolbox to compare artificial vision .pdf;C\:\\Users\\felix\\Zotero\\storage\\42KCDVPV\\2208.html}
}
```
Abstract
We introduce Net2Brain, a graphical and command-line user interface toolbox for comparing the representational spaces of artificial deep neural networks (DNNs) and human brain recordings. While different toolboxes facilitate only single functionalities or only focus on a small subset of supervised image classification models, Net2Brain allows the extraction of activations of more than 600 DNNs trained to perform a diverse range of vision-related tasks (e.g semantic segmentation, depth estimation, action recognition, etc.), over both image and video datasets. The toolbox computes the representational dissimilarity matrices (RDMs) over those activations and compares them to brain recordings using representational similarity analysis (RSA), weighted RSA, both in specific ROIs and with searchlight search. In addition, it is possible to add a new data set of stimuli and brain recordings to the toolbox for evaluation. We demonstrate the functionality and advantages of Net2Brain with an example showcasing how it can be used to test hypotheses of cognitive computational neuroscience.
Dwivedi, K., Cichy, R. M., & Roig, G. (2021). Unraveling Representations in Scene-selective Brain Regions Using Scene-Parsing Deep Neural Networks. Journal of Cognitive Neuroscience, 33(10), 2032–2043. https://doi.org/10.1162/jocn_a_01624
```
@article{dwivediUnravelingRepresentationsSceneselective2021,
  title = {Unraveling {{Representations}} in {{Scene-selective Brain Regions Using Scene-Parsing Deep Neural Networks}}},
  author = {Dwivedi, Kshitij and Cichy, Radoslaw Martin and Roig, Gemma},
  date = {2021-09-01},
  journaltitle = {Journal of Cognitive Neuroscience},
  shortjournal = {Journal of Cognitive Neuroscience},
  volume = {33},
  number = {10},
  pages = {2032--2043},
  issn = {0898-929X},
  doi = {10.1162/jocn_a_01624},
  url = {https://doi.org/10.1162/jocn_a_01624},
  urldate = {2024-07-01},
  file = {C\:\\Users\\felix\\Zotero\\storage\\Q7NNH46Z\\Dwivedi et al. - 2021 - Unraveling Representations in Scene-selective Brai.pdf;C\:\\Users\\felix\\Zotero\\storage\\UK7HI79F\\Unraveling-Representations-in-Scene-selective.html}
}
```
Abstract
Visual scene perception is mediated by a set of cortical regions that respond preferentially to images of scenes, including the occipital place area (OPA) and parahippocampal place area (PPA). However, the differential contribution of OPA and PPA to scene perception remains an open research question. In this study, we take a deep neural network (DNN)-based computational approach to investigate the differences in OPA and PPA function. In a first step, we search for a computational model that predicts fMRI responses to scenes in OPA and PPA well. We find that DNNs trained to predict scene components (e.g., wall, ceiling, floor) explain higher variance uniquely in OPA and PPA than a DNN trained to predict scene category (e.g., bathroom, kitchen, office). This result is robust across several DNN architectures. On this basis, we then determine whether particular scene components predicted by DNNs differentially account for unique variance in OPA and PPA. We find that variance in OPA responses uniquely explained by the navigation-related floor component is higher compared to the variance explained by the wall and ceiling components. In contrast, PPA responses are better explained by the combination of wall and floor, that is, scene components that together contain the structure and texture of the scene. This differential sensitivity to scene components suggests differential functions of OPA and PPA in scene processing. Moreover, our results further highlight the potential of the proposed computational approach as a general tool in the investigation of the neural basis of human scene perception.
Dwivedi, K., Bonner, M. F., Cichy, R. M., & Roig, G. (2021). Unveiling Functions of the Visual Cortex Using Task-Specific Deep Neural Networks. PLOS Computational Biology, 17(8), e1009267. https://doi.org/10.1371/journal.pcbi.1009267
```
@article{dwivediUnveilingFunctionsVisual2021,
  title = {Unveiling Functions of the Visual Cortex Using Task-Specific Deep Neural Networks},
  author = {Dwivedi, Kshitij and Bonner, Michael F. and Cichy, Radoslaw Martin and Roig, Gemma},
  date = {2021-08-13},
  journaltitle = {PLOS Computational Biology},
  shortjournal = {PLOS Computational Biology},
  volume = {17},
  number = {8},
  pages = {e1009267},
  publisher = {Public Library of Science},
  issn = {1553-7358},
  doi = {10.1371/journal.pcbi.1009267},
  url = {https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009267},
  urldate = {2024-07-01},
  langid = {english},
  keywords = {Functional magnetic resonance imaging,Linear regression analysis,Neural networks,Permutation,Semantics,Sensory perception,Vision,Visual cortex},
  file = {C:\Users\felix\Zotero\storage\NX3RJWGC\Dwivedi et al. - 2021 - Unveiling functions of the visual cortex using tas.pdf}
}
```
Abstract
The human visual cortex enables visual perception through a cascade of hierarchical computations in cortical regions with distinct functionalities. Here, we introduce an AI-driven approach to discover the functional mapping of the visual cortex. We related human brain responses to scene images measured with functional MRI (fMRI) systematically to a diverse set of deep neural networks (DNNs) optimized to perform different scene perception tasks. We found a structured mapping between DNN tasks and brain regions along the ventral and dorsal visual streams. Low-level visual tasks mapped onto early brain regions, 3-dimensional scene perception tasks mapped onto the dorsal stream, and semantic tasks mapped onto the ventral stream. This mapping was of high fidelity, with more than 60% of the explainable variance in nine key regions being explained. Together, our results provide a novel functional mapping of the human visual cortex and demonstrate the power of the computational approach.
Nicholls, V. I., Krugliak, A., Alsbury-Nealy, B., Gramann, K., & Clarke, A. (2024). Congruency Effects on Object Recognition Persist When Objects Are Placed in the Wild: An AR and Mobile EEG Study (p. 2024.05.30.596613). https://doi.org/10.1101/2024.05.30.596613
```
@online{nichollsCongruencyEffectsObject2024,
  title = {Congruency Effects on Object Recognition Persist When Objects Are Placed in the Wild: {{An AR}} and Mobile {{EEG}} Study},
  shorttitle = {Congruency Effects on Object Recognition Persist When Objects Are Placed in the Wild},
  author = {Nicholls, Victoria I. and Krugliak, Alexandra and Alsbury-Nealy, Benjamin and Gramann, Klaus and Clarke, Alex},
  date = {2024-05-31},
  eprinttype = {bioRxiv},
  eprintclass = {New Results},
  pages = {2024.05.30.596613},
  doi = {10.1101/2024.05.30.596613},
  url = {https://www.biorxiv.org/content/10.1101/2024.05.30.596613v1},
  urldate = {2024-07-01},
  langid = {english},
  pubstate = {prepublished},
  file = {C:\Users\felix\Zotero\storage\B3TSX4P9\Nicholls et al. - 2024 - Congruency effects on object recognition persist w.pdf}
}
```
Abstract
Objects in expected locations are recognised faster and more accurately than objects in incongruent environments. This congruency effect has a neural component, with increased activity for objects in incongruent environments. Studies have increasingly shown differences between neural processes in realistic environments and tasks, and neural processes in the laboratory. To what extent do findings obtained from a laboratory setting translate to neural processes elicited in real-world environments? We investigated how object recognition is modulated when objects are placed in real environments using augmented reality while recording mobile EEG. Participants approached, viewed, and rated how congruent they found the objects with the environment. We found significantly higher theta-band power for objects in incongruent contexts than objects in congruent contexts. This demonstrates that real-world contexts impact on how we recognize objects, and that mobile brain imaging and augmented reality are effective tools to study cognition in the wild. Teaser Combining augmented reality with mobile brain imaging to show that real-world contexts modulate object recognition processes.
Sassenhagen, J., & Fiebach, C. J. (2020). Traces of Meaning Itself: Encoding Distributional Word Vectors in Brain Activity. Neurobiology of Language, 1(1), 54–76. https://doi.org/10.1162/nol_a_00003
```
@article{sassenhagenTracesMeaningItself2020,
  title = {Traces of {{Meaning Itself}}: {{Encoding Distributional Word Vectors}} in {{Brain Activity}}},
  shorttitle = {Traces of {{Meaning Itself}}},
  author = {Sassenhagen, Jona and Fiebach, Christian J.},
  date = {2020-03-01},
  journaltitle = {Neurobiology of Language},
  shortjournal = {Neurobiology of Language},
  volume = {1},
  number = {1},
  pages = {54--76},
  issn = {2641-4368},
  doi = {10.1162/nol_a_00003},
  url = {https://doi.org/10.1162/nol_a_00003},
  urldate = {2024-07-01},
  file = {C\:\\Users\\felix\\Zotero\\storage\\PVMI7PXW\\Sassenhagen und Fiebach - 2020 - Traces of Meaning Itself Encoding Distributional .pdf;C\:\\Users\\felix\\Zotero\\storage\\CSAP8USW\\Traces-of-Meaning-Itself-Encoding-Distributional.html}
}
```
Abstract
How is semantic information stored in the human mind and brain? Some philosophers and cognitive scientists argue for vectorial representations of concepts, where the meaning of a word is represented as its position in a high-dimensional neural state space. At the intersection of natural language processing and artificial intelligence, a class of very successful distributional word vector models has developed that can account for classic EEG findings of language, that is, the ease versus difficulty of integrating a word with its sentence context. However, models of semantics have to account not only for context-based word processing, but should also describe how word meaning is represented. Here, we investigate whether distributional vector representations of word meaning can model brain activity induced by words presented without context. Using EEG activity (event-related brain potentials) collected while participants in two experiments (English and German) read isolated words, we encoded and decoded word vectors taken from the family of prediction-based Word2vec algorithms. We found that, first, the position of a word in vector space allows the prediction of the pattern of corresponding neural activity over time, in particular during a time window of 300 to 500 ms after word onset. Second, distributional models perform better than a human-created taxonomic baseline model (WordNet), and this holds for several distinct vector-based models. Third, multiple latent semantic dimensions of word meaning can be decoded from brain activity. Combined, these results suggest that empiricist, prediction-based vectorial representations of meaning are a viable candidate for the representational architecture of human semantic knowledge.
Schwartz, D., Toneva, M., & Wehbe, L. (2019). Inducing Brain-Relevant Bias in Natural Language Processing Models. https://doi.org/10.48550/arXiv.1911.03268
```
@online{schwartzInducingBrainrelevantBias2019,
  title = {Inducing Brain-Relevant Bias in Natural Language Processing Models},
  author = {Schwartz, Dan and Toneva, Mariya and Wehbe, Leila},
  date = {2019-10-29},
  eprint = {1911.03268},
  eprinttype = {arXiv},
  eprintclass = {cs, q-bio},
  doi = {10.48550/arXiv.1911.03268},
  url = {http://arxiv.org/abs/1911.03268},
  urldate = {2024-07-01},
  pubstate = {prepublished},
  keywords = {Computer Science - Computation and Language,Computer Science - Machine Learning,Quantitative Biology - Neurons and Cognition},
  file = {C\:\\Users\\felix\\Zotero\\storage\\WL2AXIT9\\Schwartz et al. - 2019 - Inducing brain-relevant bias in natural language p.pdf;C\:\\Users\\felix\\Zotero\\storage\\S57P6AF5\\1911.html}
}
```
Abstract
Progress in natural language processing (NLP) models that estimate representations of word sequences has recently been leveraged to improve the understanding of language processing in the brain. However, these models have not been specifically designed to capture the way the brain represents language meaning. We hypothesize that fine-tuning these models to predict recordings of brain activity of people reading text will lead to representations that encode more brain-activity-relevant language information. We demonstrate that a version of BERT, a recently introduced and powerful language model, can improve the prediction of brain activity after fine-tuning. We show that the relationship between language and brain activity learned by BERT during this fine-tuning transfers across multiple participants. We also show that, for some participants, the fine-tuned representations learned from both magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI) are better for predicting fMRI than the representations learned from fMRI alone, indicating that the learned representations capture brain-activity-relevant information that is not simply an artifact of the modality. While changes to language representations help the model predict brain activity, they also do not harm the model’s ability to perform downstream NLP tasks. Our findings are notable for research on language understanding in the brain.
Toneva, M., Mitchell, T. M., & Wehbe, L. (2022). Combining Computational Controls with Natural Text Reveals Aspects of Meaning Composition. Nature Computational Science, 2(11), 745–757. https://doi.org/10.1038/s43588-022-00354-6
```
@article{tonevaCombiningComputationalControls2022,
  title = {Combining Computational Controls with Natural Text Reveals Aspects of Meaning Composition},
  author = {Toneva, Mariya and Mitchell, Tom M. and Wehbe, Leila},
  date = {2022-11},
  journaltitle = {Nature Computational Science},
  shortjournal = {Nat Comput Sci},
  volume = {2},
  number = {11},
  pages = {745--757},
  publisher = {Nature Publishing Group},
  issn = {2662-8457},
  doi = {10.1038/s43588-022-00354-6},
  url = {https://www.nature.com/articles/s43588-022-00354-6},
  urldate = {2024-07-01},
  langid = {english},
  keywords = {Computer science,Language,Neural encoding},
  file = {C:\Users\felix\Zotero\storage\L8I4366D\Toneva et al. - 2022 - Combining computational controls with natural text.pdf}
}
```
Abstract
To study a core component of human intelligence—our ability to combine the meaning of words—neuroscientists have looked to linguistics. However, linguistic theories are insufficient to account for all brain responses reflecting linguistic composition. In contrast, we adopt a data-driven approach to study the composed meaning of words beyond their individual meaning, which we term ‘supra-word meaning’. We construct a computational representation for supra-word meaning and study its brain basis through brain recordings from two complementary imaging modalities. Using functional magnetic resonance imaging, we reveal that hubs that are thought to process lexical meaning also maintain supra-word meaning, suggesting a common substrate for lexical and combinatorial semantics. Surprisingly, we cannot detect supra-word meaning in magnetoencephalography, which suggests that composed meaning might be maintained through a different neural mechanism than the synchronized firing of pyramidal cells. This sensitivity difference has implications for past neuroimaging results and future wearable neurotechnology.
Toneva, M., & Wehbe, L. (2019). Interpreting and Improving Natural-Language Processing (in Machines) with Natural Language-Processing (in the Brain). arXiv.org. https://arxiv.org/abs/1905.11833v4
```
@online{tonevaInterpretingImprovingNaturallanguage2019,
  title = {Interpreting and Improving Natural-Language Processing (in Machines) with Natural Language-Processing (in the Brain)},
  author = {Toneva, Mariya and Wehbe, Leila},
  date = {2019-05-28},
  url = {https://arxiv.org/abs/1905.11833v4},
  urldate = {2024-07-01},
  langid = {english},
  organization = {arXiv.org},
  file = {C:\Users\felix\Zotero\storage\V2XIL34E\Toneva und Wehbe - 2019 - Interpreting and improving natural-language proces.pdf}
}
```
Abstract
Neural networks models for NLP are typically implemented without the explicit encoding of language rules and yet they are able to break one performance record after another. This has generated a lot of research interest in interpreting the representations learned by these networks. We propose here a novel interpretation approach that relies on the only processing system we have that does understand language: the human brain. We use brain imaging recordings of subjects reading complex natural text to interpret word and sequence embeddings from 4 recent NLP models - ELMo, USE, BERT and Transformer-XL. We study how their representations differ across layer depth, context length, and attention type. Our results reveal differences in the context-related representations across these models. Further, in the transformer models, we find an interaction between layer depth and context length, and between layer depth and attention type. We finally hypothesize that altering BERT to better align with brain recordings would enable it to also better understand language. Probing the altered BERT using syntactic NLP tasks reveals that the model with increased brain-alignment outperforms the original model. Cognitive neuroscientists have already begun using NLP networks to study the brain, and this work closes the loop to allow the interaction between NLP and cognitive neuroscience to be a true cross-pollination.

Project Publications

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Background Publications

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract