Evkaya, O., & de Carvalho, M. (2024). Decoding AI: The inside story of data analysis in ChatGPT. arXiv. http://arxiv.org/abs/2404.08480
@misc{evkaya_decoding_2024,
title = {Decoding {AI}: {The} inside story of data analysis in {ChatGPT}},
shorttitle = {Decoding {AI}},
url = {http://arxiv.org/abs/2404.08480},
language = {en},
urldate = {2024-07-31},
publisher = {arXiv},
author = {Evkaya, Ozan and de Carvalho, Miguel},
month = apr,
year = {2024},
note = {arXiv:2404.08480 [cs, stat]},
keywords = {Computer Science - Machine Learning, Computer Science - Computation and Language, Statistics - Computation},
file = {Evkaya und de Carvalho - 2024 - Decoding AI The inside story of data analysis in .pdf:C\:\\Users\\felix\\Zotero\\storage\\EIXA24HF\\Evkaya und de Carvalho - 2024 - Decoding AI The inside story of data analysis in .pdf:application/pdf}
}
Abstract
As a result of recent advancements in generative AI, the field of Data Science is prone to various changes. This review critically examines the Data Analysis (DA) capabilities of ChatGPT assessing its performance across a wide range of tasks. While DA provides researchers and practitioners with unprecedented analytical capabilities, it is far from being perfect, and it is important to recognize and address its limitations.
Aubret, A., Schaumlöffel, T., Roig, G., & Triesch, J. (2024, April). Learning Object Semantic Similarity with Self-Supervision. Proceedings of the 2024 IEEE International Conference on Development and Learning (ICDL). https://doi.org/10.48550/arXiv.2405.05143
@inproceedings{aubret_learning_2024,
title = {Learning {Object} {Semantic} {Similarity} with {Self}-{Supervision}},
url = {http://arxiv.org/abs/2405.05143},
doi = {10.48550/arXiv.2405.05143},
urldate = {2024-07-01},
booktitle = {Proceedings of the 2024 {IEEE} {International} {Conference} on {Development} and {Learning} ({ICDL})},
publisher = {arXiv},
author = {Aubret, Arthur and Schaumlöffel, Timothy and Roig, Gemma and Triesch, Jochen},
month = apr,
year = {2024},
note = {arXiv:2405.05143 [cs]},
keywords = {Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Computer Science - Computer Vision and Pattern Recognition},
file = {arXiv Fulltext PDF:C\:\\Users\\felix\\Zotero\\storage\\X3EFC5TM\\Aubret et al. - 2024 - Learning Object Semantic Similarity with Self-Supe.pdf:application/pdf;arXiv.org Snapshot:C\:\\Users\\felix\\Zotero\\storage\\C43SMCQ9\\2405.html:text/html}
}
Abstract
Humans judge the similarity of two objects not just based on their visual appearance but also based on their semantic relatedness. However, it remains unclear how humans learn about semantic relationships between objects and categories. One important source of semantic knowledge is that semantically related objects frequently co-occur in the same context. For instance, forks and plates are perceived as similar, at least in part, because they are often experienced together in a “kitchen" or “eating” context. Here, we investigate whether a bio-inspired learning principle exploiting such co-occurrence statistics suffices to learn a semantically structured object representation {}em de novo} from raw visual or combined visual and linguistic input. To this end, we simulate temporal sequences of visual experience by binding together short video clips of real-world scenes showing objects in different contexts. A bio-inspired neural network model aligns close-in-time visual representations while also aligning visual and category label representations to simulate visuo-language alignment. Our results show that our model clusters object representations based on their context, e.g. kitchen or bedroom, in particular in high-level layers of the network, akin to humans. In contrast, lower-level layers tend to better reflect object identity or category. To achieve this, the model exploits two distinct strategies: the visuo-language alignment ensures that different objects of the same category are represented similarly, whereas the temporal alignment leverages that objects from the same context are frequently seen in succession to make their representations more similar. Overall, our work suggests temporal and visuo-language alignment as plausible computational principles for explaining the origins of certain forms of semantic knowledge in humans.
Ernst, M. R., López, F. M., Aubret, A., Fleming, R. W., & Triesch, J. (2024, April). Self-Supervised Learning of Color Constancy. Proceedings of the 2024 IEEE International Conference on Development and Learning (ICDL). http://arxiv.org/abs/2404.08127
@inproceedings{ernst_self-supervised_2024,
title = {Self-{Supervised} {Learning} of {Color} {Constancy}},
url = {http://arxiv.org/abs/2404.08127},
language = {en},
urldate = {2024-07-01},
booktitle = {Proceedings of the 2024 {IEEE} {International} {Conference} on {Development} and {Learning} ({ICDL})},
publisher = {arXiv},
author = {Ernst, Markus R. and López, Francisco M. and Aubret, Arthur and Fleming, Roland W. and Triesch, Jochen},
month = apr,
year = {2024},
note = {arXiv:2404.08127 [cs]},
keywords = {Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence},
file = {2404.pdf:C\:\\Users\\felix\\Zotero\\storage\\TSYSWQ9R\\2404.pdf:application/pdf}
}
Abstract
Color constancy (CC) describes the ability of the visual system to perceive an object as having a relatively constant color despite changes in lighting conditions. While CC and its limitations have been carefully characterized in humans, it is still unclear how the visual system acquires this ability during development. Here, we present a first study showing that CC develops in a neural network trained in a self-supervised manner through an invariance learning objective. During learning, objects are presented under changing illuminations, while the network aims to map subsequent views of the same object onto close-by latent representations. This gives rise to representations that are largely invariant to the illumination conditions, offering a plausible example of how CC could emerge during human cognitive development via a form of self-supervised learning.
Vilas, M. G., Adolfi, F., Poeppel, D., & Roig, G. (2024, June). Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience. Forty-First International Conference on Machine Learning. https://openreview.net/forum?id=66KmnMhGU5
@inproceedings{vilas_position_2024,
title = {Position: {An} {Inner} {Interpretability} {Framework} for {AI} {Inspired} by {Lessons} from {Cognitive} {Neuroscience}},
shorttitle = {Position},
url = {https://openreview.net/forum?id=66KmnMhGU5},
language = {en},
urldate = {2024-07-30},
booktitle = {Forty-first {International} {Conference} on {Machine} {Learning}},
author = {Vilas, Martina G. and Adolfi, Federico and Poeppel, David and Roig, Gemma},
month = jun,
year = {2024},
file = {Full Text PDF:C\:\\Users\\felix\\Zotero\\storage\\TT78LFKT\\Vilas et al. - 2024 - Position An Inner Interpretability Framework for .pdf:application/pdf}
}
Abstract
Inner Interpretability is a promising emerging field tasked with uncovering the inner mechanisms of AI systems, though how to develop these mechanistic theories is still much debated. Moreover, recent critiques raise issues that question its usefulness to advance the broader goals of AI. However, it has been overlooked that these issues resemble those that have been grappled with in another field: Cognitive Neuroscience. Here we draw the relevant connections and highlight lessons that can be transferred productively between fields. Based on these, we propose a general conceptual framework and give concrete methodological strategies for building mechanistic explanations in AI inner interpretability research. With this conceptual framework, Inner Interpretability can fend off critiques and position itself on a productive path to explain AI systems.
Oota, S. R., Çelik, E., Deniz, F., & Toneva, M. (2024, June). Speech language models lack important brain-relevant semantics. https://doi.org/10.48550/arXiv.2311.04664
@inproceedings{oota_speech_2024,
title = {Speech language models lack important brain-relevant semantics},
url = {http://arxiv.org/abs/2311.04664},
doi = {10.48550/arXiv.2311.04664},
urldate = {2024-07-01},
publisher = {arXiv},
author = {Oota, Subba Reddy and Çelik, Emin and Deniz, Fatma and Toneva, Mariya},
month = jun,
year = {2024},
note = {arXiv:2311.04664 [cs, eess, q-bio]},
keywords = {Computer Science - Machine Learning, Computer Science - Computation and Language, Quantitative Biology - Neurons and Cognition, Electrical Engineering and Systems Science - Audio and Speech Processing},
file = {arXiv Fulltext PDF:C\:\\Users\\felix\\Zotero\\storage\\2W5MRDIE\\Oota et al. - 2024 - Speech language models lack important brain-releva.pdf:application/pdf;arXiv.org Snapshot:C\:\\Users\\felix\\Zotero\\storage\\SCRZEMZV\\2311.html:text/html}
}
Abstract
Despite known differences between reading and listening in the brain, recent work has shown that text-based language models predict both text-evoked and speech-evoked brain activity to an impressive degree. This poses the question of what types of information language models truly predict in the brain. We investigate this question via a direct approach, in which we systematically remove specific low-level stimulus features (textual, speech, and visual) from language model representations to assess their impact on alignment with fMRI brain recordings during reading and listening. Comparing these findings with speech-based language models reveals starkly different effects of low-level features on brain alignment. While text-based models show reduced alignment in early sensory regions post-removal, they retain significant predictive power in late language regions. In contrast, speech-based models maintain strong alignment in early auditory regions even after feature removal but lose all predictive power in late language regions. These results suggest that speech-based models provide insights into additional information processed by early auditory regions, but caution is needed when using them to model processing in late language regions. We make our code publicly available. [https://github.com/subbareddy248/speech-llm-brain]
Lahner, B., Dwivedi, K., Iamshchinina, P., Graumann, M., Lascelles, A., Roig, G., Gifford, A. T., Pan, B., Jin, S. Y., Ratan Murty, N. A., Kay, K., Oliva, A., & Cichy, R. (2024). Modeling short visual events through the BOLD moments video fMRI dataset and metadata. Nature Communications, 15(1), 6241. https://doi.org/10.1038/s41467-024-50310-3
@article{lahner_modeling_2024,
title = {Modeling short visual events through the {BOLD} moments video {fMRI} dataset and metadata},
volume = {15},
copyright = {2024 The Author(s)},
issn = {2041-1723},
url = {https://www.nature.com/articles/s41467-024-50310-3},
doi = {10.1038/s41467-024-50310-3},
language = {en},
number = {1},
urldate = {2024-07-30},
journal = {Nature Communications},
author = {Lahner, Benjamin and Dwivedi, Kshitij and Iamshchinina, Polina and Graumann, Monika and Lascelles, Alex and Roig, Gemma and Gifford, Alessandro Thomas and Pan, Bowen and Jin, SouYoung and Ratan Murty, N. Apurva and Kay, Kendrick and Oliva, Aude and Cichy, Radoslaw},
month = jul,
year = {2024},
note = {Publisher: Nature Publishing Group},
keywords = {Perception, Visual system, Neural encoding},
pages = {6241},
file = {Full Text PDF:C\:\\Users\\felix\\Zotero\\storage\\MLQBFTHX\\Lahner et al. - 2024 - Modeling short visual events through the BOLD mome.pdf:application/pdf}
}
Abstract
Studying the neural basis of human dynamic visual perception requires extensive experimental data to evaluate the large swathes of functionally diverse brain neural networks driven by perceiving visual events. Here, we introduce the BOLD Moments Dataset (BMD), a repository of whole-brain fMRI responses to over 1000 short (3 s) naturalistic video clips of visual events across ten human subjects. We use the videos’ extensive metadata to show how the brain represents word- and sentence-level descriptions of visual events and identify correlates of video memorability scores extending into the parietal cortex. Furthermore, we reveal a match in hierarchical processing between cortical regions of interest and video-computable deep neural networks, and we showcase that BMD successfully captures temporal dynamics of visual events at second resolution. With its rich metadata, BMD offers new perspectives and accelerates research on the human brain basis of visual event perception.
Yu, Z., Aubret, A., Raabe, M. C., Yang, J., Yu, C., & Triesch, J. (2024). Active Gaze Behavior Boosts Self-Supervised Object Learning. arXiv. https://doi.org/10.48550/arXiv.2411.01969
@misc{yu_active_2024,
title = {Active {Gaze} {Behavior} {Boosts} {Self}-{Supervised} {Object} {Learning}},
url = {http://arxiv.org/abs/2411.01969},
doi = {10.48550/arXiv.2411.01969},
urldate = {2024-12-27},
publisher = {arXiv},
author = {Yu, Zhengyang and Aubret, Arthur and Raabe, Marcel C. and Yang, Jane and Yu, Chen and Triesch, Jochen},
month = nov,
year = {2024},
note = {arXiv:2411.01969 [cs]},
keywords = {Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition},
file = {Preprint PDF:C\:\\Users\\felix\\Zotero\\storage\\G8XWI7GL\\Yu et al. - 2024 - Active Gaze Behavior Boosts Self-Supervised Object Learning.pdf:application/pdf;Snapshot:C\:\\Users\\felix\\Zotero\\storage\\6634GLNC\\2411.html:text/html}
}
Abstract
Due to significant variations in the projection of the same object from different viewpoints, machine learning algorithms struggle to recognize the same object across various perspectives. In contrast, toddlers quickly learn to recognize objects from different viewpoints with almost no supervision. Recent works argue that toddlers develop this ability by mapping close-in-time visual inputs to similar representations while interacting with objects. High acuity vision is only available in the central visual field, which may explain why toddlers (much like adults) constantly move their gaze around during such interactions. It is unclear whether/how much toddlers curate their visual experience through these eye movements to support learning object representations. In this work, we explore whether a bio inspired visual learning model can harness toddlers’ gaze behavior during a play session to develop view-invariant object recognition. Exploiting head-mounted eye tracking during dyadic play, we simulate toddlers’ central visual field experience by cropping image regions centered on the gaze location. This visual stream feeds a time-based self-supervised learning algorithm. Our experiments demonstrate that toddlers’ gaze strategy supports the learning of invariant object representations. Our analysis also reveals that the limited size of the central visual field where acuity is high is crucial for this. We further find that toddlers’ visual experience elicits more robust representations compared to adults’ mostly because toddlers look at objects they hold themselves for longer bouts. Overall, our work reveals how toddlers’ gaze behavior supports self-supervised learning of view-invariant object recognition.
Neamaalkassis, H., Boubenec, Y., Muralikrishnan, R., Fiebach, C., & Tavano, A. (2024). The fundamental frequencies of our own voice. OSF. https://doi.org/10.31234/osf.io/fm9ed
@misc{neamaalkassis_fundamental_2024,
title = {The fundamental frequencies of our own voice},
url = {https://osf.io/fm9ed},
doi = {10.31234/osf.io/fm9ed},
language = {en-us},
urldate = {2024-12-27},
publisher = {OSF},
author = {Neamaalkassis, Hakam and Boubenec, Yves and Muralikrishnan, R. and Fiebach, Christian and Tavano, Alessandro},
month = feb,
year = {2024},
file = {OSF Preprint:C\:\\Users\\felix\\Zotero\\storage\\T4XNRFQW\\Neamaalkassis et al. - 2024 - The fundamental frequencies of our own voice.pdf:application/pdf}
}
Abstract
Own actions send a corollary discharge (CD) signal, that is a copy of the planned motor program, to sensory-specific brain areas to suppress the anticipated sensory response, providing a neural basis for the sense of self. When we speak, the sensory consequences of the fundamental frequency (f0) of our own voice, generated by vocal fold vibrations, are suppressed. However, due to bone/air conduction filtering effects, the f0 we self-generate is measurably different from the f0 we subjectively perceive as defining our own voice. Using an auditory change deafness paradigm, we parametrically tested the sensitivity to auditory change in the frequency neighbourhoods of individual objective and subjective voice f0, and found that participants experience change deafness for both to a similar extent, relative to a control pitch condition. We conclude that when we listen attentively, we are likely to filter out voice pitches in the vicinity of our own objective and subjective voice f0, possibly as a long-term consequence of speaking-induced suppression mechanisms integrated with individual, perceptual bodily priors.
Taylor, J. E., Sinn, R., Iaia, C., & Fiebach, C. J. (2024). Beyond Letters: Optimal Transport as a Model for Sub-Letter Orthographic Processing. bioRxiv. https://doi.org/10.1101/2024.11.11.622929
Letter processing plays a key role in visual word recognition. However, word recognition models typically overlook or greatly simplify early perceptual processes of letter recognition. We suggest that optimal transport theory may provide a computational framework for describing letter shape processing. We use representational similarity analysis to show that optimal transport cost (Wasserstein distance) between pairs of letters aligns with neural activity elicited by visually presented letters \textless225 ms after stimulus onset, outperforming an existing approach based on shape overlap. We additionally show that optimal transport can capture the emergence of geometric invariances (e.g., to position or size) observed in letter perception. Finally, we demonstrate that Wasserstein distance predicts neural activity similarly well to features from artificial networks trained to classify images and letters. However, whereas representations in artificial neural networks emerge in a computationally unconstrained manner, our proposal provides a computationally explicit route to modeling the earliest orthographic processes.
Gagl, B., Weyers, I., Eisenhauer, S., Fiebach, C. J., Colombo, M., Scarf, D., Ziegler, J. C., Grainger, J., Güntürkün, O., & Mueller, J. L. (2024). Non-Human Recognition of Orthography: How is it implemented and how does it differ from Human orthographic processing. bioRxiv. https://doi.org/10.1101/2024.06.25.600635
The ability to robustly recognize strings of letters, a cornerstone of reading, was observed in Baboons and Pigeons despite their lack of phonological and semantic knowledge. Here, we apply a comparative modeling approach to investigate the neuro-cognitive basis of Human, Baboon, and Pigeon orthographic decision behavior, addressing whether phylogenetic relatedness entails similar underlying neuro-cognitive phenotypes. We use the highly transparent Speechless Reader Model (SLR), which assumes letter string recognition based on widely accepted computational principles of predictive coding so that orthographic decisions rely on a prediction error signal emerging from multiple, hierarchically ordered representational levels, i.e., low-level visual, letter, or letter sequence representations. We investigate which representations species use during successful orthographic decision-making. We introduce multiple SLR variants, each including one or multiple prediction error representations, and compare the simulations of each SLR variant to the orthographic decisions from individuals of three species after learning letter strings without meaning. Humans predominantly relied on letter-sequence-level representations, resulting in the highest task performance in behavior and model simulations. Baboons also relied on sequence-based representations but in combination with pixel- and letter-level representations. In contrast, all Pigeons relied on pixel-level representations, partly in combination with letter- and letter-sequence-level representations. These findings suggest that orthographic representations utilized in orthographic decisions reflect the phylogenetic distance between species: Humans and Baboons use more similar representations compared to Pigeons. Overall, the description of orthographic decisions based on a small set of representations and computations was highly successful in describing behavior, even for Humans who mastered reading in its entirety.
Significance Statement Imagine being able to read without ever learning the alphabet. Research has shown that baboons and pigeons can exhibit reading-like behavior, suggesting shared processes across the species involved. To increase our understanding of the similarities and differences between humans and animals in reading-like behavior, we use a computational model to uncover the underlying processes that enable humans, baboons, and pigeons to perform these tasks. We found that humans and baboons rely on similar processes, focusing on information related to letters and letter sequences. In contrast, pigeons rely more heavily on visual cues. This discovery sheds light on the evolution of processes underlying reading and reading-like behavior, indicating that the lower the evolutionary distance between species, the more similar processes are involved.
Aubret, A., Teulière, C., & Triesch, J. (2024). Self-supervised visual learning from interactions with objects. arXiv. https://doi.org/10.48550/arXiv.2407.06704
@misc{aubret_self-supervised_2024,
title = {Self-supervised visual learning from interactions with objects},
url = {http://arxiv.org/abs/2407.06704},
doi = {10.48550/arXiv.2407.06704},
urldate = {2024-12-27},
publisher = {arXiv},
author = {Aubret, Arthur and Teulière, Céline and Triesch, Jochen},
month = aug,
year = {2024},
note = {arXiv:2407.06704 [cs]},
keywords = {Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning},
file = {Preprint PDF:C\:\\Users\\felix\\Zotero\\storage\\IGFD4Q5M\\Aubret et al. - 2024 - Self-supervised visual learning from interactions with objects.pdf:application/pdf;Snapshot:C\:\\Users\\felix\\Zotero\\storage\\GNVXGGY3\\2407.html:text/html}
}
Abstract
Self-supervised learning (SSL) has revolutionized visual representation learning, but has not achieved the robustness of human vision. A reason for this could be that SSL does not leverage all the data available to humans during learning. When learning about an object, humans often purposefully turn or move around objects and research suggests that these interactions can substantially enhance their learning. Here we explore whether such object-related actions can boost SSL. For this, we extract the actions performed to change from one ego-centric view of an object to another in four video datasets. We then introduce a new loss function to learn visual and action embeddings by aligning the performed action with the representations of two images extracted from the same clip. This permits the performed actions to structure the latent visual representation. Our experiments show that our method consistently outperforms previous methods on downstream category recognition. In our analysis, we find that the observed improvement is associated with a better viewpoint-wise alignment of different objects from the same category. Overall, our work demonstrates that embodied interactions with objects can improve SSL of object categories.
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv. https://doi.org/10.48550/arXiv.2203.11171
@misc{wang_self-consistency_2023,
title = {Self-{Consistency} {Improves} {Chain} of {Thought} {Reasoning} in {Language} {Models}},
url = {http://arxiv.org/abs/2203.11171},
doi = {10.48550/arXiv.2203.11171},
urldate = {2024-07-30},
publisher = {arXiv},
author = {Wang, Xuezhi and Wei, Jason and Schuurmans, Dale and Le, Quoc and Chi, Ed and Narang, Sharan and Chowdhery, Aakanksha and Zhou, Denny},
month = mar,
year = {2023},
note = {arXiv:2203.11171 [cs]},
keywords = {Computer Science - Computation and Language, Computer Science - Artificial Intelligence},
file = {arXiv Fulltext PDF:C\:\\Users\\felix\\Zotero\\storage\\RXABRRVP\\Wang et al. - 2023 - Self-Consistency Improves Chain of Thought Reasoni.pdf:application/pdf;arXiv.org Snapshot:C\:\\Users\\felix\\Zotero\\storage\\8NGA4CCN\\2203.html:text/html}
}
Abstract
Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).
Vilas, M. G., Schaumlöffel, T., & Roig, G. (2023). Analyzing Vision Transformers for Image Classification in Class Embedding Space. Advances in Neural Information Processing Systems, 36, 40030–40041. https://proceedings.neurips.cc/paper_files/paper/2023/hash/7dd309df03d37643b96f5048b44da798-Abstract-Conference.html
@article{vilas_analyzing_2023,
title = {Analyzing {Vision} {Transformers} for {Image} {Classification} in {Class} {Embedding} {Space}},
volume = {36},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/hash/7dd309df03d37643b96f5048b44da798-Abstract-Conference.html},
language = {en},
urldate = {2024-07-30},
journal = {Advances in Neural Information Processing Systems},
author = {Vilas, Martina G. and Schaumlöffel, Timothy and Roig, Gemma},
month = dec,
year = {2023},
pages = {40030--40041},
file = {Full Text PDF:C\:\\Users\\felix\\Zotero\\storage\\KML6KDAY\\Vilas et al. - 2023 - Analyzing Vision Transformers for Image Classifica.pdf:application/pdf}
}
Abstract
Oota, S., Gupta, M., & Toneva, M. (2023). Joint processing of linguistic properties in brains and language models. Advances in Neural Information Processing Systems, 36, 18001–18014. https://proceedings.neurips.cc/paper_files/paper/2023/hash/3a0e2de215bd17c39ad08ba1d16c1b12-Abstract-Conference.html
@article{oota_joint_2023,
title = {Joint processing of linguistic properties in brains and language models},
volume = {36},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/hash/3a0e2de215bd17c39ad08ba1d16c1b12-Abstract-Conference.html},
language = {en},
urldate = {2024-07-30},
journal = {Advances in Neural Information Processing Systems},
author = {Oota, Subbareddy and Gupta, Manish and Toneva, Mariya},
month = dec,
year = {2023},
pages = {18001--18014},
file = {Full Text PDF:C\:\\Users\\felix\\Zotero\\storage\\QGZCWPRS\\Oota et al. - 2023 - Joint processing of linguistic properties in brain.pdf:application/pdf}
}
Abstract
Schaumlöffel, T., Vilas, M. G., & Roig, G. (2023). PEACS: PREFIX ENCODING FOR AUDITORY CAPTION SYNTHESIS. IEEE Transactions on Multimedia, 17(10), 1733–1746. https://doi.org/https://dcase.community/documents/challenge2023/technical_reports/DCASE2023_Schaumloeffel_107_t6a.pdf
@article{schaumloffel_peacs_2023,
title = {{PEACS}: {PREFIX} {ENCODING} {FOR} {AUDITORY} {CAPTION} {SYNTHESIS}},
volume = {17},
copyright = {https://creativecommons.org/licenses/by/3.0/legalcode},
issn = {1520-9210, 1941-0077},
url = {http://ieeexplore.ieee.org/document/7100934/},
doi = {https://dcase.community/documents/challenge2023/technical_reports/DCASE2023_Schaumloeffel_107_t6a.pdf},
language = {en},
number = {10},
urldate = {2024-07-01},
journal = {IEEE Transactions on Multimedia},
author = {Schaumlöffel, Timothy and Vilas, Martina G. and Roig, Gemma},
year = {2023},
pages = {1733--1746},
file = {Stowell et al. - 2015 - Detection and Classification of Acoustic Scenes an.pdf:C\:\\Users\\felix\\Zotero\\storage\\3YDUBNF3\\Stowell et al. - 2015 - Detection and Classification of Acoustic Scenes an.pdf:application/pdf}
}
Abstract
This technical report describes an Automated Audio Captioning system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge, Task 6a (automated audio captioning). Our approach employs an encoder-decoder architecture, with the encoder utilizing a large contrastive pre-trained HTS-AT capable of handling variable-length audio segments. The decoder is based on the GPT2 model. To incorporate audio into the decoding process, we employ a light mapping network that translates audio representations into a prefix, effectively guiding the decoder’s generation process. Given the limited data availability, we pre-train our model on various audio captioning datasets and fine-tune it on Clotho. We reach a SPIDERr-FL score of 29.3 on the evaluation split of the Clotho-v2 dataset.
Schaumlöffel, T., Aubret, A., Roig, G., & Triesch, J. (2023). Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play. 2023 IEEE International Conference on Development and Learning (ICDL), 67–72. https://doi.org/10.1109/ICDL55364.2023.10364409
@inproceedings{schaumloffel_caregiver_2023,
title = {Caregiver {Talk} {Shapes} {Toddler} {Vision}: {A} {Computational} {Study} of {Dyadic} {Play}},
shorttitle = {Caregiver {Talk} {Shapes} {Toddler} {Vision}},
url = {http://arxiv.org/abs/2312.04118},
doi = {10.1109/ICDL55364.2023.10364409},
urldate = {2024-07-01},
booktitle = {2023 {IEEE} {International} {Conference} on {Development} and {Learning} ({ICDL})},
author = {Schaumlöffel, Timothy and Aubret, Arthur and Roig, Gemma and Triesch, Jochen},
month = nov,
year = {2023},
note = {arXiv:2312.04118 [cs]},
keywords = {Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence},
pages = {67--72},
file = {arXiv Fulltext PDF:C\:\\Users\\felix\\Zotero\\storage\\AV6UR8J8\\Schaumlöffel et al. - 2023 - Caregiver Talk Shapes Toddler Vision A Computatio.pdf:application/pdf;arXiv.org Snapshot:C\:\\Users\\felix\\Zotero\\storage\\WDLDI6PZ\\2312.html:text/html}
}
Abstract
Infants’ ability to recognize and categorize objects develops gradually. The second year of life is marked by both the emergence of more semantic visual representations and a better understanding of word meaning. This suggests that language input may play an important role in shaping visual representations. However, even in suitable contexts for word learning like dyadic play sessions, caregivers utterances are sparse and ambiguous, often referring to objects that are different from the one to which the child attends. Here, we systematically investigate to what extent caregivers’ utterances can nevertheless enhance visual representations. For this we propose a computational model of visual representation learning during dyadic play. We introduce a synthetic dataset of ego-centric images perceived by a toddler-agent that moves and rotates toy objects in different parts of its home environment while hearing caregivers’ utterances, modeled as captions. We propose to model toddlers’ learning as simultaneously aligning representations for 1) close-in-time images and 2) co-occurring images and utterances. We show that utterances with statistics matching those of real caregivers give rise to representations supporting improved category recognition. Our analysis reveals that a small decrease/increase in object-relevant naming frequencies can drastically impact the learned representations. This affects the attention on object names within an utterance, which is required for efficient visuo-linguistic alignment. Overall, our results support the hypothesis that caregivers’ naming utterances can improve toddlers’ visual representations.
Xu, X., & Triesch, J. (2023). CIPER: Combining Invariant and Equivariant Representations Using Contrastive and Predictive Learning. In L. Iliadis, A. Papaleonidas, P. Angelov, & C. Jayne (Eds.), Artificial Neural Networks and Machine Learning – ICANN 2023 (pp. 320–331). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-44213-1_27
@inproceedings{xu_ciper_2023,
address = {Cham},
title = {{CIPER}: {Combining} {Invariant} and {Equivariant} {Representations} {Using} {Contrastive} and {Predictive} {Learning}},
isbn = {978-3-031-44213-1},
shorttitle = {{CIPER}},
doi = {10.1007/978-3-031-44213-1_27},
language = {en},
booktitle = {Artificial {Neural} {Networks} and {Machine} {Learning} – {ICANN} 2023},
publisher = {Springer Nature Switzerland},
author = {Xu, Xia and Triesch, Jochen},
editor = {Iliadis, Lazaros and Papaleonidas, Antonios and Angelov, Plamen and Jayne, Chrisina},
year = {2023},
pages = {320--331},
file = {Eingereichte Version:C\:\\Users\\felix\\Zotero\\storage\\C8HQU98R\\Xu und Triesch - 2023 - CIPER Combining Invariant and Equivariant Representations Using Contrastive and Predictive Learning.pdf:application/pdf}
}
Abstract
Self-supervised representation learning (SSRL) methods have shown great success in computer vision. In recent studies, augmentation-based contrastive learning methods have been proposed for learning representations that are invariant or equivariant to pre-defined data augmentation operations. However, invariant or equivariant features favor only specific downstream tasks depending on the augmentations chosen. They may result in poor performance when the learned representation does not match task requirements. Here, we consider an active observer that can manipulate views of an object and has knowledge of the action(s) that generated each view. We introduce Contrastive Invariant and Predictive Equivariant Representation learning (CIPER). CIPER comprises both invariant and equivariant learning objectives using one shared encoder and two different output heads on top of the encoder. One output head is a projection head with a state-of-the-art contrastive objective to encourage invariance to augmentations. The other is a prediction head estimating the augmentation parameters, capturing equivariant features. Both heads are discarded after training and only the encoder is used for downstream tasks. We evaluate our method on static image tasks and time-augmented image datasets. Our results show that CIPER outperforms a baseline contrastive method on various tasks. Interestingly, CIPER encourages the formation of hierarchically structured representations where different views of an object become systematically organized in the latent representation space.
Aubret, A., Ernst, M., Teulière, C., & Triesch, J. (2022). Time to augment self-supervised visual representation learning. arXiv. https://doi.org/10.48550/arXiv.2207.13492
@misc{aubret_time_2022,
title = {Time to augment self-supervised visual representation learning},
url = {http://arxiv.org/abs/2207.13492},
doi = {10.48550/arXiv.2207.13492},
language = {en},
urldate = {2024-12-27},
publisher = {arXiv},
author = {Aubret, Arthur and Ernst, Markus and Teulière, Céline and Triesch, Jochen},
month = dec,
year = {2022},
note = {arXiv:2207.13492 [cs]},
keywords = {Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning},
file = {PDF:C\:\\Users\\felix\\Zotero\\storage\\KH2TTQDM\\Aubret et al. - 2022 - Time to augment self-supervised visual representation learning.pdf:application/pdf}
}
Abstract
Biological vision systems are unparalleled in their ability to learn visual representations without supervision. In machine learning, self-supervised learning (SSL) has led to major advances in forming object representations in an unsupervised fashion. Such systems learn representations invariant to augmentation operations over images, like cropping or flipping. In contrast, biological vision systems exploit the temporal structure of the visual experience during natural interactions with objects. This gives access to “augmentations” not commonly used in SSL, like watching the same object from multiple viewpoints or against different backgrounds. Here, we systematically investigate and compare the potential benefits of such time-based augmentations during natural interactions for learning object categories. Our results show that time-based augmentations achieve large performance gains over state-of-the-art image augmentations. Specifically, our analyses reveal that: 1) 3-D object manipulations drastically improve the learning of object categories; 2) viewing objects against changing backgrounds is important for learning to discard background-related information from the latent representation. Overall, we conclude that time-based augmentations during natural interactions with objects can substantially improve self-supervised learning, narrowing the gap between artificial and biological vision systems.
Background Publications
Bersch, D., Dwivedi, K., Vilas, M., Cichy, R. M., & Roig, G. (2022). Net2Brain: A Toolbox to Compare Artificial Vision Models with Human Brain Responses. https://doi.org/10.48550/arXiv.2208.09677
@online{berschNet2BrainToolboxCompare2022,
title = {{{Net2Brain}}: {{A Toolbox}} to Compare Artificial Vision Models with Human Brain Responses},
shorttitle = {{{Net2Brain}}},
author = {Bersch, Domenic and Dwivedi, Kshitij and Vilas, Martina and Cichy, Radoslaw M. and Roig, Gemma},
date = {2022-08-25},
eprint = {2208.09677},
eprinttype = {arXiv},
eprintclass = {cs, q-bio},
doi = {10.48550/arXiv.2208.09677},
url = {http://arxiv.org/abs/2208.09677},
urldate = {2024-07-01},
pubstate = {prepublished},
keywords = {Computer Science - Artificial Intelligence,Computer Science - Computer Vision and Pattern Recognition,Quantitative Biology - Neurons and Cognition},
file = {C\:\\Users\\felix\\Zotero\\storage\\XJ6BRPEW\\Bersch et al. - 2022 - Net2Brain A Toolbox to compare artificial vision .pdf;C\:\\Users\\felix\\Zotero\\storage\\42KCDVPV\\2208.html}
}
Abstract
We introduce Net2Brain, a graphical and command-line user interface toolbox for comparing the representational spaces of artificial deep neural networks (DNNs) and human brain recordings. While different toolboxes facilitate only single functionalities or only focus on a small subset of supervised image classification models, Net2Brain allows the extraction of activations of more than 600 DNNs trained to perform a diverse range of vision-related tasks (e.g semantic segmentation, depth estimation, action recognition, etc.), over both image and video datasets. The toolbox computes the representational dissimilarity matrices (RDMs) over those activations and compares them to brain recordings using representational similarity analysis (RSA), weighted RSA, both in specific ROIs and with searchlight search. In addition, it is possible to add a new data set of stimuli and brain recordings to the toolbox for evaluation. We demonstrate the functionality and advantages of Net2Brain with an example showcasing how it can be used to test hypotheses of cognitive computational neuroscience.
Dwivedi, K., Cichy, R. M., & Roig, G. (2021). Unraveling Representations in Scene-selective Brain Regions Using Scene-Parsing Deep Neural Networks. Journal of Cognitive Neuroscience, 33(10), 2032–2043. https://doi.org/10.1162/jocn_a_01624
@article{dwivediUnravelingRepresentationsSceneselective2021,
title = {Unraveling {{Representations}} in {{Scene-selective Brain Regions Using Scene-Parsing Deep Neural Networks}}},
author = {Dwivedi, Kshitij and Cichy, Radoslaw Martin and Roig, Gemma},
date = {2021-09-01},
journaltitle = {Journal of Cognitive Neuroscience},
shortjournal = {Journal of Cognitive Neuroscience},
volume = {33},
number = {10},
pages = {2032--2043},
issn = {0898-929X},
doi = {10.1162/jocn_a_01624},
url = {https://doi.org/10.1162/jocn_a_01624},
urldate = {2024-07-01},
file = {C\:\\Users\\felix\\Zotero\\storage\\Q7NNH46Z\\Dwivedi et al. - 2021 - Unraveling Representations in Scene-selective Brai.pdf;C\:\\Users\\felix\\Zotero\\storage\\UK7HI79F\\Unraveling-Representations-in-Scene-selective.html}
}
Abstract
Visual scene perception is mediated by a set of cortical regions that respond preferentially to images of scenes, including the occipital place area (OPA) and parahippocampal place area (PPA). However, the differential contribution of OPA and PPA to scene perception remains an open research question. In this study, we take a deep neural network (DNN)-based computational approach to investigate the differences in OPA and PPA function. In a first step, we search for a computational model that predicts fMRI responses to scenes in OPA and PPA well. We find that DNNs trained to predict scene components (e.g., wall, ceiling, floor) explain higher variance uniquely in OPA and PPA than a DNN trained to predict scene category (e.g., bathroom, kitchen, office). This result is robust across several DNN architectures. On this basis, we then determine whether particular scene components predicted by DNNs differentially account for unique variance in OPA and PPA. We find that variance in OPA responses uniquely explained by the navigation-related floor component is higher compared to the variance explained by the wall and ceiling components. In contrast, PPA responses are better explained by the combination of wall and floor, that is, scene components that together contain the structure and texture of the scene. This differential sensitivity to scene components suggests differential functions of OPA and PPA in scene processing. Moreover, our results further highlight the potential of the proposed computational approach as a general tool in the investigation of the neural basis of human scene perception.
Dwivedi, K., Bonner, M. F., Cichy, R. M., & Roig, G. (2021). Unveiling Functions of the Visual Cortex Using Task-Specific Deep Neural Networks. PLOS Computational Biology, 17(8), e1009267. https://doi.org/10.1371/journal.pcbi.1009267
@article{dwivediUnveilingFunctionsVisual2021,
title = {Unveiling Functions of the Visual Cortex Using Task-Specific Deep Neural Networks},
author = {Dwivedi, Kshitij and Bonner, Michael F. and Cichy, Radoslaw Martin and Roig, Gemma},
date = {2021-08-13},
journaltitle = {PLOS Computational Biology},
shortjournal = {PLOS Computational Biology},
volume = {17},
number = {8},
pages = {e1009267},
publisher = {Public Library of Science},
issn = {1553-7358},
doi = {10.1371/journal.pcbi.1009267},
url = {https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009267},
urldate = {2024-07-01},
langid = {english},
keywords = {Functional magnetic resonance imaging,Linear regression analysis,Neural networks,Permutation,Semantics,Sensory perception,Vision,Visual cortex},
file = {C:\Users\felix\Zotero\storage\NX3RJWGC\Dwivedi et al. - 2021 - Unveiling functions of the visual cortex using tas.pdf}
}
Abstract
The human visual cortex enables visual perception through a cascade of hierarchical computations in cortical regions with distinct functionalities. Here, we introduce an AI-driven approach to discover the functional mapping of the visual cortex. We related human brain responses to scene images measured with functional MRI (fMRI) systematically to a diverse set of deep neural networks (DNNs) optimized to perform different scene perception tasks. We found a structured mapping between DNN tasks and brain regions along the ventral and dorsal visual streams. Low-level visual tasks mapped onto early brain regions, 3-dimensional scene perception tasks mapped onto the dorsal stream, and semantic tasks mapped onto the ventral stream. This mapping was of high fidelity, with more than 60% of the explainable variance in nine key regions being explained. Together, our results provide a novel functional mapping of the human visual cortex and demonstrate the power of the computational approach.
Nicholls, V. I., Krugliak, A., Alsbury-Nealy, B., Gramann, K., & Clarke, A. (2024). Congruency Effects on Object Recognition Persist When Objects Are Placed in the Wild: An AR and Mobile EEG Study (p. 2024.05.30.596613). https://doi.org/10.1101/2024.05.30.596613
@online{nichollsCongruencyEffectsObject2024,
title = {Congruency Effects on Object Recognition Persist When Objects Are Placed in the Wild: {{An AR}} and Mobile {{EEG}} Study},
shorttitle = {Congruency Effects on Object Recognition Persist When Objects Are Placed in the Wild},
author = {Nicholls, Victoria I. and Krugliak, Alexandra and Alsbury-Nealy, Benjamin and Gramann, Klaus and Clarke, Alex},
date = {2024-05-31},
eprinttype = {bioRxiv},
eprintclass = {New Results},
pages = {2024.05.30.596613},
doi = {10.1101/2024.05.30.596613},
url = {https://www.biorxiv.org/content/10.1101/2024.05.30.596613v1},
urldate = {2024-07-01},
langid = {english},
pubstate = {prepublished},
file = {C:\Users\felix\Zotero\storage\B3TSX4P9\Nicholls et al. - 2024 - Congruency effects on object recognition persist w.pdf}
}
Abstract
Objects in expected locations are recognised faster and more accurately than objects in incongruent environments. This congruency effect has a neural component, with increased activity for objects in incongruent environments. Studies have increasingly shown differences between neural processes in realistic environments and tasks, and neural processes in the laboratory. To what extent do findings obtained from a laboratory setting translate to neural processes elicited in real-world environments? We investigated how object recognition is modulated when objects are placed in real environments using augmented reality while recording mobile EEG. Participants approached, viewed, and rated how congruent they found the objects with the environment. We found significantly higher theta-band power for objects in incongruent contexts than objects in congruent contexts. This demonstrates that real-world contexts impact on how we recognize objects, and that mobile brain imaging and augmented reality are effective tools to study cognition in the wild. Teaser Combining augmented reality with mobile brain imaging to show that real-world contexts modulate object recognition processes.
Sassenhagen, J., & Fiebach, C. J. (2020). Traces of Meaning Itself: Encoding Distributional Word Vectors in Brain Activity. Neurobiology of Language, 1(1), 54–76. https://doi.org/10.1162/nol_a_00003
@article{sassenhagenTracesMeaningItself2020,
title = {Traces of {{Meaning Itself}}: {{Encoding Distributional Word Vectors}} in {{Brain Activity}}},
shorttitle = {Traces of {{Meaning Itself}}},
author = {Sassenhagen, Jona and Fiebach, Christian J.},
date = {2020-03-01},
journaltitle = {Neurobiology of Language},
shortjournal = {Neurobiology of Language},
volume = {1},
number = {1},
pages = {54--76},
issn = {2641-4368},
doi = {10.1162/nol_a_00003},
url = {https://doi.org/10.1162/nol_a_00003},
urldate = {2024-07-01},
file = {C\:\\Users\\felix\\Zotero\\storage\\PVMI7PXW\\Sassenhagen und Fiebach - 2020 - Traces of Meaning Itself Encoding Distributional .pdf;C\:\\Users\\felix\\Zotero\\storage\\CSAP8USW\\Traces-of-Meaning-Itself-Encoding-Distributional.html}
}
Abstract
How is semantic information stored in the human mind and brain? Some philosophers and cognitive scientists argue for vectorial representations of concepts, where the meaning of a word is represented as its position in a high-dimensional neural state space. At the intersection of natural language processing and artificial intelligence, a class of very successful distributional word vector models has developed that can account for classic EEG findings of language, that is, the ease versus difficulty of integrating a word with its sentence context. However, models of semantics have to account not only for context-based word processing, but should also describe how word meaning is represented. Here, we investigate whether distributional vector representations of word meaning can model brain activity induced by words presented without context. Using EEG activity (event-related brain potentials) collected while participants in two experiments (English and German) read isolated words, we encoded and decoded word vectors taken from the family of prediction-based Word2vec algorithms. We found that, first, the position of a word in vector space allows the prediction of the pattern of corresponding neural activity over time, in particular during a time window of 300 to 500 ms after word onset. Second, distributional models perform better than a human-created taxonomic baseline model (WordNet), and this holds for several distinct vector-based models. Third, multiple latent semantic dimensions of word meaning can be decoded from brain activity. Combined, these results suggest that empiricist, prediction-based vectorial representations of meaning are a viable candidate for the representational architecture of human semantic knowledge.
Schwartz, D., Toneva, M., & Wehbe, L. (2019). Inducing Brain-Relevant Bias in Natural Language Processing Models. https://doi.org/10.48550/arXiv.1911.03268
@online{schwartzInducingBrainrelevantBias2019,
title = {Inducing Brain-Relevant Bias in Natural Language Processing Models},
author = {Schwartz, Dan and Toneva, Mariya and Wehbe, Leila},
date = {2019-10-29},
eprint = {1911.03268},
eprinttype = {arXiv},
eprintclass = {cs, q-bio},
doi = {10.48550/arXiv.1911.03268},
url = {http://arxiv.org/abs/1911.03268},
urldate = {2024-07-01},
pubstate = {prepublished},
keywords = {Computer Science - Computation and Language,Computer Science - Machine Learning,Quantitative Biology - Neurons and Cognition},
file = {C\:\\Users\\felix\\Zotero\\storage\\WL2AXIT9\\Schwartz et al. - 2019 - Inducing brain-relevant bias in natural language p.pdf;C\:\\Users\\felix\\Zotero\\storage\\S57P6AF5\\1911.html}
}
Abstract
Progress in natural language processing (NLP) models that estimate representations of word sequences has recently been leveraged to improve the understanding of language processing in the brain. However, these models have not been specifically designed to capture the way the brain represents language meaning. We hypothesize that fine-tuning these models to predict recordings of brain activity of people reading text will lead to representations that encode more brain-activity-relevant language information. We demonstrate that a version of BERT, a recently introduced and powerful language model, can improve the prediction of brain activity after fine-tuning. We show that the relationship between language and brain activity learned by BERT during this fine-tuning transfers across multiple participants. We also show that, for some participants, the fine-tuned representations learned from both magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI) are better for predicting fMRI than the representations learned from fMRI alone, indicating that the learned representations capture brain-activity-relevant information that is not simply an artifact of the modality. While changes to language representations help the model predict brain activity, they also do not harm the model’s ability to perform downstream NLP tasks. Our findings are notable for research on language understanding in the brain.
Toneva, M., Mitchell, T. M., & Wehbe, L. (2022). Combining Computational Controls with Natural Text Reveals Aspects of Meaning Composition. Nature Computational Science, 2(11), 745–757. https://doi.org/10.1038/s43588-022-00354-6
@article{tonevaCombiningComputationalControls2022,
title = {Combining Computational Controls with Natural Text Reveals Aspects of Meaning Composition},
author = {Toneva, Mariya and Mitchell, Tom M. and Wehbe, Leila},
date = {2022-11},
journaltitle = {Nature Computational Science},
shortjournal = {Nat Comput Sci},
volume = {2},
number = {11},
pages = {745--757},
publisher = {Nature Publishing Group},
issn = {2662-8457},
doi = {10.1038/s43588-022-00354-6},
url = {https://www.nature.com/articles/s43588-022-00354-6},
urldate = {2024-07-01},
langid = {english},
keywords = {Computer science,Language,Neural encoding},
file = {C:\Users\felix\Zotero\storage\L8I4366D\Toneva et al. - 2022 - Combining computational controls with natural text.pdf}
}
Abstract
To study a core component of human intelligence—our ability to combine the meaning of words—neuroscientists have looked to linguistics. However, linguistic theories are insufficient to account for all brain responses reflecting linguistic composition. In contrast, we adopt a data-driven approach to study the composed meaning of words beyond their individual meaning, which we term ‘supra-word meaning’. We construct a computational representation for supra-word meaning and study its brain basis through brain recordings from two complementary imaging modalities. Using functional magnetic resonance imaging, we reveal that hubs that are thought to process lexical meaning also maintain supra-word meaning, suggesting a common substrate for lexical and combinatorial semantics. Surprisingly, we cannot detect supra-word meaning in magnetoencephalography, which suggests that composed meaning might be maintained through a different neural mechanism than the synchronized firing of pyramidal cells. This sensitivity difference has implications for past neuroimaging results and future wearable neurotechnology.
Toneva, M., & Wehbe, L. (2019). Interpreting and Improving Natural-Language Processing (in Machines) with Natural Language-Processing (in the Brain). arXiv.org. https://arxiv.org/abs/1905.11833v4
@online{tonevaInterpretingImprovingNaturallanguage2019,
title = {Interpreting and Improving Natural-Language Processing (in Machines) with Natural Language-Processing (in the Brain)},
author = {Toneva, Mariya and Wehbe, Leila},
date = {2019-05-28},
url = {https://arxiv.org/abs/1905.11833v4},
urldate = {2024-07-01},
langid = {english},
organization = {arXiv.org},
file = {C:\Users\felix\Zotero\storage\V2XIL34E\Toneva und Wehbe - 2019 - Interpreting and improving natural-language proces.pdf}
}
Abstract
Neural networks models for NLP are typically implemented without the explicit encoding of language rules and yet they are able to break one performance record after another. This has generated a lot of research interest in interpreting the representations learned by these networks. We propose here a novel interpretation approach that relies on the only processing system we have that does understand language: the human brain. We use brain imaging recordings of subjects reading complex natural text to interpret word and sequence embeddings from 4 recent NLP models - ELMo, USE, BERT and Transformer-XL. We study how their representations differ across layer depth, context length, and attention type. Our results reveal differences in the context-related representations across these models. Further, in the transformer models, we find an interaction between layer depth and context length, and between layer depth and attention type. We finally hypothesize that altering BERT to better align with brain recordings would enable it to also better understand language. Probing the altered BERT using syntactic NLP tasks reveals that the model with increased brain-alignment outperforms the original model. Cognitive neuroscientists have already begun using NLP networks to study the brain, and this work closes the loop to allow the interaction between NLP and cognitive neuroscience to be a true cross-pollination.