The headline development is that large language models (LLMs) have entered the game — the same GPTs, only under the hood for researchers. LLMs accelerate analytical and computational tasks: we automate corpus annotation, compute complex syntactic parameters, and automate the creation of experimental tests. But it is precisely in psycholinguistics that LLMs are becoming an exceptionally convenient "mirror" for science: since a model has no innate sense of language, comparing its behavior with human behavior allows us to see where the laws of information transfer are at work and where the particularities of the brain come into play.
GR: Impressive. Is any of this confirmed experimentally?
ARTEM: Yes, absolutely. For instance, researchers take short texts and present them to people word by word, while simultaneously “feeding” the same material to a large language model. For the model, surprisal is immediately calculated — a number that indicates how unexpected a given word is for it. Common words yield low surprisal; rare or anomalous ones yield high surprisal. In humans, unexpectedness is captured by EEG sensors, as well as by increased pauses, reading times, and gaze fixation on the word. If a sentence is “broken,” two characteristic peaks light up on the graph of the brain’s electrical activity: N400 and P600. N400 appears approximately 0.4 seconds later when a word violates grammar (”He spread-plural jam on the bread”); P600 emerges another two-tenths of a second later when a word is grammatically correct but semantically absurd (”I spread socks on the bread”).
The results are intriguing. Model surprisal aligns well with N400: when the algorithm and the brain encounter a grammatical disruption, both “stumble” simultaneously. An analogue of the P600 response is also present in the model — just not as a discrete signal, but as the cost of prediction reassembly (the difference between surprisal on the final and penultimate word, or KL-divergence). This allows for a more precise linking of the computational and neurophysiological levels of language processing description.
The second line of research involves using large language models as a “test range” for probing the limits of human cognition. Here, scientists deliberately place models and humans under identical conditions and look for divergences. For example, a model can be artificially constrained to “hold” no more than two phrases simultaneously; when this limitation is introduced, its predictions of reading times and error rates begin to resemble human data — confirming the hypothesis that humans consider only a small number of predictions in parallel. In a similar fashion, researchers test the so-called frequency indistinguishability threshold: humans barely notice the difference between very rare words, whereas a model distinguishes them effortlessly. When a simplified “threshold” is imposed on the model, its numerical results once again approximate reader behavior.
The third line of research is the search for universal “default language rules,” or biases. Linguists have long observed that the same prohibitions recur across all the world’s languages: for example, a component of a complex question cannot be moved too far from the verb (the so-called “syntactic islands”).
We also study word order: for instance, languages in which the basic word order is object–verb–subject (OVS) are virtually nonexistent — only eleven such languages are known worldwide, and those with OSV order number just four.
It was previously thought that these were innate properties of the human brain. Now, scientists run large language models and discover that without any built-in settings, models rarely employ these constructions. This lends support to the idea that these constraints may arise from the very logic of information transfer — short, easily predictable phrases are transmitted more reliably and therefore become entrenched in language.
To test this hypothesis, models of varying size and architecture are compared. Smaller bidirectional networks (which read a sentence simultaneously from left to right and right to left) sometimes predict reader behavior better than enormous unidirectional GPT models — demonstrating that scale alone is not decisive; architecture and how the model perceives context matter as well. Researchers also account for technical details, such as the way a model segments words into “token chunks”: this affects its ability to capture rare forms and can introduce its own artificial “biases.”
Ultimately, by comparing the behavior of different LLMs with real-world languages, scientists are able to separate constraints dictated by human perception from those imposed by pure informational economy — and in doing so, refine the theory of why language is structured the way it is.
GR: Can we say that LLMs primarily help us understand how our brain works?
ARTEM: Yes, large language models have genuinely become a new window into the workings of the human brain. When we compare their statistical predictions with EEG bursts, eye movements, or reading speed, we obtain a (relatively) simple, numerically measurable "model" of the very same processes that previously had to be studied through lengthy behavioral tests.
At the same time, models possess nothing "innate": they learn from text alone. This is precisely why their convergences with brain data reveal which constraints are dictated by the pure informatics of language, while their divergences reveal where the particularities of our memory, attention, or embodied experience come into play. As a result, a language model becomes a universal test range where one can not only safely "probe" the boundaries of human cognition, but also test hypotheses about languages that are possible and impossible for humans — and, step by step, make them more comprehensible both to ourselves and to the machines we create.
GR: And finally, please tell us about your latest work.
ARTEM: Recently, I’ve published two new papers, with two more awaiting publication. We are studying how native speakers of Russian and Serbo-Croatian assess the acceptability (naturalness or correctness) of sentences.
To do this, we use specialized metrics that help describe and predict such judgments. For instance, Mean Dependency Distance indicates how far apart words that should be linked to one another are positioned within a sentence. The greater the distance, the harder it is for the brain to process the phrase — and the less natural it sounds. Another metric, Projectivity, checks whether dependency links between words cross one another.
If they do, the structure is considered more complex, and sentences are typically rated lower. This knowledge helps us build computationally precise, cognitively grounded models of acceptability for free word order languages, and also allows us to disentangle hard grammatical constraints from processing constraints (cognitive load).