So far, so BERT, but here comes the twist: Instead of submitting yet another "we have SOTA, accept please" type paper, the authors were suspicious of this seemingly great success.
Argument comprehension is a rather difficult task that requires world knowledge and commonsense reasoning see figure below , and while no one doubts that BERT is one of the best language models created yet and that transfer learning is "NLP's Imagenet Moment" , there is little evidence that language models are capable of such feats of high-level natural language understanding. The authors perform three analyses.http://ipdwew0030atl2.public.registeredsite.com/137143-how-i-location.php
Leadership Lessons from the Story of Clever Hans
First, they count unigrams and bigrams in the possible answers i. Then, to check if the model indeed exploits such cues, the authors provide the model only with partial input, which makes reasoning about the correct answer impossible: For example, it should not be possible to reason about whether other search engines don't redirect to Google or all other search engines redirect to Google is the correct warrant if no claim or reason is given.
However, the model doesn't care about this impossibility and identifies the correct warrant with 71 percent accuracy. After running similar experiments for the other two task-breaking settings claim and warrant only; reason and warrant only , the authors conclude that dataset contains statistical cues and that BERT's performance on this task can be entirely explained by its ability to exploit these cues. To drive the point home, in their third experiment the authors construct a version of the dataset in which the cues are not informative anymore and find that performance drops to random chance level.
Without getting into a Chinese Room argument about what it means to understand something, most people would probably agree that a model making predictions based on the presence or absence of a handful of words like "not", "is", or "do" does not understand anything about argumentation. The authors declare that their SOTA result is meaningless. Of course, the problem of learners solving a task by learning the "wrong" thing has been known for a long time and is known as the Clever Hans effect , after the eponymous horse which appeared to be able to perform simple intellectual tasks, but in reality relied on involuntary cues given by its handler.
- NLP's Clever Hans Moment has Arrived?
- Nanny`s Whisk-King Bowl.
- Unmasking Clever Hans predictors and assessing what machines really learn | Nature Communications?
- The Voyage of the Neptune.
- Foundations of Regenerative Medicine: Clinical and Therapeutic Applications.
- Emigrating Home!
Since the s, versions of the tank anecdote tell of a neural network trained by the military to recognize tanks in images, but actually learning to recognize different levels of brightness due to one type of tank appearing only in bright photos and another type only in darker ones. Less anecdotal, Viktoria Krakovna has collected a depressingly long list of agents following the letter, but not the spirit of their reward function, with such gems as a video game agent learning to die at the end of the first level , since repeating that easy level gives a higher score than dying early in the harder second level.
Two more recent, but already infamous cases are an image classifier claimed to be able to distinguish faces of criminals from those of law-abiding citizens, but actually recognizing smiles and a supposed "sexual orientation detector" which can be better explained as a detector of glasses, beards and eyeshadow. If NLP is following in the footsteps of computer vision , it seems to be doomed to repeat its failures, too. Coming back to the paper, the authors point to a again, depressingly large amount of recent work reporting Clever Hans effects in NLP datasets. Statistical learners such as standard neural network architectures are prone to adopting shallow heuristics that succeed for the majority of training examples, instead of learning the underlying generalizations that they are intended to capture.
To be clear, no one is claiming that large models like BERT or deep learning in general are useless and I've found quite the opposite in my own work , but failures like the ones demonstrated in the paper and related work should make us skeptical about reports of near-human performance in high-level natural language understanding tasks. Two minor gripes I have about the paper concern terminology.
The authors call their second analysis "probing experiments", but probing usually refers to training shallow classifiers on top of a deep neural network in order to determine what kind of information its representations contain, which is not done here. Similarly, the authors call the dataset they construct in their third analysis "adversarial dataset", but adversarial usually refers to instances that mislead a model into making a wrong decision, which, again, is not done here.
But these wording issues do not diminish the main findings of the paper. In , after extensive testing, the Hans Commission — which included two zoologists, a psychologist, a horse trainer, several schoolteachers and a circus manager — concluded that no trickery was involved.
The Commission then passed the investigation to a young psychologist, Oskar Pfungst. Pfungst designed a careful set of experiments and began testing Hans. He soon noticed that Hans performed well when questioned under his normal conditions. With that in mind, Pfungst began watching the questioners, and he noticed that as Hans tapped his hoof in response to a question, their breathing, posture, and expression showed subtle signs of increasing tension, tension which disappeared when Hans made the correct tap. Innocently and without realizing they were doing so, Pfungst concluded, the questioners were giving Hans a cue when to stop tapping.
Clever Hans the Math Horse • Damn Interesting
Unconscious cues introduce a form of bias into experiments, leading subjects to give answers that seem right to the researchers. Blinding and double blinding trials are one response.
In a blind drug trial of a new drug, for instance, one group of participants receives the drug while a second group receives a placebo. Von Osten spent a lot of time and energy trying to teach various animals, including a cat and a bear, how to do arithmetic. He failed.
Leave a Reply
Undaunted, he tried teaching the same thing to his horse. That seemed to work.
He claimed Hans could answer questions by tapping his hoof. Von Osten would write an arithmetic problem on the chalkboard and Hans would tap out the answer. It amazed crowds. One even said Hans had the arithmetical knowledge of a human year-old. There were still skeptics. Psychologist Carl Stumpf convened a group of experts to investigate.
They were sure that von Osten was performing a sophisticated trick. When they couldn't figure it out, they concluded that Clever Hans was actually solving arithmetic problems. The commission passed their findings to psychologist Oskar Pfundt.