Scientific fact checking using AI language models: COVID-19 research and beyond

If you think fact checking is tough, which is what would you say about verifying scientific claims on COVID-19 no less? Tip: it’s also difficult – different in some ways, similar in some others.

Fact or Fiction: Verification of Scientific Claims is the title of a research paper published on the Arxiv pre-print server by a team of researchers from Allen Institute for Artificial Intelligence (AI2), with data and code available on GitHub. ZDNet connected with David Wadden, lead author of the paper and a visiting researcher at AI2, to discuss the rationale, details and directions for this work.

What is scientific fact checking?

Although the authors of the paper refer to their work as scientific fact-checking, we believe it is important to clarify semantics before moving on. Confirmation of scientific claims refers to the process of proving or disproving (with some degree of certainty) claims made in scientific research articles. It does not refer to a scientific method to performs “regular” fact checking.

Fact checking, as defined by the authors, is a task in which the authenticity of an input claim is verified against a corpus of documents supporting or refuting the claim. An assertion is defined as an atomic factual statement that expresses a finding about an aspect of a scientific entity or process that can be verified from a single source. This area of ​​research has received increased attention, motivated by the dissemination of misinformation in political news, social media and the web.

In contrast, interest in fact checking has spurred the creation of many datasets across different domains to support the research and development of automated fact-checking systems. Yet, so far, it does not appear that such a dataset exists to facilitate research on another important domain of fact checking – scientific literature.


Regular fact checking is tough and most people don’t. If you think scientific fact checking can be easier, think again

The ability to verify claims of scientific concepts, especially those related to biomedicine, is an important application for fact checking. Furthermore, this line of research also offers a unique opportunity to explore the capabilities of modern neural models, as successfully verifying most scientific claims requires expert background knowledge, complex language.
understanding and reasoning capabilities.

The AI2 scientists introduce the task of scientific fact checking. To facilitate research into this task, they constructed SCIFACT, a dataset of 1,409 scientific claims fact-checked against a corpus of 5,183 abstracts supporting or refuting each claim, and commented on rationales that justify each support / return decision.

To aggregate this data set, a new annotation protocol was used, benefiting from an abundant source of naturally occurring claims in the scientific literature – citation phrases or “citances”.

Why and how do you do scientific fact checking?

Wadden, a graduate student at the University of Washington with a background in physics, computer biology and natural language processing (NLP), shared an interesting story about what motivated him to start this work. other than that well-known problem of navigating vast bodies of scientific knowledge, personal experience also played its part.

Wadden briefly considered a career as an opera singer when he had a vocal injury. I have visited a number of doctors for consultations and received a number of recommendations for potential treatments. Although they were all good doctors, Wadden observed that none of them were able to provide data such as the percentage of patients for whom the procedure works.

Waden’s situation was not dramatic, but he couldn’t help but think about what would happen if that were the case. He felt that the information he was given was incomplete in order to make informed decisions, and I thought it had to do with the fact that it is not easy to find this information for doctors.


Scientific fact checking is about controlling claims made in scientific articles in a scientific way and automating the process as much as possible. Image: Allen Institute for AI

The work uses a data set specifically aimed at fact checking COVID-19 related research. Wadden explained that the team planned to do this work in October 2019 before COVID-19 was a thing. However, they soon realized what was going on and decided to make COVID-19 their focus.

In addition to the SCIFACT dataset, the research also includes the SCIFACT task and the VERISCI baseline model. In a nutshell, they can be summarized as creating a dataset by manually commenting on scientific articles and generating claims, evaluating claims, and creating a basic AI language model for assessment evaluation.

The annotation process described in detail in the paper is both a necessity and a limiting factor. It is a necessity because it requires expert knowledge to process quotes, ask the right questions and find the right answers. It is a limiting factor because relying on manual work makes the process difficult to scale and it introduces bias.

Can there be bias in science?

Today NLP is largely driven by machine learning. SCIFACT developed VERISCI based on BERT, Google’s deep learning language model. Machine learning algorithms need training data and training data needs to be processed and commented on by people. This is a labor intensive task. Relying on people to process large data sets means the process is slow and expensive and the results can be partial.

Large commented datasets exist for NLP and specifically for fact checking, but scientific fact checking is special. When dealing with common sense reasoning. Mechanical Turk workers are typically asked to comment on datasets. However, in scientific work, expert knowledge is needed to be able to understand, evaluate, and process claims contained in research articles.

The SCIFACT team hired Biology undergrad and grad students for this job. Wadden is fully aware of the limitations this entails in scaling up the approach, and is considering crowd sourcing, hiring medical professionals through a recruiting platform, or assigning many mechanical Turkish workers to comment on the same work, and then averaging their answer, knowing each one will be imperfect.


Science is not infallible. It can also introduce bias via imperfect data and methods. And even scientists with the best intentions do not always agree on everything – this is part of the process

Bias can be introduced in all moving parts of the process: what papers are selected, what requirements are checked for each paper, what quotes are checked for each claim, and how each quote is ranked. In other words: if research X supports claim A, while research Y contradicts it, what should we believe? Not to mention if research Y is not in the dataset, we will never know about its findings.

IN COVID-19 times that many people have turned armchair epidemiologists, this is something to keep in mind: Science and computer science are not always straightforward processes that produce final, undisputed results. For one, Vade is very aware of the limitations of this research. Although the team has tried to mitigate these restrictions, Wadden acknowledges that this is only a first step in a long and winding road.

One way the SCIFACT team tried to tackle bias when choosing claims is that they extracted them from citations: They only dealt with allegations in which a paper was cited. In addition, they used a variety of techniques to get the highest quality possible.

The paper selection process is driven by an initial collection of seed papers: citations referring to these papers are examined. Only papers cited at least 10 times can be part of the seed set in an attempt to select the most important papers. A technique called citation intent classification is used. The technique attempts to identify the reason why a citation is cited. Only quotes referring to the results were processed.

Promising results

Another important thing to note is that claims are evaluated based on the abstract of the paper they cite. This is done for simplicity, as the underlying assumption seems to be that if a finding is the key to a paper, it will be mentioned in the paper’s summary. It would be difficult for a language model to evaluate a claim based on the full text of a scientific article.

Claims found in papers may have multiple citations. For example, the statement “The R0 of the novel coronavirus is 2.5 “can cite several papers with supporting evidence. In these cases, each citation is processed independently and a result is obtained whether it supports or rejects the claim or a final decision cannot be made for each.

Wadden’s team used the SCIFACT dataset and annotation process to develop and train the VERISCI model. VERISCI is a three-component pipeline: Abstract retrieval that retrieves the most similar abstracts. Rationale that identifies rationales for each candidate abstract. Feel the prediction that makes the final label prediction.

Given a claim and a corpus of papers, VERISCI must predict a set of proof abstract. For each abstract in the corpus, it must predict a label and a collection of rational sentences. Although the annotations provided by the annotators may contain several separate rationales, the model simply needs to predict a single collection of rational sentences; these sentences may come from several commented rationales.


Where there are gray areas, they must be mapped and measures taken as rewriting of the original clarity requirements. Image: Allen Institute for AI

The team experimented to establish a performance baseline on SCIFACT using VERISCI, analyzed the performance of the three components of VERISCI and demonstrated the importance of in-domain training data. Qualitative results for verification of claims about COVID-19 using VERISCI were promising.

For about half of the claims-abstract pairs, VERISCI correctly identifies whether an abstract supports or rejects a claim, and provides reasonable evidence to justify the decision. Given the difficulty of the task and limited training data within the domains, the team considers this a promising result, while providing plenty of room for improvement.

Some exploratory experiments for fact-checking allegations regarding COVID-19 were also conducted. One medical student was assigned to write 36 COVID19-related requirements. VERISCI was used to predict abstract evidence. The same medical student annotator assigned a label to each claim-abstract pair.

For the majority of these COVID-related claims (23 out of 36), the rationales produced by VERISCI were considered plausible by the annotator. The sample is really small, but the team believes VERISCI is able to successfully retrieve and classify evidence in many cases.

Complicated process, educational work

There are a number of future directions for this work. In addition to expanding the dataset and generating more comments, adding support for partial documentation, modeling contextual information, and evidence synthesis are important areas for future research.

Expanding the system to include partial support is an interesting topic. Not all decisions can be clear. A typical example is when we have a claim about drug X’s effectiveness. If a paper reports the efficacy of the drug in mice or in limited clinical trials in humans, this may provide unmatched support for the claim.

Initial experiments showed a high degree of disagreement among expert notes as to whether certain claims were fully, partially, or not at all supported by certain research findings. Do you sound familiar? In these gray zone scenarios, the goal is to be able to better identify the situation. What the team wants to do is edit the requirement to reflect the discrepancy.

Modeling contextual information has to do with identifying implicit references. Initially, the commenters were instructed to identify primary and supplementary justification sentences for each justification. Primary sentences are those needed to verify the claim, while supplementary sentences provide an important context that lacks primary sentences that are still needed to determine whether a claim is supported or rejected.

For example, if a claim mentions “experimental animal” and a rational phrase mentions “test group”, it is not always straightforward whether they refer to the same thing. Again, a high degree of disagreement was observed among human experts in such scenarios. Thus, additional justification sentences were removed from the dataset and the team continued to work with commentators to improve the agreement.

Last but not least: Evidence synthesis basically means that not all evidence is created equal, and that should probably be reflected in the decision making process in some way. To use one extreme example: at the moment, a preprint that has not undergone peer review and a paper with 1000 citations is treated equally. They probably shouldn’t.

One obvious thing to do here would be to use some kind Page rank for research articles, ie an algorithm that researches what Google does for the web – select the relevant stuff. Such algorithms already exist, for example, for calculation impact factors. But again, this is another gray area.

This work is not the only example of what we will call meta research triggered by COVID19: research on how to facilitate research in an attempt to accelerate the process of understanding and combating COVID19. For example, we have seen how other researchers use knowledge graphs for the same purpose.

Wadden argues that these approaches could complement each other. For example, where knowledge graphs have an edge between two nodes claiming a type of relationship, SCIFACT could provide the text on the basis of which the claim was made.

Currently, the work is being submitted for peer review. It is instructive because it highlights the strengths and weaknesses of the scientific process. And despite its shortcomings, it reminds us of the basic premises of science: peer review and intellectual honesty.

Source link