ClaimDetective
is a python class that allows the user to rank a list of sentences (i.e. potential claims) in order of most check-worthy to least check-worthy, i.e., the priority with which they should be fact-checked.
ClaimDetective
was built with a deep-learning model that fine-tunes RoBERTa under-the-hood to identify and rank claims that are worth fact-checking. To see the code used to train the ClaimDetective
models, click here.
For ClaimDetective
documentation, click here.
-
claim_detective.py contains all the necessary source code to use the check-worthiness detection models located in the
models
directory. -
models is a directory containing the latest trained models. See below for details.
-
requirements.txt contains the packages and the versions used to write
claim_detective.py
-
example_small.py contains a very brief example of loading and using one of the models. Read this file before using! Essentially provides all the documentation needed. The output from this file can be found in the
example_outputs
directory, here: small_output.csv. -
example_big.py is another example of how to load and use a model in a more realistic setting. Note: to run this you will need more packages than those listed in
requirements.txt
(e.g.nltk
andBeautifulSoup
). The output from this file can be found in theexample_outputs
directory, in the files calledbig_output_[model].csv
where[model]
= the model used to generate the file. -
example_outputs contains the output
.csv
files from the twoexample.py
files. -
misclassified.py is another example of how to load and use the model. The output of this file can be seen in the incorrect_preds directory.
Each model is located in its own subdirectory. Each model subdirectory contains two files:
logfile.txt
which contains a log of all the training and testing that model has been through, as well as the architecture of the model.model.pth
which is a pyTorch checkpoint file containing the model weights in the form of astate_dict
object.
Because the models are so large, you must download their respective .zip
files from Google Drive, then unzip each model inside the models directory.
At the time of writing, I have made the following models are available on Google Drive:
-
claimbuster was trained on the ClaimBuster dataset described in Arslan et. al. Briefly, the ClaimBuster dataset consists of 23,533 statements extracted from all U.S. general election presidential debates (1960-2016) which were then annotated by human coders.
-
clef19 was trained first on the ClaimBuster dataset described above, and then on the CLEF-2019 CheckThat! dataset (CT19-T1 corpus) described in Atanasova et. al. Briefly, the CT19-T1 corpus contains 23,500 human-annotated sentences from political speeches and debates during the 2016 U.S. presidential election.
-
clef20 was trained solely trained on the CLEF-2020 CheckThat! dataset (CT20-T1(en) corpus) described in Barron-Cedeno et. al. Briefly, the CT20-T1(en) corpus contains 962 human-annotated tweets about the novel coronavirus caused by SARS-CoV-2.
Note that the very first time running a model will take a few minutes to load and run everything properly. After that first go, using the model to identify claims is very fast.