ClaimDetective

ClaimDetective is a python class that allows the user to rank a list of sentences (i.e. potential claims) in order of most check-worthy to least check-worthy, i.e., the priority with which they should be fact-checked.

ClaimDetective was built with a deep-learning model that fine-tunes RoBERTa under-the-hood to identify and rank claims that are worth fact-checking. To see the code used to train the ClaimDetective models, click here.

For ClaimDetective documentation, click here.

Overview

claim_detective.py contains all the necessary source code to use the check-worthiness detection models located in the models directory.
models is a directory containing the latest trained models. See below for details.
requirements.txt contains the packages and the versions used to write claim_detective.py
example_small.py contains a very brief example of loading and using one of the models. Read this file before using! Essentially provides all the documentation needed. The output from this file can be found in the example_outputs directory, here: small_output.csv.
example_big.py is another example of how to load and use a model in a more realistic setting. Note: to run this you will need more packages than those listed in requirements.txt (e.g. nltk and BeautifulSoup). The output from this file can be found in the example_outputs directory, in the files called big_output_[model].csv where [model] = the model used to generate the file.
example_outputs contains the output .csv files from the two example.py files.
misclassified.py is another example of how to load and use the model. The output of this file can be seen in the incorrect_preds directory.

Models

Each model is located in its own subdirectory. Each model subdirectory contains two files:

logfile.txt which contains a log of all the training and testing that model has been through, as well as the architecture of the model.
model.pth which is a pyTorch checkpoint file containing the model weights in the form of a state_dict object.

Because the models are so large, you must download their respective .zip files from Google Drive, then unzip each model inside the models directory.

At the time of writing, I have made the following models are available on Google Drive:

claimbuster was trained on the ClaimBuster dataset described in Arslan et. al. Briefly, the ClaimBuster dataset consists of 23,533 statements extracted from all U.S. general election presidential debates (1960-2016) which were then annotated by human coders.
clef19 was trained first on the ClaimBuster dataset described above, and then on the CLEF-2019 CheckThat! dataset (CT19-T1 corpus) described in Atanasova et. al. Briefly, the CT19-T1 corpus contains 23,500 human-annotated sentences from political speeches and debates during the 2016 U.S. presidential election.
clef20 was trained solely trained on the CLEF-2020 CheckThat! dataset (CT20-T1(en) corpus) described in Barron-Cedeno et. al. Briefly, the CT20-T1(en) corpus contains 962 human-annotated tweets about the novel coronavirus caused by SARS-CoV-2.

Note that the very first time running a model will take a few minutes to load and run everything properly. After that first go, using the model to identify claims is very fast.

lawrence-chillrud / claimdetective Goto Github PK

claimdetective's Introduction

ClaimDetective

Overview

Models

claimdetective's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent