This repository contains the code for reproducing results of the paper:
Martin Potthast, Johannes Kiesel, Kevin Reinartz, Janek Bevendorff, and Benno Stein. A Stylometric Inquiry into Hyperpartisan and Fake News. In Proceedings of 56th Annual Meeting of the Association for Computational Linguistics (ACL 18), July 2018
- Download the dataset, place it under
data
, and extract it there. - Get the required libraries,
aitools4-ie-uima.jar
andjsoup-1.6.1.jar
, from the resources page and place them underlib
. - Download the Tree Tagger binaries that match your operating system and add it to the directory structure as detailed below (naming must be exact). Please visit the TreeTagger homepage beforehand to view the license terms (and instructions for the Windows installation).
- In all cases, there should be a
bin
directory directly within the operating-system-specific directory. Then add alib
directory next to thisbin
directory and add the parameters file you extract from this archive asenglish.par
into thislib
directory. - Get the TeX hyphenation patterns ZIP, place it next to the ACL-18 directory, and extract it there. This should create a directory called
thirdparty
next to theACL-18
directory of this project.
Just use ant
in this directory. This will create a single acl18-bundle.jar
JAR file that contains everything you need.
Split the data into three folds (by portal/publisher) and convert to UIMA XMI.
java -cp acl18-bundle.jar de.aitools.ie.articles.DataPreprocessor data/articles data/xmi
Then extract the features using UIMA and generate WEKA ARFF files for each task. Note that this extracts all features. The actually used feature set is specified in the next step.
java -cp acl18-bundle.jar de.aitools.ie.articles.FeatureExtractor VERACITY data/xmi data/veracity
java -cp acl18-bundle.jar de.aitools.ie.articles.FeatureExtractor ORIENTATION data/xmi data/orientation
java -cp acl18-bundle.jar de.aitools.ie.articles.FeatureExtractor HYPERPARTISANSHIP data/xmi data/hyperpartisanship
You can then train and test the classifier. Available feature sets are: TOPIC, TEXT_STYLE, HYPERTEXT_STYLE, STYLE (= TEXT_STYLE + HYPERTEXT_STYLE), ALL (= TOPIC + STYLE). The following command will build the TOPIC classifier for VERACITY on the first fold training set and evaluate it on the first fold test set.
java -cp acl18-bundle.jar de.aitools.ie.articles.RandomForestClassifier TOPIC data/veracity/*-fold1-training.arff data/veracity/*-fold1-test.arff