Paraphrase Detection: Human vs. Machine Content

This is the official repository for the paper Paraphrase Detection: Human vs. Machine Content.

Setup

We recommend using Python 3.10 for this project.

First install the requirements: pip install -r requirements.txt

To use GloVe and Fasttext, you need to place their corresponding pre-trained word vectors into the models directory.

GloVe: Get the glove.6B.11d.txt from here.
Fasttext: Get the cc.en.300.bin from here.

Experiments

The project has multiple scripts included, each used for separate parts of the experiment.

Parse datasets from the datasets folder to a unified json format: parse.py
Create the BERT embeddings for text pairs in true_data.json and visualize them with t-SNE: embedding_handler.py
Apply detection methods (training & testing): detect_paraphrases.py
Evaluate the detection results: evaluate.py
Get examples sorted by best / worst / random performance: get_examples.py

Datasets

Not all datasets used in the paper are freely available to the public which is why we do not offer the prediction results on text pairs from these datasets for download. However, you are free to reprocess the experiments using all datasets from the paper once you got access.

This study includes twelve datasets (seven human-generated and five machine-generated). For further information, please refer to the paper.

Human-generated datasets: ETPC, QQP, TURL, SaR, MSCOCO, ParaSCI, APH

Machine-generated datasets: MPC, SAv2, ParaNMT-50M, PAWS-Wiki, APT

Results

We evaluated the results of our experiments in the linked paper above. However, we provide additional material here that was not used in the final version of the paper.

t-SNE visualizations of each datasets BERT embeddings

Dataset	Aquisition Type	Mixed	Paraphrases Only
APH	Human	Live View	Live View
APT	Machine	Live View	Live View
ETPC	Human	Live View	Live View
MPC	Machine	Live View	Live View
MSCOCO	Human	Live View	Live View
PAWS-Wiki	Machine	Live View	Live View
ParaNMT-50M	Machine	Live View	Live View
ParaSCI	Human	Live View	Live View
QQP	Human	Live View	Live View
SAv2	Machine	Live View	Live View
SaR	Human	Live View	Live View
TURL	Human	Live View	Live View
All Datasets	Mixed	Live View	Live View

Grid Search Results

We performed a 2-fold randomized grid search of 25 iterations once per detection method. The grid search results can be seen in this directory.

One-on-one correlation graphs of detection methods

For a detailed view at each one-on-one correlation, please refer to this directory.

Citation

If you use this repository or our paper for your research work, please cite us in the following way.

@misc{becker2023paraphrase,
      title={Paraphrase Detection: Human vs. Machine Content}, 
      author={Jonas Becker and Jan Philip Wahle and Terry Ruas and Bela Gipp},
      year={2023},
      eprint={2303.13989},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

jonas-becker / pd-human-vs-machine-content Goto Github PK

pd-human-vs-machine-content's Introduction

Paraphrase Detection: Human vs. Machine Content

Setup

Experiments

Datasets

Results

Citation

pd-human-vs-machine-content's People

Contributors

Stargazers

Watchers

pd-human-vs-machine-content's Issues

ValueError: Only callable can be used as callback in DataFrame Creation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent