Giter VIP home page Giter VIP logo

jonas-becker / pd-human-vs-machine-content Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 0.0 59.75 MB

The official repository for the paper "Paraphrase Detection: Human vs. Machine Content".

License: Apache License 2.0

Python 0.14% HTML 99.86%
datasets paraphrase-detection paraphrase-recognition paraphrased-data evaluations human-data machine-data natural-language-procressing nlp paraphrase-identification paraphrases

pd-human-vs-machine-content's Introduction

Paraphrase Detection: Human vs. Machine Content

arXiv

This is the official repository for the paper Paraphrase Detection: Human vs. Machine Content.

Setup

We recommend using Python 3.10 for this project.

First install the requirements: pip install -r requirements.txt


To use GloVe and Fasttext, you need to place their corresponding pre-trained word vectors into the models directory.

  • GloVe: Get the glove.6B.11d.txt from here.
  • Fasttext: Get the cc.en.300.bin from here.

Experiments

The project has multiple scripts included, each used for separate parts of the experiment.

  1. Parse datasets from the datasets folder to a unified json format: parse.py
  2. Create the BERT embeddings for text pairs in true_data.json and visualize them with t-SNE: embedding_handler.py
  3. Apply detection methods (training & testing): detect_paraphrases.py
  4. Evaluate the detection results: evaluate.py
  5. Get examples sorted by best / worst / random performance: get_examples.py

Datasets

Not all datasets used in the paper are freely available to the public which is why we do not offer the prediction results on text pairs from these datasets for download. However, you are free to reprocess the experiments using all datasets from the paper once you got access.

This study includes twelve datasets (seven human-generated and five machine-generated). For further information, please refer to the paper.

Human-generated datasets: ETPC, QQP, TURL, SaR, MSCOCO, ParaSCI, APH

Machine-generated datasets: MPC, SAv2, ParaNMT-50M, PAWS-Wiki, APT

Results

We evaluated the results of our experiments in the linked paper above. However, we provide additional material here that was not used in the final version of the paper.

t-SNE visualizations of each datasets BERT embeddings
Dataset Aquisition Type Mixed Paraphrases Only
APH Human Live View Live View
APT Machine Live View Live View
ETPC Human Live View Live View
MPC Machine Live View Live View
MSCOCO Human Live View Live View
PAWS-Wiki Machine Live View Live View
ParaNMT-50M Machine Live View Live View
ParaSCI Human Live View Live View
QQP Human Live View Live View
SAv2 Machine Live View Live View
SaR Human Live View Live View
TURL Human Live View Live View
*All Datasets* Mixed Live View Live View
Grid Search Results We performed a 2-fold randomized grid search of 25 iterations once per detection method. The grid search results can be seen in this directory.
One-on-one correlation graphs of detection methods For a detailed view at each one-on-one correlation, please refer to this directory.

Citation

If you use this repository or our paper for your research work, please cite us in the following way.

@misc{becker2023paraphrase,
      title={Paraphrase Detection: Human vs. Machine Content}, 
      author={Jonas Becker and Jan Philip Wahle and Terry Ruas and Bela Gipp},
      year={2023},
      eprint={2303.13989},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

pd-human-vs-machine-content's People

Contributors

jonas-becker avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

pd-human-vs-machine-content's Issues

ValueError: Only callable can be used as callback in DataFrame Creation

I encountered an issue when attempting to execute the code due to a "ValueError: Only callable can be used as a callback" error in the "parser.py" file at line number 28. Specifically, the problem arose when I tried to create a DataFrame using the pandas library with the provided column names. Here is the code snippet that triggered the error:

df = pd.DataFrame(columns=[DATASET, ORIGIN, PAIR_ID, ID1, ID2, TEXT1, TEXT2, PARAPHRASE, PARAPHRASE_TYPE, SPLIT])

I kindly request your assistance in resolving this issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.