Giter VIP home page Giter VIP logo

plagiarism_detection's Introduction

Plagarism Detection

Code for detecting extrinsic and intrinsic plagiarism

Dataset

Dataset used can be downloaded from - https://webis.de/data/pan-pc-09.html. The ground truth for extrinsic is available here and for intrinsic it's available here

Script to generate ground truth data for extrinsic can be downloaded from here

Requirements

pip install -r requirements.txt

If you encounter ImportError: cannot import name 'complexity' from 'cophi' then run pip install cophi==1.2.3

Config

extrinsic:
  source:
    # dir where .txt files are stored for source
    dir:
      - dataset/subset/subset_1/sou_1
      - dataset/subset/subset_2/sou_2
      - dataset/subset/subset_3/sou_3
      
    # If pth is not present it will compute do the pre-proceissing 
    # and save them so for next run its will skip the processing and used data from csv
    pth: dataset/source_sent_all_three_subset.csv

  suspicious:
    # dir where .txt files are stored for source
    dir:
      - dataset/subset/subset_1/sus_1
    
    # If pth is not present it will compute do the pre-proceissing 
    # and save them so for next run its will skip the processing and used data from csv
    pth: dataset/suspicious_sent.csv

  index: dataset/output/se_index_subset_1_2_3.index
  save: dataset/output/set1/SE/se_output_subset_1_with_all_three_source.csv

intrinsic:
  suspicious:
    # dir where .txt files are stored for source
    dir:
      - dataset/pan-plagiarism-corpus-2009.part3/pan-plagiarism-corpus-2009/intrinsic-analysis-corpus/suspicious-documents
      - dataset/pan-plagiarism-corpus-2009.part2/pan-plagiarism-corpus-2009/intrinsic-analysis-corpus/suspicious-documents
      - dataset/pan-plagiarism-corpus-2009/intrinsic-analysis-corpus/suspicious-documents
    
    # If pth is not present it will compute do the pre-proceissing 
    # and save them so for next run its will skip the processing and used data from csv
    pth: path/to/suspicious_sent_intrinsic.csv

  save: path/to/save/intrinsic_output.csv

  features:
    - automated_readability_index
    - average_sentence_length_chars
    - average_sentence_length_words
    - average_syllables_per_word
    - average_word_frequency_class
    - average_word_length
    - coleman_liau_index
    - flesch_reading_ease
    - functionword_frequency
    - linsear_write_formula
    - most_common_words_without_stopwords
    - number_frequency
    - punctuation_frequency
    - sentence_length_distribution
    - special_character_frequency
    - stopword_ratio
    - top_3_gram_frequency
    - top_bigram_frequency
    - top_word_bigram_frequency
    - uppercase_frequency
    - word_length_distribution
    - yule_k_metric

evaluation:
  results: path/where/results.csv
  ground_truth: path/where/ground_truth.csv

Run Extrinsic

# USING TFIDF FOR FEATURES
python extrinsic_tfidf --config path/to/config.yaml

# USING DISTILL_BERT FOR FEATURES
python extrinsic_se --config path/to/config.yaml

Run Intrinsic

python intrinsic --config path/to/config.yaml

Evaluate

python evaluation --config path/to/config.yaml

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.