Giter VIP home page Giter VIP logo

plagiarism-detection-software's Introduction

Plagiarism detection with Python 3 and Django

In 2014-2015 I developed a side-project called Plagiarism Guard which was a plagiarism detection service. I mainly did this to teach myself Python (kudos to Learn Python the Hard Way and also implement some (very!) basic NLP. This was developed in Python 3 and the Django framework.

The basic premise for this plagiarism detection is to accept resources in a few different formats (URL being the most popular, but also text files and Office-type documents). The resources can then be periodically scanned (via custom management commands triggered via a crontab), and a few fairly unique phrases are pulled out of them. These phrases are then searched online (using Bing's search engine API), and the results are the 'plagiarism-detected' candidates. Finally these candidates are scanned to rule out any false positives, and discover an approximate duplication score.

The files for this project are organised into three main folders:

  • /plag/ - this is the bulk of the Django application, hence it contains the models, forms, routes, services etc.
    • The /plag/templates/ folder contains the HTML pages covering both the public (unauthenticated) website pages such as the order form and legal documents (under static/), and the account (authenticated) pages (under dynamic/)
    • /plag/templatetags/custom_tags.py contains the Django custom template tags used in various parts of the HTML frontend
    • /plag/management/commands contains the custom management commands:
      • scan_resources.py chooses ProtectedResource entries which are due to be scanned, and then calls the relevant 'utility' methods in /util/getqueriespertype/ to get a few (hopefully distinct) queries from the document/resource. Bing's search engine API is then called in /util/handlequeries.py to get any potential plagiarism matches for each query. These results are saved back to the DB.
      • post_processing.py then looks at each potential plagiarism match URL, loads up the URL and parses the text content to see whether this is a false positive or not. If it's a real match, a duplication percentage score is calculcated. This then appears on the user's account.
      • recent_blog_posts.py this parses a blog's RSS feed and saves the latest results to the database, so that the blog results can be shown in a cached/efficient way.
  • /PlagiarismGuard/ - these are the standard Django files used to configure and power the application.
  • /util/ - as covered a little above, these are a set of 'utilities' which perform the bulk of the plagiarism detection work.

A further write-up of this project is available on my personal site.

plagiarism-detection-software's People

Contributors

tristanperry avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.