Giter VIP home page Giter VIP logo

gsoc2019-anonymization's Introduction

Google Summer Of Code 2019 ☀️

Anonymisation Through Data Encryption of Sensitive Data in ODT and Text Files in Greek Language

Problem Statement

Over the past year, great importance has been attached to information anonymisation from governments all around the world. The GDPR defines pseudonymization and the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information. Although the GDPR has been implemented since 2018 no reliable infrastructure exists in Greece to encrypt sensitive documents. It is therefore necessary to develop a product specifically for users of the Greek language that can safely and promptly anonymize their data in order for it to abide to the GDPR.

Abstract

I propose the creation of a LibreOffice extension as well as a web GUI that will anonymize information in any legal document given. All sensitive information should be easily anonymized through this open-source tool.

On the subject of the creation of the anonymizer I suggest the following metrics. First of all, given any document the anonymizer should encrypt any greek entity in the file from a standard token vocabulary set. The user will be able to add specific arguments for entities to be anonymized (in addition to the standard ones) and he will be given the option to choose for an additional encryption. I believe that the LibreOffice extension as well as the web GUI should be user-friendly so customizable technologies should be used.

Wiki

An extended documentation has been written to wiki pages in order the service to be understandable and maintainable.

Technologies used

Anonymizer Service

The anonymizer service uses the following libraries: argparse, json, termcolor.

Web GUI

The web GUI uses the following libraries: django, bootstrap, requests, crispy-forms, django-form-utils.

LibreOffice Extension

The libreoffice extension uses the following libraries: uno, json, pynput.

Future work

  • Improvements in user interface.

  • Extending Web GUI, so that it can be hosted in VM and serve multiple clients at the same time.

  • Creating API.

  • Machine learning techniques to identify sensitive information in text.

  • Resolving any open issues.

For more information you can visit future work in wiki pages.

Final Report Gist

You can find the final report here.

Contributors

  • Google Summer of Code participant: Dimitrios Katsiros

  • Mentor: Kostas Papadimas

  • Mentor: Panos Louridas

  • Mentor: Iraklis Varlamis

  • Organization: GFOSS

License

This project is open source as a part of the Google Summer of Code Program. Here, the MIT license is adopted. For more information see LICENSE.

gsoc2019-anonymization's People

Contributors

dependabot[bot] avatar dkatsiros avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gsoc2019-anonymization's Issues

GUI - Import anonymizer as package

Web GUI makes use of anonymizer as module inside the src folder.

Anonymizer module inside src is called anonymizer_service.

Right now the service is executed using os.system library.

GUI should be updated so that any call to the service makes use of find_entities() instead.

Syntax error and possible logical bug

'αιου', 'λα', 'νου',

In this line of code surnames_postfixes list has an syntax error, the last item of the list is a comma character. This error though doesn't show up when running the code which means either this part of code never gets reached either the init of a new list in the line 406 with exacly the same name as the one above outscopes surnames_postfixes. In any case the declaration of the first list has no effect on how the program works.

GUI - Make use of quick mode in Web GUI

As user may anonymize manually words, there is no need for the service to search every time using the standard patterns, except the first time. Therefore after the first text analysis, GUI should use anonymizer package in quick mode.

ODT files alignment

How to keep original's file (.odt) alignment ?

Converting .odt files to .txt through odt2txt.
After identifying entities in the converted .txt file, the output should be converted back to the original's file format (.odt).
Although odt2txt offers exporting file to raw format, I still can't find a way to keep the original's file (.odt) alignment.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.