Giter VIP home page Giter VIP logo

anonipy's Introduction

Hi, I'm Erik ๐Ÿ‘‹๐Ÿผ

  • Position: Researcher @ Department for Artificial Intelligence, Joลพef Stefan Institute, Slovenia
  • Focus: ๐Ÿค– AI, ๐Ÿ“ NLP, ๐ŸŒ cross-lingual language models, ๐Ÿ”„ (semi-)automatic text processing, ๐Ÿ“Š data visualization
  • Connect: LinkedIn

See my Homepage for more information.

Open Source Packages

Name Description GitHub Stars PyPi
anonipy Data anonymization package supporting different anonymization strategies. Stars PyPi
datachart Data visualization package, simple to use, highly customizable. Stars PyPi

Data Sets

Name Description GitHub Stars Source
OG2021 The 2021 Tokyo Olympics data set Stars Clarin.si
SloATOMIC 2020 The Slovene translation of the ATOMIC 2020 data set Stars Clarin.si

anonipy's People

Contributors

eriknovak avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

anonipy's Issues

Add support for Python v3.12

Connected to a problem?

The latest Python v3.12 is unable to install the anonipy package. The problem is with the dependant packages (specifically gensim) which currently do not support v3.12.

Solution?

Since anonipy does not list gensim as a dependency, find a way for configuring the package such that it will support Python v3.12.

Alternatives?

No response

The method `read_file` does not work on Windows

Contact Details

No response

What happened?

The method read_file returns an error when reading a PDF. The output shows that the method is using pdftotext.

As identified, this is an issue with the used textract package, which does not support Windows, as presented in these open issues:

What operating system are you seeing the problem on?

Windows

Relevant log output

No response

Additional context

No response

Update the GLiNER & GLiNER-Spacy dependency

Connected to a problem?

The GLiNER and GLiNER-Spacy have recently implemented new features and customizations that would be useful to include in the extractors. The features include running extraction on GPU and (in the future) confidence scores of predictions.

Solution?

Check for when the features will be published in GLiNER and GLiNER-Spacy, and implement the features in anonipy.

Alternatives?

No response

Add acknowledgements

Describe the missing documentation

Add an acknowledgement section to the readme and the documentation to acknowledge the support given to develop the package.

Improve documentation

Describe the missing documentation

The current documentation might be hard to follow due to the number of components. Improving documentation, where each component and function of the package has clear explanations and examples, would benefit the project.

Related issues: #3

Add post-anonymization function to help manual replacement

Connected to a problem?

The current package supports automatic data anonymization. However, manually modifying the anonymized text can be cumbersome, especially if the original text is long.

Solution?

Add a post-anonymization function that takes the original text and the suggested replacements and returns the modified anonymized text.

The function should work with replacements returned by the strategyโ€™s anonymize method.

Alternatives?

No response

Add new models for NER information extraction

Connected to a problem?

The current NER information extraction focuses on using the GLiNER model, specifically urchade/gliner_multi_pii-v1. While this model does support some different languages, we would need models that would cover a more extensive list of languages. Furthermore, the NER model should support various domains as well.

Solution?

Find NER datasets or create synthetic datasets that support different languages and domains. For this, we could use the scripts provided by the GLiNER package and publish the trained models on the huggingface hub.

An additional bonus would be to evaluate these models in different languages and domains. However, this could be difficult due to the lack of open datasets for these use cases.

Alternatives?

No response

A regex pattern matching extractor

Connected to a problem?

The current EntityExtractor searches for relevant entities using Named Entity Extraction models. However, sometimes, when the documents are in the same format, we can easily extract the values using regex. Having an easy way of extracting such values with regex would be useful.

Solution?

Develop a new extractor called PatternExtractor or RegexExtractor, which would receive a list of labels and regex expressions and extract the parts of the document that match it.

Alternatives?

No response

Add an automatic date format detection to DateGenerator

Connected to a problem?

A document can contain dates that are in different formats. Because of this, we would need to initialize multiple DateGenerator generators for each date format in the document.

Solution?

Add support for date_format=โ€œautoโ€, which would automatically select the appropriate date format for the given input.

Alternatives?

No response

Implement unit tests

Connected to a problem?

The package currently needs unit tests. Having them would improve its stability and enable faster bug discovery in the code.

Solution?

To implement unit tests in the /test folder, specifically for:

  • extractors
  • generators
  • strategies
  • language detector
  • regex

The easiest way would be to use the unittest package.

Alternatives?

No response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.