eriknovak / anonipy Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 0.0 1.02 MB

Data anonymization package, supporting different anonymization strategies

Home Page: https://eriknovak.github.io/anonipy/

License: BSD 2-Clause "Simplified" License

Python 100.00%

data-anonymization data-generation package python term-extraction

anonipy's Introduction

Hi, I'm Erik 👋🏼

Position: Researcher @ Department for Artificial Intelligence, Jožef Stefan Institute, Slovenia
Focus: 🤖 AI, 📝 NLP, 🌍 cross-lingual language models, 🔄 (semi-)automatic text processing, 📊 data visualization
Connect: LinkedIn

See my Homepage for more information.

Open Source Packages

Name	Description	GitHub Stars	PyPi
anonipy	Data anonymization package supporting different anonymization strategies.
datachart	Data visualization package, simple to use, highly customizable.

Data Sets

Name	Description	GitHub Stars	Source
OG2021	The 2021 Tokyo Olympics data set		Clarin.si
SloATOMIC 2020	The Slovene translation of the ATOMIC 2020 data set		Clarin.si

anonipy's People

Contributors

Stargazers

Watchers

anonipy's Issues

Add support for Python v3.12

Connected to a problem?

The latest Python v3.12 is unable to install the anonipy package. The problem is with the dependant packages (specifically gensim) which currently do not support v3.12.

Solution?

Since anonipy does not list gensim as a dependency, find a way for configuring the package such that it will support Python v3.12.

Alternatives?

No response

The method `read_file` does not work on Windows

Contact Details

No response

What happened?

The method read_file returns an error when reading a PDF. The output shows that the method is using pdftotext.

As identified, this is an issue with the used textract package, which does not support Windows, as presented in these open issues:

What operating system are you seeing the problem on?

Windows

Relevant log output

No response

Additional context

No response

Update the GLiNER & GLiNER-Spacy dependency

Connected to a problem?

The GLiNER and GLiNER-Spacy have recently implemented new features and customizations that would be useful to include in the extractors. The features include running extraction on GPU and (in the future) confidence scores of predictions.

Solution?

Check for when the features will be published in GLiNER and GLiNER-Spacy, and implement the features in anonipy.

Alternatives?

No response

Add acknowledgements

Describe the missing documentation

Add an acknowledgement section to the readme and the documentation to acknowledge the support given to develop the package.

Improve documentation

Describe the missing documentation

The current documentation might be hard to follow due to the number of components. Improving documentation, where each component and function of the package has clear explanations and examples, would benefit the project.

Related issues: #3

Add post-anonymization function to help manual replacement

Connected to a problem?

The current package supports automatic data anonymization. However, manually modifying the anonymized text can be cumbersome, especially if the original text is long.

Solution?

Add a post-anonymization function that takes the original text and the suggested replacements and returns the modified anonymized text.

The function should work with replacements returned by the strategy’s anonymize method.

Alternatives?

No response

Code documentation

Describe the missing documentation

The current code is mostly undocumented. It would be good to have all components documented, showing the input parameters and what they return. Furthermore, using typings would help with IntelliSense, i.e. autocompletion of the code.

Possible doc style: https://peps.python.org/pep-0257/

Example of use: https://www.programiz.com/python-programming/docstrings

Add new models for NER information extraction

Connected to a problem?

The current NER information extraction focuses on using the GLiNER model, specifically urchade/gliner_multi_pii-v1. While this model does support some different languages, we would need models that would cover a more extensive list of languages. Furthermore, the NER model should support various domains as well.

Solution?

Find NER datasets or create synthetic datasets that support different languages and domains. For this, we could use the scripts provided by the GLiNER package and publish the trained models on the huggingface hub.

An additional bonus would be to evaluate these models in different languages and domains. However, this could be difficult due to the lack of open datasets for these use cases.

Alternatives?

No response

A regex pattern matching extractor

Connected to a problem?

The current EntityExtractor searches for relevant entities using Named Entity Extraction models. However, sometimes, when the documents are in the same format, we can easily extract the values using regex. Having an easy way of extracting such values with regex would be useful.

Solution?

Develop a new extractor called PatternExtractor or RegexExtractor, which would receive a list of labels and regex expressions and extract the parts of the document that match it.

Alternatives?

No response

Add an automatic date format detection to DateGenerator

Connected to a problem?

A document can contain dates that are in different formats. Because of this, we would need to initialize multiple DateGenerator generators for each date format in the document.

Solution?

Add support for date_format=“auto”, which would automatically select the appropriate date format for the given input.

Alternatives?

No response

Implement unit tests

Connected to a problem?

The package currently needs unit tests. Having them would improve its stability and enable faster bug discovery in the code.

Solution?

To implement unit tests in the /test folder, specifically for:

The easiest way would be to use the unittest package.

Alternatives?

No response

eriknovak / anonipy Goto Github PK

anonipy's Introduction

Hi, I'm Erik 👋🏼

Open Source Packages

Data Sets

anonipy's People

Contributors

Stargazers

Watchers

anonipy's Issues

Connected to a problem?

Solution?

Alternatives?

Contact Details

What happened?

What operating system are you seeing the problem on?

Relevant log output

Additional context

Connected to a problem?

Solution?

Alternatives?

Describe the missing documentation

Describe the missing documentation

Connected to a problem?

Solution?

Alternatives?

Describe the missing documentation

Connected to a problem?

Solution?

Alternatives?

Connected to a problem?

Solution?

Alternatives?

Connected to a problem?

Solution?

Alternatives?

Connected to a problem?

Solution?

Alternatives?

Recommend Projects

Recommend Topics

Recommend Org