dataqa / nlp-labelling Goto Github PK

Labelling platform for text using weak supervision.

License: GNU General Public License v3.0

HTML 0.14% JavaScript 51.62% Python 48.16% Dockerfile 0.07%

data-labeling data-science text-annotation-tool annotation-tool text-mining search-engine nlp natural-language-processing nlp-machine-learning pseudo-labeling

nlp-labelling's Introduction

DataQA

DataQA is a tool to label and explore unstructured documents. It uses rules-based weak supervision to significantly reduce the number of labels needed compared to other tools. Here are a few things you can do with it:

Search your documents using Elasticsearch powerful text search engine,
Classify your documents,
Extract entities from your own data or from Wikipedia,
Link mentions of entities to your own ontology.

... and it's all available with a simple pip command!

Installation
Usage
What is weak supervision and why does it work?
Tutorials
Contact

Installation

Pre-requisites:

Python 3.6, 3.7, 3.8 and 3.9
(Recommended) start a new python virtual environment
Update your pip pip install -U pip
Tested on backend: MacOSX, Ubuntu. Tested on browser: Chrome, Firefox.

Installing from pypi

pip install dataqa

To run with Docker

The first time it is run: docker run -d -p 5000:5000 dataqa/dataqa
In order to keep the data between runs, use docker start [container-id] and docker stop [container-id]

Usage

In the terminal, type dataqa run. Wait a few minutes initially, as it takes some minutes to start everything up.

Doing this will run a server locally and open a browser window at port 5000. If the application does not open the browser automatically, open localhost:5000 in your browser. You need to keep the terminal open.

To quit the application, simply do Ctr-C in the terminal. To resume the application, type dataqa run. Doing so will create a folder at $HOME/.dataqa_data.

Uploading data

The text file needs to be a csv file in utf-8 encoding of up to 30MB with a column named "text" which contains the main text. The other columns will be ignored.

This step is running some analysis on your text and might take up to 5 minutes.

Uninstall

In the terminal:

dataqa uninstall: this deletes your local application data in the home directory in the folder .dataqa_data. It will prompt the user before deleting.
pip uninstall dataqa

Does this tool need an internet connection?

Nope. No data will ever leave your local machine.

Troubleshooting

If the project data does not load, try to go to the homepage and http://localhost:5000 and navigate to the project from there.

Try running dataqa test to get more information about the error, and bug reports are very welcome!

To test the application, it is possible to upload a text that contains a column "__LABEL__". The ground-truth labels will then be displayed during labelling and the real performance will be shown in the performance table between brackets.

Documentation

Documentation at: https://dataqa.ai/docs/.

To get started with a multi-class classification problem, go here.
To get started with a named entity recognition problem, go here.
To get started with a named entity linking problem, go here.

What is weak supervision and why does it work?

Weak supervision is a set of techniques to produce noisy labels for large quantities of data. It has gained popularity in recent years due to the large amounts of data typically needed for ML systems. The annotator is able to encode any prior domain knowledge it has in the form of rules. Even though these rules can be noisy, the algorithm learns how to weigh them accordingly and use them as signals to extract patterns from the data.

Creating a rule for classification

Creating a rule for NER

Contact

For any feedback, please contact us at [email protected]. Also follow me on for more updates and content around ML and labelling.

nlp-labelling's People

Stargazers

Watchers

Forkers

codeaudit ibmw zxexz thinkdolabs mduval1 anath2 shainaraza staeff smvorwerk govready zhejia-doordash sangchulsuh klamenzui binh-forked-projects sanyaade-teachings preshocx

nlp-labelling's Issues

Make Ports configurable

Please make the flask port configurable. I need to run in on another port than 5000.

Onobarding Suggestion: Make docs easier to find

I found the documentation on the organization page and they are thorough and helpful, especially the tutorial section.

However, they are hard to find, with a passing mention on the repo page.
I'd suggest

Add a link to the docs in the repos about section
In the Readme, add a Docs section with links to each section in the docs

This would make it easier for a potential user to get a sense of what the platform is before deciding to install

During onboarding, the upload button should be disabled until a project name is set

I uploaded a file by pressing upload, but got an error message because I didn't set a project name.

Instead consider either:

Disable the upload button until a project name is set (use the disabled prop on the MUI Button)
Or, add a second button,"select file" and only show the upload button when a name is set and file selected
BEST, use the material ui stepper To let the user know what needs to be done

I uploaded a file, now what ?

I uploaded a file, then I was sent here:

I've just been here. What should I do now ?

As a new user, I'd like you to

Give me a confirmation that my data was uploaded successfully.
Now that I'm in a new state (Have data) tell me what to do with it.

BTW , terminal shows this:

So I guess things went well

dataqa: command not found

Greetings! When trying to run the application in Python 3.8.10 with dataqa run I get dataqa: command not found (installation with pip was successful). Do you know what could be the issue?

The screenshots in the README don't help

The screenshots in the README don't look "nice" and don't help me understand what the tool does.

I'd consider replacing them with one video or animated gif that highlights the core differentiator of the tool, which is (rule based) search.

Show me one quick video that makes me say wow instead of enumerating capabilities with screenshots

Onboarding Suggestion: Tell me why I should use this

The README opens with a welcome message and a logo, but it doesn't tell me why I might want to use this.

I happen to be in the industry and think about this stuff a lot, so it clicks for me quick. But at this stage of the project, you're probably looking for early adopters who would try it out, and you need to tell them what's the best thing that could happen to them if they do. e.g. "Instead of spending days labeling data you can write a few rules with dataqa and get X,Y,Z in minutes "

Allow selection of column name in CSV upload

I have a csv but the text column is named content, not text.
As a new user, I have a limited "grace budget" for onboarding, and it doesn't cover opening the csv and renaming the column. I don't yet know what dataqa is, so I can't be bothered (hypothetically).
Thus when I get the following, I will churn:

I'd suggest detecting column names and letting the user select the relevant column. You can use PapaParse to parse a csv client side and then let the user select the proper column

Exception when Running Docker

When I run the docker container I get:

Exception in thread "main" java.io.IOException: Cannot run program "/usr/local/lib/python3.9/dist-packages/dataqa_es/server/elasticsearch-7.9.2/jre-15/bin/java": error=0, Failed to exec spawn helper: pid: 214, exit value: 1

at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1142)

at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1073)

at org.elasticsearch.tools.launchers.JvmErgonomics.flagsFinal(JvmErgonomics.java:114)

at org.elasticsearch.tools.launchers.JvmErgonomics.finalJvmOptions(JvmErgonomics.java:88)

at org.elasticsearch.tools.launchers.JvmErgonomics.choose(JvmErgonomics.java:59)

at org.elasticsearch.tools.launchers.JvmOptionsParser.jvmOptions(JvmOptionsParser.java:137)

at org.elasticsearch.tools.launchers.JvmOptionsParser.main(JvmOptionsParser.java:95)

Caused by: java.io.IOException: error=0, Failed to exec spawn helper: pid: 214, exit value: 1

at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)

at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:313)

at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:244)

at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1109)

... 6 more

Please mention when applicable:

Docker version 20.10.12, build e91ed5
Latest dataqa/dataqa
macOS Monterey Version 12.1.

Resolve the python 3.8+ installation issue

The README specifies extra work if I'm using Python >3.7.
I don't want to do extra work or even check what version of python I'm on. For potential users, having to deal with installation hassle is a turn off.

You, as package maintainer, can make my life easier and get more users by overriding snorkels dependency on networkx 2.5 in the setup.py file

Installation issue: Dependenccy conflict

Thank you for the library.

I am unable to install it locally:

Environment:
MacOSX (Big Sur 11.5.2)
Used python virtualenv (python v3.9.6)

(vtest) dev@npl:(~/Downloads) % pip install dataqa

Collecting dataqa
  Downloading dataqa-1.0.2-py3-none-any.whl (16.1 MB)
     |████████████████████████████████| 16.1 MB 5.4 MB/s
Collecting pandas==0.25.0
  Downloading pandas-0.25.0.tar.gz (12.6 MB)
     |████████████████████████████████| 12.6 MB 1.2 MB/s
Collecting requests==2.23.0
  Downloading requests-2.23.0-py2.py3-none-any.whl (58 kB)
     |████████████████████████████████| 58 kB 6.3 MB/s
Collecting scikit-learn==0.21.3
  Downloading scikit-learn-0.21.3.tar.gz (12.2 MB)
     |████████████████████████████████| 12.2 MB 6.0 MB/s
Collecting dataqa
  Downloading dataqa-1.0.1-py3-none-any.whl (16.1 MB)
     |████████████████████████████████| 16.1 MB 5.6 MB/s
ERROR: Cannot install dataqa==1.0.1 and dataqa==1.0.2 because these package versions have conflicting dependencies.

The conflict is caused by:
    dataqa 1.0.2 depends on dataqa-es==0.0.2
    dataqa 1.0.1 depends on dataqa-es==0.0.1

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies
WARNING: You are using pip version 21.1.2; however, version 21.2.4 is available.
You should consider upgrading via the '/Users/nabin/Downloads/vtest/bin/python -m pip install --upgrade pip' command.

Installation

Tried to install and got an error message

Please mention when applicable:

python version 3.6.9
OS version Ubuntu 18
browser version NA
version of dataqa 1.0.6

Where do I set class names?

So I uploaded data and made a project.
I got excited when I could add a rule, but I don't have any classes defined and I don't know where to add them.

I went to projects but I couldn't find a way to add classes

As a new user, I have churned now

dataqa / nlp-labelling Goto Github PK

nlp-labelling's Introduction

DataQA

Installation

Pre-requisites:

Installing from pypi

To run with Docker

Usage

Uploading data

Uninstall

Does this tool need an internet connection?

Troubleshooting

Documentation

What is weak supervision and why does it work?

Creating a rule for classification

Creating a rule for NER

Contact

nlp-labelling's People

Stargazers

Watchers

Forkers

nlp-labelling's Issues

Recommend Projects

Recommend Topics

Recommend Org