label-sleuth / label-sleuth Goto Github PK

Open source no-code system for text annotation and building of text classifiers

License: Apache License 2.0

Python 53.26% Dockerfile 0.06% HTML 0.20% CSS 3.30% TypeScript 43.18%

active-learning annotation-tool labeling-tool nlp no-code python pytorch react text-annotation text-classification transformers

label-sleuth's Introduction

Quick Start | Documentation | Join Slack

Label Sleuth is an open source no-code system for text annotation and building text classifers. With Label Sleuth, domain experts (e.g., physicians, lawyers, psychologists) can quickly create custom NLP models by themselves, with no dependency on NLP experts.

Creating real-world NLP models typically requires a combination of two expertise - deep knowledge of the target domain, provided by domain experts, and machine learning knowledge, provided by NLP experts. Thus, domain experts are dependent on NLP experts. Label Sleuth comes to eliminate this dependency. With an intuitive UX, it escorts domain experts in the process of labeling the data and building NLP models which are tailored to their specific needs. As domain experts label examples within the system, machine learning models are being automatically trained in the background, make predictions on new examples, and provide suggestions for the users on the examples they should label next.

Label Sleuth is a no-code system, no knowledge in machine learning is need, and - it is fast to obtain a model – from task definition to a working model in just a few hours!

Table of contents

Installation for end users

Setting up a development environment

Project structure

Using the system

Customizing the system

System configuration
Implementing new components

Reference

Installation for end users (non-developers)

Follow the instructions on our website.

Setting up a development environment

The system requires Python 3.8 or 3.9 (other versions are currently not supported and may cause issues).

Clone the repository:

git clone [email protected]:label-sleuth/label-sleuth.git
cd to the cloned directory: cd label-sleuth
Install the project dependencies using conda (recommended) or pip:

Installing with conda

Install Anaconda https://docs.anaconda.com/anaconda/install/index.html
Restart your console
Use the following commands to create a new anaconda environment and install the requirements:

# Create and activate a virtual environment:
conda create --yes -n label-sleuth python=3.9
conda activate label-sleuth
# Install requirements
pip install -r requirements.txt

Installing with pip

Assuming python 3.8/3.9 is already installed.

Install pip https://pip.pypa.io/en/stable/installation/
Restart your console
Install requirements:

pip install -r requirements.txt

Start the Label Sleuth server: run python -m label_sleuth.start_label_sleuth.

By default all project files are written to <home_directory>/label-sleuth, to change the directory add --output_path <your_output_path>.

You can add --load_sample_corpus wiki_animals_2000_pages to load a sample corpus into the system at startup. This fetches a collection of Wikipedia documents from the data-examples repository.

By default, the host will be localhost to expose the server only on the host machine. If you wish to expose the server to external communication, add --host <IP> for example, --host 0.0.0.0 to listen to all IPs.

Default port is 8000, to change the port add --port <port_number> to the command.

The system can then be accessed by browsing to http://localhost:8000 (or http://localhost:<port_number>)

Project Structure

The repository consists of a backend library, written in Python, and a frontend that uses React. A compiled version of the frontend can be found under label_sleuth/build.

Using the system

See our website for a simple tutorial that illustrates how to use the system with a sample dataset of Wikipedia pages. Before starting the tutorial, make sure you pre-load the sample dataset by running:

python -m label_sleuth.start_label_sleuth --load_sample_corpus wiki_animals_2000_pages.

Customizing the system

System configuration

The configurable parameters of the system are specified in a json file. The default configuration file is label_sleuth/config.json.

A custom configuration can be applied by passing the --config_path parameter to the "start_label_sleuth" command, e.g., python -m label_sleuth.start_label_sleuth --config_path <path_to_my_configuration_json>

Alternatively, it is possible to override specific configuration parameters at startup by adding them to the run command, e.g., python -m label_sleuth.start_label_sleuth --changed_element_threshold 100

Configurable parameters:

Parameter	Description
`first_model_positive_threshold`	Number of elements that must be assigned a positive label for the category in order to trigger the training of a classification model. See also: The training invocation documentation.
`first_model_negative_threshold`	Number of elements that must be assigned a negative label for the category in order to trigger the training of a classification model. See also: The training invocation documentation.
`changed_element_threshold`	Number of changes in user labels for the category -- relative to the last trained model -- that are required to trigger the training of a new model. A change can be a assigning a label (positive or negative) to an element, or changing an existing label. Note that `first_model_positive_threshold` must also be met for the training to be triggered. See also: The training invocation documentation.
`training_set_selection_strategy`	Strategy to be used from TrainingSetSelectionStrategy. A TrainingSetSelectionStrategy determines which examples will be sent in practice to the classification models at training time - these will not necessarily be identical to the set of elements labeled by the user. For currently supported implementations see get_training_set_selector(). See also: The training set selection documentation.
`model_policy`	Policy to be used from ModelPolicies. A ModelPolicy determines which type of classification model(s) will be used, and when (e.g. always / only after a specific number of iterations / etc.). See also: The model selection documentation.
`active_learning_strategy`	Strategy to be used from ActiveLearningCatalog. An ActiveLearner module implements the strategy for recommending the next elements to be labeled by the user, aiming to increase the efficiency of the annotation process. For currently supported implementations see the ActiveLearningCatalog. See also: The active learning documentation.
`precision_evaluation_size`	Sample size to be used for estimating the precision of the current model. To be used in future versions of the system, which will provide built-in evaluation capabilities.
`apply_labels_to_duplicate_texts`	Specifies how to treat elements with identical texts. If `true`, assigning a label to an element will also assign the same label to other elements which share the exact same text; if `false`, the label will only be assigned to the specific element labeled by the user.
`language`	Specifies the chosen system-wide language. This determines some language-specific resources that will be used by models and helper functions (e.g., stop words). The list of supported languages can be found in Languages. We welcome contributions of additional languages.
`login_required`	Specifies whether or not using the system will require user authentication. If `true`, the configuration file must also include a `users` parameter.
`users`	Only relevant if `login_required` is `true`. Specifies the pre-defined login information in the following format: "users":[ { "username": "<predefined_username1>", "token":"<randomly_generated_token1>", "password":"<predefined_user1_password>" } ] * The list of usernames is static and currently all users have access to all the workspaces in the system.

Implementing new components

Label Sleuth is a modular system. We welcome the contribution of additional implementations for the various modules, aiming to support a wider range of user needs and to harness efficient and innovative machine learning algorithms.

Below are instructions for implementing new models and active learning strategies:

Implementing a new machine learning model

These are the steps for integrating a new classification model:

Implement a new ModelAPI

Machine learning models are integrated by adding a new implementation of the ModelAPI.

The main functions are _train(), load_model() and infer():

def _train(self, model_id: str, train_data: Sequence[Mapping], model_params: Mapping):

model_id
train_data - a list of dictionaries with at least the "text" and "label" fields. Additional fields can be passed e.g. [{'text': 'text1', 'label': 1, 'additional_field': 'value1'}, {'text': 'text2', 'label': 0, 'additional_field': 'value2'}]
model_params - dictionary for additional model parameters (can be None)

def load_model(self, model_path: str):

model_path: path to a directory containing all model components

Returns an object that contains all the components that are necessary to perform inference (e.g., the trained model itself, the language recognized by the model, a trained vectorizer/tokenizer etc.).

def infer(self, model_components, items_to_infer) -> Sequence[Prediction]:

model_components: the return value of load_model(), i.e. an object containing all the components that are necessary to perform inference
items_to_infer: a list of dictionaries with at least the "text" field. Additional fields can be passed, e.g. [{'text': 'text1', 'additional_field': 'value1'}, {'text': 'text2', 'additional_field': 'value2'}]

Returns a list of Prediction objects - one for each item in items_to_infer - where Prediction.label is a boolean and Prediction.score is a float in the range [0-1]. Additional outputs can be passed by inheriting from the base Prediction class and overriding the get_predictions_class() method.

Add the newly implemented ModelAPI to ModelsCatalog
Add one or more policies that use the new model to ModelPolicies

Implementing a new active learning strategy

These are the steps for integrating a new active learning approach:

Implement a new ActiveLearner

Active learning modules are integrated by adding a new implementation of the ActiveLearner API. The function to implement is get_per_element_score:

 def get_per_element_score(self, candidate_text_elements: Sequence[TextElement],
                           candidate_text_element_predictions: Sequence[Prediction], workspace_id: str,
                           dataset_name: str, category_name: str) -> Sequence[float]:

Given sequences of text elements and the model predictions for these elements, this function returns an active learning score for each element. The elements with the highest scores will be recommended for the user to label next.

Add the newly implemented ActiveLearner to the ActiveLearningCatalog

Reference

Eyal Shnarch, Alon Halfon, Ariel Gera, Marina Danilevsky, Yannis Katsis, Leshem Choshen, Martin Santillan Cooper, Dina Epelboim, Zheng Zhang, Dakuo Wang, Lucy Yip, Liat Ein-Dor, Lena Dankin, Ilya Shnayderman, Ranit Aharonov, Yunyao Li, Naftali Liberman, Philip Levin Slesarev, Gwilym Newton, Shila Ofek-Koifman, Noam Slonim and Yoav Katz (EMNLP 2022). Label Sleuth: From Unlabeled Text to a Classifier in a Few Hours.

Please cite:

@inproceedings{shnarch2022labelsleuth,
	title={{L}abel {S}leuth: From Unlabeled Text to a Classifier in a Few Hours},
	author={Shnarch, Eyal and Halfon, Alon and Gera, Ariel and Danilevsky, Marina and Katsis, Yannis and Choshen, Leshem and Cooper, Martin Santillan and Epelboim, Dina and Zhang, Zheng and Wang, Dakuo and Yip, Lucy and Ein-Dor, Liat and Dankin, Lena and Shnayderman, Ilya and Aharonov, Ranit and Li, Yunyao and Liberman, Naftali and Slesarev, Philip Levin and Newton, Gwilym and Ofek-Koifman, Shila and Slonim, Noam and Katz, Yoav},
	booktitle={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing ({EMNLP}): System Demonstrations},
    	month={dec},
    	year={2022},
    	address={Abu Dhabi, UAE},
	publisher={Association for Computational Linguistics},
	url={https://aclanthology.org/2022.emnlp-demos.16},
    	pages={159--168}
}

label-sleuth's People

Contributors

Stargazers

Watchers

label-sleuth's Issues

Models are using the GPU thread pool even if GPU is not available

What is the current behavior?
Models with GPU support are using the GPU thread pool even if GPU is not available

What is the expected behavior?
Models with GPU support will use the the GPU thread pool only if GPU is available

Steps to reproduce the problem:

train a model with GPU support on a device without GPU
check the logs and see Adding training for model id HFTransformers_0401a12a-2476-11ed-9df5-0a94ef3e9940 into the GPU_1_threadpool

Environment:

OS:
Web browser:
Python version:

Suggested solution (if any):
In ModelsBackgroundJobsManager.get_executor() check in addition to the current checks if definitions.GPU_AVAILABLE if True.

Add a blue frame to the elements on the right panel

Is your feature request related to a problem?

To denote that an element is predicted as positive by the last model, main panel elements have a blue frame around them, like in the following image.

The style for predicted elements on the sidebar panels is different, they don't have that blue frame.

What is the expected behavior?

To be consistent, add a blue frame to the sidebar predicted elements as well

underscore added to category name in the upload labeled data feature

What is the current behavior?
I’ve downloaded the labeled data of a workspace with a category called “Employee Count”.
When I upload this file to a new workspace, the category that is created is called “Employee_Count” (note the underscore that was added).

What is the expected behavior?
the underscore should not be added to the category name.

Steps to reproduce the problem:

in workspace A, create a category name which includes spaces
label a few examples
export the labeled data
open an new workspace B
in workspace B import the labeled data from step 3
look at the category name in workspace B

Environment:

OS: macOS Monterey V 12.6
Web browser: Chrome
Python version: 3.9.13

Suggested solution (if any):
I guess there is some escaping code which doesn't like spaces and replaces them with underscores.

[git] repo is getting bigger as we keep the history of all builds of the frontend in it

Currently, on every merge to the main, we compile the frontend using a GitHub action, and push the compiled version to label_sleuth/frontend.

We should:

Remove all the historic compiled versions
Look for a solution to prevent historic builds from being saved

Focus next sidebar element when labeling using shortcuts

Is your feature request related to a problem?

Currently, labeling a sidebar element on the Label Next or Evaluation sidebar panels automatically focuses the next element. This automatic focus doesn't happen in the other panels.
What is the expected behavior?

The automatic focus of the next sidebar element should be consistent throughout all the panels. Thus, add this behavior to the other panels as well.

Add icons for accesing Github, Slack and Webpage on the UI

Currently the UI displays a link to easily access the webpage at the bottom left corner:

Update this by replacing that link with three new icons that work as links to the Github page of Label Sleuth, an invitation to the Label Sleuth Slack workspace and the webpage.

Change labeling button order

Is your feature request related to a problem?
From a UX designing perspective: "It is easier to aim and click a button positioned in the corner than aim for a button that is placed 2nd from the corner.". Currently the X (negative label action) is the closer one to the corner and the V (positive label action) is the second one.

What is the expected behavior?
Place the V button on the right side (closer to the corner). The sidebar shortcuts for labeling actions should also be switched. Arrow left would label the focused element as negative and arrow right would label the focused element as positive.

Pagination in right panel element lists

Is your feature request related to a problem?
Currently the element lists on the right panel are static and with a predefined size. This means the user cannot view more than X results, and/or must scroll through a very long list.

What is the expected behavior?
Pagination

Additional context (if any):
Some of the relevant endpoints in the backend already support pagination, but this is not reflected in the frontend.

Error when deleting workspace

What is the current behavior?
Sometimes, deleting a workspace fails.

Request:

Error stacktrace:

2022-10-06 12:15:01,574 ERROR    [orchestrator_api.py:101] error deleting workspace 'www'
Traceback (most recent call last):
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/orchestrator/orchestrator_api.py", line 98, in delete_workspace
    self._delete_category_models(workspace_id, category_id)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/orchestrator/orchestrator_api.py", line 341, in _delete_category_models
    for idx in range(len(workspace.categories[category_id].iterations)):
AttributeError: 'NoneType' object has no attribute 'iterations'
2022-10-06 12:15:01,575 ERROR    [app.py:1455] Exception on /workspace/www [DELETE]
Traceback (most recent call last):
  File "/Users/martin/opt/miniconda3/envs/label-sleuth-test/lib/python3.8/site-packages/flask/app.py", line 2077, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/martin/opt/miniconda3/envs/label-sleuth-test/lib/python3.8/site-packages/flask/app.py", line 1525, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/Users/martin/opt/miniconda3/envs/label-sleuth-test/lib/python3.8/site-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/Users/martin/opt/miniconda3/envs/label-sleuth-test/lib/python3.8/site-packages/flask/app.py", line 1523, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/martin/opt/miniconda3/envs/label-sleuth-test/lib/python3.8/site-packages/flask/app.py", line 1509, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/authentication.py", line 32, in wrapper
    return function(*args, **kwargs)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/app_utils.py", line 36, in wrapper
    return function(workspace_id, *args, **kwargs)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/app.py", line 246, in delete_workspace
    curr_app.orchestrator_api.delete_workspace(workspace_id)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/orchestrator/orchestrator_api.py", line 102, in delete_workspace
    raise e
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/orchestrator/orchestrator_api.py", line 98, in delete_workspace
    self._delete_category_models(workspace_id, category_id)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/orchestrator/orchestrator_api.py", line 341, in _delete_category_models
    for idx in range(len(workspace.categories[category_id].iterations)):
AttributeError: 'NoneType' object has no attribute 'iterations'

What is the expected behavior?
The workspace should be deleted.

Steps to reproduce the problem:

Call the /delete_workspace endpoint.

Scroll to top of Label Next when a new model becomes available

Is your feature request related to a problem?
When I am labeling in the Label Next panel and a new model becomes available, the elements are automatically reloaded and so the element I was on is no longer available. That is fine, but now I am at some arbitrary place in the list. Maybe the element ordering is arbitrary, but I always find myself manually scrolling back up to the top when this happens.

What is the expected behavior?
When a new model is loaded, the Label Next panel should scroll up to the top.

Additional context (if any):
N/A

[Discussion] Frontend progressive migration to Typescript

I would like to propose to start migrating the frontend codebase from Javascript to Typescript. Typescript is a superset of javascript that provides optional static typing, classes, and interfaces. Thus, any Javascript code can be run as Typescript code.

Advantages

Some advantages of using Typescript over Javascript are:

Its considered a way of testing: although it is not a replacement for testing, it can help finding bugs thanks to type checking. e.g. if a Component expects a number as a prop and receives a string or null it fails at compile time, while Javascript would not only not fail but also introduce unpredictable behavior. Generally, it allows finding errors earlier.
Intelligent code completion: both for writing general code and also for writing React component in tsx files, as component properties are defined statically.
Code is easier to understand: specially in an open source environment, where we want developers to easily understand what the code does to start contributing. There would be a contract between each part of the application regarding how they communicate with each other, going from what the frontend expects to receive from an API call to the props that a functional component should receive.
As Typescript is widely used nowadays, major libraries like the ones that we currently use in the frontend has support for Typescript (i.e. types are published) like React itself, Redux, MUI, React Testing Library, etc.
Migrating to Typescript could be also an opportunity to developers that want to contribute to an open source project (lots of good first issues). Other open source projects have done this receiving feedback from the community, specially for deciding whether to actually migrate to Typescript or not: see this PR from the Jest repo and this page for a long list of examples.
Refactoring is way easier as it can be done with confidence of not breaking other parts of the code.

Migration

Migration can be done gradually. There are official migration guides and several open source projects that successfully did such a migration. A step to step of the migration could be as follow:

Start by adding the typescript compiler: as ts is a superset of js, the compiler can compile the current source code and already give some advantages like being sure that the number of parameters passed to a function are equal to the number of parameters that the function expects.
Start migrating the code base. I would suggest to start by the Redux state as it is where the majority of types have to be defined. Migrating react components should be straightforward.
Select a convention for new code: new code should be written in typescript, if the new feature has to make changes in pre-existing source files, migrate them as well.

Frontend issues with parentheses in the search bar

What is the current behavior?
Typing in parentheses or brackets in the search bar can cause a blank page to appear.

Steps to reproduce the problem:

Search using the search bar, for any expression that returns some results in the dataset.
After the results have returned, modify the contents of the search bar, and type in any one of the following characters: ][)(

Environment:

OS: macOS
Web browser: Chrome / Safari

Expose shortcuts so users are aware of them

Is your feature request related to a problem?

As part of the issue #253, shortcuts has been added to perform actions on the sidebar panel. Currently, we don't expose the information about which shortcuts are available anywhere. This means we are on an incosistent state: because of shortcuts, a sidebar element is focused by default (i.e. has a yellow background) and the user has no clue of why it has that style.

Based on a discussion with @yannisk2, we came up to four different options we could implement in order to let users know about the available shortcuts. Those are:

Tooltips: when a user hovers a button that has a shortcut associated, show a tooltip with the action and the shortcut key. For example: positive and negative labeling buttons. Not all shortcuts actions have a button associated (i.g. going to the next or previous element).

Example from Microsoft Word:

Modal: display a modal that lists all the available shortcuts and explain them.

Jupyter follows this approach:

Notifications: we could display notifications with a message that reminds the user that there are shortcuts available for the action they are performing. Example: user clicks an element to discover its document on the main view -> a message informing that pressing enter would do the trick if the element is focused or something like "You can go to ? to see the available shortcuts" where pressing ? opens the modal described in the previous item.
Tutorial: expose the shortcuts on the tutorial. This could be done a) when showing the user how to use the sidebar panels or b) having a whole tutorial step dedicated to shortcuts.

Avoid scrolling into view a sidebar element when labeling it

What is the current behavior?
Labeling an element on the sidebar panel scrolls that element into view. This is not the expected behaviour.

What is the expected behavior?
Clicking a sidebar element is expected to scroll that element into view but labeling it should't.

Scrolling elements into view is not working properly

What is the current behavior?
Scrolling a DOM element into view is used extensively in the UI. The native functionality of the browsers is used for this. More specifically, the element.scrollIntoView() method is used. It receives several parameters, one is the place where the element has to be located at after the scrolling is applied related to its parent (we use center) and the scrolling behavior, which can be instant or smooth. We use smooth which is better for UX (not supported in Safari).

The scrolling behavior is not working in some cases. It is used both on the main document view and on the sidebar panels. On the main panel, it is used to focus an element when a user clicks a sidebar element, to show its context. It works fine here. On the sidebar panels, user can navigate through the element list using shortcuts or the mouse (see shortcuts here #253 ). Scrolling into view is used when:

User press arrow down or up.
User press arrow left or right to label an element and the panel is configured so that the next element is focused when an element is labeled.
User clicks an element. That element gets focused and thus scrolled into view.

Case 1: works fine.
Case 2: doesn't work. I've already found the bug here. We have to use event.preventDefault() for the ArrowRight and ArrowLeft events.
Case 3: works on Safari, doesn't work on Chrome when internet connection is fast. Why? To understand it, we have to be aware on that two consecutive and concurrent scrolling into view processes are happening. When a user clicks a sidebar element, that element is scrolled into view in the sidebar panel. At the same time, the focused element's document is requested and then scrolled into view on the main panel. If fetching the element finishes after the scrolling process on the sidebar panel, the bug doesn't occur. Instead, if the document is quickly fetched, the sidebar element scrolling into view gets cancelled. Chrome doesn't currently support two of those processes happening at the same time (for more info, see this so question and this Chromium ticket).

Suggested solution (if any):
I will implement Case 2 fix. For Case 3, a potential fix can be implemented done using scrollTo() (link).

Add shortcuts to sidebar actions

Is your feature request related to a problem?

We want users to spend most of their time on the sidebar panel, i.g. the Label Next panel. Sometimes is it very easy and quick to decide whether an element has to be labeled as positive or negative. In such cases, labeling the elements by clicking the labeling buttons can take a long time.

What is the expected behavior?

Add shortcuts so the user can traverse the sidebar elements.
Proposed shortcuts are:

Arrow up: go to the previous element.
Arrow down: go to the next element.
Arrow left: label focused element as positive.
Arrow right: label focused element as negative.
Enter: focus the element on the main panel.

Behaviour description:

If the current focused element is the last element of the page and the user focus the next element, the first element of the next page (if any) will be focused.
If the current focused element is the first element of the page and the user focus the previous element, the last element of the previous page (if any) will be focused.
In the Label next panel, labeling an element will automatically make the next element to be focused and scrolled into view (the need for this behavior is described in #240).
The following events will make sidebar elements to be focused: labeling them, clicking them.

Improve the wording of the document tab in the left sidebar

Is your feature request related to a problem?

The document tab description text Labeled for Current Doc: could be changed to Labeled for current document:.

Add more sections to the contribution guidelines

I suggest to add the following items to the contribution guidelines:

Commit messages: we should at least point out that commit messages has to be descriptive. In addition, we could suggest to use the commit convention that the carbon project uses and I also use. https://github.com/carbon-design-system/carbon/blob/main/docs/developer-handbook.md#commit-conventions
How to work locally:
- Add prerequisites: explicitly list the tools that the contributor has to have installed locally (node, git, conda, etc).
- Add steps 1 to 5 from the Start contributing section of the carbon's contribution guidelines.
Git merge strategy: we should decide on what git merge strategy/ies to use when merging PR's.
- carbon, for example, uses squash and merge. This has the advantage of that the to-merge branch can have any number of commits and they will be squashed into a single one when merged with main. Another advantage is that contributors can use the update with main functionality that Github provides so that that doesn't have to be done locally.
- However, sometimes we want several PR's commits to be in main when merged because they make important changes. In this case we will have to use the Rebase and merge strategy. As in the previous strategy, a fast-forward merge is done. However, this implies that the branch has to be updated locally using git pull --rebase origin main, which changes the commit history, and then push --force the updated branch. This will require contributors to have at least an intermediate skill on using git. So I suggest to not to enforce this option, but to inform users that this can be done.
- To enforce a linear history, we could add the rule Require linear history to the main branch. Both previous strategies maintain a linear history.
Developer's Certificate of Origin 1.1: clarify that it has to be added to the description of the PR instead of along with your Pull Request.
Change WIP PRs: suggest to create a draft PR instead of prefixing [WIP]

Mixed workspace information when using multiple tabs

What is the current behavior?
When a user is working on several workspaces at the same time by selecting them on different tabs on the same browser, the information of the last selected workspace starts to be displayed on the former.

What is the expected behavior?
User should be able to work on multiple workspaces at the same time without any problems.

Steps to reproduce the problem:

Create workspace w1
Go to menu and create workspace w2
Create a category and label some elements.
Go to menu, select 'w1' and go to the Positive labels panel on the sidebar. You will see w2's positive labels.

Like this bug, many others can be found due to this shared memory issue.

Suggested solution (if any):
This is happening because the localStorage is used to store the selected workspace so it is maintained across page refreshes. To solve this, we just have to change from localStorage to sessionStorage, as sessionStorage is bound only to the tab and not to the domain.

[backend] error when generating the contradicting elements report if there is no positive or no negative elements

What is the current behavior?
The frontend is stuck and error 500 returns from the backend

What is the expected behavior?
Get the contradicting elements if exists

Steps to reproduce the problem:

Create a workspace
Label a few positive elements
Click the contradicting report button

Environment:

OS: MacOS
Web browser: Chrome
Python version: 3.9

Suggested solution (if any):
Return an empty list if there no positive or negative elements

Model predictions in the "Positive labels" panel do not update

What is the current behavior?
After a new model is ready, the blue frames around the elements in the "Positive labels" panel still reflect the predictions of the previous model. After reopening the category / refreshing the page the correct predictions are shown.

What is the expected behavior?
The positive labels panel contents should be updated whenever a new model is ready.

Add the version number to the menu page

Add the version number to the menu page. It is currently displayed on the workplace page. It would be useful to add it to the other pages as well.

Add optimistic updates on labeling actions

Is your feature request related to a problem?

Specially in low connection networks, when an element is labeled it takes some time to wait till the frontend receives the OK response from the backend. Moreover, labeling an element is not a request that is prone to fail. It can be a bit confusing to experience a delay in the element style after the element has been labeling. The following video shows this issue.

Screen.Recording.2022-10-19.at.10.05.58.mov

What is the expected behavior?

An optimistic update would instantaneously update the style of the element based on what the user action was (positive or negative label). A mechanism that reverts the changes if the request to the backend failed should be implemented.

Feature flag management

Is your feature request related to a problem?

Currently, we manage system configuration separately in the frontend and in the backend. This implies that we maintain two configuration files. The frontend has env files and the backend a config file. This is actually required and this issue doesn't aim to remove one of them. Frontend env files can't be removed because some of the configuration has to be set that way. For example, we necessarily need two env files, one for frontend local development and one for production, to have different values of the REACT_APP_API_URL environment variable. In addition, changing any of the frontend environment variables implies re-compiling the frontend source code. Managing feature flags will make re-compiling the frontend not required.

However, environment variables that manage feature enablement or customization should be stored in a single source of truth component that other components can use to retrieve the value of those variables. I will refer to these variables as feature flags from now on.

At this stage, the backend uses several feature flags (see the system configuration page), while the frontend only one, which is called REACT_APP_AUTH_ENABLED and enables or disables authentication. The correspondent feature flag is called login_required in the backend side. This is a case where both components, the frontend and the backend, depend on the same feature flag. Handling its value separately leads to inconcistencies when the feature flag has different value on the different components.

What is the expected behavior?

The backend should have a service that manages feature flags. This service would retrieve the feature flag values from a configuration file (or possibly from other sources like a database). The backend's REST API would expose an endpoint that allows the frontend to retrieve the value of (some of) the feature flags. This way: the backend services will consume the feature flag service and the frontend will consume the endpoint that consumes the feature flag service.

How can I use the trained model to evaluate new datasets?

Your work is great, and thank you very much!
By the way , can you kindly tell me how to use the trained model to evaluate new dataset file (like a csv file) ? This will be more userful to see the predicted result.
Thanks!

Possible bug regarding predictions when using BERT model

I experience the following bug, it seems quite strange to me, but if anyone could verify as well it would be nice.

What is the current behavior?
Using the model implemented in hf_tranformers.py I experience that the scores returned in the predictions are all >0.5. Hence, all the sencences of the database result labelled as "True". I think it is due to the fact that the pipeline used to interpret the results of the BERT model returns the inferences as {'label': LABEL_x, 'score': d}, where d will always be >0.5, because it is the reason for which the label is x and not y.

What is the expected behavior?
It should assign to each elements a score corresponding to the "probability" of that sentence to be True, not the probability to be the LABEL assigned by the model.

Steps to reproduce the problem:

Perform a training setting the model policy as "STATIC_HF_BERT".
Then you will see that all the sentences are labelled as positive.

Environment:

Web browser: Chrome
Python version: 3.9

Suggested solution (if any):
I experimented that to solve this bug it is enough to replace the line with something like:

        scores=[]
        for p in preds:
            if p['label']=='LABEL_0':
                scores.append(1-p['score'])
            if p['label']=='LABEL_1':
                scores.append(p['score'])

[Frontend] tabValue and tabStatus manage the same logic

Is your feature request related to a problem?

The WorkspaceInfo component has two variables created using React.useState() that manage the state of the tab panel, tabValue and tabStatus. tabStatus is not being used.

What is the expected behavior?
Remove the tabStatus state and all the usages if setTabStatus().

More consistent handling of exceptions in backend

There are currently some inconsistencies with how exceptions are logged and handled in the backend.

We should aim that:
A. Every exception that is explicitly raised within the backend modules should be caught upstream (usually by the app or orchestrator). For instance we do not currently catch exceptions raised by orchestrator_api.delete_workspace()
B. Each such raised exception should have an informative message, and this message should be logged where the exception is caught.

Model training encounters error nondeterministically

What is the current behavior?
When training a model, sometimes the following error is encountered.
What is strange is that I am training two identical models, on identical data (differentiated using 2 different workspace names), and the error is encountered for one of them only (although on some rare occasions, the error is encountered for both).

[orchestrator_api.py:390] starting iteration 0 in background for workspace 'WS1' category id '0' using 7 items
[orchestrator_api.py:602] Workspace 'WS2' category id '0' 4 positive elements (>=4) 7 elements changed since last model (>=4). Training a new model
[orchestrator_api.py:398] workspace 'WS1' training a model for category id '0', train_statistics: {'train_counts': {'False': 3, 'True': 4}}
[train_set_selectors.py:38] using 7 for train using dataset DATASET
[orchestrator_api.py:390] starting iteration 0 in background for workspace 'WS2' category id '0' using 7 items
[orchestrator_api.py:398] workspace 'WS2' training a model for category id '0', train_statistics: {'train_counts': {'False': 3, 'True': 4}}
[models_background_jobs_manager.py:37] Adding training for model id SVM_BOW_289e6de2-1a94-11ed-add3-296780c3e733 into the CPU_10_threadpool
[models_background_jobs_manager.py:37] Adding training for model id SVM_BOW_289e8d0e-1a94-11ed-add3-296780c3e733 into the CPU_10_threadpool
[models_background_jobs_manager.py:37] Adding training for model id SVM_GloVe_289e6de3-1a94-11ed-add3-296780c3e733 into the CPU_10_threadpool
[models_background_jobs_manager.py:37] Adding training for model id SVM_GloVe_289e8d0f-1a94-11ed-add3-296780c3e733 into the CPU_10_threadpool
[tools.py:67] Done getting GloVe representations for 7 sentences
[tools.py:67] Done getting GloVe representations for 7 sentences
Adding training for model id SVM_BOW_289e6de2-1a94-11ed-add3-296780c3e733,SVM_GloVe_289e6de3-1a94-11ed-add3-296780c3e733 into the CPU_10_threadpool
training an ensemble model id SVM_BOW_289e6de2-1a94-11ed-add3-296780c3e733,SVM_GloVe_289e6de3-1a94-11ed-add3-296780c3e733 using 7 elements
[models_background_jobs_manager.py:37] Adding training for model id SVM_BOW_289e8d0e-1a94-11ed-add3-296780c3e733,SVM_GloVe_289e8d0f-1a94-11ed-add3-296780c3e733 into the CPU_10_threadpool
[ensemble.py:77] training an ensemble model id SVM_BOW_289e8d0e-1a94-11ed-add3-296780c3e733,SVM_GloVe_289e8d0f-1a94-11ed-add3-296780c3e733 using 7 elements
ERROR    [_base.py:332] exception calling callback for <Future at 0x7f0195972ca0 state=finished returned str>
Traceback (most recent call last):
  File "/home/farnaz/anaconda3/envs/label-sleuth/lib/python3.9/concurrent/futures/_base.py", line 330, in _invoke_callbacks
    callback(self)
  File "/root/label-sleuth/label_sleuth/orchestrator/orchestrator_api.py", line 433, in _train_done_callback
    self.orchestrator_state.update_model_status(workspace_id=workspace_id, category_id=category_id,
  File "/root/label-sleuth/label_sleuth/orchestrator/core/state_api/orchestrator_state_api.py", line 278, in update_model_status
    assert len(iterations) > iteration_index,\
AssertionError: Iteration '0' doesn't exist in workspace 'WS2'

Note that iteration '0' is not the only iteration where this can happen. In one case I encountered this error in iteration '2', with 15 labeled datapoints:
train_statistics: {'train_counts': {'True': 9, 'False': 6}}

From then on, the logs show the following message for the workspace:
workspace 'WS2' category id '0' new elements criterion was met but previous AL not yet ready, not initiating a new training

Steps to reproduce the problem:
These settings are changed from the default config:

"first_model_positive_threshold": 4,
"changed_element_threshold": 4,
"training_set_selection_strategy": "ALL_LABELED"

This error is encountered non-deterministically (not always encountered).

Environment:

OS: Ubuntu 20.04.4 LTS
Python version: 3.9.12

Make tutorial modal to appear only once

Is your feature request related to a problem?

The modal that informs the user that there is a tutorial available is opened each time the user enters a workspa. Because of this being implemented using Redux state, if a user refresh the page being on a workspace, the modal is going to be opened again.

What is the expected behavior?

Prevent this happening using the local storage of the browser to store this information, so the modal is opened only once. That is, the first time a user enters the workspace.

Performance improvement: preload the dataset when a user selects a workspace

Is your feature request related to a problem?

When a user enters a workspace that uses a big dataset, it can tak a while to load.

What is the expected behavior?

When the user enters a workspace, start preloading the dataset so that when the user selects a category, the dataset will have started being preloaded.

The total performance improvement is the following: if total time to put the dataset into cache is x and user takes y to select a category after entering the workspace, then the performance improvement the user will experience is x-y.

Improve error notifications of 500 Server errors

Is your feature request related to a problem?

Currently 500 error notifications are not user friendly. The toast message directly displays some html text. Although in the long term 500 errors should be treated first on the backend side to identify the issue and display an explainable error message to the users, a default user-friendly error message should be displayed for 500 Server errors.

What is the expected behavior?

Display a user friendly error when requests fail with status code 500. The message could be something like: Somethig went wrong. Please ask your system administrator to share the logs by creating an issue on Github or sending a message via Slack

Implementation of specific BERT models

Is your feature request related to a problem?
It is not really a problem but it would make the code working in a more efficiently way, in particular in other languages.

What is the expected behavior?
In the configuration file it seems that I can only choose a generic "HF BERT", I was wondering if there was some other specific BERT model already implemented and how would it be possible to set them up.

Additional context (if any):

Update frotend readme

Is your feature request related to a problem?
Yes, the current frontend readme is outdated.

It should be updated to cover the following topics:

Local development.
Building for production.
Environment variables and feature flags.
Application structure.
Redux state description.
How does the labeling mechanism work?
How does interaction with the backend API happen?

When we get to a point were this readme's information is complete we can add it to the webpage.

Make Search/Label Next panel wider or adjustable

Is your feature request related to a problem?
The sentences and paragraphs I am labeling are longer than those in the wiki_animals_2000 example dataset. Consequently, only a few elements are shown at a time in the right panel (when searching or using the Label Next feature). This also makes it frustrating to have to scroll down to the next element after every labeling action.

What is the expected behavior?
Since 90% of my time is spent in the right panel (searching or labeling next) the right panel should not only use 20% of the screen real estate. Ideally it would default to something more like 40% and/or be adjustable.

Additional context (if any):
Automatically scrolling down to the next element after labeling could also improve the UX.

Improve the visibility of the Webpage link

Is your feature request related to a problem?

The Webpage link at the botom left of workspace page is difficult to read in some screens.

What is the expected behavior?
It should be easy to read and to notice. Increasing the font weight should solve the issue.

Refactor elements state design and labeling mechanism in the frontend

Current status

The way elements are stored in the frontend and how labeling status is managed is currently overly-complicated, thus difficult to understand, and suboptimal, thus performance can be improved.

Regarding performance in the labeling mechanism: We aim to maintain the frontend state updated making the minimal number of endpoint calls to the backend, specially if it is a heavy GET request. To achieve this in the labeling mechanism, the elements have to stay updated between different views. For example: if an element A is present both in the main panel and in the sidebar panel, a label action to one of them has to be reflected in the other without requiring to retrieve all the elements in the panel again. The labeling mechanism is responsible for synchronizing the panels. Regarding the state of the application, it should be as simple as posible to allow developers to easily understand the code.

Currently, the element-related information for each panel is the following:

an elements array: has an entry for each element with the fields begin, end, doc_id, id, model_predictions and user_labels. Both model_predictions and user_labels are a dictionary of the form category_id: label.
a labelState dictionary: with keys equal to L[index of element in the array] (e.g. L10) if it is a main panel element or L[index of element in the array]-[elementId] (e.g. L4-medium_wiki-Yellow_fronted tinkerbird-10).

Disadvantages of this approach:

the elements and the labelState are not synchronized. elements is only used to retrieve the text entry and the model prediction, while labelState is used to retrieve the user labels.
elements are stored without being parsed, thus the names of the fields use the Python naming convention. In addition, there are fields that are never used.
When a labeling action is performed, synchronization is only done between the main panel and the current active sidebar panel. Thus, other non-active sidebar panel remain unsynchronized.
Labeling mechanism for a label action in the main panel and one in the sidebar panel are implemented in different custom hooks, thus there is code duplication.
When a labeling action is performed in the main panel, the labeling mechanism searches for the current element's id in the sidebar panel labelState dictionary. As the key of this dictionary is not the labeled elementId, potencially all keys has to be searched in order to determine if the label element is present in the sidebar panel. This operation has O(N) time complexity.
When a labeling action happens in the sidebar panel, the key that has to be updated in the labelStatus dictionary of the main panel is calculated from the elementId. This calculation shouldn't be necessary if the elementId is used as the key.
The frontend uses a different terminology thant the one the backend uses for labeling values. The frontend uses: "pos", "neg" and "none" while the backend uses "true", "false" and "none". This can be confusing.

Proposed changes:

Merge the elements and the labelState objects into a single dictionary elements for each panel with its keys being [elementId] and with value an object with fields docId, id, modelPrediction and userLabel, with modelPrediction and userLabel being a label value instead of a dictionary, as we are, at any point in the time using a single category.
Example:

"elements": {
  "medium_wiki-Yellow_fronted tinkerbird-10": {
    "id": "medium_wiki-Yellow_fronted tinkerbird-10",
    "docId": "medium_wiki-Yellow_fronted tinkerbird",
    "modelPrediction": "false",
    "userLabel": "true"
  },
  // other elements
}

Storing the information of the element like this allows the synchronization mechanism to have complexity O(1) as searching for a key in a dictionary should be O(1) (this may vary from browser to browser).

Use a single custom hook to synchronize the panels. This could be potentially be done in the redux reducer of the endpoint that labels an element.
Synchronize all the panels (except from the active panel if the label action happened in a sidebar panel and except from the main panel if the labeling action happened in the main panel).
Unify labeling values terminology.
Some of the panels has to store more information that elements.
- The search panel has to store the searched text.
- The contradicting labels panels has to store a list of pairs. The pairs can be identified with their element IDs.
- The precision evaluation panel has to store information about the evaluation progress, the last evaluation result, etc.

[backend] Add a flag for the search endpoint to enable/disable regex search

Is your feature request related to a problem?
Currently, the search endpoint uses regex by default which might be confusing for some users.

What is the expected behavior?
The search endpoint will not use regex by default. A flag will be passed to the endpoint to turn the regex search on.

Additional context (if any):
Later, the frontend could add a checkbox to enable/disable regex search

Missing dependencey to fairscale for default end-user installation

Following instructions on : https://www.label-sleuth.org/docs/installation.html,
I got stuck by :
RuntimeError: Failed to import transformers.trainer because of the following error (look up to see its traceback):
The 'fairscale>0.3' distribution was not found and is required by this application.

So I just add pip install fairscale, launch label_sleuth again, and it works.

OS: Ubuntu 22.04
Web browser: NA
Python version: 3.8

Training model does not get created due to FileNotFoundError

What is the current behavior?
The Training Model does not get created. The UI keeps displaying the message Training in progress inspite of waiting for several hours. Ultimately, the Current Model shows as "Not Available yet"

What is the expected behavior?
Training should completed in few minutes at the most, and the model should be available. If there is an issue with creating the training model, user should be notified that there was a problem and that the training model could not be created, along with guidance on what to do next/how to get help(eg. create an issue in this github repo)

Steps to reproduce the problem:

Set Home/default directory to a longish name Or install label-sleuth in a directory with a long name - "eg C:\Users\SheshadriChandrashek"
Follow the installation steps as documented on label-sleuth getting started guide, install on Windows OS
Upload data, begin labeling until 100%, wait for training model to be created.

Environment:

OS: Windows
Web browser: Chrome
Python version: 3.8

Suggested solution (if any):

Add Contributor License Agreement Github action

Is your feature request related to a problem?

For contributing code, contributors have to first sign a Contributor License Agreement (CLA). We instruct them to do so in the Contribution Guidelines by adding a copy of the Developer's Certificate of Origin 1.1 in a comment of the Pull Request. However, so far we have had to remind contributors to add this in almost all cases.

What is the expected behavior?

We can help contributors to remember adding the CLA PR comment by adding a note reminding them to do so in the Pull Request template, but they may not read it carefully either. I propose to add the CLA Assistant Github action to the actions that listen for changes in Pull Requests. This action maintains a file inside the repository where licenses signed by users are stored. When a user that has not signed the CLA before creates a PR without including the license copy, the action will make comment asking the user to add a copy of the CLA and block the PR. The PR will be unblocked once the contributor adds the copy of the license.

Include weakly labeled elements when exporting labeled data

Is your feature request related to a problem?
When exporting labeled data, weak labels that were used for training are not included in the file.

What is the expected behavior?
Include weak labels when exporting labeled data. Mark them so the user can decide whether to use them or not.

Additional context (if any):

Improve performance in fetching positive predictions

Is your feature request related to a problem?
When using Label Sleuth with a very large corpus, there is a significant performance bottleneck in the Positive predictions panel. This is because currently in order to calculate the contents in this panel we need to create TextElement objects for all the texts in the corpus and then collect the inference results for all of them, which can be quite expensive.

What is the expected behavior?
Assuming the user usually does not need to see all the positive predictions, this can be implemented in a lazy manner - perform object creation and inference on batches of elements rather than the entire corpus, and stop once the requested amount of positively predicted examples has been accumulated.

Additional context (if any):
Switching to lazy load would also mean modifying the frontend pagination behavior, as in this scenario we do not know in advance the total number of positive predictions and thus the total number of pages of results.

[Frontend] Clean scripts and non-used dependencies package.json

Is your feature request related to a problem?

The package.json file is used by npm to know which are the dependencies of the project and it is were scripts are defined. A few of the direct dependencies were identified to not being used in any place in the code. Also, most of the defined scripts doesn't have a usecase that justifies having them.

What is the expected behavior?

Remove the unsued dependencies and scripts.

Import List of Classes/Labels when starting out

Is your feature request related to a problem?
I'm trying to import a dataset that already has a set number of 10 classes. However, I need to create each class manually instead of being able to import a list of classes quickly. This can become a problem, especially if there are even more classes and/or more datasets.

What is the expected behavior?
I would like to be able to import a list of class names instead of typing in each class name individually by hand.

Add version number in the frontend

Is your feature request related to a problem?

We don't have an easy way of identifying which is the version of the system that we are using. Having this information would be useful to be aware of what features/fixes should be present in the system.

What is the expected behavior?

The frontend should display the system's version. As part of this issue, I will research how other projects implement this. Note that it is not enough to retrieve the pypi version as we also want to destinguish in-between versions that are not published in pypi. This is the case for commits that are in the main branch but aren't referenced by a tag.

500 Internal Server Error when deleting a category

What is the current behavior?
Deleting a category fails. The following is the stack error trace:

2022-07-18 16:45:48,619 ERROR    [app.py:1455] Exception on /workspace/w6/category/0 [DELETE]
Traceback (most recent call last):
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/site-packages/flask/app.py", line 2077, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/site-packages/flask/app.py", line 1525, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/site-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/site-packages/flask/app.py", line 1523, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/site-packages/flask/app.py", line 1509, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/authentication.py", line 32, in wrapper
    return function(*args, **kwargs)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/app_utils.py", line 36, in wrapper
    return function(workspace_id, *args, **kwargs)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/app.py", line 341, in delete_category
    current_app.orchestrator_api.delete_category(workspace_id, category_id)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/orchestrator/orchestrator_api.py", line 148, in delete_category
    self._delete_category_models(workspace_id, category_id)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/orchestrator/orchestrator_api.py", line 336, in _delete_category_models
    if workspace.categories[category_id].iterations[idx].model.model_status != ModelStatus.DELETED:
KeyError: '0'

What is the expected behavior?
The category should be deleted.

Steps to reproduce the problem:

Create a workspace
Create a category on the workspace
Call this endpoint to delete the category.

Environment:

OS: macOS
Web browser: Chrome
Python version: 3.8.13

Suggested solution (if any):
Casting the category_id to int in _delete_category_models() makes it to work. It seems that the category is stored as int in the workspace.categories dict but when searched it is a string

client.js:24 GET http://localhost:8000/datasets 401 (UNAUTHORIZED)

Hi,

I just install the label-sleuth framework based on the instruction. When I uploaded the dataset, the web page said the dataset was created successfully. But there is no change in the workspace, and I get this message from the console 'client.js:24 GET http://localhost:8000/datasets 401 (UNAUTHORIZED)' and 'GET http://localhost:8000/workspaces 401 (UNAUTHORIZED)'.

Could anyone help me out?

Malformed regex errors when searching for keywords with unbalanced parenthesis

What is the current behavior?
Using the search panel with a malformed regex fails. For example: querying for a string that contains ) returns the following trace:

Traceback (most recent call last):
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/site-packages/flask/app.py", line 2077, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/site-packages/flask/app.py", line 1525, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/site-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/site-packages/flask/app.py", line 1523, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/site-packages/flask/app.py", line 1509, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/authentication.py", line 32, in wrapper
    return function(*args, **kwargs)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/app_utils.py", line 36, in wrapper
    return function(workspace_id, *args, **kwargs)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/app.py", line 539, in query
    resp = curr_app.orchestrator_api.query(workspace_id, dataset_name, category_id=None, query_regex=query_string,
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/orchestrator/orchestrator_api.py", line 246, in query
    return self.data_access.get_text_elements(workspace_id=workspace_id, dataset_name=dataset_name,
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/data_access/file_based/file_based_data_access.py", line 249, in get_text_elements
    self._get_text_elements(
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/data_access/file_based/file_based_data_access.py", line 482, in _get_text_elements
    corpus_df = filter_func(corpus_df, labels_series)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/data_access/file_based/file_based_data_access.py", line 251, in <lambda>
    filter_func=lambda df, _: utils.filter_by_query_and_document_uri(df, query_regex, document_uri),
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/data_access/file_based/utils.py", line 75, in filter_by_query_and_document_uri
    return df[df.text.str.contains(query, flags=re.IGNORECASE, na=False)]
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/site-packages/pandas/core/strings/accessor.py", line 101, in wrapper
    return func(self, *args, **kwargs)
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/site-packages/pandas/core/strings/accessor.py", line 1110, in contains
    result = self._data.array._str_contains(pat, case, flags, na, regex)
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/site-packages/pandas/core/strings/object_array.py", line 110, in _str_contains
    regex = re.compile(pat, flags=flags)
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/re.py", line 252, in compile
    return _compile(pattern, flags)
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/Users/martin/opt/miniconda3/envs/label-sleuth/lib/python3.8/sre_parse.py", line 962, in parse
    raise source.error("unbalanced parenthesis")
re.error: unbalanced parenthesis at position 5

What is the expected behavior?
The backend should treat the query string as normal text if parsing the input as a regex fails. However, is the user was indeed trying to search for a regex searching for the text instead may be confusing. We can add a checkbox that lets the user specify that they are inputing a regex.

The backend fails to calculate the contradicting elements when there is only one negative label

What is the current behavior?
The backend fails to calculate the contradicting elements when there is only one negative label.

2022-10-26 17:30:21,293 ERROR    [app.py:1013] Failed to generate contradiction report for workspace w_wiki_mediun category_id 5
Traceback (most recent call last):
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/app.py", line 1003, in get_contradicting_elements
    contradiction_elements_dict = curr_app.orchestrator_api.get_contradiction_report(workspace_id, category_id)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/orchestrator/orchestrator_api.py", line 695, in get_contradiction_report
    return get_suspected_labeling_contradictions_by_distance_with_diffs(
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/analysis_utils/labeling_reports.py", line 115, in get_suspected_labeling_contradictions_by_distance_with_diffs
    pairs = get_suspected_labeling_contradictions_by_distance(category_id, labeled_elements, embedding_func, language)
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/analysis_utils/labeling_reports.py", line 149, in get_suspected_labeling_contradictions_by_distance
    distances_and_pairs = _get_nearest_neighbors_with_opposite_label(labeled_elements, embedding_vectors,
  File "/Users/martin/Documents/GitHub/label-sleuth/label_sleuth/analysis_utils/labeling_reports.py", line 181, in _get_nearest_neighbors_with_opposite_label
    for i, (distance, opposite_neighbor_idx) in enumerate(zip(np.squeeze(distances_to_closest_opposite),
TypeError: iteration over a 0-d array

What is the expected behavior?
The backend should calculate the contradicting elements.

Steps to reproduce the problem:

Create a category
Label one positive element and one negative element.
Go to the Contradicting labels panel

label-sleuth / label-sleuth Goto Github PK

label-sleuth's Introduction

Installation for end users (non-developers)

Setting up a development environment

Project Structure

Using the system

Customizing the system

System configuration

Implementing new components

Reference

label-sleuth's People

Contributors

Stargazers

Watchers

Forkers

label-sleuth's Issues

Advantages

Migration

Current status

Proposed changes:

Recommend Projects

Recommend Topics

Recommend Org