lfoppiano / supercon2 Goto Github PK

Staging-area for automatically collected experimental data for the SuperCon database with a curation interface with enhanced-document viewer and curation-ready interface

Home Page: https://supercon2.readthedocs.io

Dockerfile 0.04% Python 2.14% JavaScript 69.83% CSS 16.37% HTML 8.71% XSLT 0.25% Shell 0.01% SCSS 2.66%

tdm feedback grobid superconductors training training-data

supercon2's Introduction

SuperCon 2

SuperCon 2 is the staging-area of SuperCon, the 'de-facto standard' database of superconductors materials. SuperCon 2, collect experimental data automatically extracted from scientific documents and provide a workflow to correct them efficiently and with high quality.

The SuperCon 2 interface is described in details here.

This repository contains:

The process to create the SuperCon 2 database from scratch, using Grobid Superconductor to extract materials information from large quantities of PDF.
SuperCon 2 curation interface and workflow application for visualising and editing material and properties extracted from superconductors-related papers.
The documentation related to the usage of SuperCon 2, with notes, experiments and other information, accessible here.

Reference:

@article{doi:10.1080/27660400.2023.2286219,
	title        = {Semi-automatic staging area for high-quality structured data extraction from scientific literature},
	author       = {Luca Foppiano, Tomoya Mato, Kensei Terashima, Pedro Ortiz Suarez, Taku Tou, Chikako Sakai, Wei-Sheng Wang, Toshiyuki Amagasa, Yoshihiko Takano and Masashi Ishii},
	year         = 2023,
	journal      = {Science and Technology of Advanced Materials: Methods},
	publisher    = {Taylor & Francis},
	volume       = {0},
	number       = {ja},
	pages        = 2286219,
	doi          = {10.1080/27660400.2023.2286219},
	url          = {https://doi.org/10.1080/27660400.2023.2286219},
	eprint       = {https://doi.org/10.1080/27660400.2023.2286219}
}

supercon2's People

Contributors

Stargazers

Watchers

supercon2's Issues

Creation of environment

Setup of the current environment (download code, setup mongodb, load processed documents with grobid-superconductors....)
Check how to use vue.js with the current setup
Update documentation (if needed)

Reduce space by text using icons

Replace these text with icons

add filtering function on /automatic_database

We should have these filters:

filter by value in each column (e.g. filter raw material by words, or years, etc...) - note we should have a way not only to filter but to show only records for which a certain column contains a value or not
show data by type/status combination:
- automatic/valid (standard)
- manual/valid (corrected)
- manual/invalid (flagged)

filter for empty value does not work for document

It seems that there is a problem with the filter for empty values in the "Document" column:

We could just remove the filter for that column if it's not too complicated

Add multi-user support

We should add the support to have multiple users working on the data without having to agree in advance who's working on what.
For example when one user click on "edit", the system might lock the record for a determined amount of time. If the record is not saved in 5 minutes, then the lock is released.

Add measurement method visible in PDF viewer

In the viewer, it seems not all features are shown. I want to have me-method at least visible in the viewer. It is to distinguish whether the subject Tc is experimental or theoretical.

Backend technical investigation / fixes

Investigate the authentication with flask (as compared with django)
Make the supercon2 docker environment working (bonus)

Error handling in interface

Show error message in interface in case the result is not a 200:

show error in edit mode
show error when adding new record
when the error pop up appears, and the user presses OK, don't close the modal window

remove blank character when updating or inserting a new record

Feedback data from a corrected Excel

Main tasks:

collect the PDF documents corresponding to the corrected data
process these PDFs and write them in a separate database
prepare a script to match (with fuzzy matching) records from excel to the new database
write the logic for the update (draft #14)
find the passage relative to the corrected record in a separate collection (training data generation)

Add swagger on the API

quite nice to have

When add a new item both DOi and hash are required, but only one of them should be mandatory

We would need to modify this behaviour as a user cannot copy more than one item before clicking on "new item":

add "new item" button in each row to pre-fill bibliographic information

For adding new item ("missing" one), is it possible to copy the other part from one of the items? It helps us to fill all.

Add flagging with multiple selection

Allow flagging multiple rows
Possible problems:

users can flag by mistake
...

We should not flag record that have been manually corrected

Right now it's possible to flag records that have been corrected. This will set them as "manual" "invalid" and unflag them will set them as "automatic"/"valid". Which is not correct.

We need to extend the workflow at #14

Docker image

docker image
docker compose
automatic docker build on falcon
semi-automatic deployment (need to log in on falcon and restart the application via service supercon2 restart)

Add span identifier in each columns' record

When we build the record table, we should be able to link the columns materials, tcVAlue pressure to the originally extracted span identifier.

Should manually corrected records be flaggable?

As discussed in #58 the question is whether we consider manually corrected record unflaggable or not?

fix error messages

supercon2/supercon2/service.py

Lines 197 to 201 in ac3d396

 if 'hash' not in record or record['hash'] == "": 

 abort(400, "Missing document hash or doi") 

 if 'doi' not in record or record['doi'] == "": 

 abort(400, "Missing document hash or doi")

I don't think it is necessary to write "or" in the error message.
It knows which one is missing.
I noticed it because I was confused by the error message.

Flagging workflow API

Add new API calls for the flagging workflow:

pagination, GET with parameters from=4&to=5 or from=4&lenght=10 (perhaps the second is better for mongodb)
flag/unflag a record, PUT/PATCH with parameter record Id
filtering by record type (all/valid/invalid)

~~filtering by columns with dynamic parameters, since the data might change. If a parameter does not correspond to any property, it's ignored.~~
(to be decided if we implement it. If so we will add a new task)

Regarding blank: distinguish failure and non-relevant

Right now "blank" seems to mean both of
failure in extraction and
non-relevant items.
Can we put ----- for one of them?

View the ID of each record in the table view

Can we see ID of each extracted items, in the curation table as well as in the PDF viewer (#108)? It will help us to identify which is where.

Add git rev in the version

In order to know the exact version we should add the revision string in the version

Flagging interface

The new interface could use requests to paginate and filter. Decision to be made on the spot when implementing.

Tasks:

write the interface from scratch with vue.js with pagination and filtering as now
add button to flag/unflag a record
add keystrokes for selecting the record and flag/unflag. E.g. arrow-up, arrow-down and f for flag/unflag of a record
add button to move to next/previous page
add keystrokes for going to the next/previous page
add button to filter record / only valid / only invalid / all

add logger instead of print()

As the subject, before production :-)

Expand filter button div/icon expands to the whole page

This is minor, but currently the expand filter button expands to the whole page, meaning its clickable/selectable from the beginning to the end of page

Treat numeric values as numbers

This one is more challenging, but for sorting / searching I think its needed to treat the numeric text as numeric values, such as temperature, applied pressure etc.

In this case I would suggest the creation of two new columns for example where we would have a 'parsed' numeric value for a given text,
similar to what was done in the MagneticMaterials (RIP) database.

In the case of a material have more than one value, I would suggest to keep the highest value in the parsed column or maybe show a list of sorted(?) values, whatever is easier later on for exploring (maybe only one might be easier?)

I think this allows for easier searching / sorting and the raw texts make it easier to compare what was extracted etc.

Training data creation after editing

How to trigger this job? Manually or automatically? Ideally it should be triggered after a document has been corrected.

The data will end up in the annotation tool anyway so might be fine to just have the sentence, without annotations.

Tasks:

take the new record (from tabular), take the JSON representation (from document collection) and filter the correct sentence (for this we might use the span id or we need to add the sentence id in the document JSON and tabular record)
compose the annotated text using: a) the sentence, the b) the original spans and c) the new spans that have been added (in case of new data) or the spans that have been edited (in case of editing of existing data)
save the data in the database

Correction interface

...following #3
Implement the editing of a records:

enable editing of the table with record selectable via keyboard or mouse click (up-down keystrokes can be recycled from #3)
keystrokes: tab moves to the next column, shift/ctrl+tab moves to the previous column, enter (anywhere) saves the record
show button for editing and saving

Filter empty records

One useful feature is to show only records for which a certain column is not empty.
For example one use case is to show only record of materials with applied pressure.

This could be useful also for sorting #31

Interface suggestions

I was wondering if, in the database page, there is a standard way in which the table appears every time you open the page...
And then, if the user manipulates the columns a lot, is there a way to reset the table to have the starting columns enabled somehow, in just one click? Like a push button maybe? It could be more comfortable for the user.

I was playing around with the table and, since there is the option to create a new item, I created a new one (SORRY!!)
but now somehow I think, I cannot delete that item...... what is the best course of action there?? (I just realized you already have an issue related to deletion)
Also, it might be an error on my inputs, but the "flag" button in that new edit is not enabled somehow.

Ideas about the filtering. It works, by the way!
\\\\\\\\
3.
It seems that the filtering works correctly in the most part, but it differentiates between capital letters and lower case letters.
So, for example, if I type "oxides" it will not give me any results, since it has to be "Oxides"... so maybe a way to make it filter no matter what would be more flexible for the user.

About Applied pressure. Since now we are filtering just all the values that contain a certain number, or the exact number... I wonder if it is possible to instead, set up a range of values? But it might be too difficult because of the format.
\\\\\\\\\

It would be a good idea, if possible, to also put a pagination menu on the upper side of the table, and not only in the lower side.

Validate if date is a timestamp type

If I mistakenly send date as a String type in a test, I need to delete it from MongoDB every time.
so I want you to validate on the back end.

Correction workflow API

Add the following API calls:

PATCH/PUT for updating existing record
POST for adding a new record

avoid updates obsolete records

when we try to update a record that is marked as obsolete, the process should stop and return an error saying that the only record that can be updated is the latest in the chain of updated records

Add text area for notes while correcting

As suggested by @ChikakoSakai, would be useful to have an area where the curator can add some additional information in the correction form:

After editing a record the table should not refresh

When I edit a record, and I press "save" the modal closes and the record disappear down in the table.

In order to leave it visible to the users we should freeze the update of the table and leave the record where it is.

What are the keyboard shortcuts?

Keyboard shortcuts are not documented. 😸

We should add them in:

readme.md documentation
somewhere in the interface (a link to the documentation page, opened in a new tab/window, would be enough)

Empty records being treated as lowest value when records are sorted ascending

Currently if you select to sort by ascending (from any kind of record), it is treating blank/empty records as the ones with lowest value. I dont know how trivial is to solve this, but I think it would be more useful if the empty records are always shown last.

For example if you sort for applied pressure ascending, you will get a lot of pages of blank records

Record flags summary

I'm opening a discussion issue to summarise all the flags that a record can use.
For the moment we have just:

type: manual, automatic
status: valid, invalid, obsolete (if a correction will not modify the original record ), empty

here a draft schema:

How can I call the API when I want to retrieve all records?

I thought that I could use /records, but the get_records function always filters a type and a status.

Feed training data into label studio (exploration task)

The objective of this task is to explore the possibility of loading the pre-annotated training data directly from label studio (local instance running at http://falcon.nims.go.jp/label-studio).

Ideally the database (collection training-data contains records with passage, hash (document id), span_id (identifier of the span involved in the correction), correction_span (whole span involved in the correction).

We want to pull this data out in the format described here.

The user have to call something like:
GET /training/format=labelstudio/grobid

which will:

pull out the training data that is flagged as "new"
transform he raw data into labelstudio or grobid
flag each record as 'in_progress'
add timestamp

Add "remove" button in the training data management page

Add "remove" button in the training data management page in case we want to remove training data

Update old page (view only) with vue.js

(not high priority)
Basically to rewrite the view interface using vue.js

Comments about interface

I write comment for Supercon2 "automatically extracted database" tabular view.
・It is very good to have a choice of columns.

Add documentation in the interface or a link to a separate page

As you demanded my comment on interface... although I am more comfortable to say this kind of stuff face-to-face

Can you have "explanation" for each item? it can be either link to separate page or can be in the bottom or top of the page

kensei

Add "back" button

I don't know where to place it.. for now the back/forward of the browser was enough...

Add new record

There is a button somewhere, then two things can happens:

a new row is visualised and the user can edit the row by adding information. In this case what should we do if not all columns are snown?
a new modal is used with all the fields

Point of attentions:

the user needs to select the sentence which is pulled from the JSON documents (simple idea is to just let the user select a sentence from a dropdown, however there might be tons of sentences... so perhaps a filter while typing could be more usable)

add export function (e.g. export JSON, XML so on) on /automatic_database

The export should be revised.

The previous export could be maintained as functionality "download the current table". I'm thinking whether we should remove excel to avoid prople using it.

Two functionalities we need are as follow:

download all record of all papers containing at least one record with applied pressure
download all records of all papers containing space groups
TBD

"Remove record" functionality

We should add also a functionality to remove records. The flag/unflag might not be enough for this functionality.

add functionality in the API (removed records should have status:"removed")
implement functionality in the interface

Group by document

We would like to show only records that are belonging to the same document.

This will make the management of the records simpler than having everything together expecially when the number of records are increasing.

	if 'hash' not in record or record['hash'] == "":
	abort(400, "Missing document hash or doi")

	if 'doi' not in record or record['doi'] == "":
	abort(400, "Missing document hash or doi")