Giter VIP home page Giter VIP logo

lfoppiano / supercon2 Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 0.0 37.58 MB

Staging-area for automatically collected experimental data for the SuperCon database with a curation interface with enhanced-document viewer and curation-ready interface

Home Page: https://supercon2.readthedocs.io

Dockerfile 0.04% Python 2.14% JavaScript 69.83% CSS 16.37% HTML 8.71% XSLT 0.25% Shell 0.01% SCSS 2.66%
tdm feedback grobid superconductors training training-data

supercon2's Introduction

SuperCon 2

SuperCon 2 is the staging-area of SuperCon, the 'de-facto standard' database of superconductors materials. SuperCon 2, collect experimental data automatically extracted from scientific documents and provide a workflow to correct them efficiently and with high quality.

The SuperCon 2 interface is described in details here.

This repository contains:

  • The process to create the SuperCon 2 database from scratch, using Grobid Superconductor to extract materials information from large quantities of PDF.
  • SuperCon 2 curation interface and workflow application for visualising and editing material and properties extracted from superconductors-related papers.
  • The documentation related to the usage of SuperCon 2, with notes, experiments and other information, accessible here.

Reference:

@article{doi:10.1080/27660400.2023.2286219,
	title        = {Semi-automatic staging area for high-quality structured data extraction from scientific literature},
	author       = {Luca Foppiano, Tomoya Mato, Kensei Terashima, Pedro Ortiz Suarez, Taku Tou, Chikako Sakai, Wei-Sheng Wang, Toshiyuki Amagasa, Yoshihiko Takano and Masashi Ishii},
	year         = 2023,
	journal      = {Science and Technology of Advanced Materials: Methods},
	publisher    = {Taylor & Francis},
	volume       = {0},
	number       = {ja},
	pages        = 2286219,
	doi          = {10.1080/27660400.2023.2286219},
	url          = {https://doi.org/10.1080/27660400.2023.2286219},
	eprint       = {https://doi.org/10.1080/27660400.2023.2286219}
}

supercon2's People

Contributors

esparzamiren avatar kensei-te avatar lfoppiano avatar t29mato avatar taku641 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

supercon2's Issues

Creation of environment

  • Setup of the current environment (download code, setup mongodb, load processed documents with grobid-superconductors....)
  • Check how to use vue.js with the current setup
  • Update documentation (if needed)

add filtering function on /automatic_database

We should have these filters:

  • filter by value in each column (e.g. filter raw material by words, or years, etc...) - note we should have a way not only to filter but to show only records for which a certain column contains a value or not
  • show data by type/status combination:
    • automatic/valid (standard)
    • manual/valid (corrected)
    • manual/invalid (flagged)

Add multi-user support

We should add the support to have multiple users working on the data without having to agree in advance who's working on what.
For example when one user click on "edit", the system might lock the record for a determined amount of time. If the record is not saved in 5 minutes, then the lock is released.

Add measurement method visible in PDF viewer

In the viewer, it seems not all features are shown. I want to have me-method at least visible in the viewer. It is to distinguish whether the subject Tc is experimental or theoretical.

Error handling in interface

Show error message in interface in case the result is not a 200:

  • show error in edit mode
  • show error when adding new record
  • when the error pop up appears, and the user presses OK, don't close the modal window

Feedback data from a corrected Excel

Main tasks:

  • collect the PDF documents corresponding to the corrected data
  • process these PDFs and write them in a separate database
  • prepare a script to match (with fuzzy matching) records from excel to the new database
  • write the logic for the update (draft #14)
  • find the passage relative to the corrected record in a separate collection (training data generation)

Docker image

  • docker image
  • docker compose
  • automatic docker build on falcon
  • semi-automatic deployment (need to log in on falcon and restart the application via service supercon2 restart)

fix error messages

if 'hash' not in record or record['hash'] == "":
abort(400, "Missing document hash or doi")
if 'doi' not in record or record['doi'] == "":
abort(400, "Missing document hash or doi")

I don't think it is necessary to write "or" in the error message.
It knows which one is missing.
I noticed it because I was confused by the error message.

Flagging workflow API

Add new API calls for the flagging workflow:

  • pagination, GET with parameters from=4&to=5 or from=4&lenght=10 (perhaps the second is better for mongodb)
  • flag/unflag a record, PUT/PATCH with parameter record Id
  • filtering by record type (all/valid/invalid)

filtering by columns with dynamic parameters, since the data might change. If a parameter does not correspond to any property, it's ignored.
(to be decided if we implement it. If so we will add a new task)

Flagging interface

The new interface could use requests to paginate and filter. Decision to be made on the spot when implementing.

Tasks:

  • write the interface from scratch with vue.js with pagination and filtering as now
  • add button to flag/unflag a record
  • add keystrokes for selecting the record and flag/unflag. E.g. arrow-up, arrow-down and f for flag/unflag of a record
  • add button to move to next/previous page
  • add keystrokes for going to the next/previous page
  • add button to filter record / only valid / only invalid / all

Treat numeric values as numbers

This one is more challenging, but for sorting / searching I think its needed to treat the numeric text as numeric values, such as temperature, applied pressure etc.

In this case I would suggest the creation of two new columns for example where we would have a 'parsed' numeric value for a given text,
similar to what was done in the MagneticMaterials (RIP) database.

In the case of a material have more than one value, I would suggest to keep the highest value in the parsed column or maybe show a list of sorted(?) values, whatever is easier later on for exploring (maybe only one might be easier?)

I think this allows for easier searching / sorting and the raw texts make it easier to compare what was extracted etc.

Training data creation after editing

How to trigger this job? Manually or automatically? Ideally it should be triggered after a document has been corrected.

The data will end up in the annotation tool anyway so might be fine to just have the sentence, without annotations.

Tasks:

  • take the new record (from tabular), take the JSON representation (from document collection) and filter the correct sentence (for this we might use the span id or we need to add the sentence id in the document JSON and tabular record)
  • compose the annotated text using: a) the sentence, the b) the original spans and c) the new spans that have been added (in case of new data) or the spans that have been edited (in case of editing of existing data)
  • save the data in the database

Correction interface

...following #3
Implement the editing of a records:

  • enable editing of the table with record selectable via keyboard or mouse click (up-down keystrokes can be recycled from #3)
  • keystrokes: tab moves to the next column, shift/ctrl+tab moves to the previous column, enter (anywhere) saves the record
  • show button for editing and saving

Filter empty records

One useful feature is to show only records for which a certain column is not empty.
For example one use case is to show only record of materials with applied pressure.

This could be useful also for sorting #31

Interface suggestions

I was wondering if, in the database page, there is a standard way in which the table appears every time you open the page...
And then, if the user manipulates the columns a lot, is there a way to reset the table to have the starting columns enabled somehow, in just one click? Like a push button maybe? It could be more comfortable for the user.

I was playing around with the table and, since there is the option to create a new item, I created a new one (SORRY!!)
but now somehow I think, I cannot delete that item...... what is the best course of action there?? (I just realized you already have an issue related to deletion)
Also, it might be an error on my inputs, but the "flag" button in that new edit is not enabled somehow.

Ideas about the filtering. It works, by the way!
\\\\\\\\
3.
It seems that the filtering works correctly in the most part, but it differentiates between capital letters and lower case letters.
So, for example, if I type "oxides" it will not give me any results, since it has to be "Oxides"... so maybe a way to make it filter no matter what would be more flexible for the user.

About Applied pressure. Since now we are filtering just all the values that contain a certain number, or the exact number... I wonder if it is possible to instead, set up a range of values? But it might be too difficult because of the format.
\\\\\\\\\

It would be a good idea, if possible, to also put a pagination menu on the upper side of the table, and not only in the lower side.

Validate if date is a timestamp type

If I mistakenly send date as a String type in a test, I need to delete it from MongoDB every time.
so I want you to validate on the back end.

Correction workflow API

Add the following API calls:

  • PATCH/PUT for updating existing record
  • POST for adding a new record

avoid updates obsolete records

when we try to update a record that is marked as obsolete, the process should stop and return an error saying that the only record that can be updated is the latest in the chain of updated records

After editing a record the table should not refresh

When I edit a record, and I press "save" the modal closes and the record disappear down in the table.

In order to leave it visible to the users we should freeze the update of the table and leave the record where it is.

What are the keyboard shortcuts?

Keyboard shortcuts are not documented. 😸

We should add them in:

  • readme.md documentation
  • somewhere in the interface (a link to the documentation page, opened in a new tab/window, would be enough)

Empty records being treated as lowest value when records are sorted ascending

Currently if you select to sort by ascending (from any kind of record), it is treating blank/empty records as the ones with lowest value. I dont know how trivial is to solve this, but I think it would be more useful if the empty records are always shown last.

For example if you sort for applied pressure ascending, you will get a lot of pages of blank records

Record flags summary

I'm opening a discussion issue to summarise all the flags that a record can use.
For the moment we have just:

  • type: manual, automatic
  • status: valid, invalid, obsolete (if a correction will not modify the original record ), empty

here a draft schema:

image

Feed training data into label studio (exploration task)

The objective of this task is to explore the possibility of loading the pre-annotated training data directly from label studio (local instance running at http://falcon.nims.go.jp/label-studio).

Ideally the database (collection training-data contains records with passage, hash (document id), span_id (identifier of the span involved in the correction), correction_span (whole span involved in the correction).

We want to pull this data out in the format described here.

The user have to call something like:
GET /training/format=labelstudio/grobid

which will:

  • pull out the training data that is flagged as "new"
  • transform he raw data into labelstudio or grobid
  • flag each record as 'in_progress'
  • add timestamp

Comments about interface

I write comment for Supercon2 "automatically extracted database" tabular view.
・It is very good to have a choice of columns.

Add "back" button

I don't know where to place it.. for now the back/forward of the browser was enough...

Add new record

There is a button somewhere, then two things can happens:

  • a new row is visualised and the user can edit the row by adding information. In this case what should we do if not all columns are snown?
  • a new modal is used with all the fields

Point of attentions:

  • the user needs to select the sentence which is pulled from the JSON documents (simple idea is to just let the user select a sentence from a dropdown, however there might be tons of sentences... so perhaps a filter while typing could be more usable)

add export function (e.g. export JSON, XML so on) on /automatic_database

The export should be revised.

The previous export could be maintained as functionality "download the current table". I'm thinking whether we should remove excel to avoid prople using it.

Two functionalities we need are as follow:

  • download all record of all papers containing at least one record with applied pressure
  • download all records of all papers containing space groups
  • TBD

"Remove record" functionality

We should add also a functionality to remove records. The flag/unflag might not be enough for this functionality.

  • add functionality in the API (removed records should have status:"removed")
  • implement functionality in the interface

Group by document

We would like to show only records that are belonging to the same document.

This will make the management of the records simpler than having everything together expecially when the number of records are increasing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.