Dear <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

roadmap: preparing data model,about inspirehep/inspire-next

Comments (17)

cleggm1 commented on August 20, 2024

Experiments use case - displaying current and previous spokespeople:

A user supplies information for us to update the current spokesperson for an experiment. The current display shows all spokespeople with no differentiation. The user comments that one of the spokespeople is no longer a spokesperson. While we store the information on who is the current spokesperson and the dates of this status, the display is ambiguous.

Current display:
Spokesperson: Shutt, Thomas Alan; Nelson, Harry N.

Current MARC:
001262631 702__ $$aShutt, Thomas Alan$$e2014$$iINSPIRE-00261399
001262631 702__ $$aNelson, Harry N.$$d2014$$iINSPIRE-00110832$$zCurrent

from inspire-next.

kaplun commented on August 20, 2024

Hi Melissa, this issue is specificaly to model what we store (and hence curate), not how we display. Could you add it as a new issue?

from inspire-next.

kaplun commented on August 20, 2024

For those wishing to help me in the data model, please have a look at:

To see statistical information about current MARC usage for each record type see the corresponding files in https://github.com/inspirehep/inspire-next/tree/master-elasticsearch/inspire/dojson/current_marcxml_usage

In order to contribute you can create dedicate github issue and refer to this very same issue #265.

from inspire-next.

fschwenn commented on August 20, 2024

I have a few (first) comments on the HEP model:

Should the enums really be inside the scheme? For ISBNs it might be ok for
the next years but for publishers It's rather some dynamic knowledgebase.
Many of the enums might look as if they are sufficient but panta rhei.

It's a bit dangerous to start from the existing data model. We might end up
with the same problems we have now. We definetly should have a look at all
the cases which drove us to despair within the existing data model ;-)

"funding_info"
FIXME: Do we care about this? So far only 349 records were tagged and all
for a single EU project.

This is a political decision for DIR. From the curation point of view, funds
are a nightmare because in most the cases they are just somewhere in the
fulltext but not in the "usual metadata".

"isbn"
FIXME: this really need to be an enum and cleaned up. What is Print?!

Print is the generic term for hardcover and paperpack

"abstract"
FIXME: is there an enumerable list of sources?

No. New sources can pop up at any time.

"abstract"

Do we want to have all the abstracts of different arXiv-versions? If yes, we have
to know which is the most recent.

"imprint"
FIXME: an enum?

No. See above.

"titles"

May be we also should have "language" there.

"thesis_supervisor",

I think it should use the same object as "authors".

"thesis"
FIXME: shall we match these with the insitution database? I guess so.

Yes! "university" should be the same object as "affiliation".
There will also be special cases where degrees are not bestowed by an university.

"publication_info"

"journal_series" could be added (e.g. for Nuovo Cimento)

"publication_info", "conference_paper_info"
FIXME: This is currently the CNUM

I guess we should keep two: the cnum and a free text field.

"publication_info", "page_range"
FIXME: for ejournals this could be the page index, but there is no
realiable way to know whether something is a page index or a first
page, does it?

We need to different entries: "page_range" AND "article ID". There are
cases where both exist at the same time! Of course it is not
possible to make this distinction backwards for all records. But with
little effort one could do it for a very large fraction.

"publication_info"
FIXME: Shall we split conference information away?

No. I would very much like to have all publication infos together.

"publication_info"
FIXME: shall we
move the DOI and ISBN next to where it belongs? So that we can also
align erratum and friends?"

+1. I would very much like to have all publication infos together.

"publication_info", "year"

Can be in fact more then 1 year:
http://cis01.central.ucv.ro/pauc/vol/1994_1995_4_5/1994-1995_92-99.pdf

"reference"

I would like to keep several list of references - typically arXiv
and pubulisher as for the abstract.

"reference", "report_number"

should we make an extra entry for arXiv?

"copyright"
FIXME: should we restrict this to an enum, or not?

Again. enum in principle is good to have a unique way of writinga publisher,
but the list should be easily extentable.

"thesaurus_terms"
FIXME: What... is... that!?

"energy_range": { "maximum": 8, "description" : "It encodes the energy of the
experiment or raction; the energy is below "10**(energy_range / 2) GeV" for
energe_range < 7, below 10 TeV for energy range 7, or above 10 TeV for 8" }

"thesaurus_terms"

It might be useful to distinguish INSPIRE keywords from INSPIRE reactions
Author's or publisher's keywords should also be stored here? I could not find
a replacement for 6531.

"experiment"
Was the experiment actually proofchecked by a cataloguer?

Yes.

"arxiv_eprints"

"pattern": "\d{4}.\d{4}{5}|\w+-+/\d+" could be even
"pattern": "\d{4}.\d{4}{5}|\w+-+/\d{7}", right?

"authors", "name"

"format": ".+, .+" is too restrictive:
http://www.ihep.ac.cn/english/conference/icrc2011/paper/proc/v9/v9_1348.pdf

For "email" the format could be a bit more restrictive, something
like '.@..[a-z]+'

"citeable"
FIXME: can this be derived from other properties?

"url"

"size" in which units?

from inspire-next.

aw-bib commented on August 20, 2024

Hi!

Should the enums really be inside the scheme?

I came to this question as well. Usually, I've the gut feeling that an authority link scales better and is easier to maintain. I also understood that changes in the schema are technically a database conversion. (In the discussion of inveniosoftware/dojson#23). However, I understood that dojson is working that way. I'm not sure that I like this part yet.

For ISBNs it might be ok for
the next years but for publishers It's rather some dynamic knowledgebase.
Many of the enums might look as if they are sufficient but panta rhei.

Publishers will be difficult, indeed, I think.

It's a bit dangerous to start from the existing data model. We might end up
with the same problems we have now. We definetly should have a look at all
the cases which drove us to despair within the existing data model ;-)

"funding_info"
FIXME: Do we care about this? So far only 349 records were tagged and all
for a single EU project.

This is a political decision for DIR. From the curation point of view, funds
are a nightmare because in most the cases they are just somewhere in the
fulltext but not in the "usual metadata".

+1 for the curators point of view. However, as the EU is mentioned, given the OpenAIRE context etc...

"isbn"
FIXME: this really need to be an enum and cleaned up. What is Print?!

Print is the generic term for hardcover and paperpack

For ISBN there should probably be something like "formally known to be wrong". There're quite a few ISBNs with broken checksums out there. Ok, this makes the checksum senseless, but it might be a good idea to check ISBNs based on the checksum unless one explicitly knows that it is wrong. Could also help cataloguers, if it's checked upon input. For the checksum https://en.wikipedia.org/wiki/International_Standard_Book_Number#ISBN-10_check_digits

"abstract"

Do we want to have all the abstracts of different arXiv-versions? If yes, we
have to know which is the most recent.

Sounds like a version field.

"imprint"
FIXME: an enum?

No. See above.

+1. Enum would not work IRL

"thesis_supervisor",

I think it should use the same object as "authors".

Probably it is "persoal name" + a role field (1001_ $a + $e). Probably $e as enum. (Though we also tend to move from enum to authorities @join2.)

"thesis"
FIXME: shall we match these with the insitution database? I guess so.

Yes! "university" should be the same object as "affiliation".
There will also be special cases where degrees are not bestowed by an
university.

In authors the description for affiliation reads "as it appears on the paper". This would be a non-normalized string. Probably something to rethink.

"publication_info"

"journal_series" could be added (e.g. for Nuovo Cimento)

Depends on whether you treat this as part of the title. Ie. is "Physical Review / D" the title or "Physical Review" series: "D". It's a decision.

"publication_info", "page_range"
FIXME: for ejournals this could be the page index, but there is no
realiable way to know whether something is a page index or a first
page, does it?

We need to different entries: "page_range" AND "article ID". There are
cases where both exist at the same time! Of course it is not
possible to make this distinction backwards for all records. But with
little effort one could do it for a very large fraction.

Is it worthwhile to consider "start page" / "end page"? I admit that I tend to treat article numbers as "start page" for practical purposes.

"publication_info"
FIXME: shall we
move the DOI and ISBN next to where it belongs? So that we can also
align erratum and friends?"

+1. I would very much like to have all publication infos together.

This could get complex. Book series, journals, conferences, multivolumes, publishers and places... Multivolume books in a book series being the special issue of a journal. My gut feeling is to split it into logical chunks.

"publication_info", "year"

Can be in fact more then 1 year:
http://cis01.central.ucv.ro/pauc/vol/1994_1995_4_5/1994-1995_92-99.pdf

Also quite common for theses published as books later on (if those records are merged on inspire, not sure).

"reference"

I would like to keep several list of references - typically arXiv
and pubulisher as for the abstract.

Sounds like source subfield.

For licence one could consider to have some common ones as suggestions. (CC-licences come to mind.) Is there something like enum with a free form value possible?

Kind regards,

Alexander Wagner

Deutsches Elektronen-Synchrotron DESY
Library and Documentation

Building 01d Room OG1.444
Notkestr. 85
22607 Hamburg

phone: +49-40-8998-1758
fax: +49-40-8994-1758
e-mail: [email protected]

from inspire-next.

fschwenn commented on August 20, 2024

"publication_info"

"journal_series" could be added (e.g. for Nuovo Cimento)

Depends on whether you treat this as part of the title. Ie. is "Physical Review
/ D" the title or "Physical Review" series: "D". It's a decision.
Phys.Rev.D for me is the journal, but for Nuovo Cimento A you have something like "Series 10" and "Series 11" with the same volume numbers within the series.

from inspire-next.

kaplun commented on August 20, 2024

Removing milestone since this is no longer a blocker for Enabling search. It needs just to be polished little by little.

from inspire-next.

jalavik commented on August 20, 2024

@annetteholtkamp mentioned to me that it could be a good idea to have a "raw" affiliations field in the data model and use value as the transformed value. We seem to have both raw and treated affiliations in the same field now.

I cannot see any "raw" field in the author either, but there is raw_reference in references. Shall we decide a general direction for this. E.g. shall we add raw fields like this?

"affiliations": {
    "uniqueItems": true,
    "items": {
        "type": "object",
        "properties": {
            "curated_relation": {
                "type": "boolean",
                "description": "Did a cataloguer proof-checked the recid?",
                "title": "The affiliation is curated?"
            },
            "recid": {
                "type": "integer",
                "description": "Record ID in the Institution collection",
                "title": "Record ID of institution"
            },
            "value": {
                "type": "string",
                "description": "The transformed affiliation",
                "title": "Name of institution"
            },
            "raw": {
                "type": "string",
                "description": "The affiliation as it appears on the paper or original import",
                "title": "Name of institution"
            }
        },
        "title": "Affiliation"
    },
    "type": "array",
    "title": "Affiliations"
}

from inspire-next.

aw-bib commented on August 20, 2024

@jalavik there was some discussion under sams preliminary name of gigantic workflow I think the decision on the point of @annetteholtkamp depends on the decision for this workflow.

from inspire-next.

bing13 commented on August 20, 2024

retaining the original strings in an easily accessible form is a good safeguard against unforeseen future needs. Storage is cheap, labor is scarce.

from inspire-next.

kaplun commented on August 20, 2024

👍 (of course case by case). I like the idea of standardizing of having a value which is supposedly normalized against an external reference (e.g. affiliation against institution DB, conference ID against conf DB), Vs. raw. In this way the model can be predictable:

raw: raw original string
value: normalized value against reference DB
recid: recid of the corresponding DB (1 to 1 with value)
corresponding linked record.

"record": {
    "$ref": "http://inspirehep.net/foo/123"
}

from inspire-next.

salmele commented on August 20, 2024

An additional way to look at this would be to allow the normalization against more than one source, retaining a pointing to that. An example in mind would be for instance normalizing an institution against THREE sources: ISNI, and record that external ID, GRID.ac, and record that external ID, and whatever in that moment in time we'd have as INSPIRE institution DB, and retain the recid.

In addition, for this particular example, we'd keep the raw for trying at a later stage, programmatically, to normalize those which we failed to get right upon some ingestio against some of those external sources as they increasingly add more institutes.

from inspire-next.

annetteholtkamp commented on August 20, 2024

Is it worthwhile to multiply the id’s in the HEP records? I’d think one would be sufficient, the others you may get via lookup in the inst collection. Or are you thinking of those cases where mapping to different standards may return different results?

Annette

On 04 Mar 2016, at 08:59, Salvatore Mele [email protected] wrote:

An additional way to look at this would be to allow the normalization against more than one source, retaining a pointing to that. An example in mind would be for instance normalizing an institution against THREE sources: ISNI, and record that external ID, GRID.ac, and record that external ID, and whatever in that moment in time we'd have as INSPIRE institution DB, and retain the recid.

In addition, for this particular example, we'd keep the raw for trying at a later stage, programmatically, to normalize those which we failed to get right upon some ingestio against some of those external sources as they increasingly add more institutes.

—
Reply to this email directly or view it on GitHub.

from inspire-next.

salmele commented on August 20, 2024

The latter.

It is a bit like today for an author we'd store INSPIRE ID, BAI, ORCID, GoogleScholar and whatnot. It might be that for an affiliation in a paper we'd have a hit which resolves e.g. in a service (ISNI) but not in another (GRID.ac) and we ourselves would have even a different way to say things in our own DB.

Mind that I'm not advocating we'd do it this way, but I'm advocating that it might be appropriate to have this at the individual record level, as look-ups might fail.

from inspire-next.

aw-bib commented on August 20, 2024

Mind that I'm not advocating we'd do it this way, but I'm advocating that it might be appropriate to have this at the individual record level, as look-ups might fail.

Would not kind of an authority record that lives locally and serves for these lookups be better than storing n+1 ids on the bibliographic level? For search one should be able to expand ids from this auth rec. Especially considering that a new id might come along as time passes one would just need to update one record and not all bibliographic ones.

from inspire-next.

kaplun commented on August 20, 2024

👍

If look ups are done using the raw string, then we should simply work on improving our tools that perform the automatic matching against the authority records (e.g. for journal record we store all the name variations, so if a HEP record is published in a journal that we can't match this should raise a flag to a cataloguer for inspecting the issue).

If we perform lookup via IDs (because the publisher has provided them), then this should work because we should maintain our authority records aligned with external DBs such as GRID.ac etc.

from inspire-next.

kaplun commented on August 20, 2024

Closing this as it is nowadays superseded by several dedicated issues.

from inspire-next.

roadmap: preparing data model about inspire-next HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent