pastaplus / pasta Goto Github PK

Repository for the Provenance Aware Synthesis Tracking Architecture (PASTA) project.

HTML 1.95% CSS 13.47% JavaScript 5.54% Shell 0.20% Batchfile 0.43% Java 68.68% XSLT 9.63% Perl 0.11%

pasta's Introduction

PASTA

Repository for the Provenance Aware Synthesis Tracking Architecture (PASTA+) project. This is a project of the Environmental Data Initiative and the University of New Mexico.

pasta's People

Contributors

Stargazers

Watchers

Forkers

asu-ke-web-services jon-ide clnsmth

pasta's Issues

PASTA does not add "function='download'" attribute to data entity distribution/online element

From Corinna Gries on 20211016:

Hi @marco, I went back to check where this idea came from that the attribute is missing from our metadata files. In this very recently uploaded EML file they are missing: https://portal.edirepository.org/nis/metadataviewer?packageid=edi.1008.1&contentType=application/xml , So, this may need some investigations, and maybe there are some communications problems between the EML coming out of EAL/ezEML and PASTA. This last one was generated in ezEML.

PASTA should ensure that the function="download" attribute is always present for every data entity in the Level-1 EML document regardless of whether it does or does not exist in the Level-0 EML.

Determine if the ECC checks for case matching between EML attribute names and table column names

Determine if the ECC checks for case matching between EML attribute names and table column names. If not, pursue adding a new quality check that performs a EML attribute name/table column name parity check.

See table BisleywklyRain-Throughfall1988-2015.csv in https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-luq&identifier=148&revision=1213903 where the EML attributeName is Date and the table column name is DATE.

Inject publication date and publisher into Level-1 EML

PASTA should inject the date of upload into the publication date element and "Environmental Data Initiative" as the publisher into the corresponding Level-1 EML document of all data packages.

Linking to data in PASTA from Google Sheets fails

Mary Martin reports that linking to data in PASTA from Google Sheets fails when executing the IMPORTDATA function with a PASTA data URL:

On Tue, Mar 9, 2021 at 10:53 AM Mary Martin [email protected] wrote:

Mark,

In a meeting with teachers, they mentioned that in some classrooms, the only access to spreadsheets is via chromebook/googlesheets.
I was playing around with this, and using the command below to read in a file via url. This works fine for a file on my local website, but when I use the url to load a datafile from EDI, I get the error message below. Is there something that blocks access from google sheets?

The EDI datafile url works in a browser and with curl, but not wget. so maybe some restrictions on the source of the request?

-Mary

=IMPORTDATA("http://hbrsensor.sr.unh.edu/data/hbef_shiny_soilT.csv")
datafile reads into google sheet successfully

=IMPORTDATA("https://pasta.lternet.edu/package/data/eml/knb-lter-hbr/2/11/1254d17cbd381556c05afa740d380e78")
Could not fetch url: https://pasta.lternet.edu/package/data/eml/knb-lter-hbr/2/11/1254d17cbd381556c05afa740d380e78

handle keywords that include colons?

C:N (carbon to nitrogen ratio) is a variable measured by biologists/ecologists. Querying for that yields "Undefined field C" message, since the colon separates field from value in the Lucene engine.
https://pasta.lternet.edu/package/search/eml?q=C:N&fl=*

Note that eDisMax handles this more gracefully and returns results.
https://pasta.lternet.edu/package/search/eml?q=C:N&fl=*&defType=edismax

Full metadata page should include link back to the data package summary landing page

The full metadata display page should include a link back to the data package summary landing page. This is feature would be most useful for those who send the full metadata display page as a URL link, which makes the browser "back" functionality inoperable.

Update LevelOneEMLFactory.addDefaultIntellectualRights for EML 2.2.0

The addDefaultIntellectualRights method in the LevelOneEMLFactory class inserts a new intellectual rights element if one does not already exist. To do so, it must determine the correct location for insertion by testing the presence of surrounding optional elements. The element presently tested do not include new elements from EML 2.2.0 - these need to be added to prevent the insertion of the intellectual rights element in an incorrect schema location.

Ensure all data entity data distribution URLs have the function attribute set to download

Ensure all data entity data distribution URLs have the function attribute set to download when creating the Level-1-EML metadata for the data package. This will improve FAIR analysis of the data package when evaluated by EDI and other 3rd party systems.

Design and implement a data package change log for data packages with more than one revision

On Fri, Mar 19, 2021 at 9:57 AM Bashevkin, Sam@DeltaCouncil [email protected] wrote:

Hi Sam,

We don't currently support a data package change log in the EDI Data Portal, but that is a great idea! We'll have to ponder a bit to get the details just right.

As a data publisher, I would love a mechanism to highlight the changes from the last version, whether it’s just adding additional rows with the most recent data, adding additional variables, or changing the structure in some way.

For an effective data package change log, it seems like we would have to (as you say above) rely on both the metadata provider to clearly indicate what has changed between versions, and then perhaps perform a diff on both the data and metadata to obtain the literal changes of each object. Unfortunately, there is not a dedicated change log section of EML that I am aware of that a data publisher could record this type of information (maybe either or a method step for unstructured text or a custom structure in ). I will add the general idea of a data package change log as an enhancement ticket in our git repository.

Thank you for the suggestion, and keep them coming!

Sincerely,
Mark

Mark Servilla
[email protected]

On Fri, Mar 19, 2021 at 9:57 AM Bashevkin, Sam@DeltaCouncil [email protected] wrote:
Hello,

I was wondering if EDI had support for a changelog for updated datasets to record all changes to the dataset from the last version. I’m thinking of something analogous to those used in software updates, like this: https://cran.r-project.org/web/packages/dplyr/news/news.html. If this does not yet exist (and I’ve been unable to find it), I would like to request that it be considered as a new feature.

As a data user, integrator, and publisher I think this would be immensely useful and improve the safe use of published data. The biggest problem with maintaining integrated datasets are changes to the underlying datasets which are almost never documented. Each time a dataset is updated, it is a very time-consuming process to ensure the formatting has not changed and my integration code is accurately representing the data structure. For example, I just learned of one EDI dataset that changed units from meters to feet from one update to the next. This is something that would be so much easier to deal with if all changes were documented. As a data publisher, I would love a mechanism to highlight the changes from the last version, whether it’s just adding additional rows with the most recent data, adding additional variables, or changing the structure in some way.

Thank you,

Sam

Usage citation in EML 2.2.0 should also update the DPM citation database table

EML 2.2.0 now supports the <usageCitation> element to declare "A citation to articles or products in which the dataset is used or referenced." This element has parity with the Data Portal citation entry UI and should populate the datapackagemanager.journal_citation table in the Data Package Manager service just as the interactive UI. Population of this table would occur during the normal data package upload archive process.

API Documentation: Get Audit Count returns an integer, not application/xml

A very minor issue in the documentation with the Get Audit Count operation. The docs say a successful response will return application/xml, but it returns an integer value.

Data downloads from services like "Box" fail

Kyle Zollo reported that data links within the distribution/online/url element fail when pointing toward services like "Box":

Kyle Zollo 2:53 PM
@marco I'm having trouble uploading a large-data data package. I'm attempting to use the same methods we did last Thursday with the CCE package, but I am getting an error on urlReturnsData telling me that the url returns html (attached). Here is one of the download links I am using from Box https://uwmadison.box.com/shared/static/gq6kc3dwsrmg7qdzht19yh62jmbr1c6k.csv . It works fine on my machine.

ECC behavior differs between evaluation and upload

According to Sven Bohm:

I did notice another difference on portal-s. It allowed me to evaluate the package, but then when I tried to upload it (correctly) failed, because my temporal coverage started in the year 0000 ;) I would have expected it to fail both on the evaluate and upload.

Support "licensed" element in EML 2.2.0 to work with "intelectualRights"

EML 2.2.0 introduced the licensed element, which supplants the use of intellectualRights for declaring a data license agreement for the data package. PASTA+, however, relies on the presence of an intellectualRights element to either not add or add a default and standard license to the document - add if intellectualRights is not present. Conditional logic should be added into the automatic addition of an intellectualRights so that a default license statement is added through an intellectualRights element if and only if both the licensed and intellectualRights elements do not exist.

end_date values in working_on database appear to overwrite previously completed values for same PID

The most recent completion date/timestamp of a data package evaluation or upload appears to overwrite all other values in the end_date column of the datapackagemanager.working_on table for the same PID. See below:

The update query should confirm that the update only occur on table entries where the end_date is null.

Add query by data entity hash value

Each data entity has both an MD-5 and SHA-1 hash value generated at the time of upload - this hash value is recorded in the resource registry. In addition, the corresponding EML metadata may or may not contain the hash value assigned to the //physical//authnetication element.

Carl Boettiger has requested that a new API method be added to query on the hash values that would return a tuple containing a URI to the data entity and a URI to the corresponding EML metadata describing the data entity.

See below for the copy of the slack discussion:

Carl Boettiger  10:44 AM
Ideally it returns both something that lets me find the parent data package (with all the metadata, etc), and something that lets me download the actual data that matches the hash.

Mark Servilla  10:45 AM
as a simple text tuple, an XML data structure, or a JSON data structure? sorry for the inquisition

Carl Boettiger  10:46 AM
no worries. anything is okay for me, slight preference for JSON.   digging for some examples...
10:46
e.g. I'd say this is the 'equivalent' in Zenodo: https://zenodo.org/api/records/?q=_files.checksum:%22md5:eb5e8f37583644943b86d1d9ebd4ded5%22
10:47
but I'd be fine a text tuple, could have the request for the download URL and the request for the package id be different requests.

Mark Servilla  10:48 AM
let me see what i can come up with

Carl Boettiger  10:48 AM
cool.  meantime I think we can get all the same info via DataONE CDN though now, https://cn.dataone.org/cn/v2/query/solr/?q=checksum:b5bd61c3cb614c3486c81b88c6f6f29c[…],checksum,checksumAlgorithm,replicaMN,dataUrl&rows=10&wt=json

Mark Servilla  10:49 AM
our current API is from before JSON became popular, but should be able to add the JSON accept type

DataPackageManager: Entity names with & encoding causes database create exception

Entity names that contain an "&" XML encoded "&" results in a database create exception during the congruency check. For example:

<entityName>Physical & Chemical Limnology of Lake Kegona and Lake Waubesa</entityName>

results in the following PostgreSQL exception:

2021-10-15 14:45:17.490 MDT [29682] pasta@pasta ERROR:  syntax error at or near "&" at character 23
2021-10-15 14:45:17.490 MDT [29682] pasta@pasta STATEMENT:  CREATE TABLE Physical_&_Chemical_Limnology_("lakeid" TEXT,"year4" FLOAT,"daynum" FLOAT,"sampledate" TIMESTAMP,"depth" FLOAT,
"rep" TEXT,"sta" TEXT,"wtemp" FLOAT,"o2" FLOAT,"o2sat" FLOAT,"ph" FLOAT,"phair" FLOAT,"alk" FLOAT,"totnuf_sloh" FLOAT,"no3no2_sloh" FLOAT,"nh4_sloh" FLOAT,"kjdl_n_sloh" FLOAT,"totpuf_s
loh" FLOAT,"drp_sloh" FLOAT,"drsif_sloh" FLOAT,"cl" FLOAT,"so4" FLOAT,"flagdepth" TEXT,"flagwtemp" TEXT,"flago2" TEXT,"flago2sat" TEXT,"flagph" TEXT,"flagphair" TEXT,"flagalk" TEXT,"fl
agtotnuf_sloh" TEXT,"flagno3no2_sloh" TEXT,"flagnh4_sloh" TEXT,"flagkjdl_n_sloh" TEXT,"flagtotpuf_sloh" TEXT,"flagdrp_sloh" TEXT,"flagdrsif_sloh" TEXT,"flagcl" TEXT,"flagso4" TEXT)

Support Boolean operators in Data Portal searches

Consider supporting Boolean operators (i.e. AND, OR, NOT, "", ()) in the Data Portal search interface as these methods are more commonly understood among potential data users than the manual construction of Solr search queries.

User data upload scenario

3-step instructions for data submission to the EDI data repository, using EDI data publishing services

SUBMIT DATA

Follow these 3 steps for each dataset that you intent to publish:

Enter metadata into EDI’s Metadata Template or use the ezEML web interface.
Upload metadata and data (up to … GB) from local computer to EDI after logging in to your account. For files larger than … GB, please contact EDI.
Provide contact name and email address.

Response:

EDI will publish your data within 48 hours and/or get back to you with any questions. A confirmation email with the data
package DOI and URL will be sent to the contact email address.
If you have any questions or need help getting started, contact EDI: [email protected]

Add EDI ROR to "publisher/userId" element to the Level-1 EML

Add ROR (Research Organization Registry) identifier into the <publisher>/<userId> responsibleParty field of the Level-1 EML. The field should use the following pattern:

<userId directory=" https://ror.org">0330j0z60</userId>

This is related to #22.

Create datetime mappings from EML accepted formats to those compatible with PostgreSQL

Create datetime mappings from EML accepted formats to those compatible with PostgreSQL's TO_TIMESTAMP() function. See here for relevant PostgreSQL datetime format fields:

Remove LTER LDAP from authentication workflow

The LTER LDAP will be deprecated in Fall 2020 by the NCEAS LNO. As such, Gatekeeper authentication should no longer support LDAP binds to the LTER LDAP.

Enable user-applied data entity embargoes

Enable user applied data entity embargoes such that a user may enable a temporary embargo on data entities at the time of a data package upload (as opposed to a manually invoked embargo from an EDI administrator). See relevant suggestions from Sarah Elmendorf per the LTER IM Virtual Water Cooler held on 8 March 2021:

It would be nice if there were a button in the EDI portal interface to embargo data so it's not such a manual process and that data don't have to be released (however briefly) before being embargoed. Sometimes people are really paranoid about their data!

Nice feature would be to submit data under embargo with a self-expiring embargo date. That was the reward for laziness/forgetfulness is open data not permanently embargoed data. (probably would need the feature that you can change that expiry date to later or earlier as needed)

Sometimes reviewers need access to the data for the review. Would it be possible to set up a system where this would be possible for currently embargoed data? We think this feature may be available at the ORNL DAAC based on some review experience but can't remember the details. One catch is that you don't want to make reviewers log in with their real credentials (username) or that blows the anonymous review process as well.

Provide better indications of ERROR and WARNING occurrences in the quality report

From Renée F. Brown 5:20 PM (Slack)

Hi all, I’m currently working with some large and multi-table longterm datasets and the evaluation reports are fairly long as a result. In some cases, the eval takes so long that I’m provided a link to check the status later. It would help me if I could search on the word “error” to see if there were any Status errors in the report (particularly in cases where the processing is linked out because I can’t see right away whether the eval failed or not). However, searching on “error” doesn’t work because every quality check has the line “On failure: error.” The red color code is nice, but can be hard to spot in these complex datasets with hundreds of checks. I’m wondering if it might be possible to tweak the wording somehow re: “on failure”? I suppose I will likely run into similar issues with trying to search the “warn” statements once I move out of DEIMS… Finally, it would also be nice to see whether or not the eval failed at the very top of the report. If that statement were there (or something similar to the table that is shown in the view eval/update results page), it would be easier to know if I need to look for errors (or warnings) in the report.

Non-authenticated evaluation returns 500 Internal Server error instead of 401 Unauthorized error

Due to a change in the access control of the "evaluate" method, non-authenticated users (i.e., public/anonymous) are seeing a 500 Internal Server error message, although the message text clearly states User public is not authorized to execute service method evaluateDataPackage. Non-authenticated users should see a 401 Unauthorized error.

Quoted newline in table record causes exception during ECC database load

From Sven Bohm:

On Tue, Jun 16, 2020 at 9:02 AM sven bohm [email protected] wrote:
Hi Mark,

Hope you are doing well. I'm not sure who to send this to, but I noticed that the congruency checker seems to prioritize line breaks over quotes. That is if a quoted string includes a line break it complains "There is a un-closed quote in data file". Here is an example:

https://portal-s.edirepository.org/nis/reportviewer?packageid=knb-lter-kbs.195.20&localPath=%2Fhome%2Fpasta%2Flocal%2Fharvester%2FLTER-ecoinformatics-org%2FKBS-evaluate-2020-06-16-1592317526799%2Fknb-lter-kbs.195.20%2FqualityReport.xml on entity: /datatables/640

I can file a issue if you'd like.

Thanks

Sven Bohm -.- ..-. ---.. .-

Remove public write permission to API method evaluateDataPackage

Anonymous users may execute the evaluateDataPackage API method, which ultimately could cause a excess use of both CPU and filesystem resources. This privilege should be removed permanently.

The datapackagemanager.journal_citation should update principal_id content to new EDI LDAP DNs

The datapackagemanager.journal_citation package_id field contains LTER LDAP DNs. These DNs are filtered out when an EDI LDAP logged-in user attempts to review their list of journal citations. All LTER LDAP DNs should be modified to the correct EDI LDAP DN values.

Investigate the addition of data table attributes to Solr

Investigate the addition of data table attributes to Solr. This will affect:

Solr
Data Package manager
Data Portal

Add an update capability for journal citation entries

Provide an update API method to update journal citation entries.

A newline within a quoted string resulted in a warning during quality checking.

From Sven Bohm:

I noticed that the congruency checker seems to prioritize line breaks over quotes. That is if a quoted string includes a line break it complains "There is a un-closed quote in data file".

Here is an example: https://portal-s.edirepository.org/nis/reportviewer?packageid=knb-lter-kbs.195.20&localPath=%2Fhome%2Fpasta%2Flocal%2Fharvester%2FLTER-ecoinformatics-org%2FKBS-evaluate-2020-06-16-1592317526799%2Fknb-lter-kbs.195.20%2FqualityReport.xml on entity: /datatables/640

Apparently, the ECC issues a warning when a table element contains a line-break (new-line) even when within quotes.

Enforce data citation DOIs to use canonical form of "shoulder/identifier" when entering into citation DB

DOIs that are entered through the Data Portal data citation UI are not inspected for consistency or correctness. These DOIs are entered into an updated version of the DataCite metadata to inform DataCite of a manuscript/data package relationship but are only recognized as such when structured so that only the "shoulder/identifier" is present. DOIs should be structured to use only the "shoulder/identifier" canonical form when entered into citation DB.

There are two possible solutions:

Enforce the canonical form at the Data Portal UI input form, or
Clean-up the DOI just before entering into the DB

The second option also catches entries written via the REST API and is therefore a more comprehensive solution.

The default intellectualRights statement uses an older CC-BY license - should be CC0

The default intellectualRights statement that is provided by PASTA+ if the intellectualRights field is not present in the new EML document uses a CC-BY license declaration. According to the EDI Data Policy, this should be CC0.

Revise the existing default license to use CC0.

Rethink the portal-to-pasta data download sequence to eliminate non-essential network nodes

Currently, data downloads that are initiated by the Data Portal result in network traffic that flows from the "package", through "pasta", and onto "portal". This process places an excess burden on all three servers to route data bytes through their respective web servers. We should investigate whether a downloads could effectively be re-routed through a download service which run separately from all other servers. Perhaps using a redirect and download token?

Add new "YYYYMMDDThhmmssZ" datetime mapping to PostgresAdapter datetimeTransformationTable

Although the YYYYMMDDThhmmssZ datetime format is EML compliant, it is not compatible with the PostgreSQL datetime format strings, which are used in the to_timestamp function. An appropriate transformation must be added to the pasta.dml.database.PostgresAdapter class datetimeTransformationTable.

Remove "libcurl" from list of robot patterns

The libcurl library is used by the R runtime environment for executing HTTP requests, including those to PASTA. libcurl is also one of the robot patterns used for bot detection in the Gatekeeper, resulting in failed data requests when issued from R.

Remove libcurl from NIS/Gatekeeper/WebRoot/WEB-INF/conf/robotPatterns.txt. Doing so may open up PASTA for robot abuse - we'll need to monitor to see if this issue materializes.

Support use case for fully public data packages that are flagged to not receive a DOI

By default, PASTA assigns all data packages (with at least publicly readable metadata) a Digital Object Identifier (DOI) at either the point of upload or, if this fails, at a later time during a scheduled scanning process of the resource registry database table. There may be instances where a data package may not be a candidate for a DOI, even though it may meet the minimum requirement of having publicly readable metadata.

PASTA should support a mechanism to flag a data package so that it is not assigned a DOI at either the upload event (create/update) or by the DOI scanner process.

Certain unicode characters cause quality checking to fail during schemaValidDereferenced verification

Certain unicode characters cause quality checking to fail during the schemaValidDereferenced verification check, even though there is no "id/references" dereferencing that occurs.

EML supports a mechanism to identify a block of XML code so that it may be reused at a different location within the same document without having to repeat the same block content. The block source is identified with an id attribute. The reuse location is performed with the a <references> element where the element content is the id string literal. A common use of this the "id/references" pattern is with the responsibleParty element - the first time a responsibleParty element is define an id attribute would be declared. Then subsequent responsibleParty elements would be able to reference the id without having to redefine the entire responsibleParty block. For example:

<creator id="chase_gaucho">
    <individualName>
        <givenName>Chase</givenName>
        <surName>Gaucho</surName>
    </individualName>
</creator>
.
.
.
<contact>
    <references>chase_gaucho</references>
</contact>

To ensure that the reused content is EML schema valid, the quality check expands all references into a new EML document and then re-applies the schema validation. The issue occurs with the expansion phase of the original EML XML where certain unicode characters (set unknown) are converted into the corresponding HTML entity references for either UTF-8 or UTF-16.

For example, the unicode character "small italicized delta" (𝛿, U+1D6FF) is first converted from "𝛿" to the UTF-8 entity reference &#x1d6ff in the original source EML XML, and then the dereferencing expansion process converts the &#x1d6ff into the to the UTF-16BE entity reference &#55349;&#57087; (the decimal values of the UTF-16 byte sequence). It is the subsequent schema validation of this second conversion that results in an invalid EML XML document (although the first entity reference would also cause an error due to the UTF-8 entity reference replacement). This exception is completely non-related to the dereferencing validation check.

*Note that the exception message that is added to the schemaValidDereferenced quality check contains the raw encoding value "&#55349;" - it is the ampersand in this message that cause the quality report to fail XSLT conversion from XML to HTML.

Class references:

Initial call for the schemaValidDereferenced check: edu/lternet/pasta/dml/parser/generic/GenericDataPackageParser.java (see ~line 264: emlDataPackage.checkSchemaValidDereferenced(doc, emlNamespace);)
Dereferencing pipe-line: edu/lternet/pasta/dml/parser/DataPackage.java (see methods: checkSchemaValidDereferenced and dereferenceEML)

Add support for SHA-256 hash values at upload

Carl Boettiger has requested that we support the SHA-256 algorithm for computing the hash value of a data entity.

Carl Boettiger  10:49 AM
(The real trick would be if everyone adopted sha256, but I know that's not gonna happen anytime soon.... 
I recently discovered that sha256 is actually faster to compute than sha1 or md5 on most recent CPUs...)

Mark Servilla  10:50 AM
nice, i didn't know that

Carl Boettiger  10:51 AM
yeah, sha-2 family and sha-3 family of hashes are so fundamental that chipmakers have put hardware
acceleration for them right into the chips, and openssl libs have the code for these algos in Assembly!
blew my mind.  https://twitter.com/cboettig/status/1369454195493900288

the cpu savings are particularly excellent if you're hashing really large objects :slightly_smiling_face:

Mark Servilla  10:54 AM
wow! i'll add this to our ticket list for an enhancement :thumbsup:

Enforce data file size limit for normal processing

Enforce data file size limit within PASTA's data download pipeline during the normal processing of a data package.

Although a data upload limit of 500MB is enforced on the Data Portal through NGINX 's client_max_body_size 500m directive, this limit can be effectively circumvented by uploading the EML to PASTA, and then having PASTA perform the data load as part of its regular data download process.

PASTA should:

Abort on a data file exceeding some limit or
Abort on metadata record that indicates a data file will exceed some limit or
Make a data file size limit as part of the quality checking or
All of above

Congruence check should accept funding information in both "funding" and "award" elements of "project".

The congruence checker should accept as compliant funding information in both the funding and award element of the project parent. Currently, if funding information is under the award element, a warning is issued.

Sarah Elmendorf 12:46 PM

@marco I noticed that the congruence checker complains (warns) if you have funding under project -> award -> funderName/title etc instead of directly under funding. Is this intentional? Would it be possible for it to check in one of those two places for funding and only complain if it finds it neither?

Sarah Elmendorf 2 hours ago

sorry should have said "instead of directly under project -> funding"

Mark Servilla 2:17 PM

hi @sarah Elmendorf. no, this is not intentional, but an artifact of latency. the award element is new as of EML 2.2.0, and we have not yet had the time to modify the check for both cases. i'll add a GitHub issue enhancement for this specifically so that we do not lose track of it. thanks for pointing it out 🙂

XML entity encoded characters in provenance metadata are reverted to non-encoded characters in database

XML entity encoded characters in provenance metadata are reverted to non-encoded characters when being stored in the datapackagemanager.prov_matrix table.

XML entity encoding of URL strings in provenance metadata is required for valid XML. During parsing of the EML XML and recognition of the provenance metadata, the encoded URL is stored with entity encodings reverted back to their non-encoded format, thus subsequent use of the provenance metadata by the Data Portal fails due to the lack of XML entity encoding.

Two solutions:

Properly encode the URL stored in the prov_matrix database table - this could pose side-effects to other PASTA classes that expect a non-encoded URL; or
Properly encode the URL as it is being used as part of the data sources XML.

Support UTC offset datetime formats in the DML

The Data Manager Library congruence checker does not support datetime formats that contain a UTC offset value (e.g., YYYY-MM-DDThh:mm:ss-hh). When added to the metadata and checked by the DML, a "warning" is issued and the data table integrity check is skipped. The UTC offset format should be supported by the DML.

Add PASTA data entity unique identifier as "id" attribute in data entity element

PASTA should add the data entity MD5 checksum identifier value to the id attribute of each data entity, including the system="https://pasta.lternet.edu" and system="document" attributes. An example would be the following

<dataTable id="63ef86a0d17acbe76f754d4d6f20dad1" system="https://pasta.lternet.edu" scope="document">

These proposed attribute values should supersede any existing values that were provided in the Level-0 metadata.

PASTA EML parsing results in addition of whitespace when <superscript> and <subscript> are adjacent

An Nguyen (BLE) reported that the rendering of two adjacent <superscript> and <subscript> elements contains extra whitespace when viewed on the portal-s when the original EML XML does not have whitespace between the elements:

On Thu, Oct 22, 2020 at 2:20 PM An T. Nguyen [email protected] wrote:
Hello EDI,

We noticed that in this dataset on staging, the abstract and method
sections have a minor formatting error that we think occurs on EDI end.
E.g. if there are subscripts and superscript text elements next to each
other, the HTML display inserts a space between. See screenshot:

--
An T. Nguyen
Beaufort Lagoon Ecosystems Long Term Ecological Research Network
The University of Texas at Austin

This can also be seen by reviewing the data package summary page (aka landing page) at: https://portal-s.edirepository.org/nis/metadataviewer?packageid=knb-lter-ble.17.1

As an example of the raw XML element of the EML <abstract>

<abstract>
      <para>
Permafrost cores (4.5-7.5 m long) were collected along a geomorphic gradient near Drew Point, Alaska to characterize active layer and permafrost geochemistry and material properties. Cores were collected from a young drained lake basin, an ancient drained lake basin, and primary surface that has not been reworked by thaw lake cycles. Measurements of total organic carbon (TOC) and total nitrogen (TN) content, stable carbon isotope ratios (δ<superscript>13</superscript>C) and radiocarbon (<superscript>14</superscript>C) analyses of bulk soils/sediments were conducted on 45 samples from 3 permafrost cores. Porewaters were extracted from these same core sections and used to measure salinity, dissolved organic carbon (DOC), total dissolved nitrogen (TDN), anion (Cl<superscript>-</superscript>, Br<superscript>-</superscript>, SO<subscript>4</subscript><superscript>2-</superscript>, NO<subscript>3</subscript><superscript>-</superscript>), and trace metal (Ca, Mn, Al, Ba, Sr, Si, and Fe) concentrations. Radiogenic strontium (<superscript>87</superscript>Sr/<superscript>86</superscript>Sr) was measured on a subset of porewater samples. Cores were also sampled for material property measurements such as dry bulk density, water content, and grain size fractions.
</para>
    </abstract>

Journal citations should include options for different relation types

The journal citation service provided by PASTA allows a user to contribute a journal citation for a data package that is archived in PASTA. When submitted, this information updates the DataCite metadata for the data package and includes a section for related identifiers. This information forms the basis of an RDF triple with the subject being the data package, the object being the journal article, and the predicate is the relationship between the data package and the journal article.

<relatedIdentifiers>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsCitedBy">
         https://doi.org/10.1371/journal.pone.0205211
    </relatedIdentifier>
</relatedIdentifiers>

The current implementation fixes the relation type to "IsCitedBy". There are, however, instances where the data package is not expressly cited in the journal article but is still a source of information for the article's purpose. In these cases, a better relation type may be selected. Options for relation types to be used in the EDI/PASTA context are (in decreasing detail):

IsCitedBy
IsDescribedBy
IsReferencedBy

See here for a complete list of supported relation types: https://support.datacite.org/docs/relationtype_for_citation

This issue is related to #48.

API Documentation: Get Audit Record returns application/xml, not text/plain

An issue in the documentation with the Get Audit Record operation. The docs say a successful response will return text/plain, but it returns application/xml. Furthermore, the application/xml content-type is listed as application/json.

Add a "dataset" distribution element with the DOI URL as the target URL with function attribute set to information

Add a second data entity distribution element for each data entity with the data package DOI with function attribute set to information. This will improve the overall FAIR assessment by EDI and other 3rd party evaluators.

Zip archive incomplete with embargoed data even when permissions are set correctly

The zip archive does not include resources that are embargoed by using the "authenticated" principal for read access, even though the user is authenticated. These same resources are not accessible to the general public on the data package summary page, but are accessible when the user authenticates.

See the following package as an example: https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-vcr&identifier=70&revision=24

Link to dataset published in another repository

The use case is that a scientist has published data in the KNB, but it is an LTER dataset and I need to be able to link to the data from EDI without giving it a new DOI. FCE generates its data catalog by harvesting metadata from EDI.

See here for PEP: https://docs.google.com/document/d/1kcTn9y18KaTa9V_NVlwvdvX6gZyEzU9knmfOoLy1h34/edit?usp=sharing