Giter VIP home page Giter VIP logo

datamgmt's People

Contributors

dmullen17 avatar drkrynstrng avatar emilyodean avatar isteves avatar jeanetteclark avatar maier-m avatar mbjones avatar sharisochs avatar smfreund avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datamgmt's Issues

Add documentation about installing udunits2

The udunits2 library is required for datamgmt and requires steps beyond install.packages("udunits2"). May be worth adding in prompts (The udunits2 library is required for the datamgmt package. Would you like to install the udunits2 library?)

update_eml_with_new_pids function.

Another idea for a fellows first project / function:

The idea here is that we sometimes use datamgmt::clone_package or datamgmt::copy_package to clone or copy a package to another member node, or to restore a previous version. This function does not update the pids in the eml (if new_pids = TRUE in the case of clone_package). It would be useful to write a helper function that does this in an efficient manner and add it to copy_package and conditionally in clone_package if new_pids = TRUE.

Potential workflow:

  1. Copy a package with ~10 data objects from production to test
  2. Match the new data pids to their old pids in the EML (this could be done multiple ways, but using checksum might the most efficient)
  3. Update the EML with the new pids (physical section)

Move which_in_eml to ropensci/EML

@maier-m I'm cleaning up datamgmt and this function makes way more sense in EML. Did you ever ask if Karl wanted to add it to the ropensci/EML package?

If you're interested in the longer story: I used which_in_eml as a helper in this function NCEAS/arcticdatautils#97, but because datamgmt already imports arcticdatautils then arcticdatautils cannot import any datamgmt functions because this creates a cyclic package dependency. My hacky solution was to add which_in_eml to arcticdatautils so I could use it 🙄, but it makes way more sense in EML.

write check_attributes function

High level function for checking whether attribute metadata is correct.
Should be able to input an eml, a single eml attribute list, or a data.frame.

arcticdatautils function pid_to_eml_physical only works for ADC

arcticdatautils::pid_to_eml_physical is one of most useful functions but specific fields only work for uploads to the Arctic Data Center.
For instance the following creates the online distribution info link:

phys@distribution[[1]]@online@url <- new("url", paste0("https://cn.dataone.org/cn/v2/resolve/", 
            x@identifier))

We should generalize this function to work across all (or alot) of the dataOne member nodes - definitely test ADC to start.

Combining object updates/resource map

Currently, any time you update an object with a new version (via update_object), you have to save the new PID and then update the resource map.

I'm guessing it's possible to bundle the two together into one function, which updates the resource map whenever an object is updated.

Remove bulk of imports

Look into how quickly replacing the bulk of our imported packages with importFrom tags could be finished

Add datamgmt style guide

We're generally following tidyverse standards, but we have a few package-specific style prefs so far:

  • underscore for all variable names (unless referring to EML)
  • functions should include checks (stopifnot()) at the beginning to avoid messy if-else's later on

Look into RT alternatives/supplements

An alternative to RT would need to meet the following requirements:

  • be able to initiate submissions with an email to support
  • be able to communicate with PI's via email (act as an email gateway)

So far, we've discussed:

  • GitHub
    • using GitHub issues to track complex submissions
    • looking into a GitHub plugin that can support (email) conversations with non-GitHub-users (currently, GitHub can do some of the above tasks, but requires that all users have GitHub accounts)
  • RedMine
  • ZenDesk ($$$)

Create a QA function to make sure EML matches data

@params – resource map PID

Output will be written to a file and should include resource map PID, and any identifiers (filenames, PIDs of datasets, PIDs of EML, datatable#, etc) that are applicable to errors that occurred.

The goal of this is to get packages ready for consumption by erddap. This will likely be integrated with the metadata quality engine.

QA Checks

Correct data/EML/resource map linkage (given a resource map PID – we need to be able to get current EML & data PIDS)

  • Physical has incorrect URL to data
  • There is no data table in EML
  • PIDs in EML match PIDs in resource map
  • Filename not set in sysmeta
  • Filename not set in EML
  • formatId not set in sysmeta
  • formatId matches file extension based on registered DataONE format types

Data/EML compatibility

  • Different number of columns in EML vs data
  • Different number of rows in EML vs data (does EML include rows?)
  • Units or missing value codes absent or incomplete

Check that all column names in attributes match the column names in the csv

  • Possible conditions to account for:
    • dataTable does not exist for a csv
    • Physical has not been set and so URL id in dataTable is incorrect
    • Some of the attributes that exist in the data don't exist in the attribute table
    • Some of the attributes that exist in the attribute table don't exist in the data
    • There is a typo in one of the attributes or column names so they don't match (maybe covered by above)

Domains: Check that all attribute types match attribute types in the csv

  • Possible conditions to account for:
    • nominal, ordinal, integer, ratio, dateTime reflected in data
    • If domain is enumerated domain, not all enumerated values in the data are accounted for in the enumerated definition
    • If domain is enumerated domain, not all enumerated values in the enumerated definition are actually represented in the data
    • The actual type of data does not match type outlined in EML
    • Values: Check for accidental characters in the csv (one char in a column of ints)

create_attributes_table textDomain bug

For textDomain definition is a required field. Function should copy attributeDefinition into definition by default, or make definition a red box when using the shiny app.

This bug fix should also be applied to eml2 version of this function.

Bug fixes and improvements for qa_package and qa_attributes

I had several open tickets for issues here, so I am going to consolidate them all into one. I think other issues arose as I was out, too.

Bugs

  • reading in public xlsx and xls fails because it attempts to do it by url
  • qa_attributes fails on private xlsx reads because dataone::getObject() fails. One option is to just skip over trying to download the object if it's a private xlsx.
  • qa_attributes will process attributes incorrectly if csv has multiple header lines. Should check physical > dataFormat > textFormat > numHeaderLines to see if there are multiple header lines.
  • qa_attributes tries and fails to read .Rmds and .ipynb

Improvements

  • Add unit tests for qa_attributes and qa_package
  • This function is pretty SASAP specific right now and apparently breaks on ADC packages. Future iterations should be more flexible.
  • Add support for spatial data

I am closing #90, #91, #92, and #119 since this captures all of them.

Update edit_attribute to modify otherEntities

The edit_attribute function only modifies dataTable objects in EML, when it should be able to modify dataTable or otherEntity objects as both can contain attribute metadata. One way to go about this would be replacing the dataTableNumber with something like object, and having the user specify the path as an input (for instance object = eml@dataset@dataTable[[2]])

Import issue with 'arcticdatautils' when 'datamgmt' is installed

This comes up when trying to install a new version of 'arcticdatautils' when 'datamgmt' is already loaded.

Reloading installed arcticdatautils
unloadNamespace("arcticdatautils") not successful, probably because another loaded package depends on it.Forcing unload. If you encounter problems, please restart R.

We might need to remove arcticdatautils as an import from datamgmt and switch to importFrom calls for the specific functions we need.

extend query_all_versions to resource map and data objects

query_all_versions as currently constructed only queries metadata. For instance this query returns a blank data frame:

datamgmt::query_all_versions(adc_prod, "resource_map_urn:uuid:a3784690-300e-4041-bd30-5d9b47759903", fields = "submitter")

It would be great to extend this to query resource map and data objects, along with some error handling for blank fields (for instance a resource map will not return a northBoundingCoordinate).
Some unit tests for these cases would be great too (resource map, data, and catching blank fields)

clone_package feature request

clone_package copies a parent + it's (optional) children. It does not update the package metadata to reflect the new pids yet.

Filing this issue as a feature request so I don't forget.

R Cmd check warnings - Mitchell

shiny_attributes_table: no visible global function definition for
  ‘packageVersion’
shiny_attributes_table: no visible global function definition for
  ‘get_unitList’
shiny_attributes_table: no visible global function definition for
  ‘fluidPage’
shiny_attributes_table: no visible global function definition for ‘br’
shiny_attributes_table: no visible binding for global variable ‘tags’
shiny_attributes_table: no visible global function definition for
  ‘actionButton’
shiny_attributes_table: no visible global function definition for ‘h5’
shiny_attributes_table: no visible global function definition for
  ‘rHandsontableOutput’
shiny_attributes_table: no visible global function definition for
  ‘fluidRow’
shiny_attributes_table: no visible global function definition for
  ‘column’
shiny_attributes_table : server: no visible global function definition
  for ‘reactive’
shiny_attributes_table : server: no visible global function definition
  for ‘hot_to_r’
shiny_attributes_table : server: no visible global function definition
  for ‘renderRHandsontable’
shiny_attributes_table : server: no visible global function definition
  for ‘%>%’
shiny_attributes_table : server: no visible global function definition
  for ‘rhandsontable’
shiny_attributes_table : server: no visible global function definition
  for ‘hot_table’
shiny_attributes_table : server: no visible global function definition
  for ‘hot_col’
shiny_attributes_table : server: no visible global function definition
  for ‘hot_cols’
shiny_attributes_table : server: no visible global function definition
  for ‘observeEvent’
shiny_attributes_table : server: no visible global function definition
  for ‘stopApp’
shiny_attributes_table: no visible global function definition for
  ‘shinyApp’
Undefined global functions or variables:
  %>% actionButton br column fluidPage fluidRow get_unitList green h5
  hot_col hot_cols hot_table hot_to_r is new observeEvent
  packageVersion pid_to_eml_physical query rHandsontableOutput reactive
  red renderRHandsontable rhandsontable shinyApp stopApp tags tail
  write.csv write_eml

Update `eml_validate` to match Metacat

The EML package eml_validate function currently does not run some of the checks used in Metacat. Ideally, they would show the same result.


From Slack:
"it would be nice to add the extra schema-validity rules to the eml_validate function so it will show the same results as Metacat" (Matt)

"Once you send it across the network to the Member Node, Metacat runs a suite of custom validation rules on the EML.
It's stuff that XML schema's can't help us enforce, such as the match between a custom unit and its definition (as here) or ids and references." (Bryce)

open question:
"Im not sure what additional schema rules would be needed? right now EML::eml_validate is basically a wrapper for xml2::xml_validate which feeds that function the eml schema (EML/xsd/eml-2.1.1/eml.xsd). what additional checks are needed?" (Mitchell)

Create initial review checklist

Easy checks for reviewing methods/abstracts. Submissions should:

  • provide instrument names
  • specify how sampling locations were chosen
  • provide citations for sampling methods that are not explained in detail

Add function summary

Add summary of all functions in datamgmt/master, possibly to README?

This will help:

  • avoid duplicating work
  • make it easier to find functions if you're not sure about the name

Add package dependencies

Add dependencies on an install of the datamgmt package.
We currently need: arcticdatautils, dataone, shiny, rhandsontable, any others?

Excel date to standard date

Right now, the easy (but tedious) way of dealing with Excel messing up dates is to open a file in Excel, save it as a csv, and import into R.

If a function exists on the internet, then please comment here! Otherwise, I solved part of this problem with this frac_to_hm function:

frac_to_hm <- function(fraction, format = "character") {
  #changes time as fraction to hh:mm format
  #can either output as character or lubridate
  h <- as.integer(fraction * 24)
  m <- as.integer((fraction * 24 - h) * 60)
  if(format == "lubridate"){
    time <- hm(paste(h, m))
  } 
  if(format == "character"){
    time <- paste(h, str_pad(m, 2, pad = "0"), sep = ":")
  }
  return(time)
}

In this case, when Excel converts hh:mm to a fraction (i.e. 0.5), I can re-convert it to the appropriate character string. (doesn't deal with dates though)

The solution to this problem should also be written out in Rmd and get added to our reference guide.

R Cmd check warnings

add_creator_id: no visible global function definition for ‘green’
add_creator_id: no visible global function definition for ‘red’
add_creator_id: no visible global function definition for ‘new’

eml_diff function idea

This idea is from @jagoldstein. It would be useful to write an eml_diff function that compares two emls and returns differences (similar to github). The trick here is not to compare it line by line, but detect which sections exist dynamically, and then compare those.

This could be a good starting project for a fellow that wants to get more familiar with EML and writing functions.

create_attributes_table doesn't show table

I just get a blank screen when using create_attributes_table. Sometimes this error pops up in the console:

  [No stack trace available]
Error in safeFromJSON(charData, simplifyVector = FALSE) : 
  Argument 'txt' is not a valid JSON string.```

Function for adding new attribute to existing table?

I received a ticket today that necessitates adding a new attribute to an existing attribute table (the csv file for this package was given an additional column). The options for doing this appear to be 1) rebuild the whole attribute table or 2) edit the xml file in-line. These are both pretty simple (especially 2) for a small table like the one I am working on, but in future cases I think it would be very helpful to have a way to set a new attribute in the eml, e.g.
attribute19<-new(attribute, arguments yada yada yada)
eml@dataset@dataTable@attributeList@attribute[[19]]<-attribute19

`obsolete_package`: no error checking for token

forgot to put in a token, and the result when using obsolete_package without one is:

obsolete_package(mn, "urn:uuid:14d91309-0ecb-4c78-88a4-ff6e691bcc9a", "urn:uuid:17d733ef-cb2b-48e0-8ca7-07aa162c6b70")
Error: arcticdatautils::object_exists(mn, metadata_obs) is not TRUE

which makes it look like there is no object there, but the issue was really the token.

Create a QA function to check physical of object

When using qa_package an object may fail to process because the distribution URL in the physical is missing or differs from the object pid. It would be nice to have a helper function to pinpoint exactly what is happening with the physical. The main possibilities are that the physical is simply missing, the physical was not updated after the object was updated, or the wrong physical was assigned to the object. A qa_physical function would first check if the physical exists and if so check for these other possibilities to narrow down the issue, producing the appropriate messages to guide action. I'd be happy to work on this if it sounds like a good idea.

Link `get_awards` to arcticbot/email updates

Related to @csjx’s idea to provide option for new awardees to populate (some fields) of an EML template of sorts, to get an idea early-on about what’s expected (at least Project level - avoid PI’s copying and pasting NSF abstract to dataset)

Update pkgdown

The datamgmt pkgdown site is a good place for newbies to datamgmt to explore a curated list of functions, but is currently very out of date

@dmullen17 perhaps you're interested?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.