Giter VIP home page Giter VIP logo

datamashr's People

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

datamashr's Issues

Function to check setup of package

Want a function which checks each directory in data and returns TRUE if passes all the tests, checking that all the right parts are there.

Then embed this within getStudyNames --> returns a list of directories that pass all the tests. These dirs are then loaded and built. Make an option to test=FALSE (for getStudyNames, default=FALSE).

Config files

  • When loading dataMashR, check config files exists and have right columns
  • do all files exist? we need to define if methodsDefinitions is among them;
  • does startDataMashR load correctly?

Variable definitions

  • does file exist?
  • do all columns exist?
  • do all numeric variables contain a valid range of values? if not, throw warning
  • are specified ranges valid? (TEST CHECKS IF COLUMNS ARE PURELY NUMERIC)
  • are types of variables (e.g. numeric, character) valid modes in R?
  • do variable names contain special characters [/*^, or any nonascii]
  • all numeric vars have methods units specified, i.e. can't be blank
  • check for trailing white spaces after variable names

Variable conversions

  • does file exist?
  • can it be empty? What if people are dealing with non-numeric datasets?
  • are columns in the right order?
  • are conversion functions valid functions?

For each study

  • modify tests to draw var.def using mashRdetail function

data.csv

  • does file exist?
  • no duplicate column names
  • values outside range --> MAYBE THIS SHOULD BE A POST-HOC TEST AFTER PROCESSING THE DATA? OTHERWISE WE'LL HAVE TO PROCESS ALL THE DATA TWICE, WHICH IN TOTAL WOULD TAKE ~30 SEC (IN THE CASE OF BAAD), AND COULD POTENTIALLY CAUSE A MEMORY CRASH DUE TO 'TOO MANY OPEN FILES'
  • contains special characters?
  • file loads without problems

dataImportOptions

  • does it have the right structure? i.e., colnames
  • are entry values acceptable? (e.g. TRUE or FALSE for header, and a numeric value for skip?)

dataManipulate

  • is it a valid function?
  • is it a function of 'raw'?
  • does it return 'raw'?

dataMatchColumns

  • order of column names should correspond to data header
  • no NA values in var_in (e.g., was in O'Hara0000)
  • are the listed methods contained within the config table methodsDefinitions?
  • [x ] variable names present in var.def
  • suitable unit conversion exists, given input units and desired output

dataNew

  • columns in the right order?
  • if a variable is listed in newVariable only (no case to match, i.e., could be either a new variable to add or a column with unique entry to be completely replaced), but it already exists, then throw a warning that it existed already in data but was completely replaced.
  • do all values listed in lookupVariable exist in the processed data post dataMatchColumns?
  • is the file empty? If so, throw a warning letting the person know that no additions will be made?
  • are the newVariable listed contained within the variableDefinitions table?
  • are new values in accordance with variable type? If so, are they within the correct range?
  • when addNewData adds a new variable conditionally on a lookupValue, it fails when the lookupVariable contains missing values. For example,
lookupVariable,lookupValue,newVariable,newValue,source
species,Cedrela odorata,family,Meliaceae,
species,Tabebuia rosea,family,Bignonaceae,

would fail if species contains NA in the data

studyContact

  • does it contain special characters?
  • does it contain the right columns?

studyMetadata

  • this should be a relatively free file, is it a true requirement?

studyRef

  • same as above, should be optional as not all datasets are published..
  • if it exists, does it contain a valid bibtex format/template (check if throws error when loading)

all files

  • Non-unicode chars, as per #104
  • check for trailing white spaces (before ,)
  • find replace on known strange characters? ’ -> ', – -> -, tabs, other?

Errors from R CMD check

  1. Non Ascii characters on line 93 of import.R.
  2. missing documentation entries ... WARNING ...Undocumented code objects: ‘data.path’ ‘mashrDetail’ ‘validateConfig’ ‘validateSetUp’

To fix 1, need to find smart quote equivalent, \uXXXX. Use tools::showNonASCIIfile("R/import.R") to find non ascii, look up codes here

argument 'data' in loadStudies?

Currently we have a parameter data in loadStudies but I can't see any application for it at the moment?

Is it something old that we should remove or is it something yet to be implemented?

Amend travis.yml

It looks like the different lines in the script all run, when we might want them to stop if one thing fails. Consider replacing

    - make install
    - make test

with

    - make install test

(or make install && make test).

Directory names not known in dataMashR

For example, getStudyNames() needs global variable 'dir.rawData'. We need a function 'setFolders' or something, which makes global hidden variables, '.dir_rawData' etc. The user would run that the first time, to 'init' a project. dataMashR_init() ?

character numeric variables

the final step in loadStudies (or processStudy?) should be to convert variables to numeric that are listed as such in the variableDefinitions.csv file. Now, in baad, h.c (and many others) are character even though no values are contained that cannot be converted to numeric.

Better regression tests

Skeleton added, but more tests are needed or this is just going to tell you something has happened and give you no good clues.

addNewData fails when NA present

when addNewData adds a new variable conditionally on a lookupValue, it fails when the lookupVariable contains missing values. For example,

lookupVariable,lookupValue,newVariable,newValue,source
species,Cedrela odorata,family,Meliaceae,
species,Tabebuia rosea,family,Bignonaceae,

would fail if species contains NA in the data

Too many open files Error

I get this error after a while (when I have baad open, and have run loadStudies a few times)

Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'data/Aiba2005/dataImportOptions.csv': Too many open files

Somewhere in dataMashR, I assume some file is opened and not closed.

Create a function that recognises new studies and sets them up within progress.

Currently we have functions that are used when we import new studies. It would be cool to have something that recognises new studies automatically, add them to the progress file (by the way, we also need a function that sets up this file initially).
Maybe adding a warning message when somebody tries to process studies but one of them was still not even set up.

processStudy error messages

For example,

dat <- processStudy("Epron2011")
Error in if (unit.from != unit.to) x <- match.fun(paste(unit.from, unit.to, :
missing value where TRUE/FALSE needed

The function (or the underlying one) should print a message WHICH variable did not have units (because units are not expected for all)

Function to change variable name throughout study

Needs to modify

  • variable definitions file
  • dataMatchColumns in every study
  • check dataNew to see if var used
  • check dataManipulate to see if variable name used (actually, this may not be necessary because variable names should then get modified in dataMatchColumns.csv

Process for managing "stage" of import for each data folder

Want to manage which data folders are complete or incomplete within data folder.

  • Add stage file to config folder
  • function to create stage file if it does not already exist
  • Define stages
  • Add option to set stage in getStudyNames, also default value
  • Function to update stage
  • Function to add new studies with stage 0, when they appear in study

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.