Giter VIP home page Giter VIP logo

dataexpks's Introduction

dataexpks

This package creates a kickstart rmarkdown document that automatically performs some basic data exploration for the supplied dataset.

The rmarkdown template is in the inst/template folder, as the file skeleton.Rmd

dataexpks's People

Contributors

collinspp avatar fleurlec avatar kaybenleroll avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

fleurlec gtm19

dataexpks's Issues

Improve the logic around writing data to disk

Currently the data is written to disk right at the end of the template, and this means that this data is often not captured properly, so we may need some intermediate chunks of code to do this earlier in the workflow.

It is likely we may create multiple outputs that also uses the additional columns created by the outlier detection piece.

Make it more obvious and easier to add custom exploratory code

Having used dataexpks quite a bit over the last six months I have started to realise that the template is not very good at encouraging the use of additional visualisations and work that people would probably want to add.

I think part of this problem is that there are no obvious places to put this additional stuff so we need to think about ways in which we can add those.

I think we also need a better way to add derived values and for some data cleaning, but I'm not quite sure where that would go.

Issue in lines 773-4 and 789-90

There is an issue with the relative table sizes in these lines:

tsne_tbl$tsne_d1 <- data_tsne$Y[,1]
tsne_tbl$tsne_d2 <- data_tsne$Y[,2]

tsne_tbl is a copy of sample_tbl while data_tsne is a set of unique values from sample_tbl. This being an issue is contingent on my suggested fix for Issue #8 being correct.

Flag for block estimate_data_mean_covariance and below

Because there is a very real possibility for the data table in the estimate_data_mean_covariance block to produce a zero length table should there be a flag to indicate if the sections should be run? Roughly line 880 (might differ on mine due to alterations)

Switch plots to make use of 'aes_string'

When I first started working on producing these plots I was unaware of the the aes_string() function in R to allow for programmatic choice of variables.

Now that I know about it, I should be able to make the plotting functions much neater in terms of code consistency.

Draw distinction between rawdata_tbl and data_tbl

At the moment the template does not do a good job of explaining the distinction between the need for a distinction between the rawdata_tbl and data_tbl.

We probably need to add a few bits and pieces around the code commentary and also in the markdown text itself to make it clear why they are being made distinct.

It is possible I could put some sample code in there perhaps?

Create code to automatically generate Rmd files

We want the main version of the package to work via a single function that will take a data file as input and generate all the markdown and code to create a very basic version of the rmarkdown doc that will serve as a kickstarter for the data exploration.

Add package link

It would be useful to put a link at the top of the template to dataexpks so that if people see it being used there is a link to the template itself.

find_highlevelcount_categorical_variables

Action: When using mtcars as a test dataset, which has no discrete variables, Run the chunk find_highlevelcount_categorical_variables of skeleton.Rmd

Expected : Chunk outputs a comment that there are no discrete variables, and moves on to the next chunk
Actual: Throws error "Error: .cols should be a character/numeric vector or a columns object"

Select on line 887 relying on static column names & Section 6.2

There are static variable names, should it be changed to a select_if(is.numeric)?

There is a commented out line near the top, is this removeable?

Section 6.2: in this instance the num_data_tbl has no rows. I don't know if this is beacuse of the data or because I've selected too many columns using the select_if. If there are no rows then that block and everything below doesn't work.

Logic does not handle missing data types well

The template does not do a good job of handling data where whole types of data are missing i.e. data frames where there are no 'categorical' columns as an example.

I need to find some test cases and work with that to fix those.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.