dataexpks
This package creates a kickstart rmarkdown document that automatically performs some basic data exploration for the supplied dataset.
The rmarkdown template is in the inst/template folder, as the file skeleton.Rmd
This package creates a kickstart rmarkdown document that automatically performs some basic data exploration for the supplied dataset.
License: MIT License
This package creates a kickstart rmarkdown document that automatically performs some basic data exploration for the supplied dataset.
The rmarkdown template is in the inst/template folder, as the file skeleton.Rmd
Around line 530, user needs to input a low cardinality / boolean variable name to facte the graphs on.
The multivariate missing data visualisation is useful, but it might be an idea to also look at the top 5 or 10 combinations of missing data in the event that there are a lot of combinations.
It may also be worth increasing the physical size of the first plot we produce also.
The code that produces the MDS and t-SNE visualisation is clunky and needs to be cleaned up.
Need the think of a much more elegant and clean way to implement those ideas.
It was
data_tsne <- Rtsne(tsne_tbl %>% select(one_of(numeric_vars)))
I suggest
data_tsne <- Rtsne(unique(tsne_tbl %>% select(one_of(numeric_vars))))
There is a stray '<' character before the email address, should this either be removed or it's sibiling put in? Just for the look of it?
To make finding parameters for the data exploration much more findable, I want to ensure that all the common parameters (such as the number of variables to include in the pairs plots for example) have a common prefix so that we can find them quickly within the dataset.
Currently the data is written to disk right at the end of the template, and this means that this data is often not captured properly, so we may need some intermediate chunks of code to do this earlier in the workflow.
It is likely we may create multiple outputs that also uses the additional columns created by the outlier detection piece.
It is not always very clear where the user of the template needs to add code, so I need to think of a way to telegraph this to the use in a consistent way.
Having used dataexpks quite a bit over the last six months I have started to realise that the template is not very good at encouraging the use of additional visualisations and work that people would probably want to add.
I think part of this problem is that there are no obvious places to put this additional stuff so we need to think about ways in which we can add those.
I think we also need a better way to add derived values and for some data cleaning, but I'm not quite sure where that would go.
When running the bivariate explorations the code tries to delete 'outlier_point' from the workspace but it has not been created yet.
Need to change that in the code.
It's really confusing to read chunks after redefine_numeric_vars where numeric_vars is (silently) redefined
There is an issue with the relative table sizes in these lines:
tsne_tbl$tsne_d1 <- data_tsne$Y[,1]
tsne_tbl$tsne_d2 <- data_tsne$Y[,2]
tsne_tbl is a copy of sample_tbl while data_tsne is a set of unique values from sample_tbl. This being an issue is contingent on my suggested fix for Issue #8 being correct.
As the template progresses through the data, we need to figure out a good progression of variable names and how they are used.
Currently we are working on rawdata_tbl and data_tbl but I need to improve this as it gets a bit confusing and needs to be improved.
With the new 'tidyeval' changes that came in dplyr 0.7.0 it is worth going through the various pieces of code in this and fixing them all up so they work better.
Because there is a very real possibility for the data table in the estimate_data_mean_covariance block to produce a zero length table should there be a flag to indicate if the sections should be run? Roughly line 880 (might differ on mine due to alterations)
While the default ordering of levels in the univariate plots is to do it from left to right by the count, if the categorical has a natural ordering it is better to do it that way.
When I first started working on producing these plots I was unaware of the the aes_string() function in R to allow for programmatic choice of variables.
Now that I know about it, I should be able to make the plotting functions much neater in terms of code consistency.
At the moment the template does not do a good job of explaining the distinction between the need for a distinction between the rawdata_tbl and data_tbl.
We probably need to add a few bits and pieces around the code commentary and also in the markdown text itself to make it clear why they are being made distinct.
It is possible I could put some sample code in there perhaps?
if(nrow(data_tbl) > mds_sample_count) {
needs to be extended to be
if(nrow(data_tbl %>% filter(row_ids)) > mds_sample_count) {
We want the main version of the package to work via a single function that will take a data file as input and generate all the markdown and code to create a very basic version of the rmarkdown doc that will serve as a kickstarter for the data exploration.
It would be useful to put a link at the top of the template to dataexpks so that if people see it being used there is a link to the template itself.
The function
create_coltype_list()
would be better served if I used a 'tidyverse' style solution to applying functions across columns.
The regular expression substitutions are adding underscores for ordinal abbreviations like '1st' etc.
Need to put in some code to suppress this or fix the regular expression.
It might be useful to have a timing piece for each section so we can get a help with benchmarking the code.
This should be easy enough to implement.
We can use histograms directly to explore date/time variables in the data so I need to change the template to fix that for us.
Action: When using mtcars as a test dataset, which has no discrete variables, Run the chunk find_highlevelcount_categorical_variables
of skeleton.Rmd
Expected : Chunk outputs a comment that there are no discrete variables, and moves on to the next chunk
Actual: Throws error "Error: .cols should be a character/numeric vector or a columns object"
There are static variable names, should it be changed to a select_if(is.numeric)?
There is a commented out line near the top, is this removeable?
Section 6.2: in this instance the num_data_tbl has no rows. I don't know if this is beacuse of the data or because I've selected too many columns using the select_if. If there are no rows then that block and everything below doesn't work.
The template does not do a good job of handling data where whole types of data are missing i.e. data frames where there are no 'categorical' columns as an example.
I need to find some test cases and work with that to fix those.
The formatting of the proportions on the y-axis of the multivariate missing data plot should be fixed so that it doesn't use scientific notation.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.