dublinlearninggroup / dataexpks Goto Github PK

This package creates a kickstart rmarkdown document that automatically performs some basic data exploration for the supplied dataset.

License: MIT License

R 100.00%

dataexpks's Introduction

dataexpks

This package creates a kickstart rmarkdown document that automatically performs some basic data exploration for the supplied dataset.

The rmarkdown template is in the inst/template folder, as the file skeleton.Rmd

dataexpks's People

Contributors

Stargazers

Watchers

Forkers

fleurlec gtm19

dataexpks's Issues

facet_varname needs changing before running

Around line 530, user needs to input a low cardinality / boolean variable name to facte the graphs on.

Improve the functionality of the multivariate missing data visualisation

The multivariate missing data visualisation is useful, but it might be an idea to also look at the top 5 or 10 combinations of missing data in the event that there are a lot of combinations.

It may also be worth increasing the physical size of the first plot we produce also.

Improve the Logic for Numeric Data Visualisation

The code that produces the MDS and t-SNE visualisation is clunky and needs to be cleaned up.

Need the think of a much more elegant and clean way to implement those ideas.

line 771 Error in Rtsne.default(X, ...) : Remove duplicates before running TSNE.

It was
data_tsne <- Rtsne(tsne_tbl %>% select(one_of(numeric_vars)))
I suggest
data_tsne <- Rtsne(unique(tsne_tbl %>% select(one_of(numeric_vars))))

In the document information bit at the top, a stray '<' character

There is a stray '<' character before the email address, should this either be removed or it's sibiling put in? Just for the look of it?

Change variable naming so that all 'exploration' parameters have common prefix

To make finding parameters for the data exploration much more findable, I want to ensure that all the common parameters (such as the number of variables to include in the pairs plots for example) have a common prefix so that we can find them quickly within the dataset.

Improve the logic around writing data to disk

Currently the data is written to disk right at the end of the template, and this means that this data is often not captured properly, so we may need some intermediate chunks of code to do this earlier in the workflow.

It is likely we may create multiple outputs that also uses the additional columns created by the outlier detection piece.

Improve labelling in the template for where code needs to be added

It is not always very clear where the user of the template needs to add code, so I need to think of a way to telegraph this to the use in a consistent way.

Make it more obvious and easier to add custom exploratory code

Having used dataexpks quite a bit over the last six months I have started to realise that the template is not very good at encouraging the use of additional visualisations and work that people would probably want to add.

I think part of this problem is that there are no obvious places to put this additional stuff so we need to think about ways in which we can add those.

I think we also need a better way to add derived values and for some data cleaning, but I'm not quite sure where that would go.

'outlier_point' not found in rm() statement

When running the bivariate explorations the code tries to delete 'outlier_point' from the workspace but it has not been created yet.

Need to change that in the code.

reuse of numeric_vars in chunk redefine_numeric_vars is confusing

It's really confusing to read chunks after redefine_numeric_vars where numeric_vars is (silently) redefined

Issue in lines 773-4 and 789-90

There is an issue with the relative table sizes in these lines:

tsne_tbl$tsne_d1 <- data_tsne$Y[,1]
tsne_tbl$tsne_d2 <- data_tsne$Y[,2]

tsne_tbl is a copy of sample_tbl while data_tsne is a set of unique values from sample_tbl. This being an issue is contingent on my suggested fix for Issue #8 being correct.

Improve the logical distinction between the different version of the table

As the template progresses through the data, we need to figure out a good progression of variable names and how they are used.

Currently we are working on rawdata_tbl and data_tbl but I need to improve this as it gets a bit confusing and needs to be improved.

Use new 'tidyeval' functionality to alter logic

With the new 'tidyeval' changes that came in dplyr 0.7.0 it is worth going through the various pieces of code in this and fixing them all up so they work better.

Flag for block estimate_data_mean_covariance and below

Because there is a very real possibility for the data table in the estimate_data_mean_covariance block to produce a zero length table should there be a flag to indicate if the sections should be run? Roughly line 880 (might differ on mine due to alterations)

Add logic for sorting ordered factors in the categorical univariate plots

While the default ordering of levels in the univariate plots is to do it from left to right by the count, if the categorical has a natural ordering it is better to do it that way.

Switch plots to make use of 'aes_string'

When I first started working on producing these plots I was unaware of the the aes_string() function in R to allow for programmatic choice of variables.

Now that I know about it, I should be able to make the plotting functions much neater in terms of code consistency.

Draw distinction between rawdata_tbl and data_tbl

At the moment the template does not do a good job of explaining the distinction between the need for a distinction between the rawdata_tbl and data_tbl.

We probably need to add a few bits and pieces around the code commentary and also in the markdown text itself to make it clear why they are being made distinct.

It is possible I could put some sample code in there perhaps?

line 698 condition not capturing data size

if(nrow(data_tbl) > mds_sample_count) {
needs to be extended to be
if(nrow(data_tbl %>% filter(row_ids)) > mds_sample_count) {

Create code to automatically generate Rmd files

We want the main version of the package to work via a single function that will take a data file as input and generate all the markdown and code to create a very basic version of the rmarkdown doc that will serve as a kickstarter for the data exploration.

Add package link

It would be useful to put a link at the top of the template to dataexpks so that if people see it being used there is a link to the template itself.

Remove use of 'sapply()' and use purrr and other tidyverse tools

The function

create_coltype_list()

would be better served if I used a 'tidyverse' style solution to applying functions across columns.

clean_names() adding underscores for '1ST' and '2ND' etc

The regular expression substitutions are adding underscores for ordinal abbreviations like '1st' etc.

Need to put in some code to suppress this or fix the regular expression.

Add timing code to each section of the workbook

It might be useful to have a timing piece for each section so we can get a help with benchmarking the code.

This should be easy enough to implement.

Use histogram for univariate exploration of Date/Time variables

We can use histograms directly to explore date/time variables in the data so I need to change the template to fix that for us.

find_highlevelcount_categorical_variables

Action: When using mtcars as a test dataset, which has no discrete variables, Run the chunk find_highlevelcount_categorical_variables of skeleton.Rmd

Expected : Chunk outputs a comment that there are no discrete variables, and moves on to the next chunk
Actual: Throws error "Error: .cols should be a character/numeric vector or a columns object"

Select on line 887 relying on static column names & Section 6.2

There are static variable names, should it be changed to a select_if(is.numeric)?

There is a commented out line near the top, is this removeable?

Section 6.2: in this instance the num_data_tbl has no rows. I don't know if this is beacuse of the data or because I've selected too many columns using the select_if. If there are no rows then that block and everything below doesn't work.

Logic does not handle missing data types well

The template does not do a good job of handling data where whole types of data are missing i.e. data frames where there are no 'categorical' columns as an example.

I need to find some test cases and work with that to fix those.

Fix labelling of the missing data proportion plots

The formatting of the proportions on the y-axis of the multivariate missing data plot should be fixed so that it doesn't use scientific notation.