deleetdk / kirkegaard Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 2.0 9.17 MB

Personal R package. No particular theme to the content.

Home Page: http://emilkirkegaard.dk

License: MIT License

R 5.56% HTML 94.44%

kirkegaard's People

Contributors

Stargazers

Watchers

Forkers

sbibauw jwingnut

kirkegaard's Issues

MOD_summary: add runs parameter

This parameter is currently being passed via the cryptic ... parameter, but this means that one cannot autocomplete it, which is annoying.

Make function to make depreciated functions

Instead of rewriting these all the time. Make a generator function.

MOD_summary: pretty names for logical variables

Sometimes names are not properly made pretty. Seems to be related to using logicals.

http://rpubs.com/EmilOWK/228495

Could not easily reproduce:

> iiris = iris
> iiris$lgl = sample(c(T, F), 150, T)
> lm(Sepal.Length ~ ., data = iiris) %>% MOD_summary()
  |=============================================================================================================| 100%
$coefs
                     Beta   SE CI.lower CI.upper
Sepal.Width          0.27 0.05     0.18     0.35
Petal.Length         1.75 0.15     1.46     2.04
Petal.Width         -0.29 0.14    -0.56    -0.01
Species: setosa      0.00   NA       NA       NA
Species: versicolor -0.85 0.29    -1.42    -0.27
Species: virginica  -1.19 0.40    -1.99    -0.39
lgl                 -0.04 0.03    -0.10     0.02

$meta
            N            R2       R2 adj. R2 10-fold cv 
       150.00          0.87          0.86          0.84

For logicals, we basically just want to leave out the levels because it can be assumed that the level reported for is TRUE.

plot_loadings_multi: add support for non-overlapping loadings

#non-overlapping indicators
fa_list2 = list(part1 = fa(iris[1:50, -c(1, 5)]),
                part2 = fa(iris[51:100, -c(2, 5)]),
                part3 = fa(iris[101:150, -c(3, 5)]))
plot_loadings_multi(fa_list2)

I'm pretty sure I wrote a function for handling this before, but apparently it got lost somewhere...

Migrate testing to testthat

Currently, using a very very long .R file with LOTS of stopifnot.

Should use a proper testing package. I had only heard of thatthat (Hadley's package), but apparently there are two more: http://yihui.name/en/2013/09/testing-r-packages/

Hadley is the master of R packages and rarely makes bad decisions IMO, so I will go with his package.

Moving over all the tests may take multiple hours. But it should be worth it.

Find out why glm %>% aov does not work with fct or ord outcome, but with lgl

Need this to get anova to work, which I need to calculate etas.

Very strange that the functionality for lgl, fct and ord outcomes are not the same. The math is the same when they have 2 possible values.

> glm("y_fct2 ~ x_num + x_lgl + x_fct + x_ord", data = tmp_data, family = binomial) %>% aov
Call:
   aov(formula = .)
Error in levels(x)[x] : only 0's may be mixed with negative subscripts
In addition: Warning messages:
1: In model.response(mf, "numeric") :
  using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors

> glm("y_ord2 ~ x_num + x_lgl + x_fct + x_ord", data = tmp_data, family = binomial) %>% aov
Call:
   aov(formula = .)
Error in levels(x)[x] : only 0's may be mixed with negative subscripts
In addition: Warning messages:
1: In model.response(mf, "numeric") :
  using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors

miss_analyze: error with mixed data

> test_data %>% miss_analyze()
Error in pool_sd(d_x$x, d_x$group) : x must be a vector!
> test_data %>% str
'data.frame':	150 obs. of  6 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 NA 3.6 3.9 3.4 3.4 2.9 NA ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ type        : chr  "C" NA "A" "C" ...

geom_denhist

Make the y-axis meaningful by having it use percent. This means that the y-values of the density fit must be adjusted.

MOD_summary: add cv for glm

Cross validation is currently done only with lm, i.e. where outcome is numeric. But one can do cross validation of any statistic, including the pseudo R2's.

But before doing that, look into rms package to see whether it already has some better function.

miss_analyze error

Using LAPOP Argentina data:

> d_lapop_s_con %>% miss_analyze()
Error in extract_.data.frame(data, col, into, regex = regex, remove = remove,  : 
  argument "into" is missing, with no default

Package installs ggplot2 everytime on install

> install("kirkegaard")
Installing kirkegaard
Installing 1 package: ggplot2
Installing package into ‘C:/Users/Emil/Documents/R/win-library/3.3’
(as ‘lib’ is unspecified)
Warning: package ‘ggplot2’ is in use and will not be installed
"C:/PROGRA~1/R/R-3.3.1/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL  \
  "Z:/Code/R/kirkegaard" --library="C:/Users/Emil/Documents/R/win-library/3.3" --install-tests 

* installing *source* package 'kirkegaard' ...
** R
** data
*** moving datasets to lazyload DB
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (kirkegaard)
Reloading installed kirkegaard

Attaching package: ‘kirkegaard’

The following object is masked from ‘package:base’:

    +

Replace handling of missing variables with NULL

I recall wasting a lot of time on this anti-pattern before realizing the problem.

http://stackoverflow.com/a/41906180/3980197

Nested functions and missing() do not play well together!

miss_plot additions

Add:

Cumulative missing % by case. Very useful when deciding how much data to keep before eventual imputation.
Choose 4 ways to plot missingness and include them all using faceting.
Reverse counting: show datapoints per person instead of NAs per person.

MOD_LASSO: error with only 1 predictor

> MOD_LASSO(iris, dependent = "Petal.Length", predictors = "Petal.Width")
Data standardized.
Run 1 of 100
Error in glmnet(x, y, weights = weights, offset = offset, lambda = lambda,  : 
  x should be a matrix with 2 or more columns
> MOD_LASSO(iris, dependent = "Petal.Length", predictors = c("Petal.Width", "Sepal.Length"))
Data standardized.
Run 1 of 100
Run 2 of 100
Run 3 of 100
Run 4 of 100
Run 5 of 100

When there is only one predictor, the matrix gets converted to a vector which then fails the glmnet input requirements.

Move spatial functions to their own package.

This collection of functions should not be in this package. They are rarely used and have large dependencies not otherwise used by the package. They also have a bad prefix. SAC is spatial autocorrelation, but these functions are just general functions for spatial statistics. So a prefix of spa_ or ss_ would make more sense. I could not find any package that use this prefix.

Brainstorming some names for the package. Should try to find a name with r, so I can be cool like the other kids. Given the very large number of seemingly very complicated spatial analyses packages already on CRAN (say, 100), these functions are very simple, so maybe pick a name along those lines.

simple spatial tools
simple spatial statistics

pu_translate: only exact matching for numbers

Otherwise, a number can may any other number in a few steps making a lot of false matches.

df_colFunc: nonsensical error messages when indices get wrong input

Example:

  #convert to ordered if necessary
  d %<>% df_colFunc2(d, func = function(x) {
    browser()
    if_else(nlevels(x)<10, ordered(x), x)
  })

Error in .subset2(x, i, exact = exact) : 
  attempt to select more than one element in get1index <real>

This subtle bug is because the pipe already passes the data.frame, and d gets passed as the indices argument. But there is no input control for it, so it tries to subset using a data.frame. Nonsense!

theme_bw()

I've switched to always using this theme. No reason not to add it to all plotting functions. One can always overwrite it if desired.

SMD_matrix improvements

Should add some things:

Option to get CIs on the values like cor_matrix. Standard and bootstrapped (when using non-parametric).
Why is this called _matrix and gives matrix (2x2) output when it only takes in one pair of variables? Because the group variable may have more than 2 levels. However, in the case of only 2 levels, it makes sense to simplify the output to a single value. Toggleable by argument otherwise the function would be unpredictably output type inconsistent.
Option to get the CIs in separate matrices for easy computational re-use.
Throw error when given length 1 input. Right now this just results in no errors and NA output. Confusing!

log breaks helper function

Often it is desirable to use logarithmic scales. However, this also often results in too few breaks (in ggplot2 terms) on the axes. One can supply breaks manually, but there does not seem to be any function to construct these. For instance, halfway breaks are very convenient. Instead of:

1, 10, 100, 1000, 10000, ...
10^0, 10^1, 10^2, 10^3, 10^4, ...

We can have:

1, 5, 10, 50, 100, 500, 1000, 5000, 10000, ...
10^0, 5*10^0, 10^1, 5*10^1, 10^2, 5*10^2,10^3, 5* 10^3, 10^4, ...

An alternative is to use the midpoint in exponents instead of in linear units. This is easier to generate, but less intuitive:

10^0, 10^0.5, 10^1, 10^1.5, 10^2, 10^2.5, 10^3, 10^3.5, 10^4, ...

MOD_LASSO: unhelpful error messages when variables are not found

MOD_LASSO does not give helpful error messages.

MOD_LASSO(iris, dependent = "Petal.Length", predictors = c("non_existent_var", "Sepal.Length"))
Error in `[.data.frame`(df, c(dependent, predictors, "weights_")) : 
  undefined columns selected

It would be better if it said something like the following:

Error: the following variables were not found in the data: non_existent_var

MOD_summary improvements

Add more useful features to this.

                        Beta   SE CI.lower CI.upper
SIRE: White             0.00   NA       NA       NA
SIRE: African_American  0.27 0.18    -0.08     0.63
SIRE: American_Indian   0.31 0.46    -0.59     1.22
SIRE: Asian             0.30 0.17    -0.03     0.63
SIRE: Hispanic         -0.31 0.09    -0.48    -0.14
SIRE: Multi_ethnic      0.08 0.11    -0.13     0.29
SIRE: No data          -0.12 0.42    -0.94     0.70
SIRE: Other             0.17 0.25    -0.33     0.67
SIRE: Pacific_Islander -0.45 0.27    -0.97     0.07
European                0.15 0.06     0.04     0.26
African                -0.33 0.05    -0.44    -0.23

[[1]]$meta
            N            R2       R2 adj. R2 10-fold cv 
      1369.00          0.17          0.16          0.15

Meta should give the outcome variable.

So we can remember what it is. If meta is a numeric vector, need to change it to a data frame. Might break some code.

Attach lm model object

The original lm fit should be attached. This also attaches the data.

Eta

Etas should be calculated so as to better summarize the categorical variables. Etas are redundant when models have no categorical variables, but for consistent output, should have a parameter that forces them to always be included.

An alternative is to add a class and change the print function to only print eta when model has categorical variables. Seems more fancy.

add: update_package()

Shorthand for devtools::install_github("deleetdk/kirkegaard")

FA_splitsample_repeat: add progress bar

See #3

Replace homebacked input checking code with ensurer package

https://cran.r-project.org/web/packages/ensurer/vignettes/ensurer.html

FA_splitsample_repeat: seed error

For some reason, the seed parameter is always treated as missing, even when present.

Called from: FA_splitsample_repeat(iris[-5])
Browse[1]> seed
[1] 1
Browse[1]> missing("seed")
[1] TRUE
Browse[1]> missing(seed)
[1] TRUE

So currently, it is not possible to not pass a seed value. If one wants to use a random seed, then pass a random value.

MOD_penalized

Next iteration of MOD_LASSO.

Output should be a list with custom class to control printing.

Should include the MOD_summarize models output.

Should also fit the resulting model. Use a parameter to decide the cutoff to include predictors with. Remember to use correct cross validation! Read up on bootstrapping too and see how hard it is to implement.

cor_matrix: add latent correlations

Note that this is incompatible with weights when one uses hetcor. But one could swap to the psych equivalents instead.

https://www.personality-project.org/r/html/tetrachor.html

Also add support for CIs and p values.

Use consistent input checking

Right now input checks are chaotically written using a variety of methods. Some use my own functions, others use assertthat functions. It's better to reuse well-tested code, so switch to assertthat functions. Fill in with missing tests.

fa_Jensens_method: incorrectly uses latent

There is something wrong with the detection of which method to use. I notice that it often uses latent when it should Pearson.

df_as_num: better deal with factors, ordered vs. non-ordered

E.g. (class(x) == "factor") is wrong. Use inherits. But really, we usually want to convert ordered factors, but not the non-ordered ones unless they actually have numeric levels.

Rewrite code to use purrr functions

purrr is a replacement package for most of the built in R vector/list based functional functions, chiefly sapply, lapply, vapply.

As I see it, there are primarily two reasons to replace the current code with the purrr versions:

They are slightly faster because they are implemented in C.
sapply can return list output if input is improper. This causes hard to understand errors further down a pipeline. Better to fail early using the map_* functions. These are shorter than using vapply.
map_* has some neat typing saving functionality: supplying names or integers as a function skips a step using extract.

Details: http://r4ds.had.co.nz/iteration.html#the-map-functions

FA_splitsample_repeat: add seed parameter for reproducible analysis

As with e.g. MOD_LASSO.

df_merge_rows: multi-function approach

Instead of supplying a single function that must work for all columns or all numeric columns, make it so one supplies either a function or a list of functions. If a function, it will be used on all columns. If a list of functions, it will use any function that matches the column name. If there is none, it will default to one for that type. This is a clever approach to handling mixed data data frames.

The default func should be something like:

list(.numeric = wtd_sum, .default = first_row)

So, given a column, it is tested for numerical status (integer or double). If yes, then the values are summed. If not, it defaults to .default and takes the first row. In this way, every column is processed. One can make it possible for there to be no function that applies, which raise an error that may be useful in some cases.

miss_filter: fails with ncol=1

> iris[1] %>% miss_filter()
Error in apply(y, 2, sum) : dim(X) must have a positive length

fa_Jensens_method: supply cor matrix to avoid recalculation

Add argument to supply a correlation matrix (or equivalent). This prevents time-consuming recalculation. Recalculation can also cause problems when there are odd patterns in missing data and depending on whether continuity correction was used or not.

GG_denhist: sometimes refuses to plot 1-column data.frames or vectors

Scalar checks

Add checks for scalar (length 1) to the general is_ functions.

miss_filter: filter by non-NA instead of NA

Very useful at times.

Some abbreviations are used in mega dataset that are no longer valid abbreviations

E.g.

> mega$names_dk = pu_translate(rownames(mega), reverse = T, lang = "da", messages = 0)
No match: Africa
No match: Asia
No match: ASM
No match: BELFL
No match: BELFR
No match: BIF
No match: CHA
No match: Eastern-Europe
No match: MCA
No match: MIC
No match: MIR
No match: North America and Oceania
No match: PRI
No match: SAS
No match: South and Middle America
No match: SRP
No match: VGB
No match: VIR
No match: Western-Europe

E.g. VGB was moved to GBR_VGB because this is actually a subunit of Great Brit which for some reason has its own ISO. There are some more oddities like this. PRI is Puerto Rico, which is a US territory.

MOD_summary: implement proper tests

The old tests fail now because I removed the rounding functionality. I rather have users use options(digits=2).

throws_error: support for strings and expressions?

I figured out a way to make it work for expressions, but that made it not work for strings. I can't seem to figure out a way to find out if something is a string or expression, so that the function knows what to do.

A worse alternative is to make a function, throws_error_str, by analogy with aes_string from ggplot2.

MOD_summary: relative importance metrics for predictors

There's a good overview here:

http://stats.stackexchange.com/questions/64010/importance-of-predictors-in-multiple-regression-partial-r2-vs-standardized

Also look into rms package and see what can be used to avoid reinventing stuff.

wtd_mean: add param to not throw error on all NA input

In a few cases, the error from all NA input is undesired and NA output would be better. Add param to make this possible.

MOD_LASSO: replace messages with progress bar

plyr's **ply functions have an option to specify a progress bar. Instead MOD_LASSO only offers to write a message for every simulation run. This is inefficient and clumsy. It should be replaced with a progress bar of some sort.

There is a built in function for this, but it's not quite optimal. Perhaps copy the code from plyr?

https://stat.ethz.ch/R-manual/R-devel/library/utils/html/txtProgressBar.html
https://github.com/hadley/plyr/blob/397a4bd0e1c7569316f5d5d014f24933b5cbba1c/R/progress.r

Looks like it's just a wrapper for the utils function.

fa_Jensens_method: slow tests

The test uses latent correlations due to the data chosen. This is slowing down the unit testing. Find some other data.

GG_denhist and SMD_matrix, error/bad output with groups without data

Error occurs when some groups have no data, and the coloring thus cannot be applied. These cause NAs in the central_tendency vector. This gets passed on to:

g = g + geom_vline(xintercept = central_tendency, linetype = "dashed", 
  size = 1, color = colors)

Which causes the error:

Error: Aesthetics must be either length 1 or the same as the data (6): colour, size, linetype

Solution is to exclude such groups before trying to plot.

This problem also occurs with SMD_matrix. It does not error, but gives bad output:

> SMD_matrix(d$g_noage, group = d$sex)
             missing inapplicable refused don't know male female
missing           NA          NaN     NaN        NaN  NaN    NaN
inapplicable     NaN           NA     NaN        NaN  NaN    NaN
refused          NaN          NaN      NA        NaN  NaN    NaN
don't know       NaN          NaN     NaN         NA  NaN    NaN
male             NaN          NaN     NaN        NaN   NA   0.22
female           NaN          NaN     NaN        NaN 0.22     NA