Giter VIP home page Giter VIP logo

standardfunctions's Introduction

title author date output subtitle
Standard Functions for Basic Statistical Analysis
Bruno Fischer Colonimos
17 juin 2016
html_document pdf_document word_document
css number_sections theme toc
customex2.css
true
readable
true
toc
true
default
Readme file

Overview

Motivation

Here is an attempt to build a set of very basic tools for (exploratory) statistical analysis. For each situation, (ie each combination of types of the variables being analyzed) , one would call a standard function, which will provide a set of most frequently useful analysis elements.

Each main function should return a list of the following elements:

  • The name(s) of the variable(s) analyzed (vector)
  • The number of (valid) cases
  • some summaries
  • one or more tables
  • one or more classical tests most often performed in this situation.
  • one or more graphs (ggplot)

The main functions

One variable

cat1 ~ 1 categorical variable (factor)

num1c ~ 1 (continuous) numeric variable

num1d ~ 1 (discrete) numeric variable

Two variables

cat2 : ~ 2 factors

cat1num1d ~ 1 categorical variable (factor) + 1 (discrete) numeric variable

cat1num1c ~ 1 categorical variable (factor) + 1 (continuous) numeric variable

num2 ~ 2 numeric variables


Code Description: Constants and settings

Function sfinitdefaults : Initialize global constants and default values

Usage {-}

sfdefault <- sfinitdefaults() 

Arguments {-}

None

Value {-}

The function sfdefault

Details {-}

sfinitdefaults(name) constructs the initial list of defaults and returns the function to access and modify it

Example {-}

Function sfdefaults() : set global constants and default values

Usage {-}

sfdefaults(name) 
sfdefaults(name, value) 

Arguments {-}

name the name of the default to set/get
value the value to set the default to

Value {-}

The value of the default name

Details {-}

sfdefaults(name) gets the value associated with name sfdefaults(name, value) sets the value associated with name

Example {-}

# set the language 
sfdefault("language","french")

# set the summaries names 
if(sfdefault("language") == "french") { 
        sfdefault("namesum", sfdefault("namesumfrench")) 
} else { sfdefault("namesum", sfdefault("namesumeng")) 
}

# report number of NA's in a variable? 
# sfdefault("reportNA", FALSE) 

Code Description: Return values structure, creation and access

Structure description

The return value of the function is sructured as a named list of the following elements

name ~ the name of the variable or a vector of names

  • some summaries and tables

numcases ~ the number of cases

summaries ~ a vector (or table ?) of numerical summaries

levels ~ the levels of the (potentially reordered) factor

breaks ~ the breaks used (for a continuous variable)

closed ~ the closed argument that goes with breaks (cf cut or geom_histogram)

table ~ a table

tabledf ~ a dataframe table

ptable ~ a printable table

  • some tests

chi2 ~ Chisquare test

anova ~ oneway()

  • some plots

plot ~ an appropriate plot

The list of fields is yet incomplete.

Function make.result(): make a result list.

Usage {-}

make.result <- function(name = NULL, numcases = NULL, summaries = NULL,
                        levels = NULL, breaks = NULL,closed= NULL,
                        table = NULL, tabledf = NULL, ptable = NULL,
                        chi2 = NULL, anova = NULL,
                        plot = NULL ) 

Arguments {-}

<field> the value to assign do the field

Value {-}

A named list

Details {-}

Unsupplied elements are assigned default=NULL and not included in result list
For the time being, there is no functions layer to access the fields. Simply do : list$field

Example {-}

# creation
res <- make.result(name = "gender",
                    numcases = 120,
                    levels = c("man", "women"))
# access
res$numcases

##[1] 120
                  

Functions Description: Helper functions

General utility functions

Utility function assoc.op : associative operator function.

Usage {-}

assoc.op(opname, listargs)

Arguments {-}

opname ~ the operator/function name (string)

listargs ~ the list of arguments to apply the operator to

Value {-}

a single result (of any type)

Details {-}

applies a binary operator or function to a list of (many) arguments

Example {-}

x <- c(T, F, F, T, T, T, F)
y <- c(T, F, T, F, T, F, F)
z <- c(T, T, T, T, T, T, F)

assoc.op("&", list(x,y,z) )
##[1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE

assoc.op("|", list(x,y,z) )
##[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

Utility function: is.warning() : Is this a warning ?

Usage {-}

is.warning(x)

Arguments {-}

x ~ the value to be tested

Value {-}

TRUE if x is a warning else FALSE

Details {-}

definition:

is.warning <- function(x) {"warning" %in% class(x)}

Example {-}

None

Functions for filtering out NA's

Utility function: nonadf(): filter out NAs from specified dataframe columns

Usage {-}

nonadf(dataf, ..., useNA = "no")

Arguments {-}

dataf ~ a dataframe

... ~ a succession of variable names which we want to filter out the NAs from

Value {-}

Details {-}

Example {-}

nonadf(dtf, "cval1", "cval2", "nam1","dval2")

Utility function: nonavect() : remove NAs from a vector

Usage {-}

nonavect(x)

Arguments {-}

x ==> the vector

Value {-}

the vector with NA's removed

Details {-}

definition:

nonavect <- function(vect) {vect[which(!is.na(vect))]}

Example {-}

None

Simple and multiple summary tables

Function sumvector(): generate a named vector of summaries

Usage {-}

sumvector(var, dnames = sfdefault("namesum"), 
          reportNA = sfdefault("reportNA")) 

Arguments {-}

var the variable (a numeric vector)
dnames The names for the summary vector
reportNA report the number of NA's ?

Value {-}

a vector of summaries

Details {-}

The names of the summaries ("namesum") are set with sfdefaults. 2 sets of names are defined initially:

namesumeng = c("n", "Mean", "St.dev", 
                "Min.", "1st Qu.","Median", "3rd Qu.", "Max.", " NA's") 
namesumfrench  = c("n", "Moyenne", "Ecart-type", 
                "Min.", "Q1","M?diane", "Q3",  "Max.", " NA's"),

Example {-}

sumvector(mpg[["hwy"]])

##          n    Moyenne Ecart-type        Min.        Q1    M?diane        Q3      Max. 
## 234.000000  23.440000   5.954643  12.000000  18.000000  24.000000 27.000000  44.000000

Function cbsummaries() : combined summaries tables for different variables in a dataframe, for all individuals

Usage {-}

cbsummaries(dataf, vnames)

Arguments {-}

dataf
vnames = vector of variable names

Value {-}

the summary vectors, as columns of a dataframe

Details {-}

Example {-}

cbsummaries(mpg, c("hwy", "cty"))

##
                  hwy        cty
n          234.000000 234.000000
Moyenne     23.440000  16.860000
Ecart-type   5.954643   4.255946
Min.        12.000000   9.000000
Q1          18.000000  14.000000
M?diane     24.000000  17.000000
Q3          27.000000  19.000000
Max.        44.000000  35.000000

Function condsummaries() : combined summaries for one variable, conditional to the values of a factor

Usage {-}

condsummaries(dataf, vname, fname)

Arguments {-}

dataf ==> the dataframe
vname ==> the numeric variable name
fname ==> the factor's name

Value {-}

a dataframe combining as columns the summaries of vname, for each level of fname

Details {-}

Example {-}

condsummaries(mpg,"hwy","class")

##             2seater  compact  midsize   minivan   pickup subcompact       suv
## n           5.00000 47.00000 41.00000 11.000000 33.00000  35.000000 62.000000
## Moyenne    24.80000 28.30000 27.29000 22.360000 16.88000  28.140000 18.130000
## Ecart-type  1.30384  3.78162  2.13593  2.062655  2.27428   5.375012  2.977973
## Min.       23.00000 23.00000 23.00000 17.000000 12.00000  20.000000 12.000000
## Q1         24.00000 26.00000 26.00000 22.000000 16.00000  24.500000 17.000000
## M?diane    25.00000 27.00000 27.00000 23.000000 17.00000  26.000000 17.500000
## Q3         26.00000 29.00000 29.00000 24.000000 18.00000  30.500000 19.000000
## Max.       26.00000 44.00000 32.00000 24.000000 22.00000  44.000000 27.000000

Data manipulation : Tables, ordering, error/warning testing

Function condfreqtable(): Make a conditional frequency table for 2 factors

Usage {-}

condfreqtable(dataf, nomfact1, nomfact2, useNA = "no") 

Arguments {-}

dataf the dataframe
nomfact1 first factor name (string)
nomfact2 2nd factor name (string)
useNA = "no" keep or remove missing values ("no" or anything else)

Value {-}

a dataframe: for each value of nomfact1, the conditional frequency for nomfact2 Dataframe columns :

  • nomfact1
  • nomfact2,
  • joint frequency ("num")
  • conditional relative frequency ("perc") for nomfact2 given nomfact1

Details{-}

currently, Only works correctly with data.frames, not tbl_df. tbl_df's should be treated with as.data.frame() before calling the function. working on repairing this. Corrected. Should work ok.

Function orderfact() : reorder the levels of a factor

Usage {-}

orderfact(dataf, nomfact, orderfreq = TRUE, orderdesc = TRUE, 
                ordervar = "c..nt", orderval = NA, orderfun = sum, nlevels = NULL)

Arguments {-}

dataf ==> The dataframe
nomfact ==> The factor's name
orderfreq ==> should the factor's levels be reordered according to frequency or another var?
orderdesc ==> in descending order if TRUE
ordervar ==> the ordering variable (if not supplied or "c..nt", the number of rows)
orderval ==> According to the frequency of a certain value ? (if supplied)
orderfun ==> summary function to use on ordervar
nlevels==> if supplied, directly reorder the levels as given here. Overrides everything else.

Value {-}

The factor, with reordered levels, according to either:

  • the frequency of the levels
  • a function of the values of another (numeric) variable
  • the relative frequency of a value of another variable, within all the rows of a level

Details

Do not reorder with this function an ordered factor ?!!
Currently, only works correctly with data.frames, not tbl_df. tbl_df's shouldbe treated with as.data.frame() before calling the function. Not true anymore. Should work ok.

Example {-}

missing example code

Statistical functions

Function try.chisq.test() : chisquare test with warning detection

Usage {-}

try.chisq.test(..., keep.all = TRUE)

Arguments {-}

... ==> The arguments to chisq.test
keep.all = TRUE ==> return all the results (if TRUE), or only a valid test

Value {-}

if keep_all = TRUE : a named list : * test1 , * test2, * warning = TRUE/FALSE , * warningmsg = ww ) Else,
a named list : * test1 , #(only)

Details {-}

The function calls chisquare.test and checks for a warning. if there is no warning:

  • test1 = classical chi2 test
  • test2 = simulation chi2 test

else (If there is a warning), the reverse:

  • test1 = simulation chi2 test
  • test2 = classical chi2 test

then, if keep-all = true, everything is returned, if not, only test1 is returned
In all cases, test1 is normally the most valid test.
Applies to all the chisquare tests

Example {-}

No example yet

Functions Description: Main functions

Function cat1() : Analyze one categorical variable (factor)

Usage {-}

cat1(dataf, nomfact, useNA = "no", 
        orderfreq = TRUE,orderdesc = TRUE, ordervar = "c..nt", 
        orderval = NA, orderfun = sum, 
        rfreq = TRUE, digits = 2, cfill = "steelblue")

Arguments {-}

dataf The dataframe
nomfact The factor's name
useNA = "no" keep or remove missing values ("no" or "ifany" or "always")
orderfreq = TRUE ....... Arguments for orderfreq .....
orderdesc = TRUE
ordervar = "c..nt" = variable to use for ordering,
orderval = NA = value if the ordering variable is the frequency of ordervar == value
orderfun = sum
rfreq = TRUE resulting yscale of barchart = relative frequency or absolute
digits = 2 rounding of relative frequency (2 = whole percentages)
cfill = "steelblue" fill of barchart

Value {-}

named list:

  • name = factor name
  • levels = factor levels
  • table = frequency table.
  • num = number of observations
  • uchisq = Chi-square test for GoF with the uniform distribution (done with chisq.test)
  • plot = ggplot2 bar chart

Details {-}

Example {-}

# example 1 mc <- cat1(as.data.frame(mpg), "class")

# with plot value labels and cosmetic changes 
mc$plot + 
        geom_text(data=mc$table, 
                  aes(x=class, y = 100 * rfreq - 1.5, label = ifelse(index <= 6, perclabs, ""))) + 
        theme(axis.text.x = element_text(angle=45, hjust=1)) + 
        labs(title = "Class", x ="", y = "percentage")

# 2nd example, with ordering by the frequency of a value in another variable 
mc <- cat1(as.data.frame(mpg), "class", ordervar = "drv", 
           orderval = "4", orderdesc = FALSE) 
mc$plot + geom_text(data=mc$table, aes(x=class, y = 100 * rfreq - 1.5, 
                                       label=ifelse(rfreq >= 0.19, perclabs, ""))) + 
        theme(axis.text.x =element_text(angle=45, hjust=1)) + 
        labs(title = "Class", x = "", y = "percentage")

# checking if differences may be significant mc$uchisq$p.value

Function num1d : Analyze one numerical variable (discrete)

Usage {-}

num1d(dataf, nomvar, digits = 2, sumdigits = 2, useNA ="no", 
      rfreq = TRUE, width = .5, cfill = "steelblue")

Arguments {-}

dataf nomvar name of the numeric variable
digits = 2 number of digits for relative frequency
sumdigits = 2 number of digits for the summaries
useNA ="no", Show missing values
rfreq = TRUE Use relative frequency Y-axis
width = .5 Width of bars
cfill = "steelblue" default fill colour

Value {-}

A named list:

  • name = name of variable
  • summaries = named vector [number of cases (Num), mean (Mean), std dev (St.dev), Min., 1st Qu.,Median,3rd Qu., Max., NA's]
  • table = Frequency table. Table columns:
    • variable name * num = frequency
    • rfreq = relative frequency
    • perclabs = percentage labels * numlabs = frequency labels
    • index = row index
  • num = number of cases
  • uchisq = chisquare test for GoF with a uniform distribution
  • plot = ggplot2 bar chart

Details {-}

Example {-}

# test with mpg 
# cyl <- num1d(as.data.frame(mpg), "cyl") 
cyl$plot + xlab("Cylinders") + ylab("Percentage") 

Function cat2 : Analyze two categorical variable (factors) together

Usage {-}

cat2(dataf, nomfact1, nomfact2,  useNA = "no", 
     orderfreq1 = TRUE, orderdesc1 = TRUE, ordervar1 = "c..nt", 
     orderval1 = NA, orderfun1 = sum, nlevel1 =NULL, 
     orderfreq2 = TRUE, orderdesc2 = TRUE, ordervar2 = "c..nt", 
     orderval2 = NA, orderfun2 = sum, nlevel2 =NULL, 
     rfreq = TRUE, digits = 2, cfill = "steelblue")

Arguments {-}

dataf The dataframe
nomfact The 1st factor's name
nomfact2 The 2nd factor's name
useNA = "no" orderfreq1 = TRUE
orderdesc1 = TRUE
ordervar1 = "c..nt"
orderval1 = NA
orderfun1 = sum
nlevel1 =NULL
orderfreq2 = TRUE
orderdesc2 = TRUE
ordervar2 = "c..nt"
orderval2 = NA
orderfun2 = sum
nlevel2 = NULL
rfreq = TRUE unused but should mean relative frequency (vs absolute, ie position = "fill" or "stack")
digits = 2
cfill = "steelblue" (unused)

Value {-}

Details {-}

Example {-}

# ex 1
mp2 <- cat2(as.data.frame(mpg), "class", "drv", orderfreq1 =TRUE, 
            ordervar1 = "drv" , orderval1 = "f", orderfun1 = mean) 
mp2 
mp2$plot + 
        geom_text(data = mp2$table$tbl1 , aes(x = class, y = -.05, label = numlabs)) + 
        geom_text(data = mp2$table$tbl1 , aes(x = class, 
                                              y = percval - 0.03, 
                                              label = ifelse( index <= 4, perclabs, ""))) +
        theme(axis.text.x = element_text(angle=45,hjust=1)) + 
        labs(title = "Drive per class", x = "Class", y = "percentage")



# ex 2
mp3 <- cat2(as.data.frame(mpg), "class", "drv", orderfreq1 =TRUE, 
            ordervar1 = "drv" , orderval1 = "4", orderfun1 = mean,
            orderfreq2 =TRUE, nlevel2 = c("4","r","f"))
mp3

mp3$plot +
        geom_text(data = mp3$table$tbl1 , 
                  aes(x = class, y = -.05, label = numlabs)) +
        geom_text(data = mp3$table$tbl1 , 
                  aes(x = class,
                      y = percval - 0.03,
                      label = ifelse( index <= 4, perclabs, ""))) +
        theme(axis.text.x = element_text(angle=45, hjust=1))

Templates

Function Namefun : Template ---

Usage {-}

code

Arguments {-}

Value {-}

Details {-}

Example {-}

print(try(log("a"), TRUE)) # dummy code

Function Namefun : Template ---

Usage {-}

code

Arguments {-}

Value {-}

Details {-}

Example {-}

missing example code

standardfunctions's People

Contributors

brufico avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.