sixianghu / dataman Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 843 KB

R package for data cleaning, preliminary data analysis and modeling assessing with visualisation.

R 92.02% C++ 7.98%

data-analysis model-checking model-predictions plot plotly preliminary-analysis r visualisation

dataman's People

Contributors

Stargazers

Watchers

dataman's Issues

Consistency Test

Consistency test by time and randno
Deal with Nan and 0
Necessary of vector in by parameter
More efficient

"'origin' must be supplied" in PopMiss

When populating the missing value for Date variable, an origin is needed when the replace parameter given is an negative number.

[DataSummary]Weighted Mean Values are not correct when including NA

test = data.frame(a = c(1,NA),b = c(1,1))
DataSummary(test)

VarName VarType Unique Missing Missing_pct Mean Min Max Entropy
1: a numeric 2 1 0.5 0.5 1 1 0
2: b numeric 1 0 0.0 1 1 1 0

That when x include NAs, the sum of weighted is less than sum(weight,na.rm=TRUE) because there is no NA in weight.

Legend on hist plot

Find a way to get rid of the legend in dataPlot's and modePlot's legend in hist plot.

Currently, this is not supported by rbokeh

Need to take `weight` into the `interPlot` function

We may need a weight variable like other plot functions.

Need a model comparison function for all ML methods

Class of Time Variable Cause One More Output to Final Table

If there is a time variable with class of "POSIXct" or "POSIXt", both of time classes will be outputed, hence in the final tables, errors will jump out saying:

Error in .Method(..., row.names = row.names, check.rows = check.rows, :
arguments imply differing number of rows: 217, 218

Summary of statistics about models

axis title of `interPlot` can be specified.

[interPlot] the tooltip needs to reflect the x-axis title

[CramersV]Add option to apply correction.

[dataPlot]Add `offset` paramter the same as `compPlot`

Features

which terms in the model have been affect by new factors and in what way?
coefficient
changes by level (vis)
Is it because of correlation?
What this new factor has explained?

Rate Compare

rateComp <- function(df,fac,peril,reorder=FALSE){
  df_sub <- filter(df,(KEY==fac) & (Peril == peril))

  if(reorder){
    suppressWarnings(min_x <- min(as.numeric(df_sub$Level1),na.rm = TRUE))
    suppressWarnings(max_x <- max(as.numeric(df_sub$Level1),na.rm = TRUE))
    suppressWarnings(oth_lvl <- as.character(df_sub$Level1[is.na(as.numeric(df_sub$Level1))]))
    df_sub$Level1 <- ordered(factor(df_sub$Level1,levels=c(paste(min_x:max_x),oth_lvl)))
  }
  df_sub <- df_sub[order(df_sub$Level1),]
  plot_ly(df_sub,x=Level1,y=Value.x, xaxis = "x1", yaxis = "y1",name="Model1") %>%
          add_trace(df,x=Level1,y=Value.y, xaxis = "x1", yaxis = "y1",name="Model2") %>%
    layout(xaxis=list(title=fac),yaxis=list(title=peril))
}

Function to check the consistency of the data

dataComp <- function(old,new,alpha=0.05){
  strName <- names(old)
  res <- data.frame(Name = NULL, Changes = NULL)
  for (i in strName){
    if (i %in% names(new)){
       x<- chisq.test(table(old[,i]),p=table(new[,i])/sum(table(new[,i])))
       if (x$p.value > alpha) {
         tmp <-data.frame(Name=i,Changes=FALSE)
         res <- rbind(res,tmp)
       }
    }
  }
  res
}

Initial model assessing

This initial model assessing step is to check each factor's ability of prediction based on different metrics in a stepwise analysis:

- F-Score from partial F-test glm
- Variance
- Correlation to Response
- Correlation feature selection
- Gain from quick decision tree / randomForest / xgboost
- Information Value

Support `train` in `modelPlot`

Because caret package uses a wrapped predict function, which requires type to be one of raw and prob.

Hence need to change the code in modelPlot to support caret's train model object.

Unified structure for Machine Learning model comparison

A structure or template is needed to conduct this model comparison.

A PMML structure would be a good starting point, even though not all ML package support it.

Issues relate to v0.4

Depricated

ly_hist and ly_bar cannot plot missing value
Reason: missing value should be dealing with out side of the dataPlot.
onLoad install the package rbokeh from git_hub
Reason: It is not safe for a development package to be installed automatically. Will guide user to install it manually.
write .onLoad function to install rbokeh from github (e.g. https://github.com/trinker/gmailR/blob/master/R/zzz.R
http://stackoverflow.com/questions/24625427/should-onload-r-be-included-in-the-description-file-collate-field)
Reason: the same as abpve.

Min and Max Value Improvement

When there are lots of null or missing values, DataSummary should still populate the min and max from non-missing values.

3D plot for 2-way interactions

This may not be the final solution, but, can use rgl or rbokeh for this purpose.

If it is under rbokeh, then probably will use hexbin plot ly_hexbin or image plot ly_image

Efficiency on CramersV

Need a efficiency on CramersV, which is slow atm.
Possible solution is call C or C++ implementation.

Issue relate to v0.3

Consistency test by time and random number (v0.3)
1. Deal with NaN and 0
2. Necessary of vector in by parameter
CramersV test (v0.3)
Data Exploratory Report Template (RMarkdown) (v0.3)
residual Plot (v0.3)

Need a prediction compare view

library(RColorBrewer)

modelCompPlot <- function(x,act,pred,by=NULL,weights=NULL,newGroupNum=10){
  if (is.null(x)) stop("x provided is blank.")
  if (is.null(act)) stop("act provided is blank.")
  if (is.null(pred)) stop("pred provided is blank.")

  str_pred <- NULL
  num_pred <- NULL
  if (length(x) != length(act)) stop("x and act don't have the same length")
  if (is.vector(pred)){
    str_pred <- "pred"
    num_pred <- 1
    if(length(x) != length(pred)) stop("x and pred don't have the same length")
  }
  else {
    str_pred <- colnames(pred)
    if (is.null(str_pred)) str_pred <- paste("Pred",1:nrow(pred),sep="")
    num_pred <- ncol(pred)
    if(length(x) != nrow(pred)) stop("x and pred don't have the same length")
  }

  if(is.null(weights)) {weights <- rep(1,length(x))}

  #New Group for data which has too much levels.
  if ( (is.numeric(x) || is.integer(x)) && nlevels(as.factor(x))>100 ) {
    if ( is.null(newGroupNum) ) newGroupNum <- 10

    new_band <- seq(min(x, na.rm = TRUE),max(x, na.rm = TRUE),length.out=newGroupNum)
    x <- as.character(cut(x,new_band,include.lowest = TRUE))
  }

  if(!is.null(by)){
    data.plot <- data.table::as.data.table(as.data.frame(cbind(xvar=x,by=by,act,pred,weights),stringsAsFactors=FALSE))
    setkey(data.plot,xvar,by)

    dp_name_str <- c("act",str_pred,"weights")
    data.plot <- data.plot[,lapply(.SD,as.numeric),by=list(xvar,by),.SDcols=dp_name_str]
    data.agg  <- data.plot[,lapply(.SD,weighted.mean,w=weights),by=list(xvar,by),.SDcols=dp_name_str]

    p1 <- rbokeh::figure(xlab="x",ylab="",height=500) %>% 
      rbokeh::ly_lines(xvar,act,color=by,type=list(2),
                      width=2,data=data.agg) 

    for (i in num_pred){
      p1 <- p1 %>% rbokeh::ly_lines(xvar,deparse(substitute(str_pred[i])),color=by,type=list(2),
                                    width=2,data=data.agg) 
    }
  }
  else {
    data.plot <- data.table::as.data.table(as.data.frame(cbind(xvar=x,act,pred,weights),stringsAsFactors=FALSE))
    setkey(data.plot,xvar)

    dp_name_str <- c("act",str_pred,"weights")
    data.plot <- data.plot[,lapply(.SD,as.numeric),by=xvar,.SDcols=dp_name_str]
    data.agg  <- data.plot[,lapply(.SD,weighted.mean,w=weights),by=xvar,.SDcols=dp_name_str]

    p1 <- rbokeh::figure(xlab="x",ylab="",width=700,height=800) %>% 
      rbokeh::ly_lines(xvar,act,color="#CC3399",
                       width=2,data=data.agg,legend = "Actual") 

    str_col <- brewer.pal(max(num_pred,3), "Spectral")
    for (i in 1:num_pred){
      print(str_pred[i])
      p1 <- p1 %>% rbokeh::ly_lines(xvar,str_pred[i],color=str_col[i],width=2,data=data.agg,
                                    legend=str_pred[i]) 
    }
  }

  p1
}

add @useDynLib to the CramersV

Add #' @useDynLib CramersV to the CramersV c++ function to let Roxygen2 to generate the NAMESPACE automatically.

To Do List

boxplot for segmentation outlier
residual to tree methods (check variable missing)
single Tree visualisation on 2 dimension split, which can compare with different tree methods.
Travis.yml update to R template from current c template
EDA report
Quick Xgboost run for important variable
Quick abnormal detection run

Done:

Need a function to visualise the interactions or correlations:
Can use the networkD3 package for the network plot, or use the DiagrammeR , igraph and vivagraph http://www.buildingwidgets.com/blog/?offset=1435611529118. [#26]
Code efficient by either Rcpp or cleaner and elegant coding in R
different residuals can be used on residual Plot (e.g. student, pearson, cooker's, etc.) [#52]
deviance residual vs predicted target value [#52]
different residuals can be used on residual Plot (e.g. student, pearson, cooker's, etc.) [#52]

Not Now:

Correlation matrix calculation for numerical matrix or data frame. Or find a quicker way for correlation calculation on big data. [#24]
Distributed / parallel Calculation
Need a model comparison function for all ML methods. A structure or template is needed to conduct this model comparison. A PMML structure would be a good starting point, even though not all ML packages support it. Or use caret package straight away.
which terms in the model have been affect by new factors and in what way?
1. Coefficient
2. Changes by level (vis)
3. Is it because of correlation?
4. What this new factor has explained?
  This is more like feature selection, and may to specific to glm / lm method.
Trends analysis:
1. Within the model to assess the consistency
2. Between datasets on different times to assess the development
3. There should have a function to give this feature. A starting point is the AbnormalDetection package and changepoint package.
  More timeseries analysis. Cannot find specific / achievable criteria.
historgram of deviance residual (symmetry). (REASON: currentl resiPlot has contour plot for this.)
deviance residual (sqrt(weighted deviance) * sign(a-e)). (REASON: #52 provides an open slot for this.)
deviance plot for outlier visualisation. (REASON: currentl resiPlot has contour plot for this.)
support multinomial analysis. (REASON: not urgent need.)
RcppParalle for CramersV function (REASON: not urgent need. And parallel overheads may have inverse effect.)

change of the base of a variable in a glm model may affect the modelPlot

Current modelPlot ignores the actual base of variables in a glm, instead, it recalcualtes the mean or mode of the variable and treat them as the base.

If modeller specified another base and used in glm, the modePlot may not recognise it, hence may give wrong "model prediction at base level" line.

Need a function to visualise the interactions or correlations

Can use the networkD3 package for the network plot, or use the DiagrammeR , igraph and vivagraph

http://www.buildingwidgets.com/blog/?offset=1435611529118

order of the axis ticktext is not in desired order in interPlot

in interPlot plot (in README file as well), the order of axis ticktext is not in order.

The issue is that then creating the matrix for plotly, the column header is not created in order when given factor is not sorted.

A force sort may be needed for variable used.

A more realistic solution is that user define the order before feading into the interPlot function.

Add model unsimplified line for modelPlot

This function can only apply to a model structure with explicit link function.

`PopMiss` doesn't work when input is a `data.table`

A drop=FALSE or with=FALSE needed in PopMiss when the input is a data.table.

Because data.frame and data.table have different method to drop variables.

view the feature without request of the model

dataPlot <- function(data,xvar,yvar,weights=NULL,byvar=NULL,newGroupNum=NULL,interactive=FALSE,...){

  opts.list<-list(...)
  opts <- names(list(...))
  if("xlim" %in% opts) xlim<-opts.list$xlim
  if("ylim" %in% opts) ylim<-opts.list$ylim
  if("binwidth" %in% opts) binwidth<-opts.list$binwidth else binwidth <- 1

  # Error Trapping
  if( is.null(data) ) stop("data set provided is null.")
  if( is.null(xvar) ) stop("X variable provided is null.") 
  if( is.null(yvar) ) stop("Responce variable provided is null.")

  if (is.character(xvar)) {
    if(!xvar %in% colnames(data)) stop(paste("xvar variable (",xvar,") cannot be found.",""))
    x <- data[,which(names(data)==xvar)]
    xname <- xvar
  }
  else if (is.integer(xvar)) { 
    x <- data[,xvar]
    xname <- names(data)[xvar]
  }
  else stop ("xvar provided is either a character (variable name) or integer (position of the variable).")

  if (is.character(yvar)) {
    if(!yvar %in% colnames(data)) stop(paste("yvar variable (",yvar,") cannot be found.",""))
    yname <- yvar
    y <- data[,which(names(data)==yvar)]
  }
  else if (is.integer(yvar)) {
    y <- data[,yvar]
    yname <- names(data)[yvar]
  }
  else stop ("yvar provided is either a character (variable name) or integer (position of the variable).")

  if( !is.null(byvar) ){
    if (is.character(byvar)){
      if(!byvar %in% colnames(data) ) stop(paste("xvar variable (",byvar,") cannot be found.",""))
      by <- data[,which(names(data)==byvar)]
      byname <- byvar
    }
    else if (is.integer(byvar)) {
      xname <- names(data)[byvar]
      by <- as.character(data[,byvar])
    }
    else by <- NULL
  }
  else by <- NULL

  if( !is.null(weights) ){
    wname = "w"
    if (is.character(weights)){
      if(!weights %in% colnames(data) ) stop(paste("xvar variable (",weights,") cannot be found.",""))
      w <- data[,which(names(data)==weights)]
    }
    else if (is.integer(weights) && length(weights)==1) {
      w <- data[,weights]
    }
    else if (is.integer(weights) && length(weights)>1){
      if ( dim(data)[1] != length(weights) ) stop ("Length of weights is not the same as dimension of the data provided.")
      w <- weights
    }
  }
  else w <- rep(1,dim(data)[1])

  #New Group for xvar which has too much levels.
  if ( (is.numeric(x) || is.integer(x) ) && nlevels(as.factor(x))>100 ) {
    if ( is.null(newGroupNum) ) newGroupNum <- 10

    new_band <- seq(min(x),max(x),length.out=newGroupNum)
    x <- cut(x,new_band,include.lowest = TRUE)
  }

  #New Group for byvar which has too much levels.
  if(!is.null(by)){
    if ( (is.numeric(by) || is.integer(by)) && nlevels(as.factor(by))>100 ) {
      if ( is.null(newGroupNum) ) newGroupNum <- 10

      new_band <- seq(min(by),max(by),length.out=newGroupNum)
      by <- cut(by,new_band,include.lowest = TRUE)
    }
  }

  #Data for plot
  if (is.null(by)) {
    data.plot <- data.table::as.data.table(as.data.frame(cbind(x=x,y=y,w=w),stringsAsFactors=FALSE))
    data.table::setkey(data.plot,x)

    data.plot <- data.plot[,lapply(.SD,as.numeric),by=x,.SDcols=c("y","w")]
    data.agg <- as.data.frame(data.plot[,lapply(.SD,weighted.mean,w=w),by=x,.SDcols=c("y","w")],row.names=c("xvar","weights","observed"))
    data.freq <- as.data.frame(data.plot[,sum(w),by=x][,freq:=V1/sum(V1)])

    data.melt <- reshape2::melt(data.agg[,-3],id=c("x"))

    #line graph

    strV1 <- paste("Observation Analysis on: ",xname)

    gLine <- ggplot2::ggplot(data=data.melt,aes(x=x,y=value)) + 
      ggplot2::geom_line(size=1,colour= "magenta3") + ggplot2::geom_point(size=4,fill="white",shape=22)
    if(("xlim" %in% opts) && is.numeric(data.melt$x)) gLine <-gLine + ggplot2::scale_x_continuous(limits=xlim)
    else if(("xlim" %in% opts) && !is.numeric(data.melt$x)) gLine <-gLine + ggplot2::scale_x_discrete(limits=xlim)
    else if(("xlim" %in% opts) && is(data.melt[,"x"],"Date")) gLine <-gLine + ggplot2::scale_x_date(label=date_format("%y%m"),limits=xlim)
    if("ylim" %in% opts) gLine <-gLine + ggplot2::ylim(ylim)
    gLine <- gLine + ggplot2::xlab("") + ggplot2::ylab(yname)+ ggplot2::ggtitle(strV1)+ theme_mp_line
    if(nlevels(as.factor(data.melt$x))>25) gLine <- gLine + ggplot2::theme(axis.text.x = element_text(angle = 90,hjust=0.5,vjust=0.5))

    #histogram graph
    ghist <- ggplot2::ggplot(data=data.freq,aes(x=x,y=freq))+
      ggplot2::geom_histogram(stat="identity",colour="black",fill="yellow")
    if(("xlim" %in% opts) && is.numeric(data.melt$x)) ghist <-ghist + ggplot2::scale_x_continuous(limits=xlim)
    else if(("xlim" %in% opts) && !is.numeric(data.melt$x)) ghist <-ghist + ggplot2::scale_x_discrete(limits=xlim)
    ghist <- ghist + ggplot2::ylab("percent (%)")+
      ggplot2::scale_y_continuous(labels = percent)+
      ggplot2::xlab("")+ theme_mp_hist

    gridExtra::grid.arrange(gLine,ghist,ncol=1,nrow=2,heights=c(4,1))

    if (interactive) {
      df <- data.frame(data.agg,freq=data.freq$freq)
      gvisSingleOptionList <- list(pointSize=8,
                                   series="[
                                   {targetAxisIndex:0, type:'line',color:'magenta',pointShape: 'square'},
                                   {targetAxisIndex:1, type:'bars',color:'yellow'}]",
                                   crosshair="{trigger:'both'}",
                                   hAxis.title=xname,
                                   theme="maximized",
                                   title=paste0("Observation analysis on ",xname, " Observed"),
                                   vAxes="{1:{format:'##.#%',maxValue:1}}",
                                   explorer="{ actions: ['dragToZoom', 'rightClickToReset'],keepInBounds: true }",
                                   chartArea="{width:'90%',height:'90%'}",
                                   height=750)

      plot(googleVis::gvisComboChart(df,xvar="x",yvar="y",options=gvisSingleOptionList))

    }
  }
  else{
    data.plot <- data.table::as.data.table(as.data.frame(cbind(x=x,y=y,w=w,by=by),stringsAsFactors=FALSE))
    data.table::setkey(data.plot,x,by)

    data.plot <- data.plot[,lapply(.SD,as.numeric),by=list(x,by),.SDcols=c("y","w")]

    data.agg <- as.data.frame(data.plot[,lapply(.SD,weighted.mean,w=w),by=list(x,by),.SDcols=c("y","w")],row.names=c("xvar","by","weights","observed"))

    data.freq <- as.data.frame(data.plot[,sum(w),by=list(x,by)][,freq:=V1/sum(V1)])

    #line graph
    gLine1 <- ggplot2::ggplot(data=data.agg,aes(x=x,y=y,group=factor(by),colour=factor(by)))+
      ggplot2::geom_line(size=1) + ggplot2::geom_point(size=4,fill="white")
    if("xlim" %in% opts) gLine1 <-gLine1 + xlim(xlim)
    if("ylim" %in% opts) gLine1 <-gLine1 + ylim(ylim)
    gLine1 <- gLine1+ggplot2::xlab("")+ggplot2::ylab(yname)+ ggplot2::ggtitle(paste("Observation Analysis on: ",xname," by ",byname))+
      theme_mp_line
    if(nlevels(as.factor(data.agg$x))>25) gLine1 <- gLine1 + ggplot2::theme(axis.text.x = element_text(angle = 90,hjust=0.5,vjust=0.5))

    #histogram graph
    ghist <- ggplot2::ggplot(data=data.freq,aes(x=x,y=freq,fill=factor(by)))+ ggplot2::geom_histogram(stat="identity",binwidth=1)
    if("xlim" %in% opts) ghist <-ghist + ggplot2::xlim(xlim)
    ghist <- ghist + ggplot2::xlab("")+ ggplot2::ylab("percent (%)")+ ggplot2::scale_y_continuous(labels = percent) + 
      theme_mp_hist+ theme(legend.position="none")

    gridExtra::grid.arrange(gLine1,ghist,ncol=1,nrow=2,heights=c(4,1))
  }
}

axis not in order when numGroupNum used in dataPlot

correlation calculation on numerical dataset

Tried to have an improvement about the speed of cor function.

Currently, the following code works (with inline and RcppArmadillo)

src <- '
  mat m1 = as<mat>(x);
  return(wrap(cor(m1)));
  '

  fx <- inline::cxxfunction( signature(x = "numeric") , 
                     body = src, 
                     plugin='RcppArmadillo', 
                     includes = 'using namespace Rcpp;
                                 using namespace arma;')

It is slightly quicker than cor:

library(microbenchmark)
x <- as.matrix(mtcars)
microbenchmark(
  fx(x),
  cor(x)
)

Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval cld
  fx(x) 11.50953 11.67405 11.96615 11.82167 12.21154 13.20968   100  a 
 cor(x) 13.92392 14.22428 14.46255 14.37597 14.72041 15.27963   100   b

But on a 300 x 300 matrix, the time spending is the same. In this case, there is no benefit to continue on this?

No `max` or `mean.or.mode` for characters in a `data.table`

No max or mean.or.mode for characters in a data.table

CramersV gives correlation value >1

CramersV forgets to convert non-integer vector to integer

CramersV forgets to convert non-integer vector to integer.

This issue is in the CramersV.data.frame.

There are 2 solutions:
1: Use template in CramersV.cpp
2: Convert all non-integer vector to integer in CramersV.r in R

Model tests

Some tests can be used from RegressionTestsInterface in fRegression package:

"bg" Breusch–Godfrey test for higher order serial correlation,
"dw" Durbin–Watson test for autocorrelation of disturbances,
"harv" Harvey–Collier test for linearity,
"rain" Rainbow test for linearity, and
"reset" Ramsey’s RESET test for functional relation.

For regression (linear) residual test (because of the assumption that the variance must be constant)

"bp" Breusch–Pagan test for heteroskedasticity,
"gq" Goldfeld–Quandt test for heteroskedasticity,
"hmc" Harrison–McCabe test for heteroskedasticity,
"VIF" variance inflation factor https://en.wikipedia.org/wiki/Variance_inflation_factor

https://cran.r-project.org/web/packages/fRegression/fRegression.pdf

[rocPlot]In some cases, the ROC is 1-ROC

When the class of response variable is character or factor, the positive response may be recognized as negative label.

Please check roc function in pROC package for more information (the levels parameter).

Efficiency of the coding

Need more efficiency on DataSummary, DetMiss and PopMiss.

Data Report Markdown

Enhance the data report marrkdown:

using formattable
plotly in rmarkdown
parametise the rmarkdown

[interPlot] Legend Name Needs to be changed from `dt2`

Detecting Trends

Sometimes we may want to monitoring the trends:

within the model to assess the consistency
between datasets on different times to assess the development

There should have a function to give this feature. A starting point is the AbnormlDetection package and changepoint package.

Suggestions for improvements to the DataPlot function

It would be good to have control over the location of the legend and the max number of bins on the x axis.

I think this is easy to do with the following amendments to the code (I'm only showing lines that need to change with the changes in bold font)

dataPlot <- function(data,xvar,yvar,byvar=NULL,weights=NULL,
newGroupNum=10, legend_x= 0.5, legend_y = 1.0, x_maxBins = 100, y_maxBins = 20)

if ( (is.numeric(x) || is.integer(x) ) && nlevels(as.factor(x))>=x_maxBins ) {

#New Group for byvar if it has too many levels.
if(!is.null(by)){
if ( (is.numeric(by) || is.integer(by)) && nlevels(as.factor(by))>y_maxBins) {

l <- list(bordercolor = "#000000",borderwidth=1,x = legend_x, y = legend_y)

Regards

Nigel

[DataSummary] Option to include or exclude Entrophy stats calculation.

A data partition function to split data into multi parts

A partition function to split the data into different proportion.

multiParts <- function(y,p){
  if(is.null(y)) stop("no inputs")

  if(sum(p)<1) warning("Sum of proportion is less than 1, 
                       rest of the data will be dropped.")

  if(sum(p)>1) stop("Sum of the proportion is larger than 1.")

  rn <- 1:length(y)
  np <- length(p)

  res <- list()

#   Use the createDataPartition from caret package: 
#   for (i in 1:np){
#     ind <- caret::createDataPartition(y=rn,p=p[i])
#     res[[i]] <- rn[ind[[1]]]
#     rn <- rn[-ind[[1]]]
#     p <- p/(sum(p[-c(1:i)]))
#   }


#   Use sample function instead
  for (i in 1:np){
    ind <- sample(1:length(rn),size=floor(p[i]*length(rn)))
    res[[i]] <- rn[ind]
    rn <- rn[-ind]
    p <- p/(sum(p[-c(1:i)]))
  }  
  res
}

A nice cutter to set a proper range for plotting

When there a variable has many unique levels, modelPlot tries to banded the variable into a default 20 groups

Will this method can be improved?
interPlot method may need a separate solution because of the melting process.

PopMiss replaces only one column when there are columns have the same names

x <- c(1,NA,2)
df <- as.data.frame(cbind(x,x,x))
names(df)
PopMiss(df,na.treatment = "replace",0)

[1] "x" "x" "x"

  x  x  x
1 1  1  1
2 0 NA NA
3 2  2  2

CramersV on user defined class

CramersV function failed on diamonds dataset due to variable type Ord.factor.

Needs to check the second class if there is any. Other wise, drop the variable to make sure the other variables will be analyised.

[DataSummary]NA is counted as unique value when specified in "missing"

test = data.frame(a = c("A","NA","NULL","NaN",""),b = c(1,NA,NaN,Inf,-Inf))
DataSummary(test)

VarName VarType Unique Missing Missing_pct Mean Min Max Entropy
1: a factor 5 1 0.2 2 1 2 0
2: b numeric 5 2 0.4 NaN -Inf Inf 0

sixianghu / dataman Goto Github PK

dataman's People

Contributors

Stargazers

Watchers

dataman's Issues

Recommend Projects

Recommend Topics

Recommend Org