Giter VIP home page Giter VIP logo

kaufman-lab / intervalaverage Goto Github PK

View Code? Open in Web Editor NEW
3.0 4.0 1.0 286 KB

time-weighted averaging function for values measured over intervals. within groups. arbitrary number of value variables simultaneously. fast and memory efficient. values measured over intervals can be "downsampled" to shorter periods or averaged up to longer periods. measurement intervals don't need to fit cleanly into averaging periods.

Home Page: https://kaufman-lab.github.io/intervalaverage/

License: GNU General Public License v3.0

R 91.14% C++ 8.86%

intervalaverage's Introduction

intervalaverage

Travis build status

Perform fast and memory efficient time-weighted averaging of values measured over intervals into new arbitrary intervals.

This package is useful in the context of data measured or represented as constant values over intervals on a one-dimensional discrete axis (e.g. time-integrated averages of a curve over defined periods).

The intervalaverage package was written specifically to deal with air pollution data recorded or predicted as averages over sampling periods. Data in this format often needs to be shifted to non-aligned periods or averaged up to longer durations (e.g. averaging data measured over sequential non-overlapping periods to calendar years).

This package makes careful use of optimized data.table syntax to reduce copying so it is well suited to deal with large datasets. The function inputs and return values are always data.tables (with values, interval starts and interval ends stored as columns) which provide a straightforward representation of the data facilitating direct inspection and manipulation via data.table’s standard interface. Functions also accept optional grouping variables as arguments; these group-by operations get passed to data.table which avoids the overhead of looping directly in R.

Installation

You can install the package from CRAN:

install.packages("intervalaverage")

Or you can install the current development version of the from GitHub:

# install.packages("devtools")
library(devtools)
install_github("kaufman-lab/intervalaverage", "main", build_vignettes=TRUE)

or you can manually clone this repository from github and install from source.

Example

library(intervalaverage)
#> Loading required package: data.table

Consider some PM2.5 data measured over periods occurring over a repeating 7-day schedule:

(x <- data.table(start=seq(1L,by=7L,length=6),
                 end=seq(7L,by=7L,length=6),
                 pm25=c(10,12,8,14,22,18)))
#>    start end pm25
#> 1:     1   7   10
#> 2:     8  14   12
#> 3:    15  21    8
#> 4:    22  28   14
#> 5:    29  35   22
#> 6:    36  42   18

Note that the start and end columns define closed intervals (i.e. inclusive).

Imagine that we would like these PM2.5 measurements to be represented over a different set of intervals, such as the following 7-day schedule which does not align cleanly with the intervals in x:

(y <- data.table(start=seq(3L,by=7L,length=6),
                 end=seq(9L,by=7L,length=6)))
#>    start end
#> 1:     3   9
#> 2:    10  16
#> 3:    17  23
#> 4:    24  30
#> 5:    31  37
#> 6:    38  44

Representing the PM25 measurements over the exact intervals in y is not strictly possible, since each y interval contains two x intervals which only partially overlap.

While there are several ways to compromise and approximate a representation of x over the periods in y, one reasonable approach is to average all the values of PM2.5 in x occuring over the interval of y. But since the two overlap durations of the two x periods into a single y period are unequal, the PM2.5 values should be averaged in a way that takes the duration of overlap into account. That is to say, we need a time-weighted average of values of x into intervals in y.

This is the exact purpose of the intervalaverage function:

(z <- intervalaverage(x,y,interval_vars=c("start","end"),
                value_vars=c("pm25")))
#>    start end      pm25 yduration xduration nobs_pm25 xminstart xmaxend
#> 1:     3   9 10.571429         7         7         7         3       9
#> 2:    10  16 10.857143         7         7         7        10      16
#> 3:    17  23  9.714286         7         7         7        17      23
#> 4:    24  30 16.285714         7         7         7        24      30
#> 5:    31  37 20.857143         7         7         7        31      37
#> 6:    38  44        NA         7         5         5        38      42
#>    maxgap_pm25
#> 1:           0
#> 2:           0
#> 3:           0
#> 4:           0
#> 5:           0
#> 6:           2

(Note that the interval average package always assumes specified intervals are closed aka inclusive)

The pm25 value of 10.571429 is equal to (5/7)*10 + (2/7)*12, where 5/7 and 2/7 are the respective time-weights of the the PM2.5 values based on their respective durations of overlap with the period [3,9].

identical(z[1,pm25], (5/7)*10 + (2/7)*12)
#> [1] TRUE

Note that in z the average value of PM2.5 is missing (NA) for the final period in y. This is because there is insufficient PM25 to calculate an average. By using a looser completeness requirements we can get an average for this period in y (in general, this will just be the mean of non-missing values in x):

(z <- intervalaverage(x,y,interval_vars=c("start","end"),
                value_vars=c("pm25"),required_percentage = 70))
#>    start end      pm25 yduration xduration nobs_pm25 xminstart xmaxend
#> 1:     3   9 10.571429         7         7         7         3       9
#> 2:    10  16 10.857143         7         7         7        10      16
#> 3:    17  23  9.714286         7         7         7        17      23
#> 4:    24  30 16.285714         7         7         7        24      30
#> 5:    31  37 20.857143         7         7         7        31      37
#> 6:    38  44 18.000000         7         5         5        38      42
#>    maxgap_pm25
#> 1:           0
#> 2:           0
#> 3:           0
#> 4:           0
#> 5:           0
#> 6:           2

There is much more described in the vignette. For example, it is possible to calculate averages within a grouping variable. For example if PM2.5 were measured at multiple locations, you could take the average separately at each location in one call to the intervalaverage function. Averaging intervals need not be the same for each locations. See the vignette for examples.

Vignette

Read the intro vignette for a demonstration of the core package functions.

library(intervalaverage)
vignette("intervalaverage-intro")

More detail is on using the package is available in the advanced vignette.

vignette("intervalaverage-advanced")

For an overview of the internals of the intervalaverage function, see the technical vignette:

vignette("intervalaverage-technicaloverview")

intervalaverage's People

Contributors

lpiep avatar myoung3 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

lpiep

intervalaverage's Issues

Unit tests and better comments for c++ code

Implementing unit tests directly for c++ to clarify specifications. Add more comments in the cpp code particularly around the (collated) order of the names and the nested loops.

data.table loaded from disk using load() causes unexpected error message when immediately used in intervalaverage function

This was fixed in is.overlapping by replacing the walrus operator (:=) with a set function in savestate (the set function returns an intelligible error message whereas the walrus operator when applied to a data.table loaded from disk may silently fail to add the column).

there are more walrus operators in the intervalaverage function that either need to be replaced with calls to the set function or else you need to ensure that memory is allocated for the data.table before proceeding with the intervalaverage function.

It would be great to write a unit test for this but it's slightly involved since you need to write a file to disk and include it with the package.

testing fails on windows i386

Likely traced to a bug in data.table::foverlaps reported here
Rdatatable/data.table#4580

I could switch all the code to equivalent non-equijoins but previous timings I've done suggest that foverlaps is faster than an equivalent non-equijoin. Plus, the non-equijoin doesn't retain both sets of the interval join columns so extra work would need to be done around column naming to reproduce the exact functionality of foverlaps. It's probably better to leave the code as is and wait for a fix in data.table since all intervalaverage checks are passing on windows 64 bit and linux.

intervalaverage::intervalaverage optimization

It may be possible to substantially reduce the intermediate large allocation that occurs in intervalaverage::intervalaverage but this depends on some unimplemented features in data.table.

If #11 gets implemented AND #13 gets implemented AND .EACHI becomes possible for foverlaps (but grouping would need to be within groups of x=y in the intervalaverage::intervalaverage foverlaps(x=y,y=x) call), weighted-averages within groups could happen by group returning the result rather than storing the whole intermediate table (this is what EACHI does for normal [.data.table calls). GForce would probably not work for this (I'm pretty sure EACHI and GForce don't work together), so timing would need to be done to see if there's a tradeoff between memory usage and timing because not using Gforce optimization may be slower.

implement benchmarks??

I'm not aware of any packages which have implemented this specific functionality of averaging from values recorded over specific intervals to arbitrary intervals. If I find something similar it's worth benchmarking for speed and ideally memory usage too (although this is harder).

diagrams for vignette

It would be useful to create a diagram in the vignette demonstrating how the averaging works for visual learners.

Request: Clarify error message when NAs present in intervalaverage args

NAs in interval_vars leads to this error: columns corresponding to interval_vars cannot be missing in x (or y)

What would you like to see happen?

This should be clarified to say NAs are not allowed in these columns. It is otherwise confusingly similar to the error where the specified interval_vars columns just don't exist in the input.

calculate maximum gaps in measurement for a given y-interval

Internal kaufman-lab processes use this as a criteria to determine if an average meets "completeness" requirements.

maximum gap is defined as the maximum length in a y interval of any concurrent run of time-points for which a specific x value is missing. Missingness could be either structural (ie part of a y interval isn't in x at all) or explicit (a row in x/specific x interval has NA for a value)..

intervalaverage doesn't check whether intervals in y are negative length

add a similar check to the existing one for x

 #stop if interval starts are before interval ends
  if(x[, sum(.SD[[2]]-.SD[[1]] <0)!=0,.SDcols=interval_vars]){
    stop("there exist values in x[[interval_vars[1] ]] that are
         less than corresponding values in  x[[interval_vars[2] ]].
         interval_vars must specify columns corresponding to increasing intervals")
  }

also add unit tests for this (it should return an error)

GForce weighted.average

If GForce weighted average gets implemented, the code for intervalaverage::intervalaverage could be greatly simplified.

document comparison with python pandas resample

perhaps in a vignette?
https://towardsdatascience.com/using-the-pandas-resample-function-a231144194c4

There are similarities between resample and intervalaverage,

  • resample is quite a bit less flexible in that it only resamples to regular, pre-specified periods. intervalaverage resamples to arbitrary periods and they don't need to be regular (or even non-overlapping).
  • resample doesn't seem to resample to different periods for different subsets. This flexibility is useful, for example, if you want to calculate a different time-average for different individuals. (for example, taking an average of air pollution in the 365 days prior to each individual's death)
  • On the other hand intervalaverage requires specification of the desired periods, so it lacks the convenience of resample for those specific cases.
  • Additionally, resample allows left, right, or completely open intervals. intervalaverage assumes completely closed intervals. This difference makes sense because intervalaverage was designed specifically for data measured over intervals, possibly irregular intervals. Resample is stored via a single column. Intervalaverage data requires start and end times to be stored as separate columns, so the user can add/subtract 1 to the start and ends to specify intervals which are both inclusive and correspond to the desired interval.

drop is.overlappingv from dev

testing shows it's slow relative when repeated over groups relative to doing this manually using foverlaps, probably because a data.table gets created for each group and that has overhead. If .SD kept keys then you could accomplish the task via

setkey(x, g, start,end)
x[, is.overlapping(.SD),by="g"]

but right now the key of .SD is NULL rather than start, end so currently this doesn't work (returns error message saying you can't reorder .SD). I think there's an issue in data.table to keep the key of .SD in this situation which would allow that code to work.

investigate performance regression

intervalaverage v 0.8 and data.table 1.12.8 is faster than v 0.9 and 1.13.x

when running a large intervalaverage (mesa events analysis) . this might be due to the addition of maxgaps, or possibly data.table performance regression.

create a publicly shareable reprex and identify what's causing it.

value_var specific xminstart and xmaxend

right now xminstart and xmaxend are just based on the minimum and maximum overlapping interval, whether or not the value_var for that row in x is missing.

value_var specific xminstart and xmaxend could check the earliest (and latest) date for which that value is nonmissing

Implement is.overlapping in c++

Right now this does a self-join and counts number of rows which is totally unnecessary. If there are many overlapping rows the memory usage could blow up rather than just returning TRUE on the first detection of an overlap.

Implement directly in c++ using a start/stop in/out approach to find overlaps and break at the first sign if the overlap will be faster than foverlaps because foverlaps completes the join even when matches have been detected.

Error when ids in x don't match ids in y

I had a dataset where there were some ID's in my exposure window data that didn't appear in the prediction data.

Error in setstate(x, statex):
all required_columns %in% names(x) is not TRUE
Error in setstate(y, statey):
all required_columns %in% names(x) is not TRUE

then when I checked head(y) I saw that the names of y had been changed to "intervalaverage_end_original_class_copy" and "intervalaverage_start_original_class_copy"

This needs a more informative error and ideally it wouldn't alter the data object that's being passed in.

intervalaverage is altering print of y in some cases

This is second-hand from Altan. After calling intervalaverage, printing is invisible the first time y is printed (possibly x too). Standard data.table stuff where if you make an assignment by reference, the next time that table prints it will do so invisibly.

I just need to do add x[] and y[] in the on.exit call to avoid this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.