kaufman-lab / intervalaverage Goto Github PK

time-weighted averaging function for values measured over intervals. within groups. arbitrary number of value variables simultaneously. fast and memory efficient. values measured over intervals can be "downsampled" to shorter periods or averaged up to longer periods. measurement intervals don't need to fit cleanly into averaging periods.

Home Page: https://kaufman-lab.github.io/intervalaverage/

License: GNU General Public License v3.0

R 91.14% C++ 8.86%

intervalaverage's Introduction

intervalaverage

Perform fast and memory efficient time-weighted averaging of values measured over intervals into new arbitrary intervals.

This package is useful in the context of data measured or represented as constant values over intervals on a one-dimensional discrete axis (e.g. time-integrated averages of a curve over defined periods).

The intervalaverage package was written specifically to deal with air pollution data recorded or predicted as averages over sampling periods. Data in this format often needs to be shifted to non-aligned periods or averaged up to longer durations (e.g. averaging data measured over sequential non-overlapping periods to calendar years).

This package makes careful use of optimized data.table syntax to reduce copying so it is well suited to deal with large datasets. The function inputs and return values are always data.tables (with values, interval starts and interval ends stored as columns) which provide a straightforward representation of the data facilitating direct inspection and manipulation via data.table’s standard interface. Functions also accept optional grouping variables as arguments; these group-by operations get passed to data.table which avoids the overhead of looping directly in R.

Installation

You can install the package from CRAN:

install.packages("intervalaverage")

Or you can install the current development version of the from GitHub:

# install.packages("devtools")
library(devtools)
install_github("kaufman-lab/intervalaverage", "main", build_vignettes=TRUE)

or you can manually clone this repository from github and install from source.

Example

library(intervalaverage)
#> Loading required package: data.table

Consider some PM2.5 data measured over periods occurring over a repeating 7-day schedule:

(x <- data.table(start=seq(1L,by=7L,length=6),
                 end=seq(7L,by=7L,length=6),
                 pm25=c(10,12,8,14,22,18)))
#>    start end pm25
#> 1:     1   7   10
#> 2:     8  14   12
#> 3:    15  21    8
#> 4:    22  28   14
#> 5:    29  35   22
#> 6:    36  42   18

Note that the start and end columns define closed intervals (i.e. inclusive).

Imagine that we would like these PM2.5 measurements to be represented over a different set of intervals, such as the following 7-day schedule which does not align cleanly with the intervals in x:

(y <- data.table(start=seq(3L,by=7L,length=6),
                 end=seq(9L,by=7L,length=6)))
#>    start end
#> 1:     3   9
#> 2:    10  16
#> 3:    17  23
#> 4:    24  30
#> 5:    31  37
#> 6:    38  44

Representing the PM25 measurements over the exact intervals in y is not strictly possible, since each y interval contains two x intervals which only partially overlap.

While there are several ways to compromise and approximate a representation of x over the periods in y, one reasonable approach is to average all the values of PM2.5 in x occuring over the interval of y. But since the two overlap durations of the two x periods into a single y period are unequal, the PM2.5 values should be averaged in a way that takes the duration of overlap into account. That is to say, we need a time-weighted average of values of x into intervals in y.

This is the exact purpose of the intervalaverage function:

(z <- intervalaverage(x,y,interval_vars=c("start","end"),
                value_vars=c("pm25")))
#>    start end      pm25 yduration xduration nobs_pm25 xminstart xmaxend
#> 1:     3   9 10.571429         7         7         7         3       9
#> 2:    10  16 10.857143         7         7         7        10      16
#> 3:    17  23  9.714286         7         7         7        17      23
#> 4:    24  30 16.285714         7         7         7        24      30
#> 5:    31  37 20.857143         7         7         7        31      37
#> 6:    38  44        NA         7         5         5        38      42
#>    maxgap_pm25
#> 1:           0
#> 2:           0
#> 3:           0
#> 4:           0
#> 5:           0
#> 6:           2

(Note that the interval average package always assumes specified intervals are closed aka inclusive)

The pm25 value of 10.571429 is equal to (5/7)*10 + (2/7)*12, where 5/7 and 2/7 are the respective time-weights of the the PM2.5 values based on their respective durations of overlap with the period [3,9].

identical(z[1,pm25], (5/7)*10 + (2/7)*12)
#> [1] TRUE

Note that in z the average value of PM2.5 is missing (NA) for the final period in y. This is because there is insufficient PM25 to calculate an average. By using a looser completeness requirements we can get an average for this period in y (in general, this will just be the mean of non-missing values in x):

(z <- intervalaverage(x,y,interval_vars=c("start","end"),
                value_vars=c("pm25"),required_percentage = 70))
#>    start end      pm25 yduration xduration nobs_pm25 xminstart xmaxend
#> 1:     3   9 10.571429         7         7         7         3       9
#> 2:    10  16 10.857143         7         7         7        10      16
#> 3:    17  23  9.714286         7         7         7        17      23
#> 4:    24  30 16.285714         7         7         7        24      30
#> 5:    31  37 20.857143         7         7         7        31      37
#> 6:    38  44 18.000000         7         5         5        38      42
#>    maxgap_pm25
#> 1:           0
#> 2:           0
#> 3:           0
#> 4:           0
#> 5:           0
#> 6:           2

There is much more described in the vignette. For example, it is possible to calculate averages within a grouping variable. For example if PM2.5 were measured at multiple locations, you could take the average separately at each location in one call to the intervalaverage function. Averaging intervals need not be the same for each locations. See the vignette for examples.

Vignette

Read the intro vignette for a demonstration of the core package functions.

library(intervalaverage)
vignette("intervalaverage-intro")

More detail is on using the package is available in the advanced vignette.

vignette("intervalaverage-advanced")

For an overview of the internals of the intervalaverage function, see the technical vignette:

vignette("intervalaverage-technicaloverview")

intervalaverage's People

Contributors

Stargazers

Watchers

Forkers

lpiep

intervalaverage's Issues

change data.table from depends to imports

Unit tests and better comments for c++ code

Implementing unit tests directly for c++ to clarify specifications. Add more comments in the cpp code particularly around the (collated) order of the names and the nested loops.

data.table loaded from disk using load() causes unexpected error message when immediately used in intervalaverage function

This was fixed in is.overlapping by replacing the walrus operator (:=) with a set function in savestate (the set function returns an intelligible error message whereas the walrus operator when applied to a data.table loaded from disk may silently fail to add the column).

there are more walrus operators in the intervalaverage function that either need to be replaced with calls to the set function or else you need to ensure that memory is allocated for the data.table before proceeding with the intervalaverage function.

It would be great to write a unit test for this but it's slightly involved since you need to write a file to disk and include it with the package.

testing fails on windows i386

Likely traced to a bug in data.table::foverlaps reported here
Rdatatable/data.table#4580

I could switch all the code to equivalent non-equijoins but previous timings I've done suggest that foverlaps is faster than an equivalent non-equijoin. Plus, the non-equijoin doesn't retain both sets of the interval join columns so extra work would need to be done around column naming to reproduce the exact functionality of foverlaps. It's probably better to leave the code as is and wait for a fix in data.table since all intervalaverage checks are passing on windows 64 bit and linux.

intervalaverage::intervalaverage optimization

It may be possible to substantially reduce the intermediate large allocation that occurs in intervalaverage::intervalaverage but this depends on some unimplemented features in data.table.

If #11 gets implemented AND #13 gets implemented AND .EACHI becomes possible for foverlaps (but grouping would need to be within groups of x=y in the intervalaverage::intervalaverage foverlaps(x=y,y=x) call), weighted-averages within groups could happen by group returning the result rather than storing the whole intermediate table (this is what EACHI does for normal [.data.table calls). GForce would probably not work for this (I'm pretty sure EACHI and GForce don't work together), so timing would need to be done to see if there's a tradeoff between memory usage and timing because not using Gforce optimization may be slower.

replace eval(parse(text=))) macros with data.table programming API

Rdatatable/data.table#4304

add website to description (github page)

implement benchmarks??

I'm not aware of any packages which have implemented this specific functionality of averaging from values recorded over specific intervals to arbitrary intervals. If I find something similar it's worth benchmarking for speed and ideally memory usage too (although this is harder).

diagrams for vignette

It would be useful to create a diagram in the vignette demonstrating how the averaging works for visual learners.

Request: Clarify error message when NAs present in intervalaverage args

NAs in interval_vars leads to this error: columns corresponding to interval_vars cannot be missing in x (or y)

What would you like to see happen?

This should be clarified to say NAs are not allowed in these columns. It is otherwise confusingly similar to the error where the specified interval_vars columns just don't exist in the input.

multiple classes (eg IDate is integer and IDate) results in warning for is.overlapping

in is.overlapping

  if(class(start)!=class(end)){     stop("start and end must be of the same class")   }

will result in this warning

In if (class(start) != class(end)) { ... :
  the condition has length > 1 and only the first element will be used

if object has multiple classes

intervalaverage extra setkey as hotfix

This bug Rdatatable/data.table#4603 necessitates setting the key of the output (q) to NULL then setting it again to the group_vars, interval_vars. Once that bug is fixed, you can increase data.table dependency to the version that fixes it and eliminate the extra setkey

tests for IDate is.overlapping

overlap length in foverlaps result

May eventually be calculated as part of foverlaps, eliminating the need for pmin/pmax calls:
Rdatatable/data.table#1975

allow different names for interval_vars (and group_vars) in x and y for intervalaverage

also allow different names for group_vars in intervalintersect

allow bit64::integer start and end

http://www.stats.bris.ac.uk/R/web/packages/bit64/index.html

allowing start and end values to be stored as bit64::integer would be useful in the context of representing many years of data as seconds. integer is insufficient for this purpose because it can't represent a range of more than ~40ish (?) years.

Simplify state functions

state function are used right now to restore arguments back to their original state (row order, remove additional columns) prior to function exit.

Rdatatable/data.table#4575 suggests row reordering can be greatly simplified implemented once Rdatatable/data.table#4386 is merged

calculate maximum gaps in measurement for a given y-interval

Internal kaufman-lab processes use this as a criteria to determine if an average meets "completeness" requirements.

maximum gap is defined as the maximum length in a y interval of any concurrent run of time-points for which a specific x value is missing. Missingness could be either structural (ie part of a y interval isn't in x at all) or explicit (a row in x/specific x interval has NA for a value)..

intervalaverage doesn't check whether intervals in y are negative length

add a similar check to the existing one for x

 #stop if interval starts are before interval ends
  if(x[, sum(.SD[[2]]-.SD[[1]] <0)!=0,.SDcols=interval_vars]){
    stop("there exist values in x[[interval_vars[1] ]] that are
         less than corresponding values in  x[[interval_vars[2] ]].
         interval_vars must specify columns corresponding to increasing intervals")
  }

also add unit tests for this (it should return an error)

GForce weighted.average

If GForce weighted average gets implemented, the code for intervalaverage::intervalaverage could be greatly simplified.

packagedown site

document comparison with python pandas resample

perhaps in a vignette?
https://towardsdatascience.com/using-the-pandas-resample-function-a231144194c4

There are similarities between resample and intervalaverage,

resample is quite a bit less flexible in that it only resamples to regular, pre-specified periods. intervalaverage resamples to arbitrary periods and they don't need to be regular (or even non-overlapping).
resample doesn't seem to resample to different periods for different subsets. This flexibility is useful, for example, if you want to calculate a different time-average for different individuals. (for example, taking an average of air pollution in the 365 days prior to each individual's death)
On the other hand intervalaverage requires specification of the desired periods, so it lacks the convenience of resample for those specific cases.
Additionally, resample allows left, right, or completely open intervals. intervalaverage assumes completely closed intervals. This difference makes sense because intervalaverage was designed specifically for data measured over intervals, possibly irregular intervals. Resample is stored via a single column. Intervalaverage data requires start and end times to be stored as separate columns, so the user can add/subtract 1 to the start and ends to specify intervals which are both inclusive and correspond to the desired interval.

drop is.overlappingv from dev

testing shows it's slow relative when repeated over groups relative to doing this manually using foverlaps, probably because a data.table gets created for each group and that has overhead. If .SD kept keys then you could accomplish the task via

setkey(x, g, start,end)
x[, is.overlapping(.SD),by="g"]

but right now the key of .SD is NULL rather than start, end so currently this doesn't work (returns error message saying you can't reorder .SD). I think there's an issue in data.table to keep the key of .SD in this situation which would allow that code to work.

examples should have set.seed

Write more tests for intervalintersect

continuous integration on other platforms and other versions of R

https://devguide.ropensci.org/ci.html

investigate performance regression

intervalaverage v 0.8 and data.table 1.12.8 is faster than v 0.9 and 1.13.x

when running a large intervalaverage (mesa events analysis) . this might be due to the addition of maxgaps, or possibly data.table performance regression.

create a publicly shareable reprex and identify what's causing it.

Implement isolateoverlaps directly in C++

for speed

lazydata note on cran

https://stackoverflow.com/questions/66860659/lazydata-is-specified-without-a-data-directory-error-when-submitting-r-packa

https://cran.r-project.org/web/checks/check_results_intervalaverage.html

value_var specific xminstart and xmaxend

right now xminstart and xmaxend are just based on the minimum and maximum overlapping interval, whether or not the value_var for that row in x is missing.

value_var specific xminstart and xmaxend could check the earliest (and latest) date for which that value is nonmissing

Implement is.overlapping in c++

Right now this does a self-join and counts number of rows which is totally unnecessary. If there are many overlapping rows the memory usage could blow up rather than just returning TRUE on the first detection of an overlap.

Implement directly in c++ using a start/stop in/out approach to find overlaps and break at the first sign if the overlap will be faster than foverlaps because foverlaps completes the join even when matches have been detected.

is it possible to include a git hook such merges to main that haven't incremented version number in description and news are rejected?

Error when ids in x don't match ids in y

I had a dataset where there were some ID's in my exposure window data that didn't appear in the prediction data.

Error in setstate(x, statex):
all required_columns %in% names(x) is not TRUE
Error in setstate(y, statey):
all required_columns %in% names(x) is not TRUE

then when I checked head(y) I saw that the names of y had been changed to "intervalaverage_end_original_class_copy" and "intervalaverage_start_original_class_copy"

This needs a more informative error and ideally it wouldn't alter the data object that's being passed in.

get rid of "(pending)" in readme (wrt to CRAN installation)

intervalaverage is altering print of y in some cases

This is second-hand from Altan. After calling intervalaverage, printing is invisible the first time y is printed (possibly x too). Standard data.table stuff where if you make an assignment by reference, the next time that table prints it will do so invisibly.

I just need to do add x[] and y[] in the on.exit call to avoid this.

maxgap needs news item...

or needs to be deleted if it slows things down too much

Better precision when performing sums in intervalaverage

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

also switch tests from expect_equal to expect_identical

and write tests that add small and large numbers