eddelbuettel / digest Goto Github PK

View Code? Open in Web Editor NEW

109.0 6.0 44.0 2.39 MB

R package to create compact hash digests of R objects

Home Page: https://eddelbuettel.github.io/digest

R 13.22% Shell 0.02% C++ 48.09% C 38.50% CSS 0.18%

r cran r-package hash-digest

digest's Introduction

digest: Compact hash representations of arbitrary R objects

Compact hash representations of arbitrary R objects

Overview

The digest package provides a principal function digest() for the creation of hash digests of arbitrary R objects (using the md5, sha-1, sha-256, crc32, xxhash, murmurhash, spookyhash, blake3, crc32c, xxh3_64, and xxh3_128 algorithms) permitting easy comparison of R language objects.

Extensive documentation is available at the package documentation site.

Examples

As R can serialize any object, we can run digest() on any object:

R> library(digest)
R> digest(trees)
[1] "12412cbfa6629c5c80029209b2717f08"
R> digest(lm(log(Height) ~ log(Girth), data=trees))
[1] "e25b62de327d079b3ccb98f3e96987b1"
R> digest(summary(lm(log(Height) ~ log(Girth), data=trees)))
[1] "86c8c979ee41a09006949e2ad95feb41"
R>

By using the hash sum, which is very likely to be unique, to identify an underlying object or calculation, one can easily implement caching strategies. This is a common use of the digest package.

Other Functions

A small number of additional functions is available:

sha1() for numerally stable hashsums,
hmac() for hashed message authentication codes based on a key,
AES() for Advanced Encryption Standard block ciphers,
getVDigest() as a function generator for vectorised versions.

Note

Please note that this package is not meant to be deployed for cryptographic purposes. More comprehensive and widely tested libraries such as OpenSSL should be used instead.

Installation

The package is on CRAN and can be installed via a standard

install.packages("digest")

Continued Testing

As we rely on the tinytest package, the already-installed package can also be verified via

tinytest::test_package("digest")

at any later point.

Author

Dirk Eddelbuettel, with contributions by Antoine Lucas, Jarek Tuszynski, Henrik Bengtsson, Simon Urbanek, Mario Frasca, Bryan Lewis, Murray Stokely, Hannes Muehleisen, Duncan Murdoch, Jim Hester, Wush Wu, Qiang Kou, Thierry Onkelinx, Michel Lang, Viliam Simko, Kurt Hornik, Radford Neal, Kendon Bell, Matthew de Queljoe, Ion Suruceanu, Bill Denney, Dirk Schumacher, Winston Chang, Dean Attali, and Michael Chirico.

License

GPL (>= 2)

digest's People

Contributors

Stargazers

Watchers

digest's Issues

Please provide release tags

It would be nice if e.g. 9c64bfb was tagged as 0.6.12.

Different hash keys are returned for the same file and algorithm on different platforms.

On Ubuntu 17.04 with digest 0.6.12 and 64-bit R-3.4.0:

writeLines("12345", con = "test.txt")
library(digest)
digest("test.txt", algo = "md5", file = TRUE)

# [1] "d577273ff885c3f84dadb8578bb41399"

sessionInfo()

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04

Matrix products: default
BLAS: /home/landau/packages/R/R-3.4.0/lib/R/lib/libRblas.so
LAPACK: /home/landau/packages/R/R-3.4.0/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] digest_0.6.12

loaded via a namespace (and not attached):
[1] compiler_3.4.0

On Windows 7, again with digest 0.6.12 and 64-bit R-3.4.0:

library(digest)
writeLines("12345", con = "test.txt")
digest("test.txt", algo = "md5", file = TRUE)

## [1] "e6481c46e064c35e8f6e371d72912507"

sessionInfo()

R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
 
Matrix products: default

locale:

[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252  
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                         
[5] LC_TIME=English_United States.1252   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] digest_0.6.12

loaded via a namespace (and not attached):
[1] compiler_3.4.0 tools_3.4.0    yaml_2.1.14

This is not necessarily a problem with digest, but I do want to understand why it happens. I am also looking for a way to leverage digest to obtain reasonably fast file hashes that are consistent across platforms.

Updating Error

Dear Sir,
I am trying to update the package and I get the following error. Please advise

install.packages("digest")
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.3/digest_0.6.12.zip'
Content type 'text/html; charset="utf-8"' length 1053 bytes
downloaded 1053 bytes

Warning in install.packages :
error 1 in extracting from zip file
Warning in install.packages :
cannot open compressed file 'digest/DESCRIPTION', probable reason 'No such file or directory'
Error in install.packages : cannot open the connection

vector to digest function

Is there any more efficient way to turn vector of characters into vector of its hashes then presented below?

col <- c("asd1","asd2","asd3","asd2","asd1")
masking <- function(col) vapply(col, function(object) digest(object, algo="md5"), FUN.VALUE = "", USE.NAMES = FALSE)
masking(col)
col

and not only for md5

Consider MetroHash

Nice blog post about it
Github repo

R CMD check on Travis = WARNING

* checking package vignettes in ‘inst/doc’ ... WARNING
Package vignette without corresponding PDF/HTML:
   ‘sha1.Rmd’
* checking running R code from vignettes ...
   ‘sha1.Rmd’ using ‘UTF-8’ ... OK
 NONE
* checking re-building of vignette outputs ... SKIPPED
* DONE
Status: 1 WARNING

Note: This link might help: https://www.r-bloggers.com/continuous-integration-for-r-packages/

Install failure on R < 3.0.0

My attempt to install "digest" on R 2.15.3 failed with the following error message:

** testing if installed package can be loaded
Error in dyn.load(file, DLLpath = DLLpath, ...) : 
  unable to load shared object '<FOO>/lib/R/library/digest/libs/digest.so':
  <FOO>/lib/R/library/digest/libs/digest.so: undefined symbol: XLENGTH
Error: loading failed
Execution halted
ERROR: loading failed

where <FOO> denotes the --prefix path where R was installed.

Installation succeeds on R 3.0.0. I suspect that the Depends field in DESCRIPTION is outdated.

installation os 0.6.15 is not working from downloaded copy

I downloaded digest_0.6.15.tar.gz, unzipped it in the Downloads directory, and tried to install it from there. Also a failure.

install.packages("~/Users/vicki/Downloads/digest", repos=NULL, type="source", dependencies = c("Depends", "Suggests", "Imports"))
Warning: invalid package ‘/Users/vicki/Users/vicki/Downloads/digest’
Error: ERROR: no packages specified
Warning in install.packages :
installation of package ‘/Users/vicki/Users/vicki/Downloads/digest’ had non-zero exit status

The session information is as follows:

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] sessioninfo_1.0.0

loaded via a namespace (and not attached):
[1] compiler_3.4.2 clisymbols_1.2.0 tools_3.4.2 withr_2.1.1
[5] yaml_2.1.16

digest_0.6.15.tar.gz is advertised in mirrors but isn’t available “Cannot open URL”

Sorry if this isn’t the right place to report this, not sure where else, but I tried several mirrors, all advertise that digest 0.6.15 for Mac OS X El-Capitan is available but give this error message:

cannot open URL 'https:// . . . /bin/macosx/el-capitancontrib/3.4/digest_0.6.15.tgz'

Changing 15 to 14 lets me download and install it.

Thanks!

Considers sha256() and sha512()

Those could be equivalent to sha1().

Test currently conditioned on < R-3.5.0 actually depends on detailed R version

There's a test in sha1Test.R that is currently done only for < R-3.5.0, which checks the following:

    identical(
        sha1(serialize("e13485e1b995f3e36d43674dcbfedea08ce237bc", NULL)),
        "93ab6a61f1a2ad50d4bf58396dc38cd3821b2eaf"
    )

But actually, this test will fail for all but at most one version of R. This is because "serialize" includes the version of R being run in its output header, so sha1 of that output will be different for every version of R.

Feature request: Optionally return NA instead of throwing an error

Rationale: Avoiding tryCatch for callers that are aware that digest() might fail. As discussed r-lib/testthat#299.

Interface: Something along the lines of

digest(..., error = c("fail", "warn", "silent"))

Download failing for El Capitan

Installing digest is not working for RStudio. This prevents knitting R Markdown files. I am on a Mac running El Capitan (v 10.11.6), R version 3.4.3, RStudio 1.1.383. I think this is because CRAN has the El Capitan version digest_0.6.14.tgz whereas the download is looking for digest_0.6.15.tgz. The call & error code is below.

install.packages("digest")
Installing package into ‘/Users/biostudent/Library/R/3.4/library’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/digest_0.6.15.tgz'
Warning in install.packages :
cannot open URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/digest_0.6.15.tgz': HTTP status was '404 Not Found'
Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/digest_0.6.15.tgz'
Warning in install.packages :
download of package ‘digest’ failed

Project website points to R-Forge

Not sure this is the right place to report, but sending an Email seems weird!

Anyway, the project web page http://dirk.eddelbuettel.com/code/digest.html points to the no longer used R-Forge project... "at the R-Forge host."

The project overview page correctly points to GitHub.

Function hashes change after being called

Not sure if I am doing something wrong, but the hash of a function appears to change just by calling it. Self evident minimal example below:

`> rm(list=ls())

fn <- function(){}
digest::digest(fn)
[1] "21b49d838ee8307a99ec2e0c88af450d"
fn()
NULL
digest::digest(fn)
[1] "2cf2a04b8630ff5abb4f874e06384bbe"`

In case the enclosing environment is also a part of the hashed object, it should not have been altered here I believe.

Change default hash function from md5 to xxhash64 for speed & space efficiency

The digest package documentation clearly warns that the cryptographic hash functions provided in digest like MD5, SHA-256 or SHA-512, are not intended for security purposes:

Please note that this package is not meant to be deployed for cryptographic purposes for which more comprehensive (and widely tested) libraries such as 'OpenSSL' should be used.

Presumably they are provided for convenience and compatibility purposes. However, cryptographic hash functions are typically slower than regular hash functions and their hashes are also longer, leading to more CPU & RAM consumption. And as one would expect, some simple microbenchmarking suggests that the slowdown, particularly on large inputs, could be substantial when using SHA or MD5 compared to the fastest lightweight hash, xxhash64 (which I understand is optimized for the now-universal 64-bit platforms R is used on):

library(digest)
library(microbenchmark)
x <- runif(100000)
microbenchmark(times=1000, unit="us", digest(x, algo="md5"), digest(x, algo="sha1"), digest(x, algo="sha256"), digest(x, algo="sha512"), digest(x, algo="crc32"), digest(x, algo="murmur32"), digest(x, algo="xxhash32"), digest(x, algo="xxhash64"))
# Unit: microseconds
#                          expr      min        lq         mean     median         uq        max neval   cld
#       digest(x, algo = "md5") 2485.131 2786.4570  4295.136380  4764.8345  5072.0785  98563.923  1000   c
#      digest(x, algo = "sha1") 3567.551 3883.1065  5958.170993  6745.6465  7060.6915  99762.800  1000    d
#    digest(x, algo = "sha256") 6611.958 7162.4820 10632.041292 12650.1370 12971.2000  79189.573  1000     e
#    digest(x, algo = "sha512") 3572.797 3930.3560  6271.029691  6775.0555  7085.4540 101179.334  1000    d
#     digest(x, algo = "crc32") 1446.463 1661.5915  2656.618527  2728.0760  3029.7340  96977.578  1000  b
#  digest(x, algo = "murmur32") 1068.145 1317.6815  2040.655517  2064.6110  2372.3685  73234.312  1000 a
#  digest(x, algo = "xxhash32")  909.042 1115.3040  1824.721796  1716.8020  2022.8500  76509.794  1000 a
#  digest(x, algo = "xxhash64")  830.653 1066.3930  1627.243898  1614.8140  1916.8650   4083.643  1000 a
x2 <- runif(100)
microbenchmark(times=1000, unit="us", digest(x2, algo="md5"), digest(x2, algo="sha1"), digest(x2, algo="sha256"), digest(x2, algo="sha512"), digest(x2, algo="crc32"), digest(x2, algo="murmur32"), digest(x2, algo="xxhash32"), digest(x2, algo="xxhash64"))
# Unit: microseconds
#                           expr    min      lq      mean  median      uq      max neval cld
#       digest(x2, algo = "md5") 48.368 50.2300 55.479182 50.9320 51.6655 2322.038  1000  b
#      digest(x2, algo = "sha1") 51.628 53.4765 56.321293 54.1390 54.8600 1886.369  1000  bc
#    digest(x2, algo = "sha256") 60.406 62.4860 63.602296 63.1400 63.9570   87.782  1000   c
#    digest(x2, algo = "sha512") 47.967 49.8710 50.851533 50.5745 51.3650   67.580  1000 ab
#     digest(x2, algo = "crc32") 43.503 45.7115 50.358762 46.2845 47.0430 1878.604  1000 ab
#  digest(x2, algo = "murmur32") 41.907 44.3050 47.264461 44.9710 45.8590 1893.419  1000 a
#  digest(x2, algo = "xxhash32") 41.688 43.9555 47.075301 44.6250 45.3780 1895.936  1000 a
#  digest(x2, algo = "xxhash64") 41.629 43.8965 44.948122 44.5920 45.3360   62.534  1000 a
object.size(digest(x, algo="md5")); object.size(digest(x, algo="xxhash64"))
#136 bytes
#120 bytes

Right now, digest() defaults to 'md5'. Any users of digest who want better performance for free must rewrite all calls to change the algo parameter, which is not always possible (for example, the ht library doesn't let the user change it, and I suspect that is part of why it shows worse performance and space efficiency compared to environments even though it is a trivial wrapper around environments which uses digest() to turn arbitrary R objects into keys).

This would require a 2-line change in digest.r to change the order of arguments to start with xxhash64:

digest <- function(object, algo=c("md5", "sha1", "crc32", "sha256", "sha512",
                       "xxhash32", "xxhash64", "murmur32"),

a change to README.md to update the examples' outputs, an update to digest.Rd to change the hash algorithm documented as the default and tweak the examples, and I guess similarly in tests/digestTest.R/tests/digestTest.Rout.save (because they too assume MD5 as the default).

Hopefully, this would result in use of digest just automatically being 5%+ faster for users. A 64-bit hash should avoid any collisions in current users of digest. I'm not sure what implications there might be for people serializing data using digest; are there any use cases where people are using the md5 hashes of objects as permanent identifiers? (Are these hashes even invariant across R versions?) It wouldn't be hard for them to change their digest use to force MD5 use (or maybe just rethink why they want a big slow broken cryptographic hash like MD5 in the first place, which offers neither performance nor security...) So it might require a version bump & warning note.

Long vector support relied on, but not available

Processing a large R object (that otherwise behaves well with most manipulations in R) results in an error with digest. Specifically:

Error in digest(user_action, algo = "md5") : 
  long vectors not supported yet: memory.c:3361

Nearly identical errors (just the call is different) occur for other algorithms. Using debug I can see that the error is triggered at val <- .Call(digest_impl, object, as.integer(algoint), as.integer(length), as.integer(skip), as.integer(raw), as.integer(seed)); which I believe either places the error in .Call or in digest. Memory.c appears to be outside of the scope of digest but belongs to R instead. Nevertheless, because I'm able to work with and manipulate this object in every other context it makes me think that some step along the way in digest is the root of the problem.

Example data:

eg <- data.frame(V1=rep("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",100000000),V2=runif(100000000),V3=rep("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",100000000),V4=rep("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",100000000),V5=rep("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",100000000))
digest(eg)

Session Info:

R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] digest_0.6.8

loaded via a namespace (and not attached):
[1] tools_3.2.2

Perhaps something else could be done in the digest internals to make it not rely on memory.c or an overly large object could be detected, truncated, and a hash returned with a warning?

sha1.list() should allow vectors for the digits and zapsmall arguments

In case a list() we could allow digits and zapsmall to be either a single number or a vector with the same length as the list. The vector allows the user the specify a different value for digits and zapsmall for each element of the list. The current default remain.

Use case: a list with e.g. a data element and a model. For the data we can use a high value for digits (e.g. 14). For the model we will need a smaller value for digits (e.g. 7).

BUG: digest(..., file=TRUE, skip=n) for n > 0 fails

The skip parameter with file=TRUE is handled just as when file=FALSE resulting in the following bug:

> digest::digest("DESCRIPTION", file=TRUE)
[1] "1c282839cdd654d02ecda6ecbab30cbe"

> digest::digest("DESCRIPTION", file=TRUE, skip=1L)
Error in digest::digest("DESCRIPTION", file = TRUE, skip = 1L) :
  Cannot open input file: ESCRIPTION

Troubleshooting

I spotted this bug by visual inspection of the code. It's because of the following in src/digest.c, which is called unconditionally on file:

    if (skip>0) {
        if (skip>=nChar) {
            nChar=0;
        } else {
            nChar -= skip;
            txt += skip;
        }
    }

I'll try to find time to fix this, but I'll add this issue here so it won't be forgotten.

Consider BLAKE2 or t1ha

See https://blake2.net/

Cannot open input file when filename contains umlaut and file = TRUE

Consider two files which has the same contents, only differing by filename:

> dir("data")
[1] "Gavleborg.txt" "Gävleborg.txt"

digest::digest works on the file without the "ä".

> digest(file.path("data", "Gavleborg.txt"), algo = "md5", file = TRUE)
[1] "85d6d41527468fd765392c6284661e62"

But throws an error when used on the file with the "ä", even when replacing ä with \u00E4. Note that the error message contains the correct character, but not the line after the colon.

> digest(file.path("data", "Gävleborg.txt"), file = TRUE)
Error in digest(file.path("data", "Gävleborg.txt"), file = TRUE) : 
  Cannot open input file: data/GÃ¤vleborg.txt

> digest(file.path("data", "G\u00E4vleborg.txt"), file = TRUE)
Error in digest(file.path("data", "Gävleborg.txt"), file = TRUE) : 
  Cannot open input file: data/GÃ¤vleborg.txt

However, it works when the "file"-argument is removed

> digest("data/Gävleborg.txt")
[1] "a8c6abd369f1b7e60c10fd35df55b123"

The function is passed with file = TRUE when I deploy a Shiny app to shinyapps.io, so I'm not sure whether this should be considered a bug within the shiny deployment context instead.

All primitive functions have the same digest::sha1() hash

Problem: All primitive functions have the same digest::sha1() hash.

Expectation: Different primitive functions should have different digest::sha1() hashes.

e.g.

> digest::sha1(`+`)
[1] "098c0ebd5df77c19684f3f8c9aefc3ca810a159b"
> digest::sha1(`*`)
[1] "098c0ebd5df77c19684f3f8c9aefc3ca810a159b"

install is failing

I am trying to install version 0.15 of digest, but it fails.

I tried install.packages from the command line in the RStudio console and it failed:

install.packages("digest")
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/digest_0.6.15.tgz'
Warning in install.packages :
cannot open URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/digest_0.6.15.tgz': HTTP status was '404 Not Found'
Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/digest_0.6.15.tgz'
Warning in install.packages :
download of package ‘digest’ failed

I tried to install it from the unzipped tar.gz that I downloaded and it failed:

install.packages("~/Users/vicki/Downloads/digest", repos=NULL, type="source", dependencies = c("Depends", "Suggests", "Imports"))
Warning: invalid package ‘/Users/vicki/Users/vicki/Downloads/digest’
Error: ERROR: no packages specified
Warning in install.packages :
installation of package ‘/Users/vicki/Users/vicki/Downloads/digest’ had non-zero exit status

I am running R version 3.4.2 within RStudio version 1.1.419.

Thanks in advance for any insight you can offer.

vectorize digest

Can you please consider vectorizing digest?

https://stackoverflow.com/questions/42335787/why-isnt-mutate-working
https://stackoverflow.com/questions/28358549/why-does-the-digest-function-return-the-same-value-every-time-when-used-with-dpl
http://stackoverflow.com/questions/32465074/new-dataframe-column-as-function-digest-of-another-one-is-not-working-for-me

problem loading the package

Hi,

This might not be the right place to post this question, but I'm wondering if anyone can help me with the error of loading digest after installation. I used to be able to do install.packages(devtools) in R-4.3.1 which installs digest, after recently upgrading to R-4.3.1, that's not working anymore, and I found that it's because I couldn't load digest correctly. This also happens when I tried to install digest directly, see below. I've read everything I can find online but had no clue why this is happening.

install.packages("digest", dependencies = TRUE)
trying URL 'https://mirror.aarnet.edu.au/pub/CRAN/bin/windows/contrib/3.4/digest_0.6.15.zip'
Content type 'application/zip' length 175238 bytes (171 KB)
downloaded 171 KB

package ‘digest’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
C:\Users\weizha\AppData\Local\Temp\RtmpwhNFrq\downloaded_packages

library(digest)
Error: package or namespace load failed for ‘digest’ in FUN(X[[i]], ...):
no such symbol digest in package E:/R-3.4.3/library/digest/libs/x64/digest.dll

here is the session info
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_3.4.3 tools_3.4.3

Thanks!

string -> int hashing

Hi here. I'm wondering whether it make sense to add simple hash function to hash arbitrary string into 32 bit integer. It seems it is quite common case and everyone reinvents the wheel (I'm opening this issue because I found myself reimplementing this 3rd time for different packages).

One solution is to use digest with Rmpfr::mpfr. But it is another pkg and system dependency (libmpfr).

digest itself is widely available and would be nice to have simpler built-in solution like One-at-a-Time or something from this article.

I found One-at-a-Time works nice - low collision and super fast. I can send PR.

nchar(digest::digest(51L, algo = "crc32")) == 7 - not 8

?digest says:

Value
The digest function returns a character string of a fixed length containing the requested digest of the supplied R object. This string is of length 32 for MD5; of length 40 for SHA-1; of length 8 for CRC32 a string; of length 8 for for xxhash32; of length 16 for xxhash64; and of length 8 for murmur32.

However, it looks like zero-padding is missing, e.g.

> x <- digest::digest(51L, algo = "crc32")
> x
[1] "82a699e"
> nchar(x)
[1] 7

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.0 digest_0.6.15

Consider SeaHash

Claims 5 to 20% faster than xxHash and MetroHash ... but is in Rust

https://github.com/ticki/tfs/tree/master/seahash

sha1() fails if x contains an empty list

Small example:

sha1(list())
sha1(list(a = 1, b = list()))

Header file names

We currently have pmurhash.h in both src/ and inst/include. That doesn't work, unless we really make sure the package itself never looks in inst/include.

In the past I used a suffix API to designate the file to be called by other packages.

Any thoughts, @wush978?

Possible relicensing of digest from GPL-2 to GPL (>= 2)

digest was started 13 years ago in 2003. At the time GPL-2 was the only game in town; the GPL-3 did AFAIK not arrive until 2007.

It has been pointed out to me that in order to use digest along with software licensed only under the GPL-3, it would be preferable to use either "GPL (>= 2)" or "GPL-2 | GPL-3" (which is what CRAN currently expands the former to).

I am ok with the idea of relicensing from "GPL-2" to "GPL (>= 2)" [1] but I am under the understanding that every copyright holder needs to agree. So based on the DESCRIPTION file, this means we need to hear from everybody below (in chronological order of contributions to digest):

I marked myself as being fine with the proposed change, and I plan to accordingly mark everybody who reports back (preferably below with a simple "I am ok with the license change to GPL (>=2)".

But it is my understanding that we need everybody to report back in order to be operational. So thanks in advance for giving this some thought, and following-up below.
Thanks to everybody for all the support over these thirteen years -- with all your help digest went much further than I ever imagined. Let's see if we can pull this one off too, so help in locating everybody listed above (and of course anybody I may have forgotten) is welcome!

[1] Also please forgive me for very plainly stating that I have zero interest in a license comparison discussion. The digest package has always been under copyleft licensing, and will remain copyleft. This is not the place or time to discuss MIT vs Apache vs BSD vs ...

digest() in base64

I'm porting a piece of code from nodejs to R

Node:

let kDate = crypto.createHmac('sha256', key);
kDate.update(private);
# kDate is a2a786d16d0e23d735ddc931ebe0cb0e4127c86d9c31a280df2c292caaf57385

kDate <- hmac(key = key, object = private, algo = "sha256", raw = FALSE)
# kDate is a2a786d16d0e23d735ddc931ebe0cb0e4127c86d9c31a280df2c292caaf57385

Which is OK, so far so good. The value matches, but then I need to do this in node

kDate_val=kDate.digest('base64');

Is there any easy way to implement this? Since digest() doesnt support base64 and exporting in other methods and converting to base64 don't work either

Thanks in advance

Classes that sha1() should have methods for in digest

Date
complex
array

mumurhash as raw?

Is it possible to get the results from murmur32 as raw?

I mean, when I run:


hash <- digest::digest("foo", "md5", serialize = FALSE, raw = TRUE)
hash
#>  [1] ac bd 18 db 4c c2 f8 5c ed ef 65 4f cc c4 a4 d8

This returns a raw vector. But when I change the algorithm to murmur32, I get:

hash <- digest::digest("foo", "murmur32", serialize = FALSE, raw = TRUE)
class(hash)
#> [1] "character"

I don't understand anything about hashes, and maybe this question doesn't make sense.

WISH: Streamed/incremental digestion?

Would it be possible to make a version of digest() to incrementally calculate the checksum. Conceptually something like:

x <- 1:10
digester <- new_digest(algo="md5")
digester <- add_to_digest(digester, object=x[1:5])
digester <- add_to_digest(digester, object=x[6:10])
checksum <- close_digest(digester)
stopifnot(identical(checksum, digest(x, algo="md5"))

Background

I've got large (~5 GiB) data files that contains ~20-100 genomic sequences. My goal is to generate a checksum for each individual sequence - so not the whole file. I know the start and stop file positions of each sequence so I can read them in individually using a file connection (via random access). What complicates my genomic files is that it contains auxiliary newlines (CR, LF, or CRLF) that are (i) OS dependent and (ii) that may occur at say every 50:th or 80:th character ("different line wrap widths depending on source). I want my checksum to ignore these newlines.

Yes, I could open a connection, skip to the start of the sequence of interest, read all of the sequence into memory, drop all newlines, and call digest(clean_seq). However, ideally I'd like to be able to do this with constant-memory constraints. Hence my wish.

Different values compared to OpenSSL

Hi,

I was using your package for internal purposes often. It all worked reliable and stable.

By accident, I discovered that OpenSSL gives quite different values for e.g. sha-1 or md5.

It would be great to add a section in the README to point out why this is. I was not aware and had issues merging sha-1 encoded fields with a 3rd party solution. OpenSSL gives the 'correct' hashes.

Many thanks for the clarification!

What about roxygen-generated documentation?

Currently, the content of man/*.Rd files is created manually.

Many packages, such as dplyr use roxygen to generate the man pages from doc-comments.

It also allows to separate examples into R files instead of writing them directly to Rd files (@example tag).
NAMESPACE file can be generated this way from @exports and @useDynLib tags.

Would something like this be desired in digest package?

UBSAN error when calling digest w/sha 512

Reproduce with the r-devel-ubsan-clang docker image:

docker run --rm -it rocker/r-devel-ubsan-clang /bin/bash

Then, within that shell (note that I can only reproduce this when running R interactively for some reason)

Rdevel

Then run in that R session:

install.packages("digest")
digest::digest(quote({}), algo = "sha512")

I see:

> digest::digest(quote({}), algo = "sha512")
sha2.c:791:3: runtime error: load of misaligned address 0x6160000648b6 for type 'const sha2_word64' (aka 'const unsigned long'), which requires 8 byte alignment
0x6160000648b6: note: pointer points here
 00 02 03 00 00 00  02 06 00 00 04 02 00 00  00 01 00 04 00 09 00 00  00 06 73 72 63 72 65 66  00 00
             ^
SUMMARY: AddressSanitizer: undefined-behavior sha2.c:791:3 in

Not a major issue by any means but just wanted to document it in case anyone else stumbled upon it.

Add some examples on writing sha1() method for classes not handled by digest

I would suggest to add this as a vignette. I'm willing to write it.

Do you prefer Sweave/LaTeX/pdf or knitr/markdown/HTML?
I have some fancy examples for lme4 and INLA. I could work something out for mainstream models like lm or glm.

Issue with `digest` function with files on network drive

I'm getting an error running the digest() function on a file stored on a network drive from a Windows computer while no error occurs on the same file from the local drive.

So the following works perfectly and C is a local drive:

fname <- "C:/junk/all_sites.prj"
h2 <- digest(fname, algo="md5", file=TRUE)

With an identical file moved to a network drive I get an error:

fname <- "X:/junk/all_sites.prj"
h2 <- digest(fname, algo="md5", file=TRUE)
Error in digest(fname, algo = "md5", file = TRUE) : 
  The specified file is not readable: X:/junk/all_sites.prj

This is causing problems with the deployApp() function in Shiny and the document() function in devtools but report digest errors. I reported this to RStudio (thinking it was a Shiny issue).

It seems this may be due to the file.access function used by digest. The code:

file.access(fname, mode=4)

returns a 0 (success) using the local path and a -1 (failure) with the network path.

R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
digest_0.6.4

Fragility of serialize() for digest() use?

First a little background: I use testthat and its expect_known_hash() (which in turn uses digest::digest()) for a package that I'm developing. I noticed that my laptop (Mac) and our CI (Linux) were returning different hashes for one object, which was causing my tests to fail.

As part of debugging, I used dput() to inspect the objects and not surprisingly, there was some differences in floating point values. I figured that was probably the cause (there is pull request r-lib/testthat#822 that references your digest() vs. sha1() vignette), but decided to go a bit further anyway.

I saved the objects as rds files, transferred them to the same machine, and loaded into the same R session. To my surprise, they now passed both all.equal() and identical() but still had different hashes with digest::digest().

Looking at the output of dput() for the two objects, they were otherwise identical (also the floating point values), but attributes class and row.names were in a different order (the objects are data.frames). Also the output of serialize() is different, as is to be expected because of the different digest::digest() hashes. I don't know what determines their order in the output of dput(), nor if it's the same underlying reason at play with serialize(). But whatever the cause, doesn't this behavior of serialize() seem a bit fragile for digest::digest()? I know the package has been around for quite some time, so I guess this must be some kind of rare edge case. Anyways, digest::sha1() gives identical hashes.

Below is a simple reprex that shows this behavior:

a <-
  structure(
    list(a = "example"),
    class = "data.frame",
    row.names = c(NA, -1L)
  )

b <-
  structure(
    list(a = "example"),
    row.names = c(NA, -1L),
    class = "data.frame"
  )

all.equal(a, b)
#> [1] TRUE
identical(a, b)
#> [1] TRUE

digest::digest(a)
#> [1] "64026fb88a58c424353ad931698acbb3"
digest::digest(b)
#> [1] "8335977c807d32d87b4c39bdf0c1c6b1"

digest::digest(a, algo = "sha1")
#> [1] "3cc1ed15c94980d4890179401b78e017f499a4c5"
digest::digest(b, algo = "sha1")
#> [1] "8464f91957b9587c3205b4ed888ebfc90abe4d12"

digest::sha1(a)
#> [1] "8a98077f38de43dd1e716e69e6ce1d58712f75af"
digest::sha1(b)
#> [1] "8a98077f38de43dd1e716e69e6ce1d58712f75af"

serialize(a, connection = NULL)
#>   [1] 58 0a 00 00 00 02 00 03 05 02 00 02 03 00 00 00 03 13 00 00 00 01 00
#>  [24] 00 00 10 00 00 00 01 00 04 00 09 00 00 00 07 65 78 61 6d 70 6c 65 00
#>  [47] 00 04 02 00 00 00 01 00 04 00 09 00 00 00 05 6e 61 6d 65 73 00 00 00
#>  [70] 10 00 00 00 01 00 04 00 09 00 00 00 01 61 00 00 04 02 00 00 00 01 00
#>  [93] 04 00 09 00 00 00 05 63 6c 61 73 73 00 00 00 10 00 00 00 01 00 04 00
#> [116] 09 00 00 00 0a 64 61 74 61 2e 66 72 61 6d 65 00 00 04 02 00 00 00 01
#> [139] 00 04 00 09 00 00 00 09 72 6f 77 2e 6e 61 6d 65 73 00 00 00 0d 00 00
#> [162] 00 02 80 00 00 00 ff ff ff ff 00 00 00 fe
serialize(b, connection = NULL)
#>   [1] 58 0a 00 00 00 02 00 03 05 02 00 02 03 00 00 00 03 13 00 00 00 01 00
#>  [24] 00 00 10 00 00 00 01 00 04 00 09 00 00 00 07 65 78 61 6d 70 6c 65 00
#>  [47] 00 04 02 00 00 00 01 00 04 00 09 00 00 00 05 6e 61 6d 65 73 00 00 00
#>  [70] 10 00 00 00 01 00 04 00 09 00 00 00 01 61 00 00 04 02 00 00 00 01 00
#>  [93] 04 00 09 00 00 00 09 72 6f 77 2e 6e 61 6d 65 73 00 00 00 0d 00 00 00
#> [116] 02 80 00 00 00 ff ff ff ff 00 00 04 02 00 00 00 01 00 04 00 09 00 00
#> [139] 00 05 63 6c 61 73 73 00 00 00 10 00 00 00 01 00 04 00 09 00 00 00 0a
#> [162] 64 61 74 61 2e 66 72 61 6d 65 00 00 00 fe

^{Created on 2019-01-17 by the reprex package (v0.2.1)}

The argument `raw` of `digest` is ineffective

Hi Dirk,

I just occasionally found this:

> (.x <- digest::digest("test", "crc32", FALSE, raw = TRUE))
[1] "d87f7e0c"
> class(.x)
[1] "character"

According to the documentation, it should be a raw vector, right?

Best,
Wush

could you replace `file.access` with `R.utils::fileAccess` ?

Hello,

This is a message for the author of the digest package.

It is known that file.access has some issues with network drives, as we can see in this issue and in this post on stackoverflow.

I am the author of this post on stackoverflow. I get:

> file.access("U:/Data", 4)
U:/Data 
     -1

I have tried the fileAccess function from the R.utils package:

> R.utils::fileAccess("U:/Data", 4)
[1] 0
Warning message:
In fileAccess.default("U:/Data", 4) :
  file.access(..., mode=4) and file(..., open="rb")+readBin() gives different results (-1 != 0). Will use the file()+readBin() results: U:/Data

This time, I get 0.
Thus, my problem (and perhaps other problems) would be solved if you replaced file.access with R.utils::fileAccess.

function object hash changed after running parallel function in Shiny

@eddelbuettel
I have some time consuming code using parallel in my Shiny app. I want to memoise it. The code worked in script but failed in Shiny, because my function take a function object, and the hash value of that function object changes after every run of parallel process.

I managed to create a minimal example to show the problem.

# test digest in shiny
library(shiny)
library(digest)
library(memoise)
library(parallel)
# just a random function. byte code compiled system function don't have this problem
test_fun <- function() {
  cat
}
para_test <- function() {
  mclapply(rep(4, 5), mean)
}
para_test_mem <- memoise(para_test)

ui <- fluidPage(actionButton("test", "Test"))
server <- function(input, output){
  observeEvent(input$test, {
    # para_test()
    para_test_mem()
    print(test_fun)
    cat("test_fun digest: ", digest(test_fun), "\n")
  })
}
shinyApp(ui = ui, server = server)

Clicking test button of the app will run a memoised parallel function, also print the hash value of test_fun. In my app I need to run a function on a list in parallel, so I need a function object as parameter.
The 1st and 2nd click will have different hash value for test_fun.
After that the cache worked and no parallel code was executed, tes_fun will have consistent hash value.
If you replace para_test_mem() with the original version para_test, every click will have consistent hash value.

Listening on http://127.0.0.1:5768
function() {
  cat
}
<environment: 0x129b2b608>
test_fun digest:  7ed2b264b5d080bd0331568ec19dfb4e 
function() {
  cat
}
<environment: 0x129b2b608>
test_fun digest:  3540d38333dbd2181e73294b9f1156ae 
function() {
  cat
}
<environment: 0x129b2b608>
test_fun digest:  3540d38333dbd2181e73294b9f1156ae

The test code above only show problem when the parallel function is memoised. I believe this is because the parallel part is too simple. In my app the parallel part took several seconds, and running the non-memoised parallel function will change hash of another function. If you can find a longer parallel test code, it should have same bug behavior.

I'm wondering if that is related to the fact that the parallel code created multiple fork then collected result.
It should not be random seed, right?
The environment of test_fun is same. And it's defined in the session, should be same in one session.
R system function is byte code compiled, and they don't have this problem.
Same code in script instead of Shiny don't have this problem.

xxHash for hmac

Paging @jimhester :

One thing I noticed but forget to bring up with you is that R/hmac.R supports all the other hashing functions. Adding xxHash seems cheap and simple. Shall we? Or can you think of a reason why we wouldn't?

Can't install package

I'm having an issue installing digest in RStudio (version 1.0.153). Here's the error:

install.packages("digest", dependencies = T)
trying URL 'https://rweb.crmda.ku.edu/cran/bin/macosx/el-capitan/contrib/3.4/digest_0.6.15.tgz'
Warning in install.packages :
cannot open URL 'https://rweb.crmda.ku.edu/cran/bin/macosx/el-capitan/contrib/3.4/digest_0.6.15.tgz': HTTP status was '404 Not Found'
Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'https://rweb.crmda.ku.edu/cran/bin/macosx/el-capitan/contrib/3.4/digest_0.6.15.tgz'
Warning in install.packages :
download of package ‘digest’ failed

I have tried updating my CRAN mirror and have tried restarting my R Studio session, and neither if these solutions worked. I discovered this error when trying to load lmerTest, as follows:

library(lmerTest)
Error: package or namespace load failed for ‘lmerTest’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
there is no package called ‘digest’

Any idea why I can't get this package installed?

URL in DESCRIPTION should point to github ?

The URL field in DESCRIPTION file currently points to http://dirk.eddelbuettel.com/code/digest.html
I think it should point to https://github.com/eddelbuettel/digest because then it would be easier for services such as http://depsy.org/package/r/digest to properly compute "research software impact" for all contributors.

sha1() fails on empty matrices

> sha1(matrix(integer(0)))
Error in x[1, 1] : subscript out of bounds

I'll look into this.

bigendian test in murmurHash not quite right

CRAN reports to us that

There is a problem with the new bigendian check:
http://cran.r-project.org/web/checks/check_results_digest.html

Happens with gcc too, which says

pmurhash.c:96:92: error: operator '&&' has no right operand
#elif defined(BIG_ENDIAN) && BIG_ENDIAN==1 ||
defined(_BIG_ENDIAN) && _BIG_ENDIAN==1

_BIG_ENDIAN is defined but empty.

I think we may just use the (existing) endian tests from R itself. Or change the define.

/cc @jimhester [for murmurHash] and @wush978 [for murmurHash use in https://github.com/wush978/FeatureHashing/issues/26]

Cannot install latest version

Looks like my R is unable to parse the description file.

R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"

> install_github('eddelbuettel/digest')
Using github PAT from envvar GITHUB_PAT
Downloading github repo eddelbuettel/digest@master
Error: Found continuation line starting ' The md5 algorithm b ...' at begin of record.
> Sys.info()
                                                                                           sysname
                                                                                          "Darwin"
                                                                                           release
                                                                                          "14.3.0"
                                                                                           version
"Darwin Kernel Version 14.3.0: Mon Mar 23 11:59:05 PDT 2015; root:xnu-2782.20.48~5/RELEASE_X86_64"
                                                                                           machine
                                                                                          "x86_64"