Giter VIP home page Giter VIP logo

zoomerjoin's Introduction

zoomerjoin

DOI Lifecycle: experimental Codecov test coverage

zoomerjoin is an R package that empowers you to fuzzy-join massive datasets rapidly, and with little memory consumption. It is powered by high-performance implementations of Locality Sensitive Hashing, an algorithm that finds the matches records between two datasets without having to compare all possible pairs of observations. In practice, this means zoomerjoin can fuzzily-join datasets days, or even years faster than other matching packages. zoomerjoin has been used in-production to join datasets of hundreds of millions of names or vectors in a matter of hours.

Installation

Installing from CRAN:

You can install from the CRAN as you would with any other package. Please be aware that you will have to have Cargo (the rust toolchain and compiler) installed to build the package from source.

install.packages(zoomerjoin)

Installing from R-Universe:

This package is distributed using r-universe, which provides pre-compiled binaries for common operating systems and recent versions of R. To install with r-universe, you can use the following command in R:

install.packages(
  'zoomerjoin',
  repos = c('https://beniaminogreen.r-universe.dev', getOption("repos"))
)

Installing Rust

If your operating system or version of R is not installed, you must have the Rust compiler installed to compile this package from sources. After the package is compiled, Rust is no longer required, and can be safely uninstalled.

Installing Rust on Linux or Mac:

To install Rust on Linux or Mac, you can simply run the following snippet in your terminal.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Installing Rust on Windows:

To install Rust on windows, you can use the Rust installation wizard, rustup-init.exe, found at this site. Depending on your version of Windows, you may see an error that looks something like this:

error: toolchain 'stable-x86_64-pc-windows-gnu' is not installed

In this case, you should run rustup install stable-x86_64-pc_windows-gnu to install the missing toolchain. If you’re missing another toolchain, simply type this in the place of stable-x86_64-pc_windows-gnu in the command above.

Installing Package from Github:

Once you have rust installed Rust, you should be able to install the package with either the install.packages function as above, or using the install_github function from the devtools package or with the pkg_install function from the pak package.

## Install with devtools
# install.packages("devtools")
devtools::install_github("beniaminogreen/zoomerjoin")

## Install with pak
# install.packages("pak")
pak::pkg_install("beniaminogreen/zoomerjoin")

Loading The Package

Once the package is installed, you can load it into memory as usual by typing:

library(zoomerjoin)

Usage:

The flagship feature of zoomerjoins are the jaccard_join and euclidean family of functions, which are designed to be near drop-ins for the corresponding dplyr/fuzzyjoin commands:

  • jaccard_left_join()
  • jaccard_right_join()
  • jaccard_inner_join()
  • jaccard_full_join()
  • euclidean_left_join()
  • euclidean_right_join()
  • euclidean_inner_join()
  • euclidean_full_join()

The jaccard_join family of functions provide fast fuzzy-joins for strings using the Jaccard distance while the euclidean_join family provides fuzzy-joins for points or vectors using the Euclidean distance.

Example: Joining rows of the Database on Ideology, Money in Politics, and Elections

(DIME)

Here’s a snippet showing off how to use the jaccard_inner_join() merge two lists of political donors in the Database on Ideology, Money in Politics, and Elections (DIME). You can see a more detailed example of this vignette in the introductory vignette.

I start with two corpuses I would like to combine, corpus_1:

corpus_1 <- dime_data %>%
  head(500)
names(corpus_1) <- c("a", "field")
corpus_1
## # A tibble: 500 × 2
##        a field                                                                  
##    <dbl> <chr>                                                                  
##  1     1 ufwa cope committee                                                    
##  2     2 committee to re elect charles e. bennett                               
##  3     3 montana democratic party non federal account                           
##  4     4 mississippi power & light company management political action and educ…
##  5     5 napus pac for postmasters                                              
##  6     6 aminoil good government fund                                           
##  7     7 national women's political caucus of california                        
##  8     8 minnesota gun owners' political victory fund                           
##  9     9 metropolitan detroit afl cio cope committee                            
## 10    10 carpenters legislative improvement committee united brotherhood of car…
## # ℹ 490 more rows

And corpus_2:

corpus_2 <- dime_data %>%
  tail(500)
names(corpus_2) <- c("b", "field")
corpus_2
## # A tibble: 500 × 2
##        b field                                                                  
##    <dbl> <chr>                                                                  
##  1   501 citizens for derwinski                                                 
##  2   502 progressive victory fund greater washington americans for democratic a…
##  3   503 ingham county democratic party federal campaign fund                   
##  4   504 committee for a stronger future                                        
##  5   505 atoka country supper committee                                         
##  6   506 friends of democracy pac inc                                           
##  7   507 baypac                                                                 
##  8   508 international brotherhood of electrical workers local union 278 cope/p…
##  9   509 louisville & jefferson county republican executive committee           
## 10   510 democratic party of virginia                                           
## # ℹ 490 more rows

Both corpuses have an observation ID column, and a donor name column. We would like to join the two datasets on the donor names column, but the two can’t be directly joined because of misspellings. Because of this, we will use the jaccard_inner_join function to fuzzily join the two on the donor name column.

Importantly, Locality Sensitive Hashing is a probabilistic algorithm, so it may fail to identify some matches by random chance. I adjust the hyperparameters n_bands and band_width until the chance of true matches being dropped is negligible. By default, the package will issue a warning if the chance of a true match being discovered is less than 95%. You can use the jaccard_probability and jaccard_hyper_grid_search to help understand the probability any true matches will be discarded by the algorithm.

More details and a more thorough description of how to tune the hyperparameters can be can be found in the guided tour vignette.

set.seed(1)
start_time <- Sys.time()
join_out <- jaccard_inner_join(corpus_1, corpus_2, n_gram_width = 6, n_bands = 20, band_width = 6)
## Warning in jaccard_join(a, b, mode = "inner", by = by, salt_by = block_by, : A pair of records at the threshold (0.7) have only a 92% chance of being compared.
## Please consider changing `n_bands` and `band_width`.

## Joining by 'field'
print(Sys.time() - start_time)
## Time difference of 0.03253984 secs
print(join_out)
## # A tibble: 19 × 4
##        a field.x                                                      b field.y 
##    <dbl> <chr>                                                    <dbl> <chr>   
##  1    88 scheuer for congress 1980                                  667 scheuer…
##  2    35 solarz for congress 82                                     671 solarz …
##  3   378 guarini for congress 1982                                  883 guarini…
##  4   163 davies county republican executive committee               852 warren …
##  5    87 kentucky state democratic central executive committee      639 arizona…
##  6   302 americans for good government inc                          910 america…
##  7   216 kent county republican finance committee                   719 harford…
##  8   319 7th congressional district democratic party of wisconsin   792 8th con…
##  9   122 tarrant county republican victory fund                     761 lake co…
## 10   238 4th congressional district democratic party                792 8th con…
## 11   387 committee to re elect congressman staton                   805 committ…
## 12   478 united democrats for better government                     642 democra…
## 13    45 dole for senate committee                                  623 riegle …
## 14   216 kent county republican finance committee                   607 lake co…
## 15   230 pipefitters local union 524                                998 pipefit…
## 16   232 republican county committee of chester county              710 republi…
## 17   292 bill bradley for u s senate '84                            913 bill br…
## 18   378 guarini for congress 1982                                  606 guarini…
## 19   238 4th congressional district democratic party                518 16th co…

Zoomerjoin is able to quickly find the matching columns without comparing all pairs of records. This saves more and more time as the size of each list increases, so it can scale to join datasets with millions or hundreds of millions of rows.

Contributing

Thanks for your interest in contributing to Zoomerjoin!

I am using a gitub-centric workflow to manage the package; You can file a bug report, request a new feature, or ask a question about the package by filing an issue on the issues page, where you will also find a range of templates to help you out. If you’d like to make changes to the code, you can write and file a pull request on this page. I’ll try to respond to all of these in a timely manner (within a week) although occasionally I may take longer to respond to a complicated question or issue.

Please also be aware of the contributor code of conduct for contributing to the repository.

Acknowledgments:

The Zoomerjoin was made using this SQL join illustration by Germanx and this speed limit sign from the Federal Highway Administration - MUTCD.

References:

Bonica, Adam. 2016. Database on Ideology, Money in Politics, and Elections: Public version 2.0 [Computer file]. Stanford, CA: Stanford University Libraries.

Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets (2nd. ed.). Cambridge University Press, USA.

Broder, Andrei Z. (1997), “On the resemblance and containment of documents”, Compression and Complexity of Sequences: Proceedings. Positano, Salerno, Italy

zoomerjoin's People

Contributors

beniaminogreen avatar etiennebacher avatar floriancaro avatar josiahparry avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

zoomerjoin's Issues

Make set seed work with join

Annoying task, but the functions should capture the R seed and pass it through to rust, so that the results are perfectly replicable.

Allow users to block on multiple columns at once.

At present, a user who wants to block on the basis of more than one column has to combine these columns themselves. Should be simple enough to convert these under-the-hood either via concatenation or hashing and pass to the rust core.

Error installing on Windows with Rust > 1.7.0

Sorry for bothering. My system is Windows 11 and R version are 4.30 and 4.23. I used the two versions to install the package but both failed. For the RUST I used stable-x86-64-windows-msvc version 1.7.00 with toolchain gnu installed. The error message writes

undefined reference to `NtCreateFile'

which could be the factor of error. Really appreciate for your help.

E:\rtools42\x86_64-w64-mingw32.static.posix\bin/ld.exe: ./rust/target/x86_64-pc-windows-gnu/release/libzoomerjoin.a(std-ca5208825e97b4ba.std.687851ba-cgu.0.rcgu.o): in function std::sys::windows::fs::open_link_no_reparse': /rustc/90c541806f23a127002de5b4038be731ba1458ca/library\std\src\sys\windows/fs.rs:800: undefined reference to NtCreateFile'
E:\rtools42\x86_64-w64-mingw32.static.posix\bin/ld.exe: /rustc/90c541806f23a127002de5b4038be731ba1458ca/library\std\src\sys\windows/fs.rs:829: undefined reference to RtlNtStatusToDosError' E:\rtools42\x86_64-w64-mingw32.static.posix\bin/ld.exe: ./rust/target/x86_64-pc-windows-gnu/release/libzoomerjoin.a(std-ca5208825e97b4ba.std.687851ba-cgu.0.rcgu.o): in function std::sys::windows::handle::Handle::synchronous_read':
/rustc/90c541806f23a127002de5b4038be731ba1458ca/library\std\src\sys\windows/handle.rs:241: undefined reference to NtReadFile' E:\rtools42\x86_64-w64-mingw32.static.posix\bin/ld.exe: /rustc/90c541806f23a127002de5b4038be731ba1458ca/library\std\src\sys\windows/handle.rs:272: undefined reference to RtlNtStatusToDosError'
E:\rtools42\x86_64-w64-mingw32.static.posix\bin/ld.exe: ./rust/target/x86_64-pc-windows-gnu/release/libzoomerjoin.a(std-ca5208825e97b4ba.std.687851ba-cgu.0.rcgu.o): in function std::sys::windows::handle::Handle::synchronous_write': /rustc/90c541806f23a127002de5b4038be731ba1458ca/library\std\src\sys\windows/handle.rs:290: undefined reference to NtWriteFile'
E:\rtools42\x86_64-w64-mingw32.static.posix\bin/ld.exe: /rustc/90c541806f23a127002de5b4038be731ba1458ca/library\std\src\sys\windows/handle.rs:318: undefined reference to `RtlNtStatusToDosError'
collect2.exe: error: ld returned 1 exit status
無法產生 DLL 檔案 # Unable to produce DLL file
ERROR: compilation failed for package 'zoomerjoin'

  • removing 'C:/Users/user/AppData/Local/Temp/RtmpW6qdAm/pkg-lib69c458dd7b3f/zoomerjoin'

Error in installation instructions

Describe the bug
I am running Windows 11 with rust 1.71.1 and wsl, when I try to install the package this is my console with an error message:

> library(zoomerjoin)
Error in library(zoomerjoin) : there is no package calledzoomerjoin> ## Install with devtools
> # install.packages("devtools")
> devtools::install_github("beniaminogreen/zoomerjoin")
Downloading GitHub repo beniaminogreen/zoomerjoin@HEAD
These packages have more recent versions available.
It is recommended to update all of them.
Which would you like to update?

1: All                          
2: CRAN packages only           
3: None                         
4: vctrs (0.6.2 -> 0.6.3) [CRAN]

Enter one or more numbers, or an empty line to skip updates: 3
── R CMD build ───────────────────────────────────────────────────────────────────────────────────────────────────────────
✔  checking for file 'C:\Users\winco\AppData\Local\Temp\RtmpCwBEaE\remotes522c7688750f\beniaminogreen-zoomerjoin-f359f28/DESCRIPTION' ...preparing 'zoomerjoin':checking DESCRIPTION meta-information ...cleaning srcchecking for LF line-endings in source and make files and shell scriptschecking for empty or unneeded directoriesbuilding 'zoomerjoin_0.0.0.9000.tar.gz'
   
Installing package intoC:/Users/winco/AppData/Local/R/win-library/4.3’
(aslibis unspecified)
* installing *source* package 'zoomerjoin' ...
** using staged installation
** libs
using C compiler: 'gcc.exe (GCC) 12.2.0'
rm -Rf zoomerjoin.dll ./rust/target/x86_64-pc-windows-gnu/release/libzoomerjoin.a entrypoint.o
gcc  -I"C:/PROGRA~1/R/R-43~1.0/include" -DNDEBUG     -I"C:/RBuildTools/4.3/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -mfpmath=sse -msse2 -mstackrealign  -c entrypoint.c -o entrypoint.o
mkdir -p ./rust/target/libgcc_mock
cd ./rust/target/libgcc_mock && \
	touch gcc_mock.c && \
	gcc -c gcc_mock.c -o gcc_mock.o && \
	ar -r libgcc_eh.a gcc_mock.o && \
	cp libgcc_eh.a libgcc_s.a
C:\RBuildTools\4.3\x86_64-w64-mingw32.static.posix\bin\ar.exe: creating libgcc_eh.a
# CARGO_LINKER is provided in Makevars.ucrt for R >= 4.2
export CARGO_TARGET_X86_64_PC_WINDOWS_GNU_LINKER="x86_64-w64-mingw32.static.posix-gcc.exe" && \
	export LIBRARY_PATH="${LIBRARY_PATH};/c/Users/winco/AppData/Local/Temp/Rtmpctc8EF/R.INSTALL1c04338dae1/zoomerjoin/src/./rust/target/libgcc_mock" && \
	cargo +stable-gnu build --target=x86_64-pc-windows-gnu --lib --release --manifest-path=./rust/Cargo.toml --target-dir ./rust/target
error: toolchain 'stable-x86_64-pc-windows-gnu' is not installed
make: *** [Makevars.win:19: rust/target/x86_64-pc-windows-gnu/release/libzoomerjoin.a] Error 1
ERROR: compilation failed for package 'zoomerjoin'
* removing 'C:/Users/winco/AppData/Local/R/win-library/4.3/zoomerjoin'
Warning message:
In i.p(...) :
  installation of packageC:/Users/winco/AppData/Local/Temp/RtmpCwBEaE/file522c5d5356f/zoomerjoin_0.0.0.9000.tar.gzhad non-zero exit status

I found this guidance rust-lang/rustup#1793 and tried to follow along but the package is still not installing.

This is the result of rustup show for me:

Default host: x86_64-unknown-linux-gnu
rustup home:  /home/wincowger/.rustup

stable-x86_64-unknown-linux-gnu (default)
rustc 1.71.1 (eb26296b5 2023-08-03)

To Reproduce
Steps to reproduce the behavior:
Follow user guidelines on Windows 11 machine with WSL.

Expected behavior
A clear and concise description of what you expected to happen.
Installation to proceed.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: Windows 11

Additional context
Add any other context about the problem here.

CRAN release?

Is your feature request related to a problem? Please describe.

This package is a great example of using R and Rust together and serves to fill a performance gap in the R ecosystem. The R community would benefit greatly from the publication of the library on CRAN.

Have you considered publishing to CRAN?

[FR] Allow inequality conditions in `block_by`

Is your feature request related to a problem? Please describe.
It is nice to be able to use block_by to filter out some comparisons before computing the string similarity. Currently, it is limited to equality conditions (e.g rows must have the same year to be considered for matching). I have a setting in which I don't want to compare rows if year.y is larger than year.x, i.e I'd like to block matching by the condition year.x > year.y.

I don't know how hard that would be to implement. I could track the usage of block_by (renamed salt in the internals) until the method new() for the Shingleset struct in Rust, but I don't know how that works exactly.

Describe the solution you'd like
Not sure of the syntax this should take. It could be something similar to dplyr::join_by(), so that one could do block_by = block_by(year.x > year.y). Also, that only works if the variable has a different name in each dataset.

Describe alternatives you've considered
Currently I don't use block_by. Instead I match on the full data and filter out by my condition afterwards.

Additional context
/

[FR] Add support for cosine and hamming distances

Now that the package is getting more mature, it would be nice to add support for other distance metrics (specifically, hamming and cosine distances). These should be relatively easy to implement following these notes, and will also provide the opportunity to refine some of the code. In the future, I would like the hash families to implement an lsh trait, which would allow us to reuse some of the amplifying code.

Instructions for how to contribute

As per JOSS requirements, there should be community guidelines "Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support". I looked in the readme and the manuscript but couldn't find these points.

Release zoomerjoin 0.1.5

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Polish NEWS
  • usethis::use_github_links()
  • urlchecker::url_check()
  • devtools::build_readme()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version(push = TRUE)

vignettes

Write vignette for record joining + benchmark vignette.

README update

The CRAN installation line in the README (install.packages(zoomerjoin)) should probably be have quotes around zoomerjoin, i.e. install.packages('zoomerjoin').

CRAN issue - Non-API entry points into R

The CRAN builder is currently showing the following issues when building zoomerjoin:

Version: 0.1.4
Check: compiled code
Result: NOTE
File ‘zoomerjoin/libs/zoomerjoin.so’:
Found non-API calls to R: ‘ENVFLAGS’, ‘FRAME’, ‘HASHTAB’, ‘PRCODE’,
‘PRENV’, ‘PRSEEN’, ‘PRVALUE’, ‘Rf_findVarInFrame3’, ‘SET_BODY’,
‘SET_CLOENV’, ‘SET_ENCLOS’, ‘SET_ENVFLAGS’, ‘SET_FORMALS’,
‘SET_PRCODE’, ‘SET_PRENV’, ‘SET_PRVALUE’, ‘STRING_PTR’, ‘SYMVALUE’,
‘XLENGTH_EX’
Compiled code should not call non-API entry points in R.
See ‘Writing portable packages’ in the ‘Writing R Extensions’ manual.

Unsure why this is the case + why CRAN is suddenly giving NOTEs when it was not before (could this be an rextendr issue?). Opening this issue to track my work as I investigate what's going on + to solicit help if anyone is familiar with this issue.

Support `join_by()`

Is your feature request related to a problem? Please describe.
The join_by() helper function was introduced in dplyr 1.1.0. Given that functions in zoomerjoin are designed to be drop-in replacements of dplyr functions, it would be nice to support this syntax so that we don't have to manually change the syntax to a named vector.

Describe the solution you'd like
Support for join_by() syntax, so that the example below works:

library(babynames)
library(zoomerjoin)
library(dplyr, warn.conflicts = FALSE)

baby_names <- data.frame(name = tolower(unique(babynames$name)))
baby_names_sans_vowels <- data.frame(
  name_wo_vowels =gsub("[aeiouy]","", baby_names$name)
)

# dplyr
joined_names <- inner_join(
  baby_names,
  baby_names_sans_vowels,
  by = join_by(name == name_wo_vowels)
)

# zoomerjoin
joined_names <- jaccard_inner_join(
  baby_names,
  baby_names_sans_vowels,
  by = join_by(name == name_wo_vowels)
)
#> Warning in jaccard_join(a, b, mode = "inner", by = by, salt_by = block_by, : A pair of records at the threshold (0.7) have only a 93% chance of being compared.
#> Please consider changing `n_bands` and `band_width`.
#> Error in simple_by_validate(a, b, by): by_a %in% names(a) are not all TRUE

Describe alternatives you've considered
This is a minor feature request, only for convenience. Using a named vector works very well, it is simply superseded by the join_by() syntax.

Additional context
/

Thanks for this amazing package, it's so nice to match names that fast

Accuracy of memory usage?

Hi, I just found this package, it looks cool and super useful!

One thing that I noted in the benchmarks is how low the memory usage is. I know that using Rust is more efficient in speed and memory usage, but I also think the numbers reported about memory might be inaccurate. From ?profmem:

[...] nearly all memory allocations done in R are logged. Neither memory deallocations nor garbage collection events are logged. Furthermore, allocations done by non-R native libraries or R packages that use native code Calloc() / Free() for internal objects are also not logged.

I suspect that a lot of memory allocations are not done in R but in Rust and that the memory usage is actually higher than reported. I run into the same thing when I benchmark polars and tidypolars, so I'm interested if you find a workaround 😉

Just to give you an example: in polars, when I take the mean of a column with 100_000, 1_000_000, 10_000_000, or 100_000_000 rows, R reports the same (tiny) memory usage but I clearly see a peak in Windows task manager:

library(polars)

bench::press(
  rows = c(1e5, 1e6, 1e7, 1e8),
  {
    dat <- pl$DataFrame(
      a = rnorm(rows),
      b = rnorm(rows),
      c = rnorm(rows)
    )
    bench::mark(
      dat$with_columns(y = pl$col("a")$mean())
    )
  }
)
#> # A tibble: 4 × 7
#>   expression                  rows     min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                 <dbl> <bch:t> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 "dat$with_columns(y = pl$…   1e5 557.8µs  739.5µs   1218.      3.81KB        0
#> 2 "dat$with_columns(y = pl$…   1e6   3.4ms   3.95ms    123.      3.81KB        0
#> 3 "dat$with_columns(y = pl$…   1e7    38ms  42.06ms     14.7     3.81KB        0
#> 4 "dat$with_columns(y = pl$…   1e8   526ms 526.01ms      1.90    3.81KB        0

I was told there's a linux tool to make more accurate benchmarks when calling other languages from R but I don't remember the name, I'll update this post if I find it.

Consider Adding HNSW for joining on arbritary distance functions

Currently, the package only supports joining on two distance functions, the Euclidean and Jaccard distances. Implementing LSH schemes for new distances is tough and time consuming. A future goal for the package could be to provide an implementation of Hierarchically-Navigable Small Worlds, which would allow users to join datasets in linearithmic time. This is worse than the scaling of LSH, but has the advantage that it works for any distance functions.

Figuring out how to create a performant implementation of HNSW in Rust will likely take a lot of time, but could have huge benefits.

Provide R package binaries?

Hello again!

Installing zoomerjoin requires Rust and building the source locally. It would be very convenient to provide R binaries, either via R-universe of via Github releases (as polars does for example) so that one can just install the package in a couple of seconds without needing Rust.

[FR] Option to set the number of threads?

Is your feature request related to a problem? Please describe.
If I'm not mistaken, zoomerjoin uses all threads available on the laptop, which explains in part its great performance. It would be nice to be able to configure the number of threads, so that I can use a part of the CPU for other tasks (I'm surprised that CRAN accepted the package given they don't want more than 2 threads to be used by tests and examples).

Describe the solution you'd like
Either a function or an option to specify the number of threads that can be used by zoomerjoin.

Describe alternatives you've considered
/

Additional context
/

Thanks again for this great package!

Suggestion: use `collapse::%!iin%` instead of `!%in%`

Hello @beniaminogreen, the following lines take quite some time to run as the output becomes larger.

not_matched_a <- ! seq(nrow(a)) %in% match_table[,1]
not_matched_b <- ! seq(nrow(b)) %in% match_table[,2]

I made a quick test with collapse::%!iin% instead of !%in% and found ~10% decrease in time and 15% decrease in memory used when both inputs have 300k rows:

library(bench) 
library(dplyr)

out <- cross::run(
  pkgs = c("beniaminogreen/zoomerjoin", "etiennebacher/zoomerjoin@more-speedup"), 
  ~{
    library(zoomerjoin)
    
    ### Setup -----
    
    corpus_1 <- data.table::rbindlist(rep(list(dime_data), 300))
    names(corpus_1) <- c("a", "field")
    
    corpus_2 <- data.table::rbindlist(rep(list(dime_data), 300))
    names(corpus_2) <- c("b", "field")
    
    ### Benchmark -----
    
    bench::mark(
      jaccard_left_join(corpus_1, corpus_2, 
                        by = "field", n_gram_width=6,
                        n_bands=20, band_width=6, threshold = .8)
    )
  }
)

tidyr::unnest(out, result) |> 
  select(pkg, median, mem_alloc)
#> # A tibble: 2 × 3
#>   pkg                                     median mem_alloc
#>   <chr>                                 <bch:tm> <bch:byt>
#> 1 beniaminogreen/zoomerjoin                3.73m    18.5GB
#> 2 etiennebacher/zoomerjoin@more-speedup    3.28m    15.1GB

I don't know whether you'd be open to using collapse as a dependency (it only has Rcpp as dependency), but I think it'd be worth it as the performance gain should increase with the size of the data. Happy to make a PR if you're fine with that.

Make join keep dplyr groups

When you use zoomer join to join grouped tables, the groups are lost. I know this is not how the dpylr joins handle it

  • look into how dplyr preserves groups when joining

Release zoomerjoin 0.1.5

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Polish NEWS
  • usethis::use_github_links()
  • urlchecker::url_check()
  • devtools::build_readme()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version(push = TRUE)

Same " Political Donors " data example, but using the euclidean_inner_join() Fx...

Hi Beniamino,

Thanks for the zoomerjoin PKG.

Tried your very clear
" Political Donors " data EXAMPLE,
using the
jaccard_inner_join( ) Fx.

Really, really fast!. :-)

Could you please show
an R code Example
with the =same= " Political Donors " data,
but using the
euclidean_inner_join( ) Fx ?.

Thanks - very highly appreciated...

SFd99
San Francisco
latest zoomerjoin PKG v0.1.0
latest R, Rstudio, Ubuntu Linux 20.04

include ability to match on regex

Hello,

first congrats to this amazing package. I am simply stunned by its performance.

I was wondering whether there is any way to expand the set of functions to include regex matches.
I am currently using fuzzyjoin::regex_left_join, but unfortunately, in my use case it's simply too slow.

Unfortunately, I don't know Rust, so I can't contribute anything in this regard. In any case, thanks
again for this powerful package which should be known much more widely.

Roland

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.