Giter VIP home page Giter VIP logo

golgotha's Introduction

golgotha - Contextualised Embeddings and Language Modelling using BERT and Friends using R

  • This R package wraps the transformers module using reticulate
  • The objective of the package is to get easily sentence embeddings using a BERT-like model in R For using in downstream modelling (e.g. Support Vector Machines / Sentiment Labelling / Classification / Regression / POS tagging / Lemmatisation / Text Similarities)
  • Golgotha: Hope for lonely AI pelgrims on their way to losing CPU power: http://costes.org/cdbm20.mp3

Installation

  • For installing the development version of this package:
    • Execute in R: devtools::install_github("bnosac/golgotha", INSTALL_opts = "--no-multiarch")
    • Look to the documentation of the functions: help(package = "golgotha")

Example with BERT model architecture

  • Download a model (e.g. bert multilingual lowercased)
library(golgotha)
transformer_download_model("bert-base-multilingual-uncased")
  • Load the model and get the embedding of sentences / subword tokens or just tokenise
model <- transformer("bert-base-multilingual-uncased")
x <- data.frame(doc_id = c("doc_1", "doc_2"),
                text = c("give me back my money or i'll call the police.",
                         "talk to the hand because the face don't want to hear it any more."),
                stringsAsFactors = FALSE)
embedding <- predict(model, x, type = "embed-sentence")
embedding <- predict(model, x, type = "embed-token")
tokens    <- predict(model, x, type = "tokenise")
  • Same example but now on Dutch / French
text <- c("vlieg met me mee naar de horizon want ik hou alleen van jou",
          "l'amour n'est qu'un enfant de pute, il agite le bonheur mais il laisse le malheur",
          "http://costes.org/cdso01.mp3", 
          "http://costes.org/mp3.htm")
text <- setNames(text, c("doc_nl", "doc_fr", "le petit boudin", "thebible"))
embedding <- predict(model, text, type = "embed-sentence")
embedding <- predict(model, text, type = "embed-token")
tokens    <- predict(model, text, type = "tokenise")

Example with DistilBERT model architecture

For any model architecture but BERT, you have to provide argument architecture within the 10 supported model architectures

  • Download a model (e.g. distilbert multilingual cased), it will be by default stored in the system.file(package = "golgotha", "models") folder. If you want to change this, use the path argument of transformer_download_model
transformer_download_model("distilbert-base-multilingual-cased", architecture = "DistilBERT")
  • Once downloaded, you can just load the model and start embedding your text
model <- transformer("distilbert-base-multilingual-uncased", architecture = "DistilBERT")
x <- data.frame(doc_id = c("doc_1", "doc_2"),
                text = c("give me back my money or i'll call the police.",
                         "talk to the hand because the face don't want to hear it any more."),
                stringsAsFactors = FALSE)
embedding <- predict(model, x, type = "embed-sentence")
embedding <- predict(model, x, type = "embed-token")
tokens    <- predict(model, x, type = "tokenise")

Some other models available

The list is not exhaustive. Look to the transformer documentation for an up-to-date model list. Available models will also depend on the version of the transformer module you have installed.

model <- transformer("bert-base-multilingual-uncased")
model <- transformer("bert-base-multilingual-cased")
model <- transformer("bert-base-dutch-cased")
model <- transformer("bert-base-uncased")
model <- transformer("bert-base-cased")
model <- transformer("bert-base-chinese")
model <- transformer("distilbert-base-cased", architecture = "DistilBERT")
model <- transformer("distilbert-base-uncased-distilled-squad", architecture = "DistilBERT")
model <- transformer("distilbert-base-german-cased", architecture = "DistilBERT")
model <- transformer("distilbert-base-multilingual-cased", architecture = "DistilBERT")
model <- transformer("distilroberta-base", architecture = "DistilBERT")

Issues

  • This package requires transformers and torch to be installed. Normally R package reticulate automagically gets this done for you.
  • If your installation gets stuck somehow, you can normally install these requirements as follows.
library(reticulate)
install_miniconda()
conda_install(envname = 'r-reticulate', c('torch', 'transformers==2.4.1'), pip = TRUE)

Continuous Integration

Build Status

golgotha's People

Contributors

cregouby avatar jwijffels avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

golgotha's Issues

avoid initializing Python in .onLoad()

If I try to install this package using the development version of reticulate, I see:

$ R CMD INSTALL --preclean golgotha
* installing to library ‘/Users/kevinushey/Library/R/3.6/library’
* installing *source* package ‘golgotha’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for ‘golgotha’:
 .onLoad failed in loadNamespace() for 'golgotha', details:
  call: check_forbidden_initialization()
  error: package 'golgotha' attempted to initialize Python in .onLoad(). Packages should not initialize Python themselves; rather, Python should be loaded on-demand as requested by the user of the package. Please see vignette("python_dependencies", package = "reticulate") for more details.
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/Users/kevinushey/Library/R/3.6/library/golgotha’

Note that calling source_python() here will force reticulate to be loaded:

https://github.com/bnosac/golgotha/blob/master/R/AAA.R#L10

Are you able to use the delay_load mechanism for this instead -- e.g. as documented in https://rstudio.github.io/reticulate/articles/package.html?

Crashing R

I am unable to execute the example you provide as the package crashes my R session immediately. I can't generate a reprex to reproduce this issue because R session is aborted.

This is my session info:

R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

This is a MacBook Pro with the following configuration:

Processor - 3.9 GHz 6-Core Intel Core i9
Memory - 32 GB 2400 MHz DDR4

Newly introduced architectures of transformers

Do you happen to have any plans to implement newly introduced architectures in the near future?
You mentioned 10 supported architectures, but Huggingface enlarged their repertoire of supported architectures to be at least 17. I'm personally interested in BART, T5 and ELECTRA...

I'm not complaining about this useful package, it is just a personal wish...

Thanks.

Questions for installation process

I've been trying to install this package in R, but could not succeed. The following error message could not be gotten rid of.
I'm a newbie for using reticulate package. Therefore, there may be something very basic that I'm ignorant of......

Error: package or namespace load failed for 'golgotha':
 .onLoad failed in loadNamespace() for 'golgotha', details:
  call: NULL
  error: could not find a Python environment for C:/Users/Ansel/Anaconda3/python.exe
Error: loading failed
Execution halted
ERROR: loading failed
* removing 'C:/Usr/R-3.6.3/library/golgotha'
Error: Failed to install 'golgotha' from GitHub:
  (converted from warning) installation of package ‘C:/Users/Ansel/AppData/Local/Temp/Rtmpo5d0Zs/file76b85e3e7d52/golgotha_0.2.0.tar.gz’ had non-zero exit status

Functions for similarity and classification

Good morning Jan,

Just curious, do you have plans to add functions to the package that make it easier to do things like finding similar words (or sentences), text classification (a bit like you did in ruimtehol) and prediction of next word in a sentence?

Thanks for making BERT available in R!

python bert package is missing

i would like to use golgotha to run Bert for classification task
but i get an error after installing and when running

transformer_download_model("bert-base-multilingual-uncased")
Downloading model to C:/Program Files/R/R-4.0.3/library/golgotha/models/bert-base-multilingual-uncased
Erreur : Python module BERT was not found.

perhaps you can help me to solve this

thk u
nicolas

how to use gpu / multicore?

Hi, is it possible to use golgotha on gpu for faster embedding assignment?
tried to set settings for gpu with the torch library, but it did not work.
played around for parallelization in rr, but did not find an efficient flow for multicore usage.
looked like i had to reload the bert model for all r process forks. thats slow.
is multicore usage or gpu usage possible with golgotha?

Loading golgotcha crashes due to python dependencies

Hi! I've just installed the package successfully and then when loading golgotcha it returned an error:

> devtools::install_github("bnosac/golgotha", INSTALL_opts = "--no-multiarch")
Using github PAT from envvar GITHUB_PAT
Downloading GitHub repo bnosac/golgotha@HEAD
✓  checking for file ‘/private/var/folders/3w/x0bjsdv52v3b3kjbw87y3wfw0000gn/T/RtmpGIic1l/remotes866533484609/bnosac-golgotha-6f9f2bf/DESCRIPTION’ ...
─  preparing ‘golgotha’:
✓  checking DESCRIPTION meta-information ...
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘golgotha_0.2.0.tar.gz’
   
Installing package into ‘/Users/bernardolares/Library/R/4.0/library’
(as ‘lib’ is unspecified)
* installing *source* package ‘golgotha’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (golgotha)
> library(golgotha)
Loading required package: reticulate
Configuring package 'golgotha': please wait ...
Collecting torch
  Downloading torch-1.9.0-cp39-none-macosx_10_9_x86_64.whl (127.9 MB)
Collecting transformers==2.4.1
  Downloading transformers-2.4.1-py3-none-any.whl (475 kB)
Collecting boto3
  Downloading boto3-1.18.16-py3-none-any.whl (131 kB)
Collecting tqdm>=4.27
  Downloading tqdm-4.62.0-py2.py3-none-any.whl (76 kB)
Requirement already satisfied: numpy in ./Library/r-miniconda/envs/r-reticulate/lib/python3.9/site-packages (from transformers==2.4.1) (1.21.1)
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp39-cp39-macosx_10_6_x86_64.whl (1.1 MB)
Collecting tokenizers==0.0.11
  Downloading tokenizers-0.0.11.tar.gz (30 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
    Preparing wheel metadata: started
    Preparing wheel metadata: finished with status 'done'
Collecting filelock
  Downloading filelock-3.0.12-py3-none-any.whl (7.6 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
Collecting requests
  Downloading requests-2.26.0-py2.py3-none-any.whl (62 kB)
Collecting regex!=2019.12.17
  Downloading regex-2021.8.3-cp39-cp39-macosx_10_9_x86_64.whl (285 kB)
Requirement already satisfied: typing-extensions in ./Library/r-miniconda/envs/r-reticulate/lib/python3.9/site-packages (from torch) (3.10.0.0)
Collecting botocore<1.22.0,>=1.21.16
  Downloading botocore-1.21.16-py3-none-any.whl (7.8 MB)
Collecting jmespath<1.0.0,>=0.7.1
  Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting s3transfer<0.6.0,>=0.5.0
  Downloading s3transfer-0.5.0-py3-none-any.whl (79 kB)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in ./Library/r-miniconda/envs/r-reticulate/lib/python3.9/site-packages (from botocore<1.22.0,>=1.21.16->boto3->transformers==2.4.1) (2.8.2)
Collecting urllib3<1.27,>=1.25.4
  Downloading urllib3-1.26.6-py2.py3-none-any.whl (138 kB)
Requirement already satisfied: six>=1.5 in ./Library/r-miniconda/envs/r-reticulate/lib/python3.9/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.22.0,>=1.21.16->boto3->transformers==2.4.1) (1.16.0)
Requirement already satisfied: certifi>=2017.4.17 in ./Library/r-miniconda/envs/r-reticulate/lib/python3.9/site-packages (from requests->transformers==2.4.1) (2021.5.30)
Collecting charset-normalizer~=2.0.0
  Downloading charset_normalizer-2.0.4-py3-none-any.whl (36 kB)
Collecting idna<4,>=2.5
  Downloading idna-3.2-py3-none-any.whl (59 kB)
Collecting click
  Downloading click-8.0.1-py3-none-any.whl (97 kB)
Requirement already satisfied: joblib in ./Library/r-miniconda/envs/r-reticulate/lib/python3.9/site-packages (from sacremoses->transformers==2.4.1) (1.0.1)
Building wheels for collected packages: tokenizers
  Building wheel for tokenizers (PEP 517): started
  Building wheel for tokenizers (PEP 517): finished with status 'error'
  ERROR: Command errored out with exit status 1:
   command: /Users/bernardolares/Library/r-miniconda/envs/r-reticulate/bin/python /Users/bernardolares/Library/r-miniconda/envs/r-reticulate/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /var/folders/3w/x0bjsdv52v3b3kjbw87y3wfw0000gn/T/tmp4_14r7ur
       cwd: /private/var/folders/3w/x0bjsdv52v3b3kjbw87y3wfw0000gn/T/pip-install-hydjbfqr/tokenizers_800ac7fe01f240cf9fe99c8faeb338df
  Complete output (20 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib
  creating build/lib/tokenizers
  copying tokenizers/__init__.py -> build/lib/tokenizers
  running build_ext
  running build_rust
  error: can't find Rust compiler
  
  If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.
  
  To update pip, run:
  
      pip install --upgrade pip
  
  and then retry package installation.
  
  If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs) is the recommended way to download and update the Rust compiler toolchain.
  ----------------------------------------
  ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers which use PEP 517 and cannot be installed directly
Error: package or namespace load failed for ‘golgotha’:
 .onLoad failed in loadNamespace() for 'golgotha', details:
  call: NULL
  error: Error installing package(s): 'torch', 'transformers==2.4.1'

My sessionInfo():

R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS  11.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reticulate_1.20 Robyn_3.0.0     ggplot2_3.3.5   dplyr_1.0.7     lares_5.0.2    

loaded via a namespace (and not attached):
 [1] bitops_1.0-7       fs_1.5.0           usethis_2.0.1      lubridate_1.7.10   devtools_2.4.0    
 [6] doParallel_1.0.16  httr_1.4.2         rprojroot_2.0.2    tools_4.0.3        doRNG_1.8.2       
[11] utf8_1.2.2         R6_2.5.0           rpart_4.1-15       lazyeval_0.2.2     DBI_1.1.0         
[16] colorspace_2.0-2   withr_2.4.2        tidyselect_1.1.1   prettyunits_1.1.1  processx_3.5.2    
[21] curl_4.3.2         compiler_4.0.3     glmnet_4.1-2       cli_3.0.1          rvest_1.0.0       
[26] xml2_1.3.2         desc_1.3.0         labeling_0.4.2     scales_1.1.1       callr_3.7.0       
[31] rappdirs_0.3.3     stringr_1.4.0      digest_0.6.27      pkgconfig_2.0.3    parallelly_1.27.0 
[36] sessioninfo_1.1.1  fastmap_1.0.1      rlang_0.4.11       rstudioapi_0.13    rPref_1.3         
[41] shape_1.4.6        prophet_1.0        farver_2.1.0       generics_0.1.0     jsonlite_1.7.2    
[46] zip_2.2.0          RCurl_1.98-1.3     magrittr_2.0.1     patchwork_1.1.1    Matrix_1.2-18     
[51] Rcpp_1.0.7         munsell_0.5.0      fansi_0.5.0        lifecycle_1.0.0    stringi_1.7.3     
[56] pROC_1.17.0.1      yaml_2.2.1         pkgbuild_1.2.0     plyr_1.8.6         grid_4.0.3        
[61] parallel_4.0.3     listenv_0.8.0      crayon_1.4.1       lattice_0.20-41    splines_4.0.3     
[66] ps_1.6.0           pillar_1.6.2       igraph_1.2.6       rngtools_1.5       codetools_0.2-18  
[71] pkgload_1.2.1      glue_1.4.2         doFuture_0.12.0    RcppParallel_5.1.4 rpart.plot_3.0.9  
[76] data.table_1.14.0  remotes_2.3.0      png_0.1-7          vctrs_0.3.8        nloptr_1.2.2.2    
[81] foreach_1.5.1      testthat_3.0.4     gtable_0.3.0       purrr_0.3.4        tidyr_1.1.3       
[86] future_1.21.0      assertthat_0.2.1   cachem_1.0.4       openxlsx_4.2.4     h2o_3.32.1.3      
[91] survival_3.2-7     minpack.lm_1.2-1   tibble_3.1.3       iterators_1.0.13   memoise_2.0.0     
[96] corrplot_0.90      globals_0.14.0     ellipsis_0.3.2  

Error in building wheels for collected packages: torch

first all thank you for this exciting new package! I can't wait to try out models like BERT in R.

When I run this code:

devtools::install_github("bnosac/golgotha", INSTALL_opts = "--no-multiarch")
library(golgotha)
transformer_download_model("bert-base-multilingual-uncased")

I run into some errors.
After calling transformer_download_model("bert-base-multilingual-uncased") it is asked to install r miniconda. The first error in the installation precess that appears is this one:

Building wheel for torch (setup.py): started
 Building wheel for torch (setup.py): finished with status 'error'

Since I am rather new in Python I am not completely sure if this is the right place to put this issue. Maybe I need to go to the reticulate github?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.