gesistsa / grafzahl Goto Github PK
View Code? Open in Web Editor NEWπ§ fine-tuning Transformers for text data from within R
Home Page: https://gesistsa.github.io/grafzahl/
License: GNU General Public License v3.0
π§ fine-tuning Transformers for text data from within R
Home Page: https://gesistsa.github.io/grafzahl/
License: GNU General Public License v3.0
At least the R part.
There seems to be issue with the newly released pyarrow
9.0.0. It is currently installing through pip. Maybe it is better to install it through conda instead.
Like a computer scientist and put it in the paper.
untested
When running grafzahl() on macOS Monterey following the example code I get
Error in py_run_file_impl(path.expand(file), local, convert) :
ModuleNotFoundError: No module named 'torch'
Sounds a lot like:
rstudio/reticulate#909
Any ideas?
paper
in pathsI have tried to install grafzahl (using the Transformers within the Quanteda infrastracture is a great choice!), everything works fine with setup_grafzahl(cuda = TRUE) - I have GPU(s) btw, until the last line that shows an error: "Error in .install_gpu_pytorch(cuda_version = cuda_version) : Cannot set up `pytorch". Unfortunately this error prevents grafzahl to correctly work. Any suggestion on how to solve it? Thanks! Luigi
Hello there,
thanks for this amazing package. I tried it out yesterday and I was able to train a model. However, once I wanted to use the trained model to predict new data, I received the following error message:
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: 'use_cuda' set to True when cuda is unavailable. Make sure CUDA is available or set use_cuda=False.
The error message appears whether I set use_cude to TRUE or FALSE.
Do you have any idea why I receive this error message or what I could do?
Thanks in advance
Tobias
Conduct detect_conda
and detect_cuda
after the current setup_grafzahl
procedure.
Hi there,
I'm having a bit of trouble getting Grafzahl to look for miniconda in the right place. Since I had to install miniconda manually, it's not in the usual folder (r-miniconda), but some other (miniconda3). I saw that this was an issue before (#20) and tried to solve it by specifying the right path for both RETICULATE_MINICONDA_PATH
and GRAFZAHL_MINICONDA_PATH
. However, detect_conda()
still returns FALSE
.
The problem, I suspect, is that the .gen_conda_path
function adds "bin" and "conda" to the path, which then points to a folder/file that doesn't exist within miniconda3. In my case, the right path would be "condabin" and then "conda" (I guess). I don't know if this is a version or system related issue, but any idea on how to fix this or some workaround would be greatly appreciated.
Thanks!
Given the discussion about which layer keeping as a token's representation in a down-streaming analysis (Jawahar et al., 2019; Ethayarajh, 2019) when for example using a pre-trained bert model, I was wondering if you are considering to allow to the user the possibility to select one specific layer when fine-tuning a Transformer via grafzahl. At the moment, which layer is consider when for example I specify [model_name = "bert-base-uncased"]? Thanks for your great package!
Should be the default to nudge users to adopt the best practice.
So that testing is possible if we install miniconda to a writable temp. directory.
The current R scripts are kind of chaotic
simpletransformers 0.7
require(grafzahl)
require(quanteda)
download.file("https://huggingface.co/datasets/israel/Amharic-News-Text-classification-Dataset/resolve/main/train.csv", destfile = "am_train.csv")
input <- read.csv("am_train.csv")
input_corpus <- corpus(input, text_field = "article") %>% corpus_subset(category != "")
model <- model <- grafzahl(x = input_corpus, y = "category", model_name = "castorini/afriberta_base")
Error
Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].Traceback:
1. grafzahl(x = input_corpus, y = "category", model_name = "castorini/afriberta_base")
2. grafzahl.corpus(x = input_corpus, y = "category", model_name = "castorini/afriberta_base")
3. py_train(data = input_data, num_labels = num_labels, output_dir = output_dir,
. best_model_dir = best_model_dir, cache_dir = cache_dir, model_type = model_type,
. model_name = model_name, num_train_epochs = num_train_epochs,
. train_size = train_size, manual_seed = manual_seed, regression = regression,
. verbose = verbose)
4. py_call_impl(callable, call_args$unnamed, call_args$named)
The Description field is intended to be a (one paragraph) description of
what the package does and why it may be useful. Please add more details
about the package functionality and implemented methods in your
Description text.
If there are references describing the methods in your package, please
add these in the description field of your DESCRIPTION file in the form
authors (year) doi:...
authors (year) arXiv:...
authors (year, ISBN:...)
or if those are not available: https:...
with no space after 'doi:', 'arXiv:', 'https:' and angle brackets for
auto-linking. (If you want to add a title as well please put it in
quotes: "Title")
The Description field should start with a capital letter.
Please add \value to .Rd files regarding exported methods and explain
the functions results in the documentation. Please write about the
structure of the output (class) and also what the output means. (If a
function does not return a value, please document that too, e.g.
\value{No return value, called for side effects} or similar)
Missing Rd-tags: hydrate.Rd: \value
\dontrun{} should only be used if the example really cannot be executed
(e.g. because of missing additional software, missing API keys, ...) by
the user. That's why wrapping examples in \dontrun{} adds the comment
("# Not run:") as a warning for the user. Does not seem necessary.
Please replace \dontrun with \donttest.
Please unwrap the examples if they are executable in < 5 sec, or replace
dontrun{} with \donttest{}.
Please add small executable examples in your Rd-files to illustrate the use of the exported function but also enable automatic testing.
Please ensure that your functions do not write by default or in your examples/vignettes/tests in the user's home filespace (including the package directory and getwd()). This is not allowed by CRAN policies. Please omit any default path in writing functions. In your examples/vignettes/tests you can write to tempdir(). -> R/train.R
The idea probably would be: Make grafzahl usable with non-Conda Python, e.g. the Python provided by Colab. As of writing, that's 3.8.
Install all the Python dependencies: pandas, emoji, tqdm, simpletransformers
Is it possible?
Hi @chainsawriot
First of all, thank you for your work and another great package! After having read the CCR software announcement, I wanted to check out grafzahl. In particular, I wanted to see whether using it on my notebook without a CUDA GPU would make any sense at all.
The setup process worked smoothly. I then replicated the Theocharis et al. (2020) example. Model training worked fine (although it needed 11.5h, but that was to be expected). However, the predict() step did not work.
Input:
pred_bert <- predict(object = model, newdata = unciviltweets[test], cuda = FALSE)
Error:
"Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: 'use_cuda' set to True when cuda is unavailable. Make sure CUDA is available or set use_cuda=False"
I share a reproducible example below, with a nonsensical reduction of the training set to make it finish within a sensible timeframe. The error message remains the same.
Thanks again!
pacman::p_load(grafzahl, quanteda, caret, tictoc, tidyverse)
uncivildfm <- unciviltweets %>% tokens(remove_url = TRUE, remove_numbers = TRUE) %>% tokens_wordstem() %>% dfm() %>% dfm_remove(stopwords("english")) %>% dfm_trim(min_docfreq = 2)
y <- docvars(unciviltweets)[,1]
seed <- 123
set.seed(seed)
training <- original_training <- sample(seq_along(y), floor(.80 * length(y)))
test <- (seq_along(y))[seq_along(y) %in% training == FALSE]
set.seed(721)
tic()
model <- grafzahl(unciviltweets[original_training[1:20]], model_type = "bertweet", model_name = "vinai/bertweet-base", output_dir = here::here("reprex"))
toc()
pred_bert <- predict(object = model, newdata = unciviltweets[test], cuda = FALSE)
The current textmodel_transformer
is not portable across machine due to the hardcoding of output_dir
https://github.com/chainsawriot/grafzahl/blob/e8b2f81ac47d026c95b3a069a94e075da6cceb21/R/train.R#L52
Until a clever solution is available, a stupid way to serialize the model object is to save the textmodel_transformer
object with the whole directory of output_dir
. When deserialize, extract the directory into a temp. directory and then rewrite output_dir
to that temp directory.
The label y
must be base-0. Therefore, as.factor()
a vector will create error in the downstream Python task because as.factor()
is base 1.
What is used to download models? Two tests fail for me. Should be fixable, I guess.
R version 4.2.3 (2023-03-15) -- "Shortstop Beagle"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: powerpc-apple-darwin10.8.0 (32-bit)
> # This file is part of the standard setup for testthat.
> # It is recommended that you do not modify it.
> #
> # Where should you do additional test configuration?
> # Learn more about the roles of various files in:
> # * https://r-pkgs.org/tests.html
> # * https://testthat.r-lib.org/reference/test_package.html#special-files
>
> library(testthat)
> library(grafzahl)
>
> test_check("grafzahl")
[ FAIL 2 | WARN 0 | SKIP 0 | PASS 10 ]
ββ Failed tests ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ Failure ('test_grafzahl.R:14:5'): .infer local ββββββββββββββββββββββββββββββ
`.infer_model_type("../testdata/fake")` threw an unexpected error.
Message: Fail to download the model `../testdata/fake` from Hugging Face
Class: simpleError/error/condition
Backtrace:
β
1. ββtestthat::expect_error(...) at test_grafzahl.R:14:4
2. β ββtestthat:::expect_condition_matching(...)
3. β ββtestthat:::quasi_capture(...)
4. β ββtestthat (local) .capture(...)
5. β β ββbase::withCallingHandlers(...)
6. β ββrlang::eval_bare(quo_get_expr(.quo), quo_get_env(.quo))
7. ββgrafzahl:::.infer_model_type("../testdata/fake")
8. ββgrafzahl:::.download_from_huggingface(model_name)
9. ββbase::tryCatch(...)
10. ββbase (local) tryCatchList(expr, classes, parentenv, handlers)
11. ββbase (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
12. ββvalue[[3L]](cond)
ββ Error ('test_grafzahl.R:15:5'): .infer local ββββββββββββββββββββββββββββββββ
Error: Fail to download the model `../testdata/fake` from Hugging Face
Backtrace:
β
1. ββtestthat::expect_equal(...) at test_grafzahl.R:15:4
2. β ββtestthat::quasi_label(enquo(object), label, arg = "object")
3. β ββrlang::eval_bare(expr, quo_get_env(quo))
4. ββgrafzahl:::.infer_model_type("../testdata/fake")
5. ββgrafzahl:::.download_from_huggingface(model_name)
6. ββbase::tryCatch(...)
7. ββbase (local) tryCatchList(expr, classes, parentenv, handlers)
8. ββbase (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
9. ββvalue[[3L]](cond)
[ FAIL 2 | WARN 0 | SKIP 0 | PASS 10 ]
Error: Test failures
Execution halted
The randomization step is not deterministic in Python.
Both simpletransformer and split need randomness.
grafzahl_condaenv_cuda
should have higher priority.
Is it possibile to tune the hyperparamenters of the model we decide to employ via grafzahl (such as the number of epochs, batch size, learning-rate, etc.)?
Thanks for any possible help!
Luigi
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.