gesistsa / grafzahl Goto Github PK

View Code? Open in Web Editor NEW

41.0 41.0 2.0 4.4 MB

🧛 fine-tuning Transformers for text data from within R

Home Page: https://gesistsa.github.io/grafzahl/

License: GNU General Public License v3.0

R 91.84% Python 8.16%

grafzahl's People

Contributors

Stargazers

Watchers

Forkers

jbgruber bachl

grafzahl's Issues

pyarrow 9 is problematic

There seems to be issue with the newly released pyarrow 9.0.0. It is currently installing through pip. Maybe it is better to install it through conda instead.

Check whether it works with simpletransformers 0.70

Completely reproduce Theocharis et al. (2020) and compare the performance

Like a computer scientist and put it in the paper.

No module named 'torch'

When running grafzahl() on macOS Monterey following the example code I get

Error in py_run_file_impl(path.expand(file), local, convert) :
ModuleNotFoundError: No module named 'torch'

Sounds a lot like:
rstudio/reticulate#909

Any ideas?

Post acceptance issues

EXAMPLES: Remove paper in paths
SP: GPU
SP: Mention examples
SP: ?de-emphasize quanteda
SP: " I would make sure that all outcomes are produced by the code snippets"
SP: URLs are readable / copy-pastable
WIKI: 'conda activate'

I have tried to install grafzahl (using the Transformers within the Quanteda infrastracture is a great choice!), everything works fine with setup_grafzahl(cuda = TRUE) - I have GPU(s) btw, until the last line that shows an error: "Error in .install_gpu_pytorch(cuda_version = cuda_version) : Cannot set up `pytorch". Unfortunately this error prevents grafzahl to correctly work. Any suggestion on how to solve it? Thanks! Luigi

Stop hardcoding of `model_types`

ref #29 reported by @bachl

Error in py_call_impl

Hello there,

thanks for this amazing package. I tried it out yesterday and I was able to train a model. However, once I wanted to use the trained model to predict new data, I received the following error message:

Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: 'use_cuda' set to True when cuda is unavailable. Make sure CUDA is available or set use_cuda=False.

The error message appears whether I set use_cude to TRUE or FALSE.

Do you have any idea why I receive this error message or what I could do?

Thanks in advance
Tobias

Post-`setup_grafzahl` check

Conduct detect_conda and detect_cuda after the current setup_grafzahl procedure.

Make Grafzahl look for conda in the right place

Hi there,

I'm having a bit of trouble getting Grafzahl to look for miniconda in the right place. Since I had to install miniconda manually, it's not in the usual folder (r-miniconda), but some other (miniconda3). I saw that this was an issue before (#20) and tried to solve it by specifying the right path for both RETICULATE_MINICONDA_PATH and GRAFZAHL_MINICONDA_PATH. However, detect_conda() still returns FALSE.

The problem, I suspect, is that the .gen_conda_path function adds "bin" and "conda" to the path, which then points to a folder/file that doesn't exist within miniconda3. In my case, the right path would be "condabin" and then "conda" (I guess). I don't know if this is a version or system related issue, but any idea on how to fix this or some workaround would be greatly appreciated.

Thanks!

Layer extracted by grafzahl

Given the discussion about which layer keeping as a token's representation in a down-streaming analysis (Jawahar et al., 2019; Ethayarajh, 2019) when for example using a pre-trained bert model, I was wondering if you are considering to allow to the user the possibility to select one specific layer when fine-tuning a Transformer via grafzahl. At the moment, which layer is consider when for example I specify [model_name = "bert-base-uncased"]? Thanks for your great package!

Cross-validation

Should be the default to nudge users to adopt the best practice.

Allow installing miniconda bzw. `grafzahl_condaenv` to any location

So that testing is possible if we install miniconda to a writable temp. directory.

Convert examples into rmd/qmd

The current R scripts are kind of chaotic

Multinomial classification does not work

simpletransformers 0.7

require(grafzahl)
require(quanteda)
download.file("https://huggingface.co/datasets/israel/Amharic-News-Text-classification-Dataset/resolve/main/train.csv", destfile = "am_train.csv")
input <- read.csv("am_train.csv")

input_corpus <- corpus(input, text_field = "article") %>% corpus_subset(category != "")
model <- model <- grafzahl(x = input_corpus, y = "category", model_name = "castorini/afriberta_base")

Error

Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].Traceback:

1. grafzahl(x = input_corpus, y = "category", model_name = "castorini/afriberta_base")

2. grafzahl.corpus(x = input_corpus, y = "category", model_name = "castorini/afriberta_base")

3. py_train(data = input_data, num_labels = num_labels, output_dir = output_dir, 
 .     best_model_dir = best_model_dir, cache_dir = cache_dir, model_type = model_type, 
 .     model_name = model_name, num_train_epochs = num_train_epochs, 
 .     train_size = train_size, manual_seed = manual_seed, regression = regression, 
 .     verbose = verbose)

4. py_call_impl(callable, call_args$unnamed, call_args$named)

CRAN issues

The Description field is intended to be a (one paragraph) description of
what the package does and why it may be useful. Please add more details
about the package functionality and implemented methods in your
Description text.
If there are references describing the methods in your package, please
add these in the description field of your DESCRIPTION file in the form
authors (year) doi:...
authors (year) arXiv:...
authors (year, ISBN:...)
or if those are not available: https:...
with no space after 'doi:', 'arXiv:', 'https:' and angle brackets for
auto-linking. (If you want to add a title as well please put it in
quotes: "Title")
The Description field should start with a capital letter.
Please add \value to .Rd files regarding exported methods and explain
the functions results in the documentation. Please write about the
structure of the output (class) and also what the output means. (If a
function does not return a value, please document that too, e.g.
\value{No return value, called for side effects} or similar)
Missing Rd-tags: hydrate.Rd: \value
\dontrun{} should only be used if the example really cannot be executed
(e.g. because of missing additional software, missing API keys, ...) by
the user. That's why wrapping examples in \dontrun{} adds the comment
("# Not run:") as a warning for the user. Does not seem necessary.
Please replace \dontrun with \donttest.
Please unwrap the examples if they are executable in < 5 sec, or replace
dontrun{} with \donttest{}.
Please add small executable examples in your Rd-files to illustrate the use of the exported function but also enable automatic testing.
Please ensure that your functions do not write by default or in your examples/vignettes/tests in the user's home filespace (including the package directory and getwd()). This is not allowed by CRAN policies. Please omit any default path in writing functions. In your examples/vignettes/tests you can write to tempdir(). -> R/train.R

Make it work on Colab

The idea probably would be: Make grafzahl usable with non-Conda Python, e.g. the Python provided by Colab. As of writing, that's 3.8.

Install all the Python dependencies: pandas, emoji, tqdm, simpletransformers

Infering `model_type` from local directory

Is it possible?

predict.grafzahl() seems to ignore "cuda = false"

Hi @chainsawriot
First of all, thank you for your work and another great package! After having read the CCR software announcement, I wanted to check out grafzahl. In particular, I wanted to see whether using it on my notebook without a CUDA GPU would make any sense at all.
The setup process worked smoothly. I then replicated the Theocharis et al. (2020) example. Model training worked fine (although it needed 11.5h, but that was to be expected). However, the predict() step did not work.
Input:

pred_bert <- predict(object = model, newdata = unciviltweets[test], cuda = FALSE)

Error:

"Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: 'use_cuda' set to True when cuda is unavailable. Make sure CUDA is available or set use_cuda=False"

I share a reproducible example below, with a nonsensical reduction of the training set to make it finish within a sensible timeframe. The error message remains the same.

Thanks again!

pacman::p_load(grafzahl, quanteda, caret, tictoc, tidyverse)
uncivildfm <- unciviltweets %>% tokens(remove_url = TRUE, remove_numbers = TRUE) %>% tokens_wordstem() %>% dfm() %>% dfm_remove(stopwords("english")) %>% dfm_trim(min_docfreq = 2)
y <- docvars(unciviltweets)[,1]
seed <- 123
set.seed(seed)
training <- original_training <- sample(seq_along(y), floor(.80 * length(y)))
test <- (seq_along(y))[seq_along(y) %in% training == FALSE]

set.seed(721)
tic()
model <- grafzahl(unciviltweets[original_training[1:20]], model_type = "bertweet", model_name = "vinai/bertweet-base", output_dir = here::here("reprex"))
toc()

pred_bert <- predict(object = model, newdata = unciviltweets[test], cuda = FALSE)

Serialize the model object

The current textmodel_transformer is not portable across machine due to the hardcoding of output_dir

https://github.com/chainsawriot/grafzahl/blob/e8b2f81ac47d026c95b3a069a94e075da6cceb21/R/train.R#L52

Until a clever solution is available, a stupid way to serialize the model object is to save the textmodel_transformer object with the whole directory of output_dir. When deserialize, extract the directory into a temp. directory and then rewrite output_dir to that temp directory.

Base 0 for y

The label y must be base-0. Therefore, as.factor() a vector will create error in the downstream Python task because as.factor() is base 1.

Two tests fail due to: Fail to download the model `../testdata/fake` from Hugging Face

What is used to download models? Two tests fail for me. Should be fixable, I guess.

R version 4.2.3 (2023-03-15) -- "Shortstop Beagle"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: powerpc-apple-darwin10.8.0 (32-bit)

> # This file is part of the standard setup for testthat.
> # It is recommended that you do not modify it.
> #
> # Where should you do additional test configuration?
> # Learn more about the roles of various files in:
> # * https://r-pkgs.org/tests.html
> # * https://testthat.r-lib.org/reference/test_package.html#special-files
> 
> library(testthat)
> library(grafzahl)
> 
> test_check("grafzahl")
[ FAIL 2 | WARN 0 | SKIP 0 | PASS 10 ]

══ Failed tests ════════════════════════════════════════════════════════════════
── Failure ('test_grafzahl.R:14:5'): .infer local ──────────────────────────────
`.infer_model_type("../testdata/fake")` threw an unexpected error.
Message: Fail to download the model `../testdata/fake` from Hugging Face
Class:   simpleError/error/condition
Backtrace:
     ▆
  1. ├─testthat::expect_error(...) at test_grafzahl.R:14:4
  2. │ └─testthat:::expect_condition_matching(...)
  3. │   └─testthat:::quasi_capture(...)
  4. │     ├─testthat (local) .capture(...)
  5. │     │ └─base::withCallingHandlers(...)
  6. │     └─rlang::eval_bare(quo_get_expr(.quo), quo_get_env(.quo))
  7. └─grafzahl:::.infer_model_type("../testdata/fake")
  8.   └─grafzahl:::.download_from_huggingface(model_name)
  9.     └─base::tryCatch(...)
 10.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
 11.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 12.           └─value[[3L]](cond)
── Error ('test_grafzahl.R:15:5'): .infer local ────────────────────────────────
Error: Fail to download the model `../testdata/fake` from Hugging Face
Backtrace:
    ▆
 1. ├─testthat::expect_equal(...) at test_grafzahl.R:15:4
 2. │ └─testthat::quasi_label(enquo(object), label, arg = "object")
 3. │   └─rlang::eval_bare(expr, quo_get_env(quo))
 4. └─grafzahl:::.infer_model_type("../testdata/fake")
 5.   └─grafzahl:::.download_from_huggingface(model_name)
 6.     └─base::tryCatch(...)
 7.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
 8.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9.           └─value[[3L]](cond)

[ FAIL 2 | WARN 0 | SKIP 0 | PASS 10 ]
Error: Test failures
Execution halted