Giter VIP home page Giter VIP logo

grafzahl's People

Contributors

bachl avatar chainsawriot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

jbgruber bachl

grafzahl's Issues

pyarrow 9 is problematic

There seems to be issue with the newly released pyarrow 9.0.0. It is currently installing through pip. Maybe it is better to install it through conda instead.

No module named 'torch'

When running grafzahl() on macOS Monterey following the example code I get

Error in py_run_file_impl(path.expand(file), local, convert) :
ModuleNotFoundError: No module named 'torch'

Sounds a lot like:
rstudio/reticulate#909

Any ideas?

Post acceptance issues

  • EXAMPLES: Remove paper in paths
  • SP: GPU
  • SP: Mention examples
  • SP: ?de-emphasize quanteda
  • SP: " I would make sure that all outcomes are produced by the code snippets"
  • SP: URLs are readable / copy-pastable
  • WIKI: 'conda activate'

Installation problem

I have tried to install grafzahl (using the Transformers within the Quanteda infrastracture is a great choice!), everything works fine with setup_grafzahl(cuda = TRUE) - I have GPU(s) btw, until the last line that shows an error: "Error in .install_gpu_pytorch(cuda_version = cuda_version) : Cannot set up `pytorch". Unfortunately this error prevents grafzahl to correctly work. Any suggestion on how to solve it? Thanks! Luigi

Error in py_call_impl

Hello there,

thanks for this amazing package. I tried it out yesterday and I was able to train a model. However, once I wanted to use the trained model to predict new data, I received the following error message:

Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: 'use_cuda' set to True when cuda is unavailable. Make sure CUDA is available or set use_cuda=False.

The error message appears whether I set use_cude to TRUE or FALSE.

Do you have any idea why I receive this error message or what I could do?

Thanks in advance
Tobias

Make Grafzahl look for conda in the right place

Hi there,

I'm having a bit of trouble getting Grafzahl to look for miniconda in the right place. Since I had to install miniconda manually, it's not in the usual folder (r-miniconda), but some other (miniconda3). I saw that this was an issue before (#20) and tried to solve it by specifying the right path for both RETICULATE_MINICONDA_PATH and GRAFZAHL_MINICONDA_PATH. However, detect_conda() still returns FALSE.

The problem, I suspect, is that the .gen_conda_path function adds "bin" and "conda" to the path, which then points to a folder/file that doesn't exist within miniconda3. In my case, the right path would be "condabin" and then "conda" (I guess). I don't know if this is a version or system related issue, but any idea on how to fix this or some workaround would be greatly appreciated.

Thanks!

Layer extracted by grafzahl

Given the discussion about which layer keeping as a token's representation in a down-streaming analysis (Jawahar et al., 2019; Ethayarajh, 2019) when for example using a pre-trained bert model, I was wondering if you are considering to allow to the user the possibility to select one specific layer when fine-tuning a Transformer via grafzahl. At the moment, which layer is consider when for example I specify [model_name = "bert-base-uncased"]? Thanks for your great package!

Cross-validation

Should be the default to nudge users to adopt the best practice.

Multinomial classification does not work

simpletransformers 0.7

require(grafzahl)
require(quanteda)
download.file("https://huggingface.co/datasets/israel/Amharic-News-Text-classification-Dataset/resolve/main/train.csv", destfile = "am_train.csv")
input <- read.csv("am_train.csv")

input_corpus <- corpus(input, text_field = "article") %>% corpus_subset(category != "")
model <- model <- grafzahl(x = input_corpus, y = "category", model_name = "castorini/afriberta_base")

Error

Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].Traceback:

1. grafzahl(x = input_corpus, y = "category", model_name = "castorini/afriberta_base")

2. grafzahl.corpus(x = input_corpus, y = "category", model_name = "castorini/afriberta_base")

3. py_train(data = input_data, num_labels = num_labels, output_dir = output_dir, 
 .     best_model_dir = best_model_dir, cache_dir = cache_dir, model_type = model_type, 
 .     model_name = model_name, num_train_epochs = num_train_epochs, 
 .     train_size = train_size, manual_seed = manual_seed, regression = regression, 
 .     verbose = verbose)

4. py_call_impl(callable, call_args$unnamed, call_args$named)

CRAN issues

  • The Description field is intended to be a (one paragraph) description of
    what the package does and why it may be useful. Please add more details
    about the package functionality and implemented methods in your
    Description text.

  • If there are references describing the methods in your package, please
    add these in the description field of your DESCRIPTION file in the form
    authors (year) doi:...
    authors (year) arXiv:...
    authors (year, ISBN:...)
    or if those are not available: https:...
    with no space after 'doi:', 'arXiv:', 'https:' and angle brackets for
    auto-linking. (If you want to add a title as well please put it in
    quotes: "Title")

  • The Description field should start with a capital letter.

  • Please add \value to .Rd files regarding exported methods and explain
    the functions results in the documentation. Please write about the
    structure of the output (class) and also what the output means. (If a
    function does not return a value, please document that too, e.g.
    \value{No return value, called for side effects} or similar)
    Missing Rd-tags: hydrate.Rd: \value

  • \dontrun{} should only be used if the example really cannot be executed
    (e.g. because of missing additional software, missing API keys, ...) by
    the user. That's why wrapping examples in \dontrun{} adds the comment
    ("# Not run:") as a warning for the user. Does not seem necessary.
    Please replace \dontrun with \donttest.

  • Please unwrap the examples if they are executable in < 5 sec, or replace
    dontrun{} with \donttest{}.

  • Please add small executable examples in your Rd-files to illustrate the use of the exported function but also enable automatic testing.

  • Please ensure that your functions do not write by default or in your examples/vignettes/tests in the user's home filespace (including the package directory and getwd()). This is not allowed by CRAN policies. Please omit any default path in writing functions. In your examples/vignettes/tests you can write to tempdir(). -> R/train.R

Make it work on Colab

The idea probably would be: Make grafzahl usable with non-Conda Python, e.g. the Python provided by Colab. As of writing, that's 3.8.

Install all the Python dependencies: pandas, emoji, tqdm, simpletransformers

predict.grafzahl() seems to ignore "cuda = false"

Hi @chainsawriot
First of all, thank you for your work and another great package! After having read the CCR software announcement, I wanted to check out grafzahl. In particular, I wanted to see whether using it on my notebook without a CUDA GPU would make any sense at all.
The setup process worked smoothly. I then replicated the Theocharis et al. (2020) example. Model training worked fine (although it needed 11.5h, but that was to be expected). However, the predict() step did not work.
Input:

pred_bert <- predict(object = model, newdata = unciviltweets[test], cuda = FALSE)

Error:

"Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: 'use_cuda' set to True when cuda is unavailable. Make sure CUDA is available or set use_cuda=False"

I share a reproducible example below, with a nonsensical reduction of the training set to make it finish within a sensible timeframe. The error message remains the same.

Thanks again!

pacman::p_load(grafzahl, quanteda, caret, tictoc, tidyverse)
uncivildfm <- unciviltweets %>% tokens(remove_url = TRUE, remove_numbers = TRUE) %>% tokens_wordstem() %>% dfm() %>% dfm_remove(stopwords("english")) %>% dfm_trim(min_docfreq = 2)
y <- docvars(unciviltweets)[,1]
seed <- 123
set.seed(seed)
training <- original_training <- sample(seq_along(y), floor(.80 * length(y)))
test <- (seq_along(y))[seq_along(y) %in% training == FALSE]

set.seed(721)
tic()
model <- grafzahl(unciviltweets[original_training[1:20]], model_type = "bertweet", model_name = "vinai/bertweet-base", output_dir = here::here("reprex"))
toc()

pred_bert <- predict(object = model, newdata = unciviltweets[test], cuda = FALSE)

Serialize the model object

The current textmodel_transformer is not portable across machine due to the hardcoding of output_dir

https://github.com/chainsawriot/grafzahl/blob/e8b2f81ac47d026c95b3a069a94e075da6cceb21/R/train.R#L52

Until a clever solution is available, a stupid way to serialize the model object is to save the textmodel_transformer object with the whole directory of output_dir. When deserialize, extract the directory into a temp. directory and then rewrite output_dir to that temp directory.

Base 0 for y

The label y must be base-0. Therefore, as.factor() a vector will create error in the downstream Python task because as.factor() is base 1.

Two tests fail due to: Fail to download the model `../testdata/fake` from Hugging Face

What is used to download models? Two tests fail for me. Should be fixable, I guess.

R version 4.2.3 (2023-03-15) -- "Shortstop Beagle"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: powerpc-apple-darwin10.8.0 (32-bit)

> # This file is part of the standard setup for testthat.
> # It is recommended that you do not modify it.
> #
> # Where should you do additional test configuration?
> # Learn more about the roles of various files in:
> # * https://r-pkgs.org/tests.html
> # * https://testthat.r-lib.org/reference/test_package.html#special-files
> 
> library(testthat)
> library(grafzahl)
> 
> test_check("grafzahl")
[ FAIL 2 | WARN 0 | SKIP 0 | PASS 10 ]

══ Failed tests ════════════════════════════════════════════════════════════════
── Failure ('test_grafzahl.R:14:5'): .infer local ──────────────────────────────
`.infer_model_type("../testdata/fake")` threw an unexpected error.
Message: Fail to download the model `../testdata/fake` from Hugging Face
Class:   simpleError/error/condition
Backtrace:
     β–†
  1. β”œβ”€testthat::expect_error(...) at test_grafzahl.R:14:4
  2. β”‚ └─testthat:::expect_condition_matching(...)
  3. β”‚   └─testthat:::quasi_capture(...)
  4. β”‚     β”œβ”€testthat (local) .capture(...)
  5. β”‚     β”‚ └─base::withCallingHandlers(...)
  6. β”‚     └─rlang::eval_bare(quo_get_expr(.quo), quo_get_env(.quo))
  7. └─grafzahl:::.infer_model_type("../testdata/fake")
  8.   └─grafzahl:::.download_from_huggingface(model_name)
  9.     └─base::tryCatch(...)
 10.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
 11.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 12.           └─value[[3L]](cond)
── Error ('test_grafzahl.R:15:5'): .infer local ────────────────────────────────
Error: Fail to download the model `../testdata/fake` from Hugging Face
Backtrace:
    β–†
 1. β”œβ”€testthat::expect_equal(...) at test_grafzahl.R:15:4
 2. β”‚ └─testthat::quasi_label(enquo(object), label, arg = "object")
 3. β”‚   └─rlang::eval_bare(expr, quo_get_env(quo))
 4. └─grafzahl:::.infer_model_type("../testdata/fake")
 5.   └─grafzahl:::.download_from_huggingface(model_name)
 6.     └─base::tryCatch(...)
 7.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
 8.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9.           └─value[[3L]](cond)

[ FAIL 2 | WARN 0 | SKIP 0 | PASS 10 ]
Error: Test failures
Execution halted

Tuning hyperparameters

Is it possibile to tune the hyperparamenters of the model we decide to employ via grafzahl (such as the number of epochs, batch size, learning-rate, etc.)?
Thanks for any possible help!
Luigi

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.