Giter VIP home page Giter VIP logo

tokenizers-ruby's Introduction

Tokenizers Ruby

🙂 Fast state-of-the-art tokenizers for Ruby

Build Status

Installation

Add this line to your application’s Gemfile:

gem "tokenizers"

Getting Started

Load a pretrained tokenizer

tokenizer = Tokenizers.from_pretrained("bert-base-cased")

Encode

encoded = tokenizer.encode("I can feel the magic, can you?")
encoded.tokens
encoded.ids

Decode

tokenizer.decode(ids)

Training

Create a tokenizer

tokenizer = Tokenizers::Tokenizer.new(Tokenizers::Models::BPE.new(unk_token: "[UNK]"))

Set the pre-tokenizer

tokenizer.pre_tokenizer = Tokenizers::PreTokenizers::Whitespace.new

Train the tokenizer (example data)

trainer = Tokenizers::Trainers::BpeTrainer.new(special_tokens: ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer)

Encode

output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
output.tokens

Save the tokenizer to a file

tokenizer.save("tokenizer.json")

Load a tokenizer from a file

tokenizer = Tokenizers.from_file("tokenizer.json")

Check out the Quicktour and equivalent Ruby code for more info

API

This library follows the Tokenizers Python API. You can follow Python tutorials and convert the code to Ruby in many cases. Feel free to open an issue if you run into problems.

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/tokenizers-ruby.git
cd tokenizers-ruby
bundle install
bundle exec rake compile
bundle exec rake download:files
bundle exec rake test

tokenizers-ruby's People

Contributors

ankane avatar petergoldstein avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

tokenizers-ruby's Issues

Version 0.2 relying on an old version of libssl

Hey @ankane !

Thank you for making this gem, it's great for Rubyists like us to still be able to play easily with the new AI trend ;)
I love that you embedded the libs in the 0.2 version of the gem as compilation was really taking forever and required way too much libs, however it seems that it now relies on an outdated version of libssl and I haven't figured out to get it running on Ubuntu in a prod environment. libssl.so.10

Do you think it's possible to upgrade it easily?

Thanks!

Caused by:
LoadError: libssl.so.10: cannot open shared object file: No such file or directory - /shared/bundle/ruby/3.1.0/gems/tokenizers-0.2.1-x86_64-linux/lib/tokenizers/3.1/tokenizers.so
/shared/bundle/ruby/3.1.0/gems/tokenizers-0.2.1-x86_64-linux/lib/tokenizers.rb:3:in `require'
/shared/bundle/ruby/3.1.0/gems/tokenizers-0.2.1-x86_64-linux/lib/tokenizers.rb:3:in `<top (required)>'
/releases/20230113084850/config/application.rb:19:in `<top (required)>'
/releases/20230113084850/Rakefile:4:in `require_relative'
/releases/20230113084850/Rakefile:4:in `<top (required)>'
/shared/bundle/ruby/3.1.0/gems/rake-13.0.6/exe/rake:27:in `<top (required)>'
/home/deploy/.asdf/installs/ruby/3.1.3/bin/bundle:25:in `load'
/home/deploy/.asdf/installs/ruby/3.1.3/bin/bundle:25:in `<main>'
(See full trace by running task with --trace)
rails aborted!

Using tiktoken

Hey thanks for writing up this gem! (and all the other ML ones).

This may be a naive question but can tiktoken be invoked as a tokenizer?

I get an error when installing

Hello!
Thanks for making this gem.

But it seems to fail to install in my environment.

gem install tokenizers

I get the following error message

Building native extensions. This could take a while...
ERROR:  Error installing tokenizers:
	ERROR: Failed to build gem native extension.

    current directory: /home/kojix2/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/tokenizers-0.1.1/ext/tokenizers
/home/kojix2/.rbenv/versions/3.1.2/bin/ruby -I /home/kojix2/.rbenv/versions/3.1.2/lib/ruby/3.1.0 -r ./siteconf20220909-19701-2a0rv0.rb extconf.rb

current directory: /home/kojix2/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/tokenizers-0.1.1/ext/tokenizers
make DESTDIR\= clean
make: 'clean' に対して行うべき事はありません. # There is nothing to do for 'clean'. (@kojix2)

current directory: /home/kojix2/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/tokenizers-0.1.1/ext/tokenizers
make DESTDIR\=
cargo build --release --target-dir target
   Compiling libc v0.2.121
   Compiling cfg-if v1.0.0
   Compiling autocfg v1.1.0
   Compiling cc v1.0.73
   Compiling pkg-config v0.3.24
   Compiling proc-macro2 v1.0.36
   Compiling unicode-xid v0.2.2
   Compiling syn v1.0.89
   Compiling memchr v2.3.4
   Compiling lazy_static v1.4.0
   Compiling log v0.4.14
   Compiling version_check v0.9.4
   Compiling pin-project-lite v0.2.8
   Compiling bitflags v1.3.2
   Compiling bytes v1.1.0
   Compiling futures-core v0.3.21
   Compiling once_cell v1.10.0
   Compiling itoa v1.0.1
   Compiling futures-task v0.3.21
   Compiling typenum v1.15.0
   Compiling crossbeam-utils v0.8.8
   Compiling serde_derive v1.0.136
   Compiling serde v1.0.136
   Compiling foreign-types-shared v0.1.1
   Compiling fnv v1.0.7
   Compiling futures-util v0.3.21
   Compiling openssl v0.10.38
   Compiling ryu v1.0.9
   Compiling pin-utils v0.1.0
   Compiling unicode-width v0.1.9
   Compiling hashbrown v0.11.2
   Compiling native-tls v0.2.8
   Compiling futures-io v0.3.21
   Compiling slab v0.4.5
   Compiling futures-channel v0.3.21
   Compiling futures-sink v0.3.21
   Compiling tinyvec_macros v0.1.0
   Compiling matches v0.1.9
   Compiling httparse v1.6.0
   Compiling crc32fast v1.3.2
   Compiling radium v0.5.3
   Compiling percent-encoding v2.1.0
   Compiling adler v1.0.2
   Compiling strsim v0.9.3
   Compiling getrandom v0.1.16
   Compiling try-lock v0.2.3
   Compiling ident_case v1.0.1
   Compiling scopeguard v1.1.0
   Compiling openssl-probe v0.1.5
   Compiling ppv-lite86 v0.2.16
   Compiling regex-syntax v0.6.25
   Compiling rayon-core v1.9.1
   Compiling either v1.6.1
   Compiling lexical-core v0.7.6
   Compiling httpdate v1.0.2
   Compiling encoding_rs v0.8.30
   Compiling tower-service v0.3.1
   Compiling unicode-bidi v0.3.7
   Compiling static_assertions v1.1.0
   Compiling wyz v0.2.0
   Compiling tap v1.0.1
   Compiling serde_json v1.0.79
   Compiling funty v1.1.0
   Compiling byteorder v1.4.3
   Compiling arrayvec v0.5.2
   Compiling cpufeatures v0.2.2
   Compiling derive_builder v0.9.0
   Compiling ipnet v2.4.0
   Compiling fastrand v1.7.0
   Compiling remove_dir_all v0.5.3
   Compiling mime v0.3.16
   Compiling number_prefix v0.4.0
   Compiling base64 v0.13.0
   Compiling unicode-segmentation v1.9.0
   Compiling glob v0.3.0
   Compiling base64 v0.12.3
   Compiling number_prefix v0.3.0
   Compiling macro_rules_attribute-proc_macro v0.0.2
   Compiling vec_map v0.8.2
   Compiling strsim v0.8.0
   Compiling rutie v0.8.4
   Compiling ansi_term v0.12.1
   Compiling smallvec v1.8.0
   Compiling unicode_categories v0.1.1
   Compiling paste v1.0.6
   Compiling tracing-core v0.1.23
   Compiling memoffset v0.6.5
   Compiling indexmap v1.8.0
   Compiling miniz_oxide v0.4.4
   Compiling crossbeam-epoch v0.9.8
   Compiling rayon v1.5.1
   Compiling generic-array v0.14.5
   Compiling nom v6.2.1
   Compiling foreign-types v0.3.2
   Compiling http v0.2.6
   Compiling textwrap v0.11.0
   Compiling tinyvec v1.5.1
   Compiling openssl-sys v0.9.72
   Compiling bzip2-sys v0.1.11+1.0.8
   Compiling onig_sys v69.7.1
   Compiling esaxx-rs v0.1.7
   Compiling form_urlencoded v1.0.1
   Compiling itertools v0.8.2
   Compiling itertools v0.9.0
   Compiling macro_rules_attribute v0.0.2
   Compiling unicode-normalization-alignments v0.1.12
   Compiling tracing v0.1.32
   Compiling unicode-normalization v0.1.19
   Compiling aho-corasick v0.7.15
   Compiling num_cpus v1.13.1
   Compiling socket2 v0.4.4
   Compiling getrandom v0.2.5
   Compiling terminal_size v0.1.17
   Compiling time v0.1.43
   Compiling filetime v0.2.15
   Compiling xattr v0.2.2
   Compiling fs2 v0.4.3
   Compiling atty v0.2.14
   Compiling tempfile v3.3.0
   Compiling dirs-sys v0.3.7
   Compiling http-body v0.4.4
   Compiling mio v0.8.2
   Compiling want v0.3.0
   Compiling quote v1.0.16
   Compiling crossbeam-channel v0.5.4
   Compiling bitvec v0.19.6
   Compiling regex v1.4.6
   Compiling idna v0.2.3
   Compiling rand_core v0.6.3
   Compiling rand_core v0.5.1
   Compiling tar v0.4.38
   Compiling clap v2.34.0
   Compiling dirs v3.0.2
   Compiling tokio v1.17.0
   Compiling flate2 v1.0.22
   Compiling block-buffer v0.10.2
   Compiling crypto-common v0.1.3
   Compiling url v2.2.2
   Compiling rand_chacha v0.3.1
   Compiling rand_chacha v0.2.2
   Compiling console v0.15.0
   Compiling bzip2 v0.4.3
   Compiling crossbeam-deque v0.8.1
   Compiling digest v0.10.3
   Compiling rand v0.8.5
   Compiling rand v0.7.3
   Compiling tokio-util v0.6.9
   Compiling indicatif v0.16.2
   Compiling indicatif v0.15.0
   Compiling darling_core v0.10.2
   Compiling onig v6.3.1
   Compiling sha2 v0.10.2
   Compiling tokio-native-tls v0.3.0
   Compiling h2 v0.3.12
   Compiling thiserror-impl v1.0.30
   Compiling darling_macro v0.10.2
   Compiling darling v0.10.2
   Compiling derive_builder_core v0.9.0
   Compiling thiserror v1.0.30
   Compiling zip v0.5.13
   Compiling zip-extensions v0.6.1
   Compiling rayon-cond v0.1.0
   Compiling hyper v0.14.17
   Compiling serde_urlencoded v0.7.1
   Compiling spm_precompiled v0.1.3
   Compiling hyper-tls v0.5.0
   Compiling reqwest v0.11.10
   Compiling cached-path v0.5.3
   Compiling tokenizers v0.11.3
   Compiling tokenizers-ruby v0.1.0 (/home/kojix2/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/tokenizers-0.1.1)
    Finished release [optimized] target(s) in 1m 22s
mv target/release/libtokenizers.so ../../lib/tokenizers/ext.so

current directory: /home/kojix2/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/tokenizers-0.1.1/ext/tokenizers
make DESTDIR\= install
cargo build --release --target-dir target
    Finished release [optimized] target(s) in 0.09s
mv target/release/libtokenizers.so ../../lib/tokenizers/ext.so
mv: 'target/release/libtokenizers.so' と '../../lib/tokenizers/ext.so' は同じファイルです # is the same file (@kojix2)
make: *** [Makefile:3: install] エラー 1 # Error1 (@kojix2)

make install failed, exit code 2

Gem files will remain installed in /home/kojix2/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/tokenizers-0.1.1 for inspection.
Results logged to /home/kojix2/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/extensions/x86_64-linux/3.1.0/tokenizers-0.1.1/gem_make.out

But I was able to try it using the developer's method.

git clone https://github.com/ankane/tokenizers-ruby.git
cd tokenizers-ruby
bundle install
bundle exec ruby ext/tokenizers/extconf.rb && make
bundle exec rake download:files
bundle exec rake test

Tried GPT-2 with onnxruntime!
It's working just fine!

require "tokenizers"
require "onnxruntime"
require "numo/narray"

tokenizer = Tokenizers.from_pretrained("gpt2")
model = OnnxRuntime::Model.new("gpt2-lm-head-10.onnx")

s = "Why do cats want to ride on the keyboard?"

ids = tokenizer.encode(s).ids

10.times do
  o = model.predict({ input1: [[ids]] })
  o = Numo::DFloat.cast(o["output1"][0])
  ids << o[true, -1, true].argmax
end

puts tokenizer.decode(ids)

🐈
⌨️

Why do cats want to ride on the keyboard?

The answer is that they do.

Add optional punctuation cleanup during decoding - clean_up_tokenization equivalent

Hi there!

I'm really glad you have made this gem available It's just the best. It's just I have that small issue – when you try to decode previously encoded string containing punctuation marks there are extra spaces added before them.

Eg.

tokenizer = Tokenizers.from_pretrained(TOKENIZER_ID)
encoding = tokenizer.encode("Who are you?")
tokenizer.decode(encoding.ids) # => "Who are you ?"

I have made a small research on how it's made in the python implementation and there is a method called clean_up_tokenization it's being called in the last if statement of the _decode method.

So I have a few questions:

  1. Would you consider adding a similar feature to this gem for optional punctuation cleanup?
  2. If so, would it be better to implement this cleanup logic in Ruby or Rust?

I'm happy to contribute to a solution if this change aligns with your project goals. I'm comfortable working in either Ruby or Rust. Please let me know how you'd like to proceed!

Error when loading tokenizers gem

Using ruby:3.2.2 docker image:

$ bin/bundle
[...]
Installing tokenizers 0.4.0 (x86_64-linux)
Bundle complete! 17 Gemfile dependencies, 77 gems now installed.
Use `bundle info [gemname]` to see where a bundled gem is installed.

Launching a rails console fails with:

$ bin/rails c
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Error(Exception(#<NameError: uninitialized constant Tokenizers>))', ext/tokenizers/src/lib.rs:25:97
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
<internal:/usr/local/lib/ruby/site_ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:38:in `require': called `Result::unwrap()` on an `Err` value: Error(Exception(#<NameError: uninitialized constant Tokenizers>)) (fatal)
        from <internal:/usr/local/lib/ruby/site_ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:38:in `require'
        from /usr/local/bundle/gems/tokenizers-0.4.0-x86_64-linux/lib/tokenizers.rb:3:in `<top (required)>'
        from /usr/local/bundle/gems/bundler-2.4.10/lib/bundler/runtime.rb:60:in `require'
        from /usr/local/bundle/gems/bundler-2.4.10/lib/bundler/runtime.rb:60:in `block (2 levels) in require'
        from /usr/local/bundle/gems/bundler-2.4.10/lib/bundler/runtime.rb:55:in `each'
        from /usr/local/bundle/gems/bundler-2.4.10/lib/bundler/runtime.rb:55:in `block in require'
        from /usr/local/bundle/gems/bundler-2.4.10/lib/bundler/runtime.rb:44:in `each'
        from /usr/local/bundle/gems/bundler-2.4.10/lib/bundler/runtime.rb:44:in `require'
        from /usr/local/bundle/gems/bundler-2.4.10/lib/bundler.rb:196:in `require'
        from /usr/src/app/config/application.rb:7:in `<top (required)>'
        from <internal:/usr/local/lib/ruby/site_ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:38:in `require'
        from <internal:/usr/local/lib/ruby/site_ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:38:in `require'
        from /usr/local/bundle/gems/railties-7.0.5/lib/rails/command/actions.rb:22:in `require_application!'
        from /usr/local/bundle/gems/railties-7.0.5/lib/rails/command/actions.rb:14:in `require_application_and_environment!'
        from /usr/local/bundle/gems/railties-7.0.5/lib/rails/commands/console/console_command.rb:105:in `perform'
        from /usr/local/bundle/gems/thor-1.2.2/lib/thor/command.rb:27:in `run'
        from /usr/local/bundle/gems/thor-1.2.2/lib/thor/invocation.rb:127:in `invoke_command'
        from /usr/local/bundle/gems/thor-1.2.2/lib/thor.rb:392:in `dispatch'
        from /usr/local/bundle/gems/railties-7.0.5/lib/rails/command/base.rb:87:in `perform'
        from /usr/local/bundle/gems/railties-7.0.5/lib/rails/command.rb:48:in `invoke'
        from /usr/local/bundle/gems/railties-7.0.5/lib/rails/commands.rb:18:in `<top (required)>'
        from <internal:/usr/local/lib/ruby/site_ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:38:in `require'
        from <internal:/usr/local/lib/ruby/site_ruby/3.2.0/rubygems/core_ext/kernel_require.rb>:38:in `require'
        from bin/rails:4:in `<main>'

My versions:

$ ruby -v
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
$ gem -v
3.4.20
$ cat /etc/debian_version 
12.1

Any idea what might be wrong?

Issues when deploying to Ubuntu 20.04

I recently installed the tokenizers gem while working with the OpenAI API. When developing locally, everything was running very smoothly. This morning, I went to deploy the rails application to our staging environment and ran into a problem. The deployment (via capastrano) failed with this error message:

01 LoadError: cannot load such file -- /var/www/personal-project/shared/bundle/ruby/2.7.0/gems/tokenizers-0.3.2-x86_64-linux-musl/lib/tokenizers/tokenizers

I fulled removed the Tokenizer gem and was able to successfully deploy. I was also able to pin the dependency in the Gemfile to 0.3.1 and it deployed correctly.

I've attached the entire stack trace to this issue.

I'm running a 2021 M1 Macbook. The target server is running Ubuntu 20.04.6 LTS hosted on Digital Ocean. Node version on both local and server is v16.20.0.

Please let me know what other information I can provide to be helpful!

tokenizer-capistrano-failed-deployment.txt

error when using in docker alpine

here is the step to replicate the issue

Dockerfile

FROM ruby:2.7.7-alpine
ENV BUNDLER_VERSION=1.17.3

RUN gem install bundler -v 1.17.3
RUN mkdir /usr/src/app
ADD . /usr/src/app/
WORKDIR /usr/src/app/
RUN bundle install

RUN ruby test.rb

Gemfile

source 'https://rubygems.org'
gem 'tokenizers', '~> 0.3.1'

test.rb

require 'tokenizers'

tokenizer = Tokenizers.from_pretrained("bert-base-cased")

the error when run docker build

[+] Building 7.3s (11/11) FINISHED                                                                                                                                                                                                                          
 => [internal] load .dockerignore                                                                                                                                                                                                                      0.0s
 => => transferring context: 2B                                                                                                                                                                                                                        0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                                                                                   0.0s
 => => transferring dockerfile: 225B                                                                                                                                                                                                                   0.0s
 => [internal] load metadata for docker.io/library/ruby:2.7.7-alpine                                                                                                                                                                                   0.9s
 => [internal] load build context                                                                                                                                                                                                                      0.0s
 => => transferring context: 277B                                                                                                                                                                                                                      0.0s
 => [1/7] FROM docker.io/library/ruby:2.7.7-alpine@sha256:cdfd6dbfa41f2826dafb8c53c677dbebe28d32b401cd644997b1baa8e3e45c8a                                                                                                                             0.0s
 => CACHED [2/7] RUN gem install bundler -v 1.17.3                                                                                                                                                                                                     0.0s
 => [3/7] RUN mkdir /usr/src/app                                                                                                                                                                                                                       0.3s
 => [4/7] ADD . /usr/src/app/                                                                                                                                                                                                                          0.0s
 => [5/7] WORKDIR /usr/src/app/                                                                                                                                                                                                                        0.1s
 => [6/7] RUN bundle install                                                                                                                                                                                                                           5.5s
 => ERROR [7/7] RUN ruby test.rb                                                                                                                                                                                                                       0.4s
------                                                                                                                                                                                                                                                      
 > [7/7] RUN ruby test.rb:                                                                                                                                                                                                                                  
#0 0.406 /usr/local/bundle/gems/tokenizers-0.3.1-x86_64-linux/lib/tokenizers.rb:5:in `require_relative': cannot load such file -- /usr/local/bundle/gems/tokenizers-0.3.1-x86_64-linux/lib/tokenizers/tokenizers (LoadError)                                
#0 0.406     from /usr/local/bundle/gems/tokenizers-0.3.1-x86_64-linux/lib/tokenizers.rb:5:in `rescue in <top (required)>'                                                                                                                                  
#0 0.406     from /usr/local/bundle/gems/tokenizers-0.3.1-x86_64-linux/lib/tokenizers.rb:2:in `<top (required)>'                                                                                                                                            
#0 0.406 	from /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:158:in `require'
#0 0.406 	from /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:158:in `rescue in require'
#0 0.406 	from /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:147:in `require'
#0 0.406 	from test.rb:1:in `<main>'
#0 0.406 /usr/local/bundle/gems/tokenizers-0.3.1-x86_64-linux/lib/tokenizers.rb:3:in `require_relative': Error loading shared library ld-linux-x86-64.so.2: No such file or directory (needed by /usr/local/bundle/gems/tokenizers-0.3.1-x86_64-linux/lib/tokenizers/2.7/tokenizers.so) - /usr/local/bundle/gems/tokenizers-0.3.1-x86_64-linux/lib/tokenizers/2.7/tokenizers.so (LoadError)
#0 0.406 	from /usr/local/bundle/gems/tokenizers-0.3.1-x86_64-linux/lib/tokenizers.rb:3:in `<top (required)>'
#0 0.406 	from /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:158:in `require'
#0 0.406 	from /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:158:in `rescue in require'
#0 0.406 	from /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:147:in `require'

I try to add gcompat to Dockerfile. there is new error

> [8/8] RUN ruby test.rb:                                                                                                                                                                                                                                  
#0 0.385 /usr/local/bundle/gems/tokenizers-0.3.1-x86_64-linux/lib/tokenizers.rb:5:in `require_relative': cannot load such file -- /usr/local/bundle/gems/tokenizers-0.3.1-x86_64-linux/lib/tokenizers/tokenizers (LoadError)                                
#0 0.385     from /usr/local/bundle/gems/tokenizers-0.3.1-x86_64-linux/lib/tokenizers.rb:5:in `rescue in <top (required)>'                                                                                                                                  
#0 0.385     from /usr/local/bundle/gems/tokenizers-0.3.1-x86_64-linux/lib/tokenizers.rb:2:in `<top (required)>'                                                                                                                                            
#0 0.385     from /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:158:in `require'                                                                                                                                                            
#0 0.385 	from /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:158:in `rescue in require'
#0 0.385 	from /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:147:in `require'
#0 0.385 	from test.rb:1:in `<main>'
#0 0.385 /usr/local/bundle/gems/tokenizers-0.3.1-x86_64-linux/lib/tokenizers.rb:3:in `require_relative': Error relocating /usr/local/bundle/gems/tokenizers-0.3.1-x86_64-linux/lib/tokenizers/2.7/tokenizers.so: __register_atfork: symbol not found - /usr/local/bundle/gems/tokenizers-0.3.1-x86_64-linux/lib/tokenizers/2.7/tokenizers.so (LoadError)
#0 0.385 	from /usr/local/bundle/gems/tokenizers-0.3.1-x86_64-linux/lib/tokenizers.rb:3:in `<top (required)>'
#0 0.385 	from /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:158:in `require'
#0 0.385 	from /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:158:in `rescue in require'
#0 0.385 	from /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:147:in `require'
#0 0.385 	from test.rb:1:in `<main>'

Support to Ruby 3.2.0 (release 0.2.1)

Hi! First of all, thanks for making and maintaining this gem! 🙌

I was wondering when you expect to release the last version of the gem with support to Ruby 3.2.0. We're using it in a project and it seems to be the only missing piece for us to upgrade it.

Thanks in advance!

PS: Asking this because I noticed the commit with the changes was already merged 1f89484

Issue with CharBPETokenizer and pile_tokenizer.json

First of all, thanks for such a great gem.

I'm trying to use it for my hobby project and don't have much luck so far.
I have the pile_tokenizer.json file which looks like this:

{
    "addedTokens": {
        "<|endoftext|>": 0,
        "<|padding|>": 1,
        "        ": 50254,
        "    ": 50255,
        "  ": 50256
    },
    "vocab": {
        "<|endoftext|>": 0,
        "<|padding|>": 1,
        "!": 2,
        "\"": 3,
        "#": 4,
        "$": 5,
        "%": 6,
        "&": 7,
...
    "merges": [
        "Ġ Ġ",
        "Ġ t",
        "Ġ a",
        "h e",
        "i n",
        "r e",
        "o n",
        "ĠĠ ĠĠ",
        "Ġt he",
        "e r",
        "a t",
...
    ]
}

I tried to use it like this:

 tokenizer = Tokenizers.from_file('pile_tokenizer.json')

But it seems like something is wrong with its format.

I also tried to extract vocab and merges into separate files: vocab.json and merges.txt and use it as its described in the README:

tokenizer = Tokenizers::CharBPETokenizer.new("vocab.json", "merges.txt")

While it decodes individual tokens well, I struggle to make it encode a string properly.

For example, when I try to encode Hello World which is represented as HelloĠWorld I expect to get these tokens: ["Hello", "ĠWorld"], but instead I get ["hell", "og", "worl", "<unk>"].

For longer strings, it returns a lot of <unk> tokens. It reminds me that I see in this test

expected_tokens = ["<unk>", "ca", "<unk>", "fee", "<unk>", "th", "<unk>", "m", "agi", "<unk>", "<unk>", "ca", "<unk>", "yo", "<unk>", "<unk>"]

If you could help me understand what I'm doing wrong, I would appreciate it.

Error compiling

Getting the following compiler output:

rustc --version
rustc 1.61.0 (fe5b13d68 2022-05-18)

gem install tokenizers
Building native extensions. This could take a while...
ERROR:  Error installing tokenizers:
	ERROR: Failed to build gem native extension.

    current directory: /home/ur5us/.rvm/gems/ruby-3.0.4/gems/tokenizers-0.1.0/ext/tokenizers
/home/ur5us/.rvm/rubies/ruby-3.0.4/bin/ruby -I /home/ur5us/.rvm/rubies/ruby-3.0.4/lib/ruby/site_ruby/3.0.0 -r ./siteconf20220623-143041-cdpust.rb extconf.rb

current directory: /home/ur5us/.rvm/gems/ruby-3.0.4/gems/tokenizers-0.1.0/ext/tokenizers
make DESTDIR\= clean
cargo clean

current directory: /home/ur5us/.rvm/gems/ruby-3.0.4/gems/tokenizers-0.1.0/ext/tokenizers
make DESTDIR\=
cargo build --release
   Compiling libc v0.2.121
   Compiling cfg-if v1.0.0
   Compiling autocfg v1.1.0
   Compiling cc v1.0.73
   Compiling pkg-config v0.3.24
   Compiling proc-macro2 v1.0.36
   Compiling unicode-xid v0.2.2
   Compiling syn v1.0.89
   Compiling memchr v2.3.4
   Compiling lazy_static v1.4.0
   Compiling log v0.4.14
   Compiling version_check v0.9.4
   Compiling pin-project-lite v0.2.8
   Compiling bitflags v1.3.2
   Compiling futures-core v0.3.21
   Compiling bytes v1.1.0
   Compiling once_cell v1.10.0
   Compiling itoa v1.0.1
   Compiling serde_derive v1.0.136
   Compiling futures-task v0.3.21
   Compiling crossbeam-utils v0.8.8
   Compiling typenum v1.15.0
   Compiling futures-util v0.3.21
   Compiling foreign-types-shared v0.1.1
   Compiling serde v1.0.136
   Compiling fnv v1.0.7
   Compiling openssl v0.10.38
   Compiling ryu v1.0.9
   Compiling pin-utils v0.1.0
   Compiling hashbrown v0.11.2
   Compiling slab v0.4.5
   Compiling crc32fast v1.3.2
   Compiling unicode-width v0.1.9
   Compiling futures-channel v0.3.21
   Compiling native-tls v0.2.8
   Compiling futures-io v0.3.21
   Compiling tinyvec_macros v0.1.0
   Compiling matches v0.1.9
   Compiling httparse v1.6.0
   Compiling futures-sink v0.3.21
   Compiling scopeguard v1.1.0
   Compiling radium v0.5.3
   Compiling percent-encoding v2.1.0
   Compiling strsim v0.9.3
   Compiling ident_case v1.0.1
   Compiling adler v1.0.2
   Compiling rayon-core v1.9.1
   Compiling getrandom v0.1.16
   Compiling ppv-lite86 v0.2.16
   Compiling try-lock v0.2.3
   Compiling regex-syntax v0.6.25
   Compiling openssl-probe v0.1.5
   Compiling httpdate v1.0.2
   Compiling lexical-core v0.7.6
   Compiling unicode-bidi v0.3.7
   Compiling encoding_rs v0.8.30
   Compiling tower-service v0.3.1
   Compiling either v1.6.1
   Compiling byteorder v1.4.3
   Compiling arrayvec v0.5.2
   Compiling tap v1.0.1
   Compiling wyz v0.2.0
   Compiling static_assertions v1.1.0
   Compiling serde_json v1.0.79
   Compiling funty v1.1.0
   Compiling fastrand v1.7.0
   Compiling cpufeatures v0.2.2
   Compiling mime v0.3.16
   Compiling base64 v0.13.0
   Compiling remove_dir_all v0.5.3
   Compiling derive_builder v0.9.0
   Compiling ipnet v2.4.0
   Compiling number_prefix v0.4.0
   Compiling ansi_term v0.12.1
   Compiling glob v0.3.0
   Compiling unicode-segmentation v1.9.0
   Compiling strsim v0.8.0
   Compiling macro_rules_attribute-proc_macro v0.0.2
   Compiling smallvec v1.8.0
   Compiling number_prefix v0.3.0
   Compiling rutie v0.8.3
   Compiling vec_map v0.8.2
   Compiling base64 v0.12.3
   Compiling unicode_categories v0.1.1
   Compiling paste v1.0.6
   Compiling tracing-core v0.1.23
   Compiling indexmap v1.8.0
   Compiling memoffset v0.6.5
   Compiling crossbeam-epoch v0.9.8
   Compiling miniz_oxide v0.4.4
   Compiling rayon v1.5.1
   Compiling generic-array v0.14.5
   Compiling nom v6.2.1
   Compiling foreign-types v0.3.2
   Compiling http v0.2.6
   Compiling textwrap v0.11.0
   Compiling openssl-sys v0.9.72
   Compiling bzip2-sys v0.1.11+1.0.8
   Compiling onig_sys v69.7.1
   Compiling esaxx-rs v0.1.7
   Compiling tinyvec v1.5.1
   Compiling form_urlencoded v1.0.1
   Compiling itertools v0.8.2
   Compiling itertools v0.9.0
   Compiling unicode-normalization-alignments v0.1.12
   Compiling macro_rules_attribute v0.0.2
   Compiling tracing v0.1.32
   Compiling aho-corasick v0.7.15
   Compiling unicode-normalization v0.1.19
   Compiling http-body v0.4.4
   Compiling num_cpus v1.13.1
   Compiling socket2 v0.4.4
   Compiling terminal_size v0.1.17
   Compiling getrandom v0.2.5
   Compiling time v0.1.43
   Compiling filetime v0.2.15
   Compiling xattr v0.2.2
   Compiling atty v0.2.14
   Compiling tempfile v3.3.0
   Compiling dirs-sys v0.3.7
   Compiling fs2 v0.4.3
   Compiling quote v1.0.16
   Compiling mio v0.8.2
   Compiling want v0.3.0
   Compiling crossbeam-channel v0.5.4
   Compiling bitvec v0.19.6
   Compiling regex v1.4.6
   Compiling idna v0.2.3
   Compiling rand_core v0.6.3
   Compiling rand_core v0.5.1
   Compiling tar v0.4.38
   Compiling clap v2.34.0
   Compiling dirs v3.0.2
   Compiling tokio v1.17.0
   Compiling flate2 v1.0.22
   Compiling block-buffer v0.10.2
   Compiling crypto-common v0.1.3
   Compiling rand_chacha v0.3.1
   Compiling rand_chacha v0.2.2
   Compiling url v2.2.2
   Compiling bzip2 v0.4.3
   Compiling console v0.15.0
   Compiling crossbeam-deque v0.8.1
   Compiling digest v0.10.3
   Compiling rand v0.8.5
   Compiling rand v0.7.3
   Compiling tokio-util v0.6.9
   Compiling indicatif v0.16.2
   Compiling indicatif v0.15.0
   Compiling darling_core v0.10.2
   Compiling onig v6.3.1
   Compiling sha2 v0.10.2
   Compiling h2 v0.3.12
   Compiling tokio-native-tls v0.3.0
   Compiling thiserror-impl v1.0.30
   Compiling darling_macro v0.10.2
   Compiling thiserror v1.0.30
   Compiling zip v0.5.13
   Compiling darling v0.10.2
   Compiling derive_builder_core v0.9.0
   Compiling zip-extensions v0.6.1
   Compiling rayon-cond v0.1.0
   Compiling hyper v0.14.17
   Compiling serde_urlencoded v0.7.1
   Compiling spm_precompiled v0.1.3
   Compiling hyper-tls v0.5.0
   Compiling reqwest v0.11.10
   Compiling cached-path v0.5.3
   Compiling tokenizers v0.11.3
   Compiling tokenizers-ruby v0.1.0 (/home/ur5us/.rvm/gems/ruby-3.0.4/gems/tokenizers-0.1.0)
warning: `extern` fn uses type `AnyObject`, which is not FFI-safe
  --> src/lib.rs:63:101
   |
63 |     fn tokenizers_from_pretrained(identifier: RString, revision: RString, auth_token: AnyObject) -> AnyObject {
   |                                                                                                     ^^^^^^^^^ not FFI-safe
   |
   = note: `#[warn(improper_ctypes_definitions)]` on by default
   = help: consider adding a `#[repr(C)]` or `#[repr(transparent)]` attribute to this struct
   = note: this struct has unspecified layout

warning: `extern` fn uses type `AnyObject`, which is not FFI-safe
  --> src/lib.rs:88:52
   |
88 |     fn bpe_new(vocab: RString, merges: RString) -> AnyObject {
   |                                                    ^^^^^^^^^ not FFI-safe
   |
   = help: consider adding a `#[repr(C)]` or `#[repr(transparent)]` attribute to this struct
   = note: this struct has unspecified layout

warning: `extern` fn uses type `AnyObject`, which is not FFI-safe
   --> src/lib.rs:107:43
    |
107 |     fn tokenizer_new(model: AnyObject) -> AnyObject {
    |                                           ^^^^^^^^^ not FFI-safe
    |
    = help: consider adding a `#[repr(C)]` or `#[repr(transparent)]` attribute to this struct
    = note: this struct has unspecified layout

warning: `extern` fn uses type `AnyObject`, which is not FFI-safe
   --> src/lib.rs:137:43
    |
137 |     fn tokenizer_encode(text: RString) -> AnyObject {
    |                                           ^^^^^^^^^ not FFI-safe
    |
    = help: consider adding a `#[repr(C)]` or `#[repr(transparent)]` attribute to this struct
    = note: this struct has unspecified layout

warning: `extern` fn uses type `rutie::RString`, which is not FFI-safe
   --> src/lib.rs:147:40
    |
147 |     fn tokenizer_decode(ids: Array) -> RString {
    |                                        ^^^^^^^ not FFI-safe
    |
    = help: consider adding a `#[repr(C)]` or `#[repr(transparent)]` attribute to this struct
    = note: this struct has unspecified layout

warning: `extern` fn uses type `AnyObject`, which is not FFI-safe
   --> src/lib.rs:159:53
    |
159 |     fn tokenizer_decoder_set(decoder: AnyObject) -> AnyObject {
    |                                                     ^^^^^^^^^ not FFI-safe
    |
    = help: consider adding a `#[repr(C)]` or `#[repr(transparent)]` attribute to this struct
    = note: this struct has unspecified layout

warning: `extern` fn uses type `AnyObject`, which is not FFI-safe
   --> src/lib.rs:167:65
    |
167 |     fn tokenizer_pre_tokenizer_set(pre_tokenizer: AnyObject) -> AnyObject {
    |                                                                 ^^^^^^^^^ not FFI-safe
    |
    = help: consider adding a `#[repr(C)]` or `#[repr(transparent)]` attribute to this struct
    = note: this struct has unspecified layout

warning: `extern` fn uses type `AnyObject`, which is not FFI-safe
   --> src/lib.rs:175:59
    |
175 |     fn tokenizer_normalizer_set(normalizer: AnyObject) -> AnyObject {
    |                                                           ^^^^^^^^^ not FFI-safe
    |
    = help: consider adding a `#[repr(C)]` or `#[repr(transparent)]` attribute to this struct
    = note: this struct has unspecified layout

warning: `extern` fn uses type `rutie::Array`, which is not FFI-safe
   --> src/lib.rs:188:26
    |
188 |     fn encoding_ids() -> Array {
    |                          ^^^^^ not FFI-safe
    |
    = help: consider adding a `#[repr(C)]` or `#[repr(transparent)]` attribute to this struct
    = note: this struct has unspecified layout

warning: `extern` fn uses type `rutie::Array`, which is not FFI-safe
   --> src/lib.rs:198:29
    |
198 |     fn encoding_tokens() -> Array {
    |                             ^^^^^ not FFI-safe
    |
    = help: consider adding a `#[repr(C)]` or `#[repr(transparent)]` attribute to this struct
    = note: this struct has unspecified layout

warning: `extern` fn uses type `AnyObject`, which is not FFI-safe
   --> src/lib.rs:213:29
    |
213 |     fn bpe_decoder_new() -> AnyObject {
    |                             ^^^^^^^^^ not FFI-safe
    |
    = help: consider adding a `#[repr(C)]` or `#[repr(transparent)]` attribute to this struct
    = note: this struct has unspecified layout

warning: `extern` fn uses type `AnyObject`, which is not FFI-safe
   --> src/lib.rs:225:36
    |
225 |     fn bert_pre_tokenizer_new() -> AnyObject {
    |                                    ^^^^^^^^^ not FFI-safe
    |
    = help: consider adding a `#[repr(C)]` or `#[repr(transparent)]` attribute to this struct
    = note: this struct has unspecified layout

warning: `extern` fn uses type `AnyObject`, which is not FFI-safe
   --> src/lib.rs:237:33
    |
237 |     fn bert_normalizer_new() -> AnyObject {
    |                                 ^^^^^^^^^ not FFI-safe
    |
    = help: consider adding a `#[repr(C)]` or `#[repr(transparent)]` attribute to this struct
    = note: this struct has unspecified layout

warning: `tokenizers-ruby` (lib) generated 13 warnings
    Finished release [optimized] target(s) in 1m 51s
mv target/release/libtokenizers.so lib/tokenizers/ext.so
mv: cannot stat 'target/release/libtokenizers.so': No such file or directory
make: *** [Makefile:3: install] Error 1

make failed, exit code 2

Gem files will remain installed in /home/ur5us/.rvm/gems/ruby-3.0.4/gems/tokenizers-0.1.0 for inspection.
Results logged to /home/ur5us/.rvm/gems/ruby-3.0.4/extensions/x86_64-linux/3.0.0/tokenizers-0.1.0/gem_make.out

Unable to run Rails when gem is installed

After successfully installing the gem (tokenizers (0.2.2-arm64-darwin)), I'm unable to open Rails console:

➜ rails console
[1]    8415 killed     rails console

When trying to install HEAD version, I get this error message:

Gem::Ext::BuildError: ERROR: Failed to build gem native extension.
…snip…
thread 'main' panicked at 'rustfmt is required to generate bindings. To install, run `rustup component add rustfmt`: Os { code: 2, kind: NotFound, message: "No such file or directory"
}', /Users/marc/.cargo/registry/src/github.com-1ecc6299db9ec823/rb-sys-build-0.9.56/src/bindings.rs:77:10
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
warning: build failed, waiting for other jobs to finish...
make: *** [target/release/libtokenizers.dylib] Error 101

Cannot install tokenizer 0.3.2

Hi i am new to this gem,

i tried to install gem with release tag 0.3.2 and this issue occurs

Gem::Ext::BuildError: ERROR: Failed to build gem native extension.

current directory:
/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/tokenizers-0.3.2/ext/tokenizers
/Users/pribadiridwan/.rbenv/versions/3.0.0/bin/ruby -I
/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/site_ruby/3.0.0 extconf.rb
checking for cargo... no

current directory: /Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/tokenizers-0.3.2/ext/tokenizers
make DESTDIR\= sitearchdir\=./.gem.20230307-17603-x8spx4 sitelibdir\=./.gem.20230307-17603-x8spx4 clean

current directory: /Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/tokenizers-0.3.2/ext/tokenizers
make DESTDIR\= sitearchdir\=./.gem.20230307-17603-x8spx4 sitelibdir\=./.gem.20230307-17603-x8spx4
info: downloading installer
info: profile set to 'minimal'
info: default host triple is aarch64-apple-darwin
info: skipping toolchain installation


Rust is installed now. Great!

To get started you need Cargo's bin directory
(/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/tokenizers-0.3.2/ext/tokenizers/.rb-sys/stable/cargo/bin)
in your PATH
environment variable. This has not been done automatically.

To configure your current shell, run:
source
"/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/tokenizers-0.3.2/ext/tokenizers/.rb-sys/stable/cargo/env"
make: rustup: No such file or directory
make: ***
[/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/tokenizers-0.3.2/ext/tokenizers/.rb-sys/stable/cargo/bin/cargo]
Error 1

make failed, exit code 2

Gem files will remain installed in /Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/tokenizers-0.3.2
for inspection.
Results logged to
/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/extensions/arm64-darwin-21/3.0.0/tokenizers-0.3.2/gem_make.out

  /Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/site_ruby/3.0.0/rubygems/ext/builder.rb:102:in `run'
  /Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/site_ruby/3.0.0/rubygems/ext/builder.rb:51:in `block in make'
  /Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/site_ruby/3.0.0/rubygems/ext/builder.rb:43:in `each'
  /Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/site_ruby/3.0.0/rubygems/ext/builder.rb:43:in `make'
  /Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/site_ruby/3.0.0/rubygems/ext/ext_conf_builder.rb:42:in `build'
  /Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/site_ruby/3.0.0/rubygems/ext/builder.rb:170:in `build_extension'
/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/site_ruby/3.0.0/rubygems/ext/builder.rb:204:in `block in
build_extensions'
  /Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/site_ruby/3.0.0/rubygems/ext/builder.rb:201:in `each'
  /Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/site_ruby/3.0.0/rubygems/ext/builder.rb:201:in `build_extensions'
  /Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/site_ruby/3.0.0/rubygems/installer.rb:843:in `build_extensions'
/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/bundler-2.4.7/lib/bundler/rubygems_gem_installer.rb:72:in
`build_extensions'
/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/bundler-2.4.7/lib/bundler/rubygems_gem_installer.rb:28:in
`install'
/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/bundler-2.4.7/lib/bundler/source/rubygems.rb:200:in
`install'
/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/bundler-2.4.7/lib/bundler/installer/gem_installer.rb:54:in
`install'
/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/bundler-2.4.7/lib/bundler/installer/gem_installer.rb:16:in
`install_from_spec'
/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/bundler-2.4.7/lib/bundler/installer/parallel_installer.rb:167:in
`do_install'
/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/bundler-2.4.7/lib/bundler/installer/parallel_installer.rb:158:in
`block in worker_pool'
/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/bundler-2.4.7/lib/bundler/worker.rb:62:in
`apply_func'
/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/bundler-2.4.7/lib/bundler/worker.rb:57:in `block
in process_queue'
  /Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/bundler-2.4.7/lib/bundler/worker.rb:54:in `loop'
/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/bundler-2.4.7/lib/bundler/worker.rb:54:in
`process_queue'
/Users/pribadiridwan/.rbenv/versions/3.0.0/lib/ruby/gems/3.0.0/gems/bundler-2.4.7/lib/bundler/worker.rb:90:in `block
(2 levels) in create_threads'

An error occurred while installing tokenizers (0.3.2), and Bundler cannot continue.

In Gemfile:
  tokenizers

however if i downgrade it to 0.3.2 its working

my application system
ruby 3.0.0
rails 7.+
rbenv 1.2.0

do you have any idea why ?
thank you

Clarify how to load tokenizers from files

In the README it says this is how you can load tokenizers from files:

tokenizer = Tokenizers::CharBPETokenizer.new("vocab.json", "merges.txt")

But both parameters are required but I don't have a merges.txt file nor do I know what it is supposed to contain.

The following worked for me, perhaps the README should be updated with this code example as well:

tokenizer = Tokenizers.from_file("vocab.json")

More flexible handling of special tokens

I may be missing something, but it looks like it's not currently possible to encode with special tokens because the special tokens argument is hardcoded to false in the encode method. Ideally I'd like to be able to encode (and decode) with special tokens.

I think the big challenge here is the lack of support for default arguments in Rust (and the corresponding Ruby/Rust interface). Ideally we'd like the special tokens value to have a default for encode, and be able to override that default value.

One option (which might free up other opportunities) is to add a wrapper class - the Rust Tokenizer interface would be implemented on an internal class, while the external class returned from factory methods would be a Ruby wrapper class. That Ruby wrapper could handle things like default arguments. This appears to be (essentially) how the Python library works. It could also make it easier to implement batch encoding, sequences, and other related features.

I'm not sure if there's an appetite for this sort of change, or if there's another solution that @ankane may have in mind. Thoughts?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.