Giter VIP home page Giter VIP logo

tiktoken's Introduction

⏳ tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
assert enc.decode(enc.encode("hello world")) == "hello world"

# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4")

The open source version of tiktoken can be installed from PyPI:

pip install tiktoken

The tokeniser API is documented in tiktoken/core.py.

Example code using tiktoken can be found in the OpenAI Cookbook.

Performance

tiktoken is between 3-6x faster than a comparable open source tokeniser:

image

Performance measured on 1GB of text using the GPT-2 tokeniser, using GPT2TokenizerFast from tokenizers==0.13.2, transformers==4.24.0 and tiktoken==0.2.0.

Getting help

Please post questions in the issue tracker.

If you work at OpenAI, make sure to check the internal documentation or feel free to contact @shantanu.

What is BPE anyway?

Language models don't see text like you and I, instead they see a sequence of numbers (known as tokens). Byte pair encoding (BPE) is a way of converting text into tokens. It has a couple desirable properties:

  1. It's reversible and lossless, so you can convert tokens back into the original text
  2. It works on arbitrary text, even text that is not in the tokeniser's training data
  3. It compresses the text: the token sequence is shorter than the bytes corresponding to the original text. On average, in practice, each token corresponds to about 4 bytes.
  4. It attempts to let the model see common subwords. For instance, "ing" is a common subword in English, so BPE encodings will often split "encoding" into tokens like "encod" and "ing" (instead of e.g. "enc" and "oding"). Because the model will then see the "ing" token again and again in different contexts, it helps models generalise and better understand grammar.

tiktoken contains an educational submodule that is friendlier if you want to learn more about the details of BPE, including code that helps visualise the BPE procedure:

from tiktoken._educational import *

# Train a BPE tokeniser on a small amount of text
enc = train_simple_encoding()

# Visualise how the GPT-4 encoder encodes text
enc = SimpleBytePairEncoding.from_tiktoken("cl100k_base")
enc.encode("hello world aaaaaaaaaaaa")

Extending tiktoken

You may wish to extend tiktoken to support new encodings. There are two ways to do this.

Create your Encoding object exactly the way you want and simply pass it around.

cl100k_base = tiktoken.get_encoding("cl100k_base")

# In production, load the arguments directly instead of accessing private attributes
# See openai_public.py for examples of arguments for specific encodings
enc = tiktoken.Encoding(
    # If you're changing the set of special tokens, make sure to use a different name
    # It should be clear from the name what behaviour to expect.
    name="cl100k_im",
    pat_str=cl100k_base._pat_str,
    mergeable_ranks=cl100k_base._mergeable_ranks,
    special_tokens={
        **cl100k_base._special_tokens,
        "<|im_start|>": 100264,
        "<|im_end|>": 100265,
    }
)

Use the tiktoken_ext plugin mechanism to register your Encoding objects with tiktoken.

This is only useful if you need tiktoken.get_encoding to find your encoding, otherwise prefer option 1.

To do this, you'll need to create a namespace package under tiktoken_ext.

Layout your project like this, making sure to omit the tiktoken_ext/__init__.py file:

my_tiktoken_extension
├── tiktoken_ext
│   └── my_encodings.py
└── setup.py

my_encodings.py should be a module that contains a variable named ENCODING_CONSTRUCTORS. This is a dictionary from an encoding name to a function that takes no arguments and returns arguments that can be passed to tiktoken.Encoding to construct that encoding. For an example, see tiktoken_ext/openai_public.py. For precise details, see tiktoken/registry.py.

Your setup.py should look something like this:

from setuptools import setup, find_namespace_packages

setup(
    name="my_tiktoken_extension",
    packages=find_namespace_packages(include=['tiktoken_ext*']),
    install_requires=["tiktoken"],
    ...
)

Then simply pip install ./my_tiktoken_extension and you should be able to use your custom encodings! Make sure not to use an editable install.

tiktoken's People

Contributors

alvarobartt avatar arvid220u avatar fritzo avatar hauntsaninja avatar henriktorget avatar jonathanagustin avatar logankilpatrick avatar mariatta avatar mdwelsh avatar nistath avatar paplorinc avatar praneet460 avatar ted-at-openai avatar xhluca avatar youkaichao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tiktoken's Issues

Reproducing Benchmark Results

Hi,

I want to reproduce the result in presented in README.md, to the extent my hardware would allow. I am aware of the scripts/benchmark.py file and could run tiktoken with different number of threads. But when it comes to setting number of thread for huggingface tokenizers, I could not set it. I tried using environment variable RAYON_RS_NUM_CPUS, but the number of threads did not change.

Any help is appreciated!

Thread Panic when decoding token id 100256 and others with cl100k_base tokenizer

Code example:

enc = tiktoken.get_encoding("cl100k_base")
enc.decode([100256])

Trace:

thread '<unnamed>' panicked at 'no entry found for key', src[/lib.rs:210:37](https://file+.vscode-resource.vscode-cdn.net/lib.rs:210:37)
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
[/var/folders/m9/s4s3bdq96pn3dk13fbgpw6rm0000gn/T/ipykernel_9548/1299473396.py](https://file+.vscode-resource.vscode-cdn.net/var/folders/m9/s4s3bdq96pn3dk13fbgpw6rm0000gn/T/ipykernel_9548/1299473396.py) in 
      1 enc = tiktoken.get_encoding("cl100k_base")
----> 2 enc.decode([100256])

[/usr/local/lib/python3.9/site-packages/tiktoken/core.py](https://file+.vscode-resource.vscode-cdn.net/usr/local/lib/python3.9/site-packages/tiktoken/core.py) in decode(self, tokens, errors)
    237         ```
    238         """
--> 239         return self._core_bpe.decode_bytes(tokens).decode("utf-8", errors=errors)
    240 
    241     def decode_single_token_bytes(self, token: int) -> bytes:

PanicException: no entry found for key

Also reproduces for token ids 100261 through 100275

If tokens are intentionally empty, they should still not cause a panic.

get_encoding error for gpt2, but other encodings fine

The code is like this.

import tiktoken

# runs ok
encoding2 = tiktoken.get_encoding("cl100k_base")

# runs ok
encoding4 = tiktoken.encoding_for_model("gpt-3.5-turbo")

# runs ok
encoding3 = tiktoken.get_encoding("p50k_base")

# runs error !!
encoding3 = tiktoken.get_encoding("gpt2")

The error message is:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[11], line 2
      1 # runs error
----> 2 encoding3 = tiktoken.get_encoding("gpt2")

File ~/work/venv310/lib/python3.10/site-packages/tiktoken/registry.py:63, in get_encoding(encoding_name)
     60     raise ValueError(f"Unknown encoding {encoding_name}")
     62 constructor = ENCODING_CONSTRUCTORS[encoding_name]
---> 63 enc = Encoding(**constructor())
     64 ENCODINGS[encoding_name] = enc
     65 return enc

File ~/work/venv310/lib/python3.10/site-packages/tiktoken_ext/openai_public.py:11, in gpt2()
     10 def gpt2():
---> 11     mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
     12         vocab_bpe_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe",
     13         encoder_json_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json",
     14     )
     15     return {
     16         "name": "gpt2",
     17         "explicit_n_vocab": 50257,
   (...)
     20         "special_tokens": {"<|endoftext|>": 50256},
     21     }

File ~/work/venv310/lib/python3.10/site-packages/tiktoken/load.py:95, in data_gym_to_mergeable_bpe_ranks(vocab_bpe_file, encoder_json_file)
     93 encoder_json_loaded.pop(b"<|endoftext|>", None)
     94 encoder_json_loaded.pop(b"<|startoftext|>", None)
---> 95 assert bpe_ranks == encoder_json_loaded
     97 return bpe_ranks

AssertionError: 

According to another issue that you suggest to run.

python --version
python -c 'import platform; print(platform.platform())'
python -m venv env
source env/bin/activate
env/bin/python -m pip install wheel
env/bin/python -m pip install tiktoken
env/bin/python -c 'import tiktoken; print(tiktoken.get_encoding("gpt2"))'
env/bin/python -c 'import site; import os; print(os.listdir(site.getsitepackages()[0]))'

Since I don't have a python, but I have python3, so I run everything in venv.

Results are something like these.

Python 3.10.3
macOS-13.2-arm64-arm-64bit

(venv310) ➜  ~ pip install wheel
Requirement already satisfied: wheel in ./work/venv310/lib/python3.10/site-packages (0.40.0)
(venv310) ➜  ~ pip install tiktoken
Requirement already satisfied: tiktoken in ./work/venv310/lib/python3.10/site-packages (0.3.1)
Requirement already satisfied: regex>=2022.1.18 in ./work/venv310/lib/python3.10/site-packages (from tiktoken) (2022.10.31)
Requirement already satisfied: requests>=2.26.0 in ./work/venv310/lib/python3.10/site-packages (from tiktoken) (2.28.2)
Requirement already satisfied: charset-normalizer<4,>=2 in ./work/venv310/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (2.0.12)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in ./work/venv310/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (1.26.9)
Requirement already satisfied: idna<4,>=2.5 in ./work/venv310/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (3.3)
Requirement already satisfied: certifi>=2017.4.17 in ./work/venv310/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (2022.12.7)
(venv310) ➜  ~ python -c 'import site; import os; print(os.listdir(site.getsitepackages()[0]))'
['shellingham-1.5.0.post1.dist-info', 'fastjsonschema', 'dataclasses_json-0.5.7.dist-info', 'typing_extensions-4.5.0.dist-info', 'commonmark-0.9.1.dist-info', 'talib', 'weibo_spider', 'multidict-6.0.4.dist-info', 'async_timeout', 'marshmallow', 'importlib_metadata-4.10.1.dist-info', 'appnope', 'packaging', 'fonttools-4.31.1.dist-info', 'aiohttp', 'rfc3339_validator-0.1.4.dist-info', 'appnope-0.1.3.dist-info', 'certifi-2022.12.7.dist-info', 'pyrsistent-0.19.3.dist-info', 'altgraph-0.17.3.dist-info', 'wcwidth-0.2.6.dist-info', 'qdarkstyle', 'fqdn-1.5.1.dist-info', 'decorator-5.1.1.dist-info', 'tokenizers', 'ffmpeg', 'jupyter_client-8.0.3.dist-info', 'wcwidth', 'idna-3.3.dist-info', 'Jinja2-3.1.2.dist-info', 'websocket', 'markupsafe', 'integv', 'deap', 'jupyter_core', 'lxml-4.8.0.dist-info', 'vnpy_algotrading-1.0.2.dist-info', 'pandocfilters-1.5.0.dist-info', 'ptyprocess-0.7.0.dist-info', 'widgetsnbextension', 'aiosignal-1.3.1.dist-info', 'pytz-2022.1.dist-info', 'bs4-0.0.1.dist-info', 'isoduration-20.11.0.dist-info', 'webcolors-1.12.dist-info', 'webencodings', 'huggingface_hub', 'tinycss2-1.2.1.dist-info', 'yt_dlp', 
'backcall', 'websocket_client-1.5.1.dist-info', 'bleach-6.0.0.dist-info', 'defusedxml', '.DS_Store', 'nbformat', 'mistune', 'webencodings-0.5.1.dist-info', 'shiboken6', 'attrs', 'colorama-0.4.6.dist-info', 'pyrsistent', 'python_dateutil-2.8.2.dist-info', 'pycryptodomex-3.17.dist-info', 'debugpy-1.6.6.dist-info', 'bleach', 'pygments', 'TA_Lib-0.4.24.dist-info', 'pure_eval', 'aiofiles', 'pyparsing-3.0.7.dist-info', 'gpt_index-0.4.28.dist-info', 'asttokens-2.2.1.dist-info', 'pycparser', 'async_timeout-4.0.2.dist-info', 'more_itertools-9.1.0.dist-info', 'soupsieve-2.3.2.post1.dist-info', 
'nbclient-0.7.2.dist-info', 'python_json_logger-2.0.7.dist-info', 'jupyter_server_terminals-0.4.4.dist-info', 'jupyterlab-3.6.1.dist-info', 'vnpy_ctastrategy', 'pylab.py', 'defusedxml-0.7.1.dist-info', 'ipywidgets', 'typer', 'xvideos_dl', 'marshmallow-3.19.0.dist-info', 'youtube_dl', 'shiboken6-6.2.3.dist-info', 'argon2', 'jupyter_core-5.2.0.dist-info', 'pyparsing', 'debugpy', 'cursor', 'requests-2.28.2.dist-info', 'pickleshare-0.7.5.dist-info', 'vnpy_paperaccount', 'stack_data-0.6.2.dist-info', 'stack_data', 'past', 'langchain', 'QDarkStyle-3.0.3.dist-info', 'jinja2', 'nest_asyncio-1.5.6.dist-info', 'jupyter_events-0.6.3.dist-info', 'arrow', 'IPython', 'soupsieve', 'frozenlist', 'Send2Trash-1.8.0.dist-info', 'jupyter_client', 'parso-0.8.3.dist-info', 'seaborn', 'isoduration', 'executing-1.2.0.dist-info', 'six-1.16.0.dist-info', 'mypy_extensions-1.0.0.dist-info', 'EbookLib-0.18.dist-info', 'peewee-3.14.10.dist-info', 'decorator.py', 'filelock-3.9.0.dist-info', 'jupyterlab_widgets-3.0.5.dist-info', 'jupyterlab_plotly', 'llvmlite', 'ipywidgets-8.0.4.dist-info', '_cffi_backend.cpython-310-darwin.so', 'mutagen', 'jsonpointer.py', 'notebook_shim', 'numba-0.56.4.dist-info', 'future-0.18.3.dist-info', 'xvideos_dl-1.3.0.dist-info', 'colorama', 'cffi', 'vnpy_spreadtrading-1.1.4.dist-info', 'aiofiles-22.1.0.dist-info', 
'executing', 'jsonpointer-2.3.dist-info', 'ipykernel_launcher.py', 'llama_index', 'matplotlib_inline', 
'jupyterlab_server', 'jedi', 'send2trash', 'PySide6-6.2.3.dist-info', 'pip-23.0.1.dist-info', 'tests', 'absl', 'ipython_genutils', 'jupyter_server-2.4.0.dist-info', 'Babel-2.12.1.dist-info', 'fqdn', 'youtube_dl-2021.12.17.dist-info', 'vnpy_sqlite', 'fontTools', 'argon2_cffi-21.3.0.dist-info', 'idna', 'json5-0.9.11.dist-info', 'prometheus_client-0.16.0.dist-info', 'importlib_metadata', 'tqdm-4.64.0.dist-info', 
'_argon2_cffi_bindings', 'wheel', 'bs4', 'click', 'pickleshare.py', 'plotly-5.5.0.dist-info', 'tenacity', 'torch', 'comm', 'websockets', 'ipykernel-6.21.3.dist-info', 'aiosqlite', 'mpl_toolkits', 'pytz', 'jupyter_server_fileid-0.8.0.dist-info', 'filelock', 
'langchain-0.0.109.dist-info', 'pydantic-1.10.6.dist-info', 'tiktoken-0.3.1.dist-info', '__pycache__', 'jupyter_ydoc-0.2.3.dist-info', 'transformers-4.26.1.dist-info', 'nbclassic', 'arrow-1.2.3.dist-info', 'altgraph', 'sqlalchemy', 'pyqtgraph', 'shellingham', 'vnpy_ctastrategy-1.0.8.dist-info', 'regex', 'platformdirs-3.1.1.dist-info', 'Pillow-9.0.1.dist-info', 'jupyter_events', 'nbclient', 'plotly', 'numpy', 'jupyterlab_pygments', 'more_itertools', 'SQLAlchemy-1.4.46.dist-info', 'notebook-6.5.3.dist-info', 'pycparser-2.21.dist-info', 'charset_normalizer', 'PIL', 'requests', 'click-7.1.2.dist-info', 'cursor-1.3.5.dist-info', 'absl_py-1.0.0.dist-info', 'pure_eval-0.2.2.dist-info', 'pwiz.py', 'backcall-0.2.0.dist-info', 
'zipp.py', '_plotly_utils', 'ypy_websocket', 'matplotlib-3.5.1-py3.10-nspkg.pth', 'multidict', 'anyio', 'pip', 'cycler-0.11.0.dist-info', 'babel', 'marshmallow_enum', 'tornado', 'pvectorc.cpython-310-darwin.so', 'tomli', 'dataclasses_json', 'seaborn-0.11.2.dist-info', 'jupyter_server_fileid', 'PySide6', 'matplotlib_inline-0.1.6.dist-info', 'nbformat-5.7.3.dist-info', 'jupyterlab_server-2.20.0.dist-info', 'certifi', 'prompt_toolkit', 'pandocfilters.py', 'terminado-0.17.1.dist-info', 'pyinstaller_hooks_contrib-2023.0.dist-info', 'distutils-precedence.pth', 'pyqtgraph-0.12.3.dist-info', 'ipython_genutils-0.2.0.dist-info', 'vnpy_spreadtrading', 'weibo_spider-0.3.0.dist-info', 'sniffio', 'attr', 'pexpect', 'tiktoken', '_pyinstaller_hooks_contrib', 'transformers', 'jsonschema', 'jupyter_ydoc', 'tqdm', 'tzlocal-2.0.0.dist-info', 'PyYAML-6.0.dist-info', 
'yt_dlp-2023.3.4.dist-info', 'Brotli-1.0.9.dist-info', 'jupyterlab_pygments-0.2.2.dist-info', 'mypy_extensions.py', 'ffmpeg_python-0.2.0.dist-info', 'kiwisolver.cpython-310-darwin.so', 'torch-1.13.1.dist-info', 'tokenizers-0.13.2.dist-info', 'MarkupSafe-2.1.2.dist-info', '_yaml', 'huggingface_hub-0.13.2.dist-info', 'aiosqlite-0.18.0.dist-info', 'ptyprocess', 'six.py', 'jupyter_server_terminals', 'playhouse', 'vnpy_algotrading', 'pandas-1.3.5.dist-info', 'json5', 'tinycss2', 'jupyter_server_ydoc-0.6.1.dist-info', 'pexpect-4.8.0.dist-info', 'rfc3339_validator.py', 'macholib-1.16.2.dist-info', 'brotli.py', 'rich', 'cycler.py', 'cffi-1.15.1.dist-info', 'urllib3-1.26.9.dist-info', 'nbclassic-0.5.3.dist-info', 'regex-2022.10.31.dist-info', 'matplotlib', 'yaml', 'prometheus_client', 'vnpy', 'uri_template-1.2.0.dist-info', 'frozenlist-1.3.3.dist-info', 'attrs-22.2.0.dist-info', 'ebooklib', 'rfc3986_validator.py', 'jupyter_server', 'pythonjsonlogger', 
'tiktoken_ext', 'scipy-1.8.0.dist-info', 'numba', 'torchgen', 'urllib3', 'nbconvert', 'wheel-0.40.0.dist-info', 'comm-0.1.2.dist-info', 'rfc3986_validator-0.1.1.dist-info', 'tomli-2.0.1.dist-info', 'ipython-8.11.0.dist-info', 'integv-1.3.0.dist-info', 'rich-10.16.2.dist-info', 'widgetsnbextension-4.0.5.dist-info', 'uri_template', 'prompt_toolkit-3.0.38.dist-info', 'macholib', 'asttokens', 'jupyterlab', 'Cryptodome', 'argon2_cffi_bindings-21.2.0.dist-info', 
'setuptools', 'marshmallow_enum-1.5.1.dist-info', 'Pygments-2.14.0.dist-info', 'numpy-1.21.5.dist-info', 'pkg_resources', 'notebook', 'tenacity-8.2.2.dist-info', 'setuptools-57.0.0.dist-info', 'charset_normalizer-2.0.12.dist-info', '_distutils_hack', 'sniffio-1.3.0.dist-info', '_pyrsistent_version.py', 'pyzmq-25.0.1.dist-info', 'fastjsonschema-2.16.3.dist-info',
 'vnpy-3.0.0.dist-info', 'llvmlite-0.39.1.dist-info', 'notebook_shim-0.2.2.dist-info', 'terminado', 'tornado-6.2.dist-info', 'openai_whisper-20230308.dist-info', 'websockets-10.4.dist-info', 'parso', 'pydantic', 'ypy_websocket-0.8.2.dist-info', 'zipp-3.7.0.dist-info', 'QtPy-2.0.1.dist-info', 'mutagen-1.46.0.dist-info', 'webcolors.py', 'y_py-0.5.9.dist-info', 'beautifulsoup4-4.11.2.dist-info', 'anyio-3.6.2.dist-info', 'openai-0.27.2.dist-info', 'typer-0.3.2.dist-info', 
'peewee.py', 'psutil', 'traitlets', 'libfuturize', 'nbconvert-7.2.9.dist-info', 'matplotlib-3.5.1.dist-info', 'mistune-2.0.5.dist-info', 'future', 'typing_inspect.py', 'lxml', 'aiohttp-3.8.4.dist-info', 'typing_inspect-0.8.0.dist-info', 'scipy', 'vnpy_sqlite-1.0.0.dist-info', 'yarl', 'vnpy_ctabacktester', 'functorch', 'vnpy_paperaccount-1.0.1.dist-info', 'zmq', 'packaging-21.3.dist-info', 'yarl-1.8.2.dist-info', 'qtpy', 'vnpy_ctabacktester-1.0.5.dist-info', 
'kiwisolver-1.4.0.dist-info', 'libpasteurize', '_brotli.cpython-310-darwin.so', 
'plotlywidget', 'ipykernel', 'tzlocal', 'aiosignal', '_plotly_future_', 'jedi-0.18.2.dist-info', 
'y_py', 'pandas', 'dateutil', 'commonmark', 'nest_asyncio.py', 'openai', 'typing_extensions.py', 'whisper', 'gpt_index', 'platformdirs', 'llama_index-0.4.28.dist-info', 'jupyterlab_widgets', 'jupyter.py', 'deap-1.3.1.dist-info', 
'psutil-5.9.4.dist-info', 'traitlets-5.9.0.dist-info', 'jsonschema-4.17.3.dist-info', 'jupyter_server_ydoc']

Hopefully there is a solution. Many thanks!

token count differs from the actual token usage from completions API

tiktoken returns 1942, while the completion API only claimed to have 748

`
import tiktoken

a = """
{
"data": {
"attributes": {
"last_dns_records": [
{
"type": "AAAA",
"value": "2a04:4e42::773",
"ttl": 24
},
{
"type": "TXT",
"value": "d1xTs9+kADZZSz3bPphLpkMXXxBGjqn5vsQHhi2M6lo0r8AdIbm6j8LfQXPujsywVgeGSP+AXWX0vO9Iep5cUg==",
"ttl": 300
},
{
"type": "TXT",
"value": "299762315-4422055",
"ttl": 300
},
{
"type": "TXT",
"value": "google-site-verification=_QivaXNjhXy-V1y_YqrycXdAWZi2mVrcwbXerX6THeY",
"ttl": 300
},
{
"type": "TXT",
"value": "764482256-4422025",
"ttl": 300
},
{
"type": "TXT",
"value": "umfe3f9bni2s85tm3m666qbfal.",
"ttl": 300
},
{
"type": "TXT",
"value": "553992719-4400647",
"ttl": 300
},
{
"type": "TXT",
"value": "755973593-4422016",
"ttl": 300
},
{
"rname": "awsdns-hostmaster.amazon.com",
"retry": 900,
"refresh": 7200,
"minimum": 86400,
"value": "ns-47.awsdns-05.com",
"expire": 1209600,
"ttl": 900,
"serial": 1,
"type": "SOA"
},
{
"type": "TXT",
"value": "globalsign-domain-verification=2lI5pahhCu_jg_2RC5GEdolQmAa4K7rhP7_OA-lZBK",
"ttl": 300
},
{
"type": "TXT",
"value": "google-site-verification=R-Btow3Z8oU_9H1IWU4Gm4lvUQ_OVmsfxonIKhIaiPE",
"ttl": 300
},
{
"type": "TXT",
"value": "294913881-4422049",
"ttl": 300
},
{
"type": "TXT",
"value": "882269757-4422010",
"ttl": 300
},
"""

print(len(tiktoken.get_encoding("gpt2").encode(a)))

import openai

completion_api_params = {
# We use temperature of 0.0 because it gives the most predictable, factual answer.
"temperature": 0.0,
"max_tokens": 1,
"model": "text-davinci-003",
}
response = openai.Completion.create(prompt=a, **completion_api_params)
print(f"completion token usage: {response['usage']}")
`

unknown encoding

pip install tiktoken in python 3.10

import tiktoken
enc = tiktoken.encoding_for_model("text-davinci-003")

Report error: ValueError: Unknown encoding p50k_base

assert ENCODING_CONSTRUCTORS is not None
59 if encoding_name not in ENCODING_CONSTRUCTORS:
---> 60 raise ValueError(f"Unknown encoding {encoding_name}")

There seems to be a problem with tokenizer installation

I have install tokenizer using when I try to call it with:

import tiktoken

tokenizer = tiktoken.get_encoding("cl100k_base")

df = pd.read_csv('processed/scraped.csv', index_col=0)
df.columns = ['title', 'text']

df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))

df.n_tokens.hist()

I get the following problem

ModuleNotFoundError Traceback (most recent call last)
Cell In [1], line 1
----> 1 import tiktoken
3 # Load the c1100k base tokenizer which is designed to work with the ada-002 model
4 tokenizer = tiktoken.get_encoding("c1100k_base")
ModuleNotFoundError: No module named 'tiktoken'

Extremely Long Text results in PanicException, which is hard to catch in python code

For some extremely long sequences, the tokenizer can result in a PanicException. Example

import tiktoken
tokenizer = tiktoken.get_encoding("cl100k_base")
text = "^" * 1000000

tokenizer.encode(text)  # this throws a PanicException

The issue is that PanicException is not caught even while catching Exception, and can only be caught by catching a BaseException, which is too broad.

Would it be possible to raise a better exception for such a scenario (maybe something similar to what was done here ?)

The workaround that I currently have is catching the BaseException, and checking for "PanicException" in the exception message. Not sure if it is the best way to do this. Would be grateful for any guidance :)

Supporting python 3.7 for tiktoken

Hi @hauntsaninja , Thank you very much for your works on Tiktoken, by supporting both 3.8 and 3.9

Could you please help to support 3.7 python version? Or could you give a guidance on where it is feasible to do so and if it is feasible, how we should do?

Thank you!

Hugging Face tokenizers

Hi, what is the best way to port tokenizers (e.g. vocabs) from Hugging Face and use them within tiktoken ? For example T5 tokenizer.

tiktoken hangs when getting model encoding.

Running in aws lambda, calling tiktoken.encoding_for_model("gpt-3.5-turbo") causes my script to hang. I found that there is some cache and set an environment variable TIKTOKEN_CACHE_DIR to /tmp/ since lambdas have read only filesystem except /tmp but have had no luck.

API models to .tiktoken mapping?

Hi! Could you confirm if the following mappings are accurate?

code-davinci-002 -> p50k_base.tiktoken
text-*-003 -> p50k_base.tiktoken
text-*-002 -> p50k_base.tiktoken
text-*-001 -> r50k_base.tiktoken
embeddings-*-001 -> r50k_base.tiktoken
embeddings-*-002 -> cl100k_base.tiktoken

NPM package

Would it be possible to add a wasm target and make tiktoken available for Node.js projects?

I'm currently relying on gpt-3-encoder but would prefer to use tiktoken for performance reasons.

Unable to find ENCODING_CONSTRUCTORS

This is happening when building tiktoken in bazel

enc = tiktoken.get_encoding(encoder)
File "/private/var/tmp/_bazel_dheerajagrawal/795e110180f9443c94e6bab86cf49f84/execroot/main/bazel-out/darwin_arm64-fastbuild/bin/metadata_extraction/functions/server/app.runfiles/main/utils/pypi/tiktoken_0.3.0/site-packages/tiktoken/registry.py", line 56, in get_encoding
_find_constructors()
File "/private/var/tmp/_bazel_dheerajagrawal/795e110180f9443c94e6bab86cf49f84/execroot/main/bazel-out/darwin_arm64-fastbuild/bin/metadata_extraction/functions/server/app.runfiles/main/utils/pypi/tiktoken_0.3.0/site-packages/tiktoken/registry.py", line 36, in _find_constructors
raise ValueError(
ValueError: tiktoken plugin tiktoken_ext.pycache does not define ENCODING_CONSTRUCTORS

Improve packaging metadata for PyPI

The tiktoken package on PyPI could use some metadata to indicate that it is an official OpenAI project because, unlike this repo, the project on PyPI does not mention its link to OpenAI at all.

  • Add the openai user as an author or maintainer
  • Link back to this repo as the project homepage
  • Use the README as a project description

These steps help because phishing and other security issues are a problem on PyPI, and linking back to this OpenAI-controlled repo (which then indicates the proper package in PyPI to install) provide a signal that it's a legit package, at least for those of us who double-check package sources before pip installing :)

Can you make an alternative, pure python version of Tiktoken? (no Rust dependency)

Can you make an alternative, pure python version of Tiktoken, with no Rust dependency?
For those who cannot compile and run Rust binaries on their system (for various reasons: package managers support, company policy, intranet or local machine security, docking containers limitations, vm restrictions, environment virtualization, lack of Rust support in jupyter notebooks remote hosting, etc).
Of course it would be slower than the Rust version, but as a fallback it would allow Tiktoken to be used in a much wider set of platforms and working environments.
Thanks!

pyinstaller has some bug that results in improper packaging of tiktoken

What could be the fix for this error. I am trying out the library for the first time.

import tiktoken
enc = tiktoken.get_encoding("gpt2")
assert enc.decode(enc.encode("hello world")) == "hello world"
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [47], in <cell line: 2>()
      1 import tiktoken
----> 2 enc = tiktoken.get_encoding("gpt2")
      3 assert enc.decode(enc.encode("hello world")) == "hello world"

File ~/work/p3ds/lib/python3.10/site-packages/tiktoken/registry.py:60, in get_encoding(encoding_name)
     57     assert ENCODING_CONSTRUCTORS is not None
     59 if encoding_name not in ENCODING_CONSTRUCTORS:
---> 60     raise ValueError(f"Unknown encoding {encoding_name}")
     62 constructor = ENCODING_CONSTRUCTORS[encoding_name]
     63 enc = Encoding(**constructor())

ValueError: Unknown encoding gpt2


Follow up question on #https://github.com/openai/tiktoken/issues/38

Hi @hauntsaninja , seems that github doesn't give me the way to send message for questions, so could I have follow up questions on the closed issue #38? Thank you very much. I will clean up the issue up after.


Follow up question in #38:

I used Huggingface's GPT2Tokenizer, with below code:

from transformers import GPT2Tokenizer
enc = GPT2Tokenizer.from_pretrained("gpt2")

Given the minor bugs on #34 (23/50275)
a. could you tell if you think the tokenized results should be comparable for both Tiktoken and Huggingface's GPT2Tokenizer?
b. If I would like to fix this, could you suggest me? should I make the change directly here, from the vocab.json file of Huggingface's GPT2 or instead of using this vocab.json, I changed to use encoder.json from gpt-2 (the reason is I hope to use Huggingface's GPT2Tokenizer because I would need to use python 3.7):

image

Thank you!

request error

blobfile: error message=request failed with exception HTTPSConnectionPool(host='login.microsoftonline.com', port=443): Read timed out. (read timeout=10), request=, status=0, error=None, error_description=None, error_headers=None when executing http request attempt 1, sleeping for 0.1 seconds before retrying

how to solve it?

how to generate _tiktoken_bg.wasm?

hi! I wonder how to generate or download _tiktken_bg.wasm? Since i always encounter errors like "chatgpt activate error, not found _tiktken_bg.wasm" while debugging a vscode extension about chatgpt. Looking forward to your reply.

Azure blob is not accessible

>>> import tiktoken
>>> tiktoken.get_encoding("gpt2")

Error:

blobfile._common.Error: Encountered an error when requesting an access token: `invalid_request: AADSTS900144: The request body must contain the following parameter: 'client_id'.

Stacktrace

>>> import tiktoken
>>> tiktoken.get_encoding("gpt2")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/miniconda3/envs/medeina/lib/python3.9/site-packages/tiktoken/registry.py", line 61, in get_encoding
    enc = Encoding(**constructor())
  File "/home/miniconda3/envs/medeina/lib/python3.9/site-packages/tiktoken_ext/openai_public.py", line 11, in gpt2
    mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
  File "/home/miniconda3/envs/medeina/lib/python3.9/site-packages/tiktoken/load.py", line 58, in data_gym_to_mergeable_bpe_ranks
    vocab_bpe_contents = read_file_cached(vocab_bpe_file).decode()
  File "/home/miniconda3/envs/medeina/lib/python3.9/site-packages/tiktoken/load.py", line 30, in read_file_cached
    with blobfile.BlobFile(blobpath, "rb") as f:
  File "/home/miniconda3/envs/medeina/lib/python3.9/site-packages/blobfile/_ops.py", line 359, in BlobFile
    return default_context.BlobFile(
  File "/home/miniconda3/envs/medeina/lib/python3.9/site-packages/blobfile/_context.py", line 910, in BlobFile
    f = azure.StreamingReadFile(self._conf, path, size=file_size)
  File "/home/miniconda3/envs/medeina/lib/python3.9/site-packages/blobfile/_azure.py", line 1015, in __init__
    st = maybe_stat(conf, path)
  File "/home/miniconda3/envs/medeina/lib/python3.9/site-packages/blobfile/_azure.py", line 1280, in maybe_stat
    resp = execute_api_request(conf, req)
  File "/home/miniconda3/envs/medeina/lib/python3.9/site-packages/blobfile/_azure.py", line 856, in execute_api_request
    return common.execute_request(conf, build_req)
  File "/home/miniconda3/envs/medeina/lib/python3.9/site-packages/blobfile/_common.py", line 504, in execute_request
    req = build_req()
  File "/home/miniconda3/envs/medeina/lib/python3.9/site-packages/blobfile/_azure.py", line 853, in build_req
    req, auth=access_token_manager.get_token(conf, key=(account, container))
  File "/home/miniconda3/envs/medeina/lib/python3.9/site-packages/blobfile/_common.py", line 755, in get_token
    self._tokens[key], self._expirations[key] = self._get_token_fn(
  File "/home/miniconda3/envs/medeina/lib/python3.9/site-packages/blobfile/_azure.py", line 726, in _get_access_token
    raise Error(
blobfile._common.Error: Encountered an error when requesting an access token: `invalid_request: AADSTS900144: The request body must contain the following parameter: 'client_id'.
Trace ID: 83948fe0-79a9-436b-ade1-c22eb4687000
Correlation ID: 84dc9ac6-fbf9-4c1c-87ee-56b866f2058c
Timestamp: 2022-12-20 18:59:07Z`.  You can attempt to fix this by re-running `az login`.

Error while installing in Docker dev environment

I tried to install tiktoken in Docker dev environment with Python 3.9 using the default approach: pip install tiktoken
But I got an error: Could not build wheels for tiktoken, which is required to install pyproject.toml-based projects
Here is the stack trace:

root ➜ /com.docker.devenvironments.code (master ✗) $ pip install tiktoken
Collecting tiktoken
  Using cached tiktoken-0.3.0.tar.gz (24 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting blobfile>=2
  Using cached blobfile-2.0.1-py3-none-any.whl (73 kB)
Collecting requests>=2.26.0
  Using cached requests-2.28.2-py3-none-any.whl (62 kB)
Collecting regex>=2022.1.18
  Using cached regex-2022.10.31-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (769 kB)
Requirement already satisfied: lxml~=4.9 in /usr/local/lib/python3.9/dist-packages (from blobfile>=2->tiktoken) (4.9.2)
Requirement already satisfied: urllib3<3,>=1.25.3 in /usr/local/lib/python3.9/dist-packages (from blobfile>=2->tiktoken) (1.26.13)
Requirement already satisfied: filelock~=3.0 in /usr/local/lib/python3.9/dist-packages (from blobfile>=2->tiktoken) (3.8.2)
Collecting pycryptodomex~=3.8
  Using cached pycryptodomex-3.17-cp35-abi3-manylinux2014_aarch64.whl (2.1 MB)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.9/dist-packages (from requests>=2.26.0->tiktoken) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests>=2.26.0->tiktoken) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests>=2.26.0->tiktoken) (2022.9.24)
Building wheels for collected packages: tiktoken
  Building wheel for tiktoken (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for tiktoken (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [37 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-aarch64-cpython-39
      creating build/lib.linux-aarch64-cpython-39/tiktoken
      copying tiktoken/registry.py -> build/lib.linux-aarch64-cpython-39/tiktoken
      copying tiktoken/__init__.py -> build/lib.linux-aarch64-cpython-39/tiktoken
      copying tiktoken/model.py -> build/lib.linux-aarch64-cpython-39/tiktoken
      copying tiktoken/core.py -> build/lib.linux-aarch64-cpython-39/tiktoken
      copying tiktoken/load.py -> build/lib.linux-aarch64-cpython-39/tiktoken
      creating build/lib.linux-aarch64-cpython-39/tiktoken_ext
      copying tiktoken_ext/openai_public.py -> build/lib.linux-aarch64-cpython-39/tiktoken_ext
      running egg_info
      writing tiktoken.egg-info/PKG-INFO
      writing dependency_links to tiktoken.egg-info/dependency_links.txt
      writing requirements to tiktoken.egg-info/requires.txt
      writing top-level names to tiktoken.egg-info/top_level.txt
      reading manifest file 'tiktoken.egg-info/SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      warning: no files found matching 'Makefile'
      adding license file 'LICENSE'
      writing manifest file 'tiktoken.egg-info/SOURCES.txt'
      copying tiktoken/py.typed -> build/lib.linux-aarch64-cpython-39/tiktoken
      running build_ext
      running build_rust
      error: can't find Rust compiler
      
      If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.
      
      To update pip, run:
      
          pip install --upgrade pip
      
      and then retry package installation.
      
      If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs) is the recommended way to download and update the Rust compiler toolchain.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tiktoken
Failed to build tiktoken
ERROR: Could not build wheels for tiktoken, which is required to install pyproject.toml-based projects

Unsound transmute in lib

Just a small note, but the repr(Rust) memory layout does not guarantee that two similarly defined structs are laid out the exact same way in memory. Thus, transmutes between repr(Rust) types are UB.

For reference, please see this nomicon article here:
https://doc.rust-lang.org/nomicon/repr-rust.html

Rust does guarantee that two instances of A have their data laid out in exactly the same way. However Rust does not currently guarantee that an instance of A has the same field ordering or padding as an instance of B.

The struct in question requires a repr(transparent) or repr(C) to be safe. However, ThreadId does not have any of these on it, so putting it only on one will do little good.

std::mem::transmute::<std::thread::ThreadId, FakeThreadId>(thread::current().id()).0

Does tiktoken support CLIP's tokenizer?

Hi! CLIP currently uses an ad-hoc tokenizer. I was wondering if tiktoken supports the codec that CLIP uses. Or, if by coincidence, CLIP's tokenizer is equivalent to--e.g--GPT2's.

Multithreading (Python) vs Parallelism (Rust)

Hi,
I just tried to test your comment on why you went with python threading option. Indeed, I got similar results to your comment.

------------------------------------------------------------------------------ benchmark: 2 tests ------------------------------------------------------------------------------
Name (time in s)         Min               Max              Mean            StdDev            Median               IQR            Outliers     OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_python_batch     1.9514 (1.0)      2.3618 (1.0)      2.1631 (1.0)      0.1870 (1.0)      2.1691 (1.0)      0.3524 (1.29)          2;0  0.4623 (1.0)           5           1
test_rust_batch       2.5516 (1.31)     3.4357 (1.45)     2.7726 (1.28)     0.3729 (1.99)     2.6189 (1.21)     0.2722 (1.0)           1;1  0.3607 (0.78)          5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Obtained with pytest-benchmark plugin

Fork: https://github.com/mert-kurttutan/tiktoken/tree/main

I am wondering why there is no performance gain with rust in comparison to python? Is this expected despite each encoding operation being a CPU heavy operation? Did you do any analysis/profiling on this issue?

Tiktoken not published to cargo

It seems that the tiktoken package is not linkable from Rust using Cargo's default registry.

Are there plans to publish the tiktoken crate? Is it published on another registry?

Thanks for your work on this BPE encoder, I've already found it very useful!


Repro:

In a rust project, run

cargo add tiktoken

Expected behavior:

Cargo should find, download and add tiktoken to the available crates

Actual behavior:

$ cargo add tiktoken
    Updating crates.io index
error: the crate `tiktoken` could not be found in registry index.

Public .tiktoken files available?

The files listed on https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py

f.e.
az://openaipublic/gpt-2/encodings/main/encoder.json
az://openaipublic/gpt-2/encodings/main/vocab.bpe
az://openaipublic/encodings/cl100k_base.tiktoken
az://openaipublic/encodings/r50k_base.tiktoken

Are these publically available somewhere? Not familiar with Azure cloud, ended up trying AzCopy, but yet no luck.

I'm starting to wonder if these files are actually public, since going through the repo, it seems they are loaded from local filesystem?

What is the recommended way to get these files?

Really amazing work, and very happy to see it as an open-source contribution with excellent explanations throughout the comments in the code!

Performance ideas

I made a toy GPT2 tokenizer as a python rust extension. It seems to be slightly faster than tiktoken in my tests. It looks like #31 may get most or all the way there, but I thought I'd post the results from this script:

import os
import time
from typing import Any, cast

import numpy as np
import tiktoken

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))

def benchmark_batch(documents: list[bytes]) -> None:
    num_threads = 1
    num_bytes = sum(map(len, documents))
    print(f"num_threads: {num_threads}, num_bytes: {num_bytes}")
    documents_decoded = [doc.decode("utf8") for doc in documents]

    enc = tiktoken.get_encoding("gpt2")
    enc.encode("warmup")

    start = time.perf_counter_ns()
    tiktoken_output = enc.encode_ordinary_batch(documents_decoded, num_threads=num_threads)
    end = time.perf_counter_ns()
    print(f"tiktoken \t{num_bytes / (end - start) * 1e9} bytes / s")

    import transformers

    hf_enc = cast(Any, transformers).GPT2TokenizerFast.from_pretrained("gpt2")
    hf_enc.model_max_length = 1e30  # silence!
    hf_enc.encode("warmup")

    start = time.perf_counter_ns()
    hf_enc_output = hf_enc(documents_decoded)
    end = time.perf_counter_ns()
    print(f"huggingface \t{num_bytes / (end - start) * 1e9} bytes / s")

    import csh_bpe.codec
    csh_bpe_enc = csh_bpe.codec.RustGPTCodec(word_encoder_kind="bigram", doc_splitter_kind="direct")
    csh_bpe_enc.encode(np.frombuffer(b"warmup", dtype=np.uint8))
    
    start = time.perf_counter_ns()
    csh_bpe_output = csh_bpe_enc.encode(np.frombuffer(documents[0], dtype=np.uint8))
    end = time.perf_counter_ns()
    print(f"csh_bpe \t{num_bytes / (end - start) * 1e9} bytes / s")

    assert hf_enc_output["input_ids"][0] == tiktoken_output[0]
    assert csh_bpe_output.tolist() == tiktoken_output[0]


def main():
    with open(os.path.join(SCRIPT_DIR, "..", "local-data", "64MB.txt"), "rb") as f:
        contents = f.read()
    benchmark_batch([contents])


if __name__ == "__main__":
    main()

The text is 64MiB of wikipedia wikitext, probably enwik8, but I just found it on my hard drive.

python -m csh_bpe.compare_tiktoken
num_threads: 1, num_bytes: 67108864
tiktoken        6004366.360373783 bytes / s
huggingface     1120214.7857500792 bytes / s
csh_bpe         17070974.6114367 bytes / s

There are no fancy optimizations here (like SIMD stuff), the library has a few things it might do differently from tiktoken:

  1. The word splitting regular expression is implemented using rust code instead of a regexp library. It uses Go's unicode tables: https://github.com/golang/go/blob/19309779ac5e2f5a2fd3cbb34421dafb2855ac21/src/unicode/tables.go and this seems to produce the same output at least for this 64MB file. The splitting is done with a function that takes a u8 numpy array and start offset and returns the end offset.
  2. The bigram encoder takes a u8 slice for the word, a HashMap<(i32, i32), i32> mergelist, an i32 slice mapping bytes to tokens (used to populate the initial output), and a mutable i32 slice of output tokens. It keeps a list of skip lengths for each index of the output tokens (initially all 1s), which it updates whenever it merges two tokens together, then compacts the output tokens when it is done.
  3. (I think tiktoken does this) after splitting, before encoding a word, it will check the vocab hashmap to see if the word is already a single token.
  4. The interface uses numpy arrays instead of bytes, and the output array is provided as one of the inputs so the caller can manage more memory allocations (not sure if this has any performance impact)

I didn't implement rust regexps so I don't know if the word splitting matters, though I could benchmark just the splitting part.

Invalid cross-device link error

I receive the following error:

Invalid cross-device link: '/tmp/tmpz830uswg/token.json' -> '/home/stromae/.blobfile/azure_access_tokens.json'

using

def limit_to_3500_tokens(transcript):
        # limit transcript to 3500 tokens using tiktoken
        enc = tiktoken.get_encoding('gpt2')
        # loop until we find a transcript that is less than 3500 tokens
        limit = 0
        while True:
            transcript_head = transcript[:limit]
            if len(enc.encode(transcript_head)) > 3500:
                break
            limit += 100

        transcript_head = transcript[:limit]
        return transcript_head

some tokens begin with extra space

23 of the tiktoken tokens are different from that of GPT2TokenizerFast, in that they start with an extra space ' '.
So why is it like that, and how can I remove that ' ' from these tokens?

This is the code:

import tiktoken
from transformers import GPT2TokenizerFast

tik = tiktoken.get_encoding("gpt2")
hug2 = GPT2TokenizerFast.from_pretrained("gpt2")

print(hug2.vocab_size)    # 50257
print(len(tik.token_byte_values()))    # 50256

for i in range(50257):
    if tik.decode([i]) != hug2.decode(i):
        print(i, tik.decode([i]), hug.decode([i]))       
print(count)

output:

764  . .
837  , ,
2644  ... ...
5145  ! !
5633  ? ?
11485  .. ..
14512  != !=
19153  ?? ??
19424  .... ....
20004  ........ ........
22135  ." ."
24457  ./ ./
34913  ??? ???
35713  ..." ..."
37867  !! !!
39864  .......... ..........
41349  ?) ?)
42911  ," ,"
44713  ................ ................
44912  .............. ..............
46328  .) .)
47082  ...... ......
47540  ._ ._

wheels for ARM64 Linux

When attempting to install tiktoken in a the python:3.11-slim Docker container running in Docker for Mac, I get an error about a missing rust compiler( error: can't find Rust compiler...).

I'm not familiar with the Rust toolchain but I believe it should be possible to compile for aarch64 by add the following:

rustup target add aarch64-unknown-linux-gnu in pyproject.toml.

As a workaround, I'm just building from source, but it'd be super nice to have pre-built wheels.

My workaround:

export PATH="/root/.cargo/bin:$PATH"
apt-get update && apt-get install -y --no-install-recommends curl gcc &&  curl https://sh.rustup.rs -sSf | sh -s -- -y && apt-get install --reinstall libc6-dev -y

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.