Giter VIP home page Giter VIP logo

gpt-3-encoder's People

Contributors

andrew-healey avatar caleb-artifact avatar invadersquibs avatar kafkas avatar michaelsharpe avatar nickwalton avatar schnerd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gpt-3-encoder's Issues

Some Updates RFC

Hi i have spent a little bit of time adding to this library in my fork. we simply need to change the URLs back and pull to update these repos.

If anyone is looking for some newer stuff take to speak and feel free to open an issue. I think ill basically leave it here the goal is to make it a 1.4.2 release and have that be final as far as what is needed fro this component.

The major things added are:

  • countTokens (a faster way if you don't care about the contents)

  • tokenStats : some interesting insights into encoded strings such as frequency and position maps.

  • jsdocs: for implementation interface docs

  • browserify: for browser support

The major things to test are: if we can get a good working version of the original python version, I added it to the npm package in case it just works but it would be good to have the package support Node, Browser and Python

JSDocs

Also check out the browser demo browser demo

GitHub last commit
example workflow
github

Compatible with Node >= 12

overall just lmk what you think.

Input with emojis tokenizes differently than python impl

Input string: hello ๐Ÿ‘‹ world ๐ŸŒ

Python

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

prompt = "hello ๐Ÿ‘‹ world ๐ŸŒ"
encoded = tokenizer.encode(prompt)

print(f"Count: {len(encoded)}") # 7
print(f"Decoded: {tokenizer.decode(encoded)}") # hello ๐Ÿ‘‹ world ๐ŸŒ

GPT-3-Encoder

const {encode, decode} = require('./encoder.js')

const str = 'hello ๐Ÿ‘‹ world ๐ŸŒ'
const encoded = encode(str)

console.log('Count: ', encoded.length); // 4
console.log('Decoded: ', decode(encoded)); // hello  world

Just wanted to document the issue, no immediate fix expected. Maybe someone from the community will find the urge to submit a PR if you all don't get around to it.

It stuck when encoding large amount of text.

I tried to encode text of large PDF of size 11MB. The text must be more than 100k tokens in size. But the gpt-3-encoder failed to process this amount of text data without throwing any error. The program is stuck forever on this line
const encoded = encode(textOfDocument);

How to solve?

Is there an alternative that doesn't use node.js's file system module?

Hi everyone, this is currently the de facto way to encode and decode strings in javascript. But I cannot import this node library in a chrome extension because it relies on the fs module:

Uncaught TypeError: fs.readFileSync is not a function
    at ./node_modules/gpt-3-encoder/Encoder.js (Encoder.js:5:1)
    at __webpack_require__ (bootstrap:19:1)
    at ./node_modules/gpt-3-encoder/index.js (index.js:1:28)
    at __webpack_require__ (bootstrap:19:1)
    at make namespace object:7:1
    at popup.js:22:2
    at popup.js:22:2

This is problematic since a great use case for this library is to count token usage BEFORE sending text to chatgpt. Is there any way this library can be made to not rely on fs so that it can be imported in a chrome extension like setting?

New release

Can we please get a new release incorporating the recent changes?

Unusable and does not match with token output from GPT-3

When a character like โ€œ is used it will give back a faulty output as shown below.

encode('โ€œwrote jack a letterโ€');

[null,222,250, 42910,14509,257,3850,null,222,251]

Whereas on openai it will give the output as:

[447, 250, 42910, 14509, 257, 3850, 447, 251]

This can be triggered by other characters like โ–ˆ and many more.

4.0 api

I am interested in acquiring a ChatGPT 4.0 API account. I understand the value of such an account and am willing to pay a premium for it. If anyone has an account that they are willing to sell, please feel free to reach out to me. I assure you a fair deal. Thank you

TypeError: bpe(...).split is not a function

I am getting an error from the encode function. I am using it in this way: encode("Some string text").length. But maybe for any value I am getting the error: TypeError: bpe(...).split is not a function from this line: const new_tokens = bpe(token).split(' ').map(x => encoder[x]).

Note: I have tried with some random values instead of string. Like: [], 112, [''] etc. But for these values another error comes which is TypeError: text.matchAll is not a function from this line: const matches = Array.from(text.matchAll(pat)).map(x => x[0]).

Please solve these issues and release another version of it.
Thanks in advance

UTF-8 issue when decode back

Using official example from README, but having UTF-8 quotes

Encoded this string looks like:  [
   1212, 318,   281, 1672,
    564, 250, 34086,  594,
    447, 250,   284, 1949,
  21004, 503,   319,    0
]
We can look at each token and what it represents
{ token: 1212, string: 'This' }
{ token: 318, string: ' is' }
{ token: 281, string: ' an' }
{ token: 1672, string: ' example' }
{ token: 564, string: ' ๏ฟฝ' }
{ token: 250, string: '๏ฟฝ' }
{ token: 34086, string: 'sent' }
{ token: 594, string: 'ence' }
{ token: 447, string: '๏ฟฝ' }
{ token: 250, string: '๏ฟฝ' }
{ token: 284, string: ' to' }
{ token: 1949, string: ' try' }
{ token: 21004, string: ' encoding' }
{ token: 503, string: ' out' }
{ token: 319, string: ' on' }
{ token: 0, string: '!' }
We can decode it back into:
 This is an example โ€œsentenceโ€œ to try encoding out on!

You can see, that when we decode it back token by token - it wont decode in correct way, but it's OK when you decode entire array back.

Possible bug with BPE RegEx

I want to start of by saying thanks so much for providing this library and doing all the hard work of converting the GPT-2 byte pair encoder from Python to JavaScript.

When going through the code and comparing it to the original GPT-2 implementation I noticed a difference between the RegEx used in this repository vs the original. The original GPT-2 regex contains the substring |'ll| in their regex:

https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53

However in this repo there is a space in between the l's:

r"""'s|'t|'re|'ve|'m|'l l|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

It also exists in the the JavaScript version:

const pat = /'s|'t|'re|'ve|'m|'l l|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/gu

We also looked at how hugging face has implemented it and it also does not contain a space:

https://github.com/huggingface/transformers/blob/ac588594e29f39a235433f00108d1416fb7c7fe5/src/transformers/models/gpt2/tokenization_gpt2.py#L193

If the space is intentional I'd be very curious to understand why it exists :)

Thanks!

"TextEncoder is not defined" error when running tests in Node 12.22.0

Package installed

"dependencies": {
    ...
    "gpt-3-encoder": "^1.1.3",
    ...
}

When ran

    ReferenceError: TextEncoder is not defined 
      at Object.<anonymous> (node_modules/gpt-3-encoder/Encoder.js:21:21)

Which references: https://github.com/latitudegames/GPT-3-Encoder/blob/master/Encoder.js#L21

Suggestion: Potentially allow pass in for TextEncoder or change to NodeJS reco:

var util = require('util');
let encoder = new util.TextEncoder();
let uint8Array = encoder.encode("Hello");
console.log(uint8Array);

encoder.json does not exist in chunks directory.

There is an error during building in Next.js applications:

info  - Collecting page data ...Error: ENOENT: no such file or directory, open '/home/hh/backre-com/.next/server/chunks/encoder.json'

I tried Node.js 16 and 18, result in the same error.

The cache in bpe() may occupy a large amount of memory after long-time running.

I use a large amount of Chinese in the GPT service, and Chinese phrases here will occupy a significant amount of memory.

After running for one day, it occupies more than 1G of memory, which made me think there was a memory leak in my code for a moment.

GPT-3-Encoder/Encoder.js

Lines 87 to 153 in 9df47fc

function bpe(token) {
if (cache.has(token)) {
return cache.get(token)
}``
let word = token.split('')
let pairs = get_pairs(word)
if (!pairs) {
return token
}
while (true) {
const minPairs = {}
Array.from(pairs).map(pair => {
const rank = bpe_ranks[pair]
minPairs[(isNaN(rank) ? 10e10 : rank)] = pair
})
const bigram = minPairs[Math.min(...Object.keys(minPairs).map(x => {
return parseInt(x)
}
))]
if (!(bigram in bpe_ranks)) {
break
}
const first = bigram[0]
const second = bigram[1]
let new_word = []
let i = 0
while (i < word.length) {
const j = word.indexOf(first, i)
if (j === -1) {
new_word = new_word.concat(word.slice(i))
break
}
new_word = new_word.concat(word.slice(i, j))
i = j
if (word[i] === first && i < word.length - 1 && word[i + 1] === second) {
new_word.push(first + second)
i = i + 2
} else {
new_word.push(word[i])
i = i + 1
}
}
word = new_word
if (word.length === 1) {
break
} else {
pairs = get_pairs(word)
}
}
word = word.join(' ')
cache.set(token, word)
return word
}

Tokenization Count based on Model

Thank you for this library and your blog posts, really appreciate it for learning more about prompt programming GPT-3.
I assume if I want to count the number of tokens for a prompt, I would run:

const str = 'This is an example sentence to try encoding out on!'
const encoded = encode(str)
const tokenCount = encoded.length

I assume this is the tokenization algo used for the davinci models, is that correct?
Do you happen to know where to find a tokenization algo for the other models, or a general way to predict token usage before submitting a prompt ?

Thx for your feedback !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.