latitudegames / gpt-3-encoder Goto Github PK

View Code? Open in Web Editor NEW

716.0 716.0 196.0 667 KB

Javascript BPE Encoder Decoder for GPT-2 / GPT-3

License: MIT License

JavaScript 62.81% Python 37.19%

gpt-3-encoder's People

Contributors

Stargazers

Watchers

Forkers

schnerd kafkas openai thundree wahidmounir granperfeccion artifact-io josephrocca salujarohit hugbubby volltin alecl trdand coderevolutionplugins digitty-forks andrew-healey quantumcodematrix mattslight serovdaniil sshadmand vitaly-z syonfox nickheiner macguyversmusic soi-20 viktor-kollegov pinkdiamond1 5l1v3r1 int3nsity boyazuo optionsx learnpythontheew wesmcouch patilkunal sbarclay ohachami aliuq nem035 lgs jeffara mbenrey eugenewest gameffective scale100xu jbierfeldt zxccs231 tbxark-arc jesdi dominicfung chenshengshu loesterfranco bigpurpledot dundas c00700702 hopsken christophebe seichis perongh hhy5277 picsoung ethicalsecurity-agency haitham2234 timothygachengo ethicalsecurity-agency 0xack13 venkateshbh99 moikas-code mrinaldeep-applied-ai jakubhalik 2685818880 kunal8164705 kingmonsterwith ihc523 kaispriestersbach invoice-alpha cybernetix-s3c shengzhigao larchliu apollohuang1 tvl83 wenhuiwu sandor-szabo daryllukas phouse-productions santana815 albertsadday ethicalsecurity-agency beautifulife heksani hbcbh1999 cspsolutions-dev chriss-0x01

gpt-3-encoder's Issues

Some Updates RFC

Hi i have spent a little bit of time adding to this library in my fork. we simply need to change the URLs back and pull to update these repos.

If anyone is looking for some newer stuff take to speak and feel free to open an issue. I think ill basically leave it here the goal is to make it a 1.4.2 release and have that be final as far as what is needed fro this component.

The major things added are:

countTokens (a faster way if you don't care about the contents)
tokenStats : some interesting insights into encoded strings such as frequency and position maps.
jsdocs: for implementation interface docs
browserify: for browser support

The major things to test are: if we can get a good working version of the original python version, I added it to the npm package in case it just works but it would be good to have the package support Node, Browser and Python

Also check out the browser demo browser demo

Compatible with Node >= 12

overall just lmk what you think.

Input with emojis tokenizes differently than python impl

Input string: hello 👋 world 🌍

Python

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

prompt = "hello 👋 world 🌍"
encoded = tokenizer.encode(prompt)

print(f"Count: {len(encoded)}") # 7
print(f"Decoded: {tokenizer.decode(encoded)}") # hello 👋 world 🌍

GPT-3-Encoder

const {encode, decode} = require('./encoder.js')

const str = 'hello 👋 world 🌍'
const encoded = encode(str)

console.log('Count: ', encoded.length); // 4
console.log('Decoded: ', decode(encoded)); // hello  world

Just wanted to document the issue, no immediate fix expected. Maybe someone from the community will find the urge to submit a PR if you all don't get around to it.

It stuck when encoding large amount of text.

I tried to encode text of large PDF of size 11MB. The text must be more than 100k tokens in size. But the gpt-3-encoder failed to process this amount of text data without throwing any error. The program is stuck forever on this line
const encoded = encode(textOfDocument);

How to solve?

Is there an alternative that doesn't use node.js's file system module?

Hi everyone, this is currently the de facto way to encode and decode strings in javascript. But I cannot import this node library in a chrome extension because it relies on the fs module:

Uncaught TypeError: fs.readFileSync is not a function
    at ./node_modules/gpt-3-encoder/Encoder.js (Encoder.js:5:1)
    at __webpack_require__ (bootstrap:19:1)
    at ./node_modules/gpt-3-encoder/index.js (index.js:1:28)
    at __webpack_require__ (bootstrap:19:1)
    at make namespace object:7:1
    at popup.js:22:2
    at popup.js:22:2

This is problematic since a great use case for this library is to count token usage BEFORE sending text to chatgpt. Is there any way this library can be made to not rely on fs so that it can be imported in a chrome extension like setting?

New release

Can we please get a new release incorporating the recent changes?

Does this work with GPT 4?

Unusable and does not match with token output from GPT-3

When a character like “ is used it will give back a faulty output as shown below.

encode('“wrote jack a letter”');

[null,222,250, 42910,14509,257,3850,null,222,251]

Whereas on openai it will give the output as:

[447, 250, 42910, 14509, 257, 3850, 447, 251]

This can be triggered by other characters like █ and many more.

please add dependency to package.json

Does not work on front-end in Next.js projects

Even after adding fs via yarn add fs, I still get this error. ):
Screenshot:

4.0 api

I am interested in acquiring a ChatGPT 4.0 API account. I understand the value of such an account and am willing to pay a premium for it. If anyone has an account that they are willing to sell, please feel free to reach out to me. I assure you a fair deal. Thank you

TypeError: bpe(...).split is not a function

I am getting an error from the encode function. I am using it in this way: encode("Some string text").length. But maybe for any value I am getting the error: TypeError: bpe(...).split is not a function from this line: const new_tokens = bpe(token).split(' ').map(x => encoder[x]).

Note: I have tried with some random values instead of string. Like: [], 112, [''] etc. But for these values another error comes which is TypeError: text.matchAll is not a function from this line: const matches = Array.from(text.matchAll(pat)).map(x => x[0]).

Please solve these issues and release another version of it.
Thanks in advance

UTF-8 issue when decode back

Using official example from README, but having UTF-8 quotes

Encoded this string looks like:  [
   1212, 318,   281, 1672,
    564, 250, 34086,  594,
    447, 250,   284, 1949,
  21004, 503,   319,    0
]
We can look at each token and what it represents
{ token: 1212, string: 'This' }
{ token: 318, string: ' is' }
{ token: 281, string: ' an' }
{ token: 1672, string: ' example' }
{ token: 564, string: ' �' }
{ token: 250, string: '�' }
{ token: 34086, string: 'sent' }
{ token: 594, string: 'ence' }
{ token: 447, string: '�' }
{ token: 250, string: '�' }
{ token: 284, string: ' to' }
{ token: 1949, string: ' try' }
{ token: 21004, string: ' encoding' }
{ token: 503, string: ' out' }
{ token: 319, string: ' on' }
{ token: 0, string: '!' }
We can decode it back into:
 This is an example “sentence“ to try encoding out on!

You can see, that when we decode it back token by token - it wont decode in correct way, but it's OK when you decode entire array back.

Possible bug with BPE RegEx

I want to start of by saying thanks so much for providing this library and doing all the hard work of converting the GPT-2 byte pair encoder from Python to JavaScript.

When going through the code and comparing it to the original GPT-2 implementation I noticed a difference between the RegEx used in this repository vs the original. The original GPT-2 regex contains the substring |'ll| in their regex:

https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53

However in this repo there is a space in between the l's:

GPT-3-Encoder/encoder.py

Line 51 in 79387f4

 r"""'s|'t|'re|'ve|'m|'l l|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""" 

It also exists in the the JavaScript version:

GPT-3-Encoder/Encoder.js

Line 68 in 79387f4

 const pat = /'s|'t|'re|'ve|'m|'l l|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/gu 

We also looked at how hugging face has implemented it and it also does not contain a space:

https://github.com/huggingface/transformers/blob/ac588594e29f39a235433f00108d1416fb7c7fe5/src/transformers/models/gpt2/tokenization_gpt2.py#L193

If the space is intentional I'd be very curious to understand why it exists :)

Thanks!

"TextEncoder is not defined" error when running tests in Node 12.22.0

Package installed

"dependencies": {
    ...
    "gpt-3-encoder": "^1.1.3",
    ...
}

When ran

    ReferenceError: TextEncoder is not defined 
      at Object.<anonymous> (node_modules/gpt-3-encoder/Encoder.js:21:21)

Which references: https://github.com/latitudegames/GPT-3-Encoder/blob/master/Encoder.js#L21

Suggestion: Potentially allow pass in for TextEncoder or change to NodeJS reco:

var util = require('util');
let encoder = new util.TextEncoder();
let uint8Array = encoder.encode("Hello");
console.log(uint8Array);

encoder.json does not exist in chunks directory.

There is an error during building in Next.js applications:

info  - Collecting page data ...Error: ENOENT: no such file or directory, open '/home/hh/backre-com/.next/server/chunks/encoder.json'

I tried Node.js 16 and 18, result in the same error.

The cache in bpe() may occupy a large amount of memory after long-time running.

I use a large amount of Chinese in the GPT service, and Chinese phrases here will occupy a significant amount of memory.

After running for one day, it occupies more than 1G of memory, which made me think there was a memory leak in my code for a moment.

GPT-3-Encoder/Encoder.js

Lines 87 to 153 in 9df47fc

 function bpe(token) { 

 if (cache.has(token)) { 

 return cache.get(token) 

 }`` 

 let word = token.split('') 

 let pairs = get_pairs(word) 

 if (!pairs) { 

 return token 

 } 

 while (true) { 

 const minPairs = {} 

 Array.from(pairs).map(pair => { 

 const rank = bpe_ranks[pair] 

 minPairs[(isNaN(rank) ? 10e10 : rank)] = pair 

 }) 

 const bigram = minPairs[Math.min(...Object.keys(minPairs).map(x => { 

 return parseInt(x) 

 } 

 ))] 

 if (!(bigram in bpe_ranks)) { 

 break 

 } 

 const first = bigram[0] 

 const second = bigram[1] 

 let new_word = [] 

 let i = 0 

 while (i < word.length) { 

 const j = word.indexOf(first, i) 

 if (j === -1) { 

 new_word = new_word.concat(word.slice(i)) 

 break 

 } 

 new_word = new_word.concat(word.slice(i, j)) 

 i = j 

 if (word[i] === first && i < word.length - 1 && word[i + 1] === second) { 

 new_word.push(first + second) 

 i = i + 2 

 } else { 

 new_word.push(word[i]) 

 i = i + 1 

 } 

 } 

 word = new_word 

 if (word.length === 1) { 

 break 

 } else { 

 pairs = get_pairs(word) 

 } 

 } 

 word = word.join(' ') 

 cache.set(token, word) 

 return word 

 }

TypeError: bpe(...).split is not a function

if token == toString the encoder errors out with this error.

https://github.com/latitudegames/GPT-3-Encoder/blob/master/Encoder.js#L163

Suggested fix: Cast token to string:

(toString).split(' ')
VM236:1 Uncaught TypeError: toString.split is not a function
    at <anonymous>:1:12

('toString').split(' ')
['toString']

EDIT:

same issue occured with constructor. The encoder may struggle with all reserved JS keywords :/

Tokenization Count based on Model

Thank you for this library and your blog posts, really appreciate it for learning more about prompt programming GPT-3.
I assume if I want to count the number of tokens for a prompt, I would run:

const str = 'This is an example sentence to try encoding out on!'
const encoded = encode(str)
const tokenCount = encoded.length

I assume this is the tokenization algo used for the davinci models, is that correct?
Do you happen to know where to find a tokenization algo for the other models, or a general way to predict token usage before submitting a prompt ?

Thx for your feedback !

	function bpe(token) {
	if (cache.has(token)) {
	return cache.get(token)
	}``

	let word = token.split('')

	let pairs = get_pairs(word)

	if (!pairs) {
	return token
	}

	while (true) {
	const minPairs = {}
	Array.from(pairs).map(pair => {
	const rank = bpe_ranks[pair]
	minPairs[(isNaN(rank) ? 10e10 : rank)] = pair
	})



	const bigram = minPairs[Math.min(...Object.keys(minPairs).map(x => {
	return parseInt(x)
	}
	))]

	if (!(bigram in bpe_ranks)) {
	break
	}

	const first = bigram[0]
	const second = bigram[1]
	let new_word = []
	let i = 0

	while (i < word.length) {
	const j = word.indexOf(first, i)
	if (j === -1) {
	new_word = new_word.concat(word.slice(i))
	break
	}
	new_word = new_word.concat(word.slice(i, j))
	i = j

	if (word[i] === first && i < word.length - 1 && word[i + 1] === second) {
	new_word.push(first + second)
	i = i + 2
	} else {
	new_word.push(word[i])
	i = i + 1
	}
	}

	word = new_word
	if (word.length === 1) {
	break
	} else {
	pairs = get_pairs(word)
	}
	}

	word = word.join(' ')
	cache.set(token, word)

	return word
	}