latitudegames / gpt-3-encoder Goto Github PK
View Code? Open in Web Editor NEWJavascript BPE Encoder Decoder for GPT-2 / GPT-3
License: MIT License
Javascript BPE Encoder Decoder for GPT-2 / GPT-3
License: MIT License
Hi i have spent a little bit of time adding to this library in my fork. we simply need to change the URLs back and pull to update these repos.
If anyone is looking for some newer stuff take to speak and feel free to open an issue. I think ill basically leave it here the goal is to make it a 1.4.2 release and have that be final as far as what is needed fro this component.
The major things added are:
countTokens (a faster way if you don't care about the contents)
tokenStats : some interesting insights into encoded strings such as frequency and position maps.
jsdocs: for implementation interface docs
browserify: for browser support
The major things to test are: if we can get a good working version of the original python version, I added it to the npm package in case it just works but it would be good to have the package support Node, Browser and Python
Also check out the browser demo browser demo
Compatible with Node >= 12
overall just lmk what you think.
Input string: hello ๐ world ๐
Python
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
prompt = "hello ๐ world ๐"
encoded = tokenizer.encode(prompt)
print(f"Count: {len(encoded)}") # 7
print(f"Decoded: {tokenizer.decode(encoded)}") # hello ๐ world ๐
GPT-3-Encoder
const {encode, decode} = require('./encoder.js')
const str = 'hello ๐ world ๐'
const encoded = encode(str)
console.log('Count: ', encoded.length); // 4
console.log('Decoded: ', decode(encoded)); // hello world
Just wanted to document the issue, no immediate fix expected. Maybe someone from the community will find the urge to submit a PR if you all don't get around to it.
I tried to encode text of large PDF of size 11MB. The text must be more than 100k tokens in size. But the gpt-3-encoder failed to process this amount of text data without throwing any error. The program is stuck forever on this line
const encoded = encode(textOfDocument);
How to solve?
Hi everyone, this is currently the de facto way to encode and decode strings in javascript. But I cannot import this node library in a chrome extension because it relies on the fs
module:
Uncaught TypeError: fs.readFileSync is not a function
at ./node_modules/gpt-3-encoder/Encoder.js (Encoder.js:5:1)
at __webpack_require__ (bootstrap:19:1)
at ./node_modules/gpt-3-encoder/index.js (index.js:1:28)
at __webpack_require__ (bootstrap:19:1)
at make namespace object:7:1
at popup.js:22:2
at popup.js:22:2
This is problematic since a great use case for this library is to count token usage BEFORE sending text to chatgpt. Is there any way this library can be made to not rely on fs
so that it can be imported in a chrome extension like setting?
Can we please get a new release incorporating the recent changes?
When a character like โ is used it will give back a faulty output as shown below.
encode('โwrote jack a letterโ');
[null,222,250, 42910,14509,257,3850,null,222,251]
Whereas on openai it will give the output as:
[447, 250, 42910, 14509, 257, 3850, 447, 251]
This can be triggered by other characters like โ and many more.
I am interested in acquiring a ChatGPT 4.0 API account. I understand the value of such an account and am willing to pay a premium for it. If anyone has an account that they are willing to sell, please feel free to reach out to me. I assure you a fair deal. Thank you
I am getting an error from the encode
function. I am using it in this way: encode("Some string text").length
. But maybe for any value I am getting the error: TypeError: bpe(...).split is not a function
from this line: const new_tokens = bpe(token).split(' ').map(x => encoder[x])
.
Note: I have tried with some random values instead of string. Like: []
, 112
, ['']
etc. But for these values another error comes which is TypeError: text.matchAll is not a function
from this line: const matches = Array.from(text.matchAll(pat)).map(x => x[0])
.
Please solve these issues and release another version of it.
Thanks in advance
Using official example from README, but having UTF-8 quotes
Encoded this string looks like: [
1212, 318, 281, 1672,
564, 250, 34086, 594,
447, 250, 284, 1949,
21004, 503, 319, 0
]
We can look at each token and what it represents
{ token: 1212, string: 'This' }
{ token: 318, string: ' is' }
{ token: 281, string: ' an' }
{ token: 1672, string: ' example' }
{ token: 564, string: ' ๏ฟฝ' }
{ token: 250, string: '๏ฟฝ' }
{ token: 34086, string: 'sent' }
{ token: 594, string: 'ence' }
{ token: 447, string: '๏ฟฝ' }
{ token: 250, string: '๏ฟฝ' }
{ token: 284, string: ' to' }
{ token: 1949, string: ' try' }
{ token: 21004, string: ' encoding' }
{ token: 503, string: ' out' }
{ token: 319, string: ' on' }
{ token: 0, string: '!' }
We can decode it back into:
This is an example โsentenceโ to try encoding out on!
You can see, that when we decode it back token by token - it wont decode in correct way, but it's OK when you decode entire array back.
I want to start of by saying thanks so much for providing this library and doing all the hard work of converting the GPT-2 byte pair encoder from Python to JavaScript.
When going through the code and comparing it to the original GPT-2 implementation I noticed a difference between the RegEx used in this repository vs the original. The original GPT-2 regex contains the substring |'ll|
in their regex:
https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53
However in this repo there is a space in between the l's:
Line 51 in 79387f4
It also exists in the the JavaScript version:
Line 68 in 79387f4
We also looked at how hugging face has implemented it and it also does not contain a space:
If the space is intentional I'd be very curious to understand why it exists :)
Thanks!
Package installed
"dependencies": {
...
"gpt-3-encoder": "^1.1.3",
...
}
When ran
ReferenceError: TextEncoder is not defined
at Object.<anonymous> (node_modules/gpt-3-encoder/Encoder.js:21:21)
Which references: https://github.com/latitudegames/GPT-3-Encoder/blob/master/Encoder.js#L21
Suggestion: Potentially allow pass in for TextEncoder or change to NodeJS reco:
var util = require('util');
let encoder = new util.TextEncoder();
let uint8Array = encoder.encode("Hello");
console.log(uint8Array);
There is an error during building in Next.js applications:
info - Collecting page data ...Error: ENOENT: no such file or directory, open '/home/hh/backre-com/.next/server/chunks/encoder.json'
I tried Node.js 16 and 18, result in the same error.
I use a large amount of Chinese in the GPT service, and Chinese phrases here will occupy a significant amount of memory.
After running for one day, it occupies more than 1G of memory, which made me think there was a memory leak in my code for a moment.
Lines 87 to 153 in 9df47fc
if token == toString
the encoder errors out with this error.
https://github.com/latitudegames/GPT-3-Encoder/blob/master/Encoder.js#L163
Suggested fix: Cast token
to string:
(toString).split(' ')
VM236:1 Uncaught TypeError: toString.split is not a function
at <anonymous>:1:12
('toString').split(' ')
['toString']
EDIT:
same issue occured with constructor
. The encoder may struggle with all reserved JS keywords :/
Thank you for this library and your blog posts, really appreciate it for learning more about prompt programming GPT-3.
I assume if I want to count the number of tokens for a prompt, I would run:
const str = 'This is an example sentence to try encoding out on!'
const encoded = encode(str)
const tokenCount = encoded.length
I assume this is the tokenization algo used for the davinci models, is that correct?
Do you happen to know where to find a tokenization algo for the other models, or a general way to predict token usage before submitting a prompt ?
Thx for your feedback !
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.