Giter VIP home page Giter VIP logo

gpt3-tokenizer-java's Introduction

GPT3/4 Java Tokenizer

License: MIT GitHub Workflow Status Maven Central

This is a Java implementation of a GPT3/4 tokenizer, loosely ported from Tiktoken with the help of ChatGPT.

Usage Examples

Encoding Text to Tokens

GPT3Tokenizer tokenizer = new GPT3Tokenizer(Encoding.CL100K_BASE);
List<Integer> tokens = tokenizer.encode("example text here");

Decoding Tokens to Text

GPT3Tokenizer tokenizer = new GPT3Tokenizer(Encoding.CL100K_BASE);
List<Integer> tokens = Arrays.asList(123, 456, 789);
String text = tokenizer.decode(tokens);

Counting Number of Tokens in Chat Messages

var messages = List.of(
        new ChatMessage(ChatMessageRole.SYSTEM.value(), "You are a helpful assistant."),
        new ChatMessage(ChatMessageRole.USER.value(), "Hello there!")
);
var model = ModelType.GPT_3_5_TURBO;
var count = TokenCount.fromMessages(messages, model);
System.out.println("Prompt tokens: " + count);

Did you know...

  1. ...that all 3.5-turbo models released after 0613 now have tokenization counts for messages consistent with gpt-4 models?

  2. ...that OpenAI Tokenizer available at https://platform.openai.com/tokenizer uses p50k_base encoding, thus it doesn't count correctly tokens for gpt-3.5 and gpt-4 models? If you look for decent alternative, you may like: https://tiktokenizer.vercel.app/, but keep in mind that tokenization for messages of gpt-3.5 models released after 0613 was changed (see point above).

  3. ...that in cl100k_base encoding every sequence of up to 81 spaces is just a single token? So next time when someone tells you that passing YAML to ChatGPT is not efficient, you can argue that...

var tokenizer = ModelType.GPT_3_5_TURBO.getTokenizer();
var tokens = (List<Integer>) null;
for (var sb = new StringBuilder(" "); (tokens = tokenizer.encode(sb)).size() == 1; sb.append(' '))
    System.out.printf("`%s`'s token is %s, and that's %d space(s)!\n".replace("(s)", sb.length()==1?"":"s"), sb, tokens, sb.length());
`                                                                           `'s token is [14984], and that's 75 spaces!
`                                                                            `'s token is [56899], and that's 76 spaces!
`                                                                             `'s token is [59691], and that's 77 spaces!
`                                                                              `'s token is [82321], and that's 78 spaces!
`                                                                               `'s token is [40584], and that's 79 spaces!
`                                                                                `'s token is [98517], and that's 80 spaces!
`                                                                                 `'s token is [96529], and that's 81 spaces!

License

This project is licensed under the MIT License.

gpt3-tokenizer-java's People

Contributors

didalgolab avatar alessandroborges avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.