Another WebAssembly binding for llama.cpp. Inspired by tangledgroup/llama-cpp-wasm, but unlike it, Wllama aims to supports low-level API like (de)tokenization, embeddings,...
- Version 1.4.0
- Add
single-thread/wllama.js
andmulti-thread/wllama.js
to the list ofCONFIG_PATHS
createEmbedding
is now adding BOS and EOS token by default
- Add
- Typescript support
- Can run inference directly on browser (using WebAssembly SIMD), no backend or GPU is needed!
- No runtime dependency (see package.json)
- High-level API: completions, embeddings
- Low-level API: (de)tokenize, KV cache control, sampling control,...
- Ability to split the model into smaller files and load them in parallel (same as
split
andcat
) - Auto switch between single-thread and multi-thread build based on browser support
- Inference is done inside a worker, does not block UI render
- Pre-built npm package @wllama/wllama
Limitations:
- To enable multi-thread, you must add
Cross-Origin-Embedder-Policy
andCross-Origin-Opener-Policy
headers. See this discussion for more details. - No WebGL support, but maybe possible in the future
- Max model size is 2GB, due to size restriction of ArrayBuffer
Documentation: https://ngxson.github.io/wllama/docs/
Demo:
- Basic usages with completions and embeddings: https://ngxson.github.io/wllama/examples/basic/
- Advanced example using low-level API: https://ngxson.github.io/wllama/examples/advanced/
- Embedding and cosine distance: https://ngxson.github.io/wllama/examples/embeddings/
Install it:
npm i @wllama/wllama
For complete code, see examples/reactjs
NOTE: this example only covers completions usage. For embeddings, please see examples/embeddings/index.html
For complete code, see examples/basic/index.html
import { Wllama } from './esm/index.js';
(async () => {
const CONFIG_PATHS = {
'single-thread/wllama.js' : './esm/single-thread/wllama.js',
'single-thread/wllama.wasm' : './esm/single-thread/wllama.wasm',
'multi-thread/wllama.js' : './esm/multi-thread/wllama.js',
'multi-thread/wllama.wasm' : './esm/multi-thread/wllama.wasm',
'multi-thread/wllama.worker.mjs': './esm/multi-thread/wllama.worker.mjs',
};
// Automatically switch between single-thread and multi-thread version based on browser support
// If you want to enforce single-thread, add { "n_threads": 1 } to LoadModelConfig
const wllama = new Wllama(CONFIG_PATHS);
await wllama.loadModelFromUrl('https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K.gguf', {});
const outputText = await wllama.createCompletion(elemInput.value, {
nPredict: 50,
sampling: {
temp: 0.5,
top_k: 40,
top_p: 0.9,
},
});
console.log(outputText);
})();
This repository already come with pre-built binary. But if you want to build it yourself, you can use the commands below:
# Require having docker compose installed
# Firstly, build llama.cpp into wasm
npm run build:wasm
# (Optionally) Build ES module
npm run build
Short term:
- Guide: How to split gguf file into smaller one?
- Add a more pratical embedding example (using a better model)
- Maybe doing a full RAG-in-browser example using tinyllama?
Long term:
- Support GPU inference via WebGL
- Support multi-sequences: knowing the resource limitation when using WASM, I don't think having multi-sequences is a good idea
- Multi-modal: Waiting for refactoring LLaVA implementation from llama.cpp