aminediro / cria Goto Github PK

View Code? Open in Web Editor NEW

211.0 211.0 14.0 2.78 MB

OpenAI compatible API for serving LLAMA-2 model

License: MIT License

Rust 78.75% Python 21.25%

cria's Introduction

👋 Hello! I'm Amine.

🤖 I’m a Moroccan Freelance Datascientist working primarly computer vision state of the art problems.
🌱 I’m currently building an end-to-end Document near-duplicate solution
💪 I’m the co-founder of Fitroulette App,a webRTC based application for enabling remote fitness sessions
💬 You can ask me about ML/DL, Python, Golang, Distributed systems, React Native and WebRTC
📫 You can reach me on @linkedIn
🖋️ I write about various technical subjects : blog
⚡ I'm also a amateur kickboxer and grappler

📕 Latest Blog Posts

⬆️ Latest Github Activity

🗣 Commented on #1118 in voxel51/fiftyone
❗️ Closed issue #4 in AmineDiro/Adversarial-Attacks
🗣 Commented on #4 in AmineDiro/Adversarial-Attacks
💪 Opened PR #1118 in voxel51/fiftyone
🗣 Commented on #1097 in voxel51/fiftyone

📫 Where to find me

cria's People

Contributors

Stargazers

Watchers

Forkers

vamseekm shabbirhasan1 gmh5225 airfuse lee-b aparo sundogs8603 bringitup benjamint22 davgit hbcbh1999 wdoppenberg kazel04 zephyr800 coyang

cria's Issues

LLama2 Chat Prompting incorrect?

From this guide https://replicate.com/blog/how-to-prompt-llama

A prompt with history would look like

<s>[INST] <<SYS>>
You are are a helpful... bla bla.. assistant
<</SYS>>

Hi there! [/INST] Hello! How can I help you today? </s><s>[INST] What is a neutron star? [/INST] A neutron star is a ... </s><s> [INST] Okay cool, thank you! [/INST]

It may even be that the newlines can be removed.

So I think this prompt technique should replace the one in https://github.com/AmineDiro/cria/blob/main/src/routes/chat.rs#L16

llama-2 70B support

I'm getting this error when trying to run on MacOS:

error: invalid value 'llama-2' for '<MODEL_ARCHITECTURE>': llama-2 is not one of supported model architectures: [Bloom, Gpt2, GptJ, GptNeoX, Llama, Mpt]

If I use LLama instead, it crashes (as it probably should)

GGML_ASSERT: llama-cpp/ggml.c:6192: ggml_nelements(a) == ne0*ne1*ne2
fish: Job 1, 'target/release/cria Llama ../ll…' terminated by signal SIGABRT (Abort)

"git submodule update --init --recursive" requires ssh

I am trying to get this running in Docker

Here is my current build

from ubuntu:latest

RUN apt-get update && apt-get install -y \
    sudo 

RUN apt-get update && \
    apt-get install -y \
    curl \
    build-essential \
    libssl-dev \
    pkg-config


# Install Rust and Cargo
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"


RUN yes | sudo apt install git 

RUN git clone https://github.com/AmineDiro/cria.git
WORKDIR /cria
#RUN yes | git submodule update --init --recursive

# Build the project using Cargo in release mode
#RUN cargo build --release

#COPY ggml-model-q4_0.bin .

Unfortunately, it seems like the git submodule update --init --recursive command is somehow interfacing with the repo via ssh

Have you considered maybe building this in a way that the dependencies can be installed without ssh?

ailed to run custom build command for `ggml-sys v0.2.0-dev (D:\textgen\cria\llm\crates\ggml\sys)`

I was following the steps given in the readme, new to rust, so have no idea how to get through this one, I tried googling but nothing.

   Compiling ggml-sys v0.2.0-dev (D:\textgen\cria\llm\crates\ggml\sys)
error: failed to run custom build command for `ggml-sys v0.2.0-dev (D:\textgen\cria\llm\crates\ggml\sys)`

Caused by:
  process didn't exit successfully: `D:\textgen\cria\target\release\build\ggml-sys-347ad8dfc92431e3\build-script-build` (exit code: 101)
  --- stdout
  cargo:rerun-if-changed=llama-cpp
  OPT_LEVEL = Some("3")
  TARGET = Some("x86_64-pc-windows-msvc")
  HOST = Some("x86_64-pc-windows-msvc")
  cargo:rerun-if-env-changed=CC_x86_64-pc-windows-msvc
  CC_x86_64-pc-windows-msvc = None
  cargo:rerun-if-env-changed=CC_x86_64_pc_windows_msvc
  CC_x86_64_pc_windows_msvc = None
  cargo:rerun-if-env-changed=HOST_CC
  HOST_CC = None
  cargo:rerun-if-env-changed=CC
  CC = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
  DEBUG = Some("false")
  cargo:rerun-if-env-changed=CFLAGS_x86_64-pc-windows-msvc
  CFLAGS_x86_64-pc-windows-msvc = None
  cargo:rerun-if-env-changed=CFLAGS_x86_64_pc_windows_msvc
  CFLAGS_x86_64_pc_windows_msvc = None
  cargo:rerun-if-env-changed=HOST_CFLAGS
  HOST_CFLAGS = None
  cargo:rerun-if-env-changed=CFLAGS
  CFLAGS = None
  OPT_LEVEL = Some("3")
  TARGET = Some("x86_64-pc-windows-msvc")
  HOST = Some("x86_64-pc-windows-msvc")
  cargo:rerun-if-env-changed=CC_x86_64-pc-windows-msvc
  CC_x86_64-pc-windows-msvc = None
  cargo:rerun-if-env-changed=CC_x86_64_pc_windows_msvc
  CC_x86_64_pc_windows_msvc = None
  cargo:rerun-if-env-changed=HOST_CC
  HOST_CC = None
  cargo:rerun-if-env-changed=CC
  CC = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
  DEBUG = Some("false")
  cargo:rerun-if-env-changed=CFLAGS_x86_64-pc-windows-msvc
  CFLAGS_x86_64-pc-windows-msvc = None
  cargo:rerun-if-env-changed=CFLAGS_x86_64_pc_windows_msvc
  CFLAGS_x86_64_pc_windows_msvc = None
  cargo:rerun-if-env-changed=HOST_CFLAGS
  HOST_CFLAGS = None
  cargo:rerun-if-env-changed=CFLAGS
  CFLAGS = None

  --- stderr
  thread 'main' panicked at 'Please make sure nvcc is executable and the paths are defined using CUDA_PATH, CUDA_INCLUDE_PATH and/or CUDA_LIB_PATH', llm\crates\ggml\sys\build.rs:344:33
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Can this project handle multiple requests at once?

Hello, I came across this project while searching for OpenAI API compatible servers for llama.cpp and I was wondering if this can handle multiple requests at once?

Loading another model into RAM for each concurrent user doesn't seem like a great idea, and I was wondering if this was even possible at all with this project.

Thank you for your work!

GGUF support

I use Docker for deployment. I downloaded a 7B file, and the environment configuration file looks like this.

CRIA_SERVICE_NAME=cria
CRIA_HOST=0.0.0.0
CRIA_PORT=3000
CRIA_MODEL_ARCHITECTURE=llama

!/! Utilizado en docker-compose

CRIA_MODEL_PATH=/llama/llama-2-7b/consolidated.00.pth
CRIA_USE_GPU=true
CRIA_GPU_LAYERS=32
CRIA_ZIPKIN_ENDPOINT=http://zipkin-server:9411/api/v2/spans

Implement streaming for the /v1/chat/completions route

I think /v1/completions has streaming code but not /v1/chat/completions the former API is deprecated and I'm using the new one.

To me it also looks like the code could be combined somehow? i.e. after creating the prompt from the incoming JSON with reference to #18

Perhaps it's a case of passing it to the streaming code in routes/completions.rs

So you're are aware and for some context to my requests we have integrated cria into https://github.com/purton-tech/bionicgpt

Thanks

Missing LICENSE

I see you have no LICENSE file for this project. The default is copyright.

I would suggest releasing the code under the GPL-3.0-or-later or AGPL-3.0-or-later license so that others are encouraged to contribute changes back to your project.

Response cutting off at around 256 tokens

The response cuts off at around 256tokens.

# .env
CRIA_MODEL_PATH=/home/bala/Models/llama-2-13b-chat.ggmlv3.q8_0.bin


# Other environement variables to set
CRIA_SERVICE_NAME=cria
CRIA_HOST=0.0.0.0
CRIA_PORT=3000
CRIA_MODEL_ARCHITECTURE=llama
CRIA_USE_GPU=false
CRIA_GPU_LAYERS=32
CRIA_ZIPKIN_ENDPOINT=http://zipkin-server:9411/api/v2/spans

Input

{
  "prompt":"[INST]<<SYS>>.<</SYS>>How do I get from UNSW to Central Station?[/INST]",
  "temperature":0.1
}

Output

console.log(response.choices[0].text)

There are several ways to get from the University of New South Wales (UNSW) to Central Station in Sydney. Here are some options: 1. Train: The easiest and most convenient way to get to Central Station from UNSW is by train. The UNSW campus is located near the Kensington station, which is on the Airport & South Line. You can take a train from Kensington station to Central Station. The journey takes around 20 minutes. 2. Bus: You can also take a bus from UNSW to Central Station. The UNSW campus is served by several bus routes, including the 395, 397, and 398. These buses run frequently throughout the day and the journey takes around 45-60 minutes, depending on traffic. 3. Light Rail: Another option is to take the light rail from UNSW to Central Station. The light rail runs along Anzac Parade and stops at Central Station. The journey takes around 30-40 minutes. 4. Taxi or Ride-sharing: You can also take a taxi or ride-sharing service such as Uber or Ly

Unable to build on Windows

I attempted to build on Windows, specifically Windows 11 and it failed and I get the following error.

error[E0308]: mismatched types
   --> llm\crates\llm-base\src\tokenizer\huggingface.rs:25:21
    |
25  |             .decode(vec![idx as u32], true)
    |              ------ ^^^^^^^^^^^^^^^^ expected `&[u32]`, found `Vec<u32>`
    |              |
    |              arguments to this method are incorrect
    |
    = note: expected reference `&[u32]`
                  found struct `Vec<u32>`
note: method defined here
   --> C:\Users\gabri\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tokenizers-0.13.4\src\tokenizer\mod.rs:814:12
    |
814 |     pub fn decode(&self, ids: &[u32], skip_special_tokens: bool) -> Result<String> {
    |            ^^^^^^

error[E0308]: mismatched types
   --> llm\crates\llm-base\src\tokenizer\huggingface.rs:70:21
    |
70  |             .decode(tokens, skip_special_tokens)
    |              ------ ^^^^^^ expected `&[u32]`, found `Vec<u32>`
    |              |
    |              arguments to this method are incorrect
    |
    = note: expected reference `&[u32]`
                  found struct `Vec<u32>`
note: method defined here
   --> C:\Users\gabri\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tokenizers-0.13.4\src\tokenizer\mod.rs:814:12
    |
814 |     pub fn decode(&self, ids: &[u32], skip_special_tokens: bool) -> Result<String> {
    |            ^^^^^^
help: consider borrowing here
    |
70  |             .decode(&tokens, skip_special_tokens)
    |                     +

For more information about this error, try `rustc --explain E0308`.
error: could not compile `llm-base` (lib) due to 2 previous errors
warning: build failed, waiting for other jobs to finish...

If anyone knows how to fix it, I appreciate the help. I just follow the details on how to run cria based on the README.

It seems like there is a problem with the llm submodule?

llama.cpp no longer support .bin

thread 'main' panicked at 'Failed to load LLaMA model from "/home/cisco/git/llama.cpp/models/llama-2-13b-chat/ggml-model-q4_0.gguf": invalid magic number 46554747 (GGUF) for "/home/cisco/git/llama.cpp/models/llama-2-13b-chat/ggml-model-q4_0.gguf"', /home/cisco/git/cria/src/lib.rs:54:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace