Giter VIP home page Giter VIP logo

codeshell-vscode's Introduction

CodeShell VSCode Extension

English readme

codeshell-vscode项目是基于CodeShell大模型开发的支持Visual Studio Code的智能编码助手插件,支持python、java、c++/c、javascript、go等多种编程语言,为开发者提供代码补全、代码解释、代码优化、注释生成、对话问答等功能,旨在通过智能化的方式帮助开发者提高编程效率。

环境要求

编译插件

如果要从源码进行打包,需要安装 node v18 以上版本,并执行以下命令:

git clone https://github.com/WisdomShell/codeshell-vscode.git
cd codeshell-vscode
npm install
npm exec vsce package

然后会得到一个名为codeshell-vscode-${VERSION_NAME}.vsix的文件。

模型服务

llama_cpp_for_codeshell项目提供CodeShell大模型 4bits量化后的模型,模型名称为codeshell-chat-q4_0.gguf。以下为部署模型服务步骤:

编译代码

  • Linux / Mac(Apple Silicon设备)

    git clone https://github.com/WisdomShell/llama_cpp_for_codeshell.git
    cd llama_cpp_for_codeshell
    make

    在 macOS 上,默认情况下启用了Metal,启用Metal可以将模型加载到 GPU 上运行,从而显著提升性能。

  • Mac(非Apple Silicon设备)

    git clone https://github.com/WisdomShell/llama_cpp_for_codeshell.git
    cd llama_cpp_for_codeshell
    LLAMA_NO_METAL=1 make

    对于非 Apple Silicon 芯片的 Mac 用户,在编译时可以使用 LLAMA_NO_METAL=1LLAMA_METAL=OFF 的 CMake 选项来禁用Metal构建,从而使模型正常运行。

  • Windows

    您可以选择在Windows Subsystem for Linux中按照Linux的方法编译代码,也可以选择参考llama.cpp仓库中的方法,配置好w64devkit后再按照Linux的方法编译。

下载模型

Hugging Face Hub上,我们提供了三种不同的模型,分别是CodeShell-7BCodeShell-7B-ChatCodeShell-7B-Chat-int4。以下是下载模型的步骤。

  • 使用CodeShell-7B-Chat-int4模型推理,将模型下载到本地后并放置在以上代码中的 llama_cpp_for_codeshell/models 文件夹的路径
git clone https://huggingface.co/WisdomShell/CodeShell-7B-Chat-int4/blob/main/codeshell-chat-q4_0.gguf
git clone https://huggingface.co/WisdomShell/CodeShell-7B-Chat
git clone https://huggingface.co/WisdomShell/CodeShell-7B

加载模型

  • CodeShell-7B-Chat-int4模型使用llama_cpp_for_codeshell项目中的server命令即可提供API服务
./server -m ./models/codeshell-chat-q4_0.gguf --host 127.0.0.1 --port 8080

注意:对于编译时启用了 Metal 的情况下,若运行时出现异常,您也可以在命令行添加参数 -ngl 0 显式地禁用Metal GPU推理,从而使模型正常运行。

模型服务[NVIDIA GPU]

对于希望使用NVIDIA GPU进行推理的用户,可以使用text-generation-inference项目部署CodeShell大模型。以下为部署模型服务步骤:

下载模型

Hugging Face Hub将模型下载到本地后,将模型放置在 $HOME/models 文件夹的路径下,即可从本地加载模型。

git clone https://huggingface.co/WisdomShell/CodeShell-7B-Chat

部署模型

使用以下命令即可用text-generation-inference进行GPU加速推理部署:

docker run --gpus 'all' --shm-size 1g -p 9090:80 -v $HOME/models:/data \
        --env LOG_LEVEL="info,text_generation_router=debug" \
        ghcr.nju.edu.cn/huggingface/text-generation-inference:1.0.3 \
        --model-id /data/CodeShell-7B-Chat --num-shard 1 \
        --max-total-tokens 5000 --max-input-length 4096 \
        --max-stop-sequences 12 --trust-remote-code

更详细的参数说明请参考text-generation-inference项目文档

配置插件

VSCode中执行Install from VSIX...命令,选择codeshell-vscode-${VERSION_NAME}.vsix,完成插件安装。

  • 设置CodeShell大模型服务地址
  • 配置是否自动触发代码补全建议
  • 配置自动触发代码补全建议的时间延迟
  • 配置补全的最大tokens数量
  • 配置问答的最大tokens数量
  • 配置模型运行环境

注意:不同的模型运行环境可以在插件中进行配置。对于CodeShell-7B-Chat-int4模型,您可以在Code Shell: Run Env For LLMs选项中选择CPU with llama.cpp选项。而对于CodeShell-7BCodeShell-7B-Chat模型,应选择GPU with TGI toolkit选项。

插件配置截图

功能特性

1. 代码补全

  • 自动触发代码建议
  • 热键触发代码建议

在编码过程中,当停止输入时,代码补全建议可自动触发(在配置选项Auto Completion Delay中可设置为1~3秒),或者您也可以主动触发代码补全建议,使用快捷键Alt+\(对于Windows电脑)或option+\(对于Mac电脑)。

当插件提供代码建议时,建议内容以灰色显示在编辑器光标位置,您可以按下Tab键来接受该建议,或者继续输入以忽略该建议。

代码建议截图

2. 代码辅助

  • 对一段代码进行解释/优化/清理
  • 为一段代码生成注释/单元测试
  • 检查一段代码是否存在性能/安全性问题

在vscode侧边栏中打开插件问答界面,在编辑器中选中一段代码,在鼠标右键CodeShell菜单中选择对应的功能项,插件将在问答界面中给出相应的答复。

代码辅助截图

3. 智能问答

  • 支持多轮对话
  • 支持会话历史
  • 基于历史会话(做为上文)进行多轮对话
  • 可编辑问题,重新提问
  • 对任一问题,可重新获取回答
  • 在回答过程中,可以打断

智能问答截图

在问答界面的代码块中,可以点击复制按钮复制该代码块,也可点击插入按钮将该代码块内容插入到编辑器光标处。

开源协议

Apache 2.0

Star History

Star History Chart

codeshell-vscode's People

Contributors

0ut0fm1nd avatar codeshell-liu avatar cove9988 avatar weiye avatar zhuohaoyu avatar zzr0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

codeshell-vscode's Issues

回答问题后输出无法自动停止

问题:
用python实现一个http服务器

导入必要的模块

import socketserver
import http.server

回答

定义HTTP处理程序

class MyHttpRequestHandler(http.server.SimpleHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.end_headers()
self.wfile.write(b'This is a GET request.')

定义HTTP服务器

class MyHTTPServer(socketserver.TCPServer):
def init(self, server_address, request_handler_class):
socketserver.TCPServer.init(self, server_address, request_handler_class)
self.request_handler_class = request_handler_class

def handle_request(self, request, client_address):
    request_handler = self.request_handler_class(request, client_address, self)
    request_handler.handle()

启动HTTP服务器

server = MyHTTPServer(('', 8000), MyHttpRequestHandler)
print('Starting HTTP server on port 8000...')
server.serve_forever()

个HTTP服务器使用了SimpleHTTPRequestHandler类,该类是http.server模块中的一个类,用于处理HTTP请求。handle_request方法中的request_handler对象是SimpleHTTPRequestHandler类的一个实例,它负责处理HTTP请求并响应。

在启动HTTP服务器时,我们首先定义了MyHTTPServer类,该类继承自socketserver.TCPServer类。然后,我们定义了MyHttpRequestHandler类,该类也继承自http.server.SimpleHTTPRequestHandler类。然后,我们创建了一个MyHTTPServer对象,并将MyHttpRequestHandler类作为参数传递给request_handler_class属性。最后,我们使用server.serve_forever()方法启动HTTP服务器。

需要注意的是,这个HTTP服务器是单线程的,不能并发处理多个请求。如果您需要并发处理多个请求,可以使用多线程的方式实现。 |end>|

| 4 | | 2 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
截屏2023-11-03 10 53 20
| | | | | | | | | | | | | |

模型下载和模型安装遇到错误提示

模型下载时遇到错误,看到“https://huggingface.co/WisdomShell/CodeShell-7B-Chat-int4/codeshell_q4_0.gguf”下没有codeshell_q4_0.gguf文件,只有个类似的文件codeshell-chat-q4_0.gguf。

手动下载下codeshell-chat-q4_0.gguf到models文件夹下后,安装模型提示错误
“./server : 无法将“./server”项识别为 cmdlet、函数、脚本文件或可运行程序的名称。请检查名称的拼写,如果包括路径,请确保路径正确,然后再试一次。
所在位置 行:1 字符: 1

  • ./server -m ./models/codeshell-chat-q4_0.gguf --host 127.0.0.1 --port ...
  •   + CategoryInfo          : ObjectNotFound: (./server:String) [], CommandNotFoundException
      + FullyQualifiedErrorId : CommandNotFoundException”
    

本地启动后,webview 页面空白

vscode、node 版本都符合要求。
一开始能启动成功,不知道忽然就不行了,拉了最新代码下来也不行,求大佬给看看。

image

Deployment models

I used the code provided in the Readme.md to deploy the model, but an error occurred after I executed the command. Why? I have carefully examined the code path of llama_cpp_for_codeshell. thank you!

The error message is:
(base) root@9020:~/llama_cpp_for_codeshell$ ./server -m ./models/codeshell-chat-q4_0.gguf --host 127.0.0.1 --port 8185
-bash: ./server: No such file or directory

GPU方式运行模型服务出错

按照README运行命令:
./server -m ./models/codeshell-chat-q4_0.gguf --host 127.0.0.1 --port 8080

报错信息如下:
ggml_metal_init: GPU name: Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 5461.34 MB
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 558.13 MB
llama_new_context_with_model: max tensor size = 224.77 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 4096.00 MB, offs = 0
ggml_metal_add_buffer: allocated 'data ' buffer, size = 486.91 MB, offs = 4059267072, ( 4583.53 / 5461.34)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1346.00 MB, ( 5929.53 / 5461.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 552.02 MB, ( 6481.55 / 5461.34), warning: current allocated size is greater than the recommended max working set size
ggml_metal_graph_compute: command buffer 0 failed with status 5
GGML_ASSERT: ggml-metal.m:1459: false
Abort trap: 6

电脑信息:
M1 MacBook Pro
MacOS Sonoma 14.1

For anyone who can not build server on Windows

I tried to use WSL to run the server program. Here are some tips:

  • Make sure that scripts/build-info.sh uses LF instead of SRLF after migrating the whole project to WSL. This can be done easily with VSCode
  • After running server on WSL, you may access the service from Windows via http://127.0.0.1:PORT instead of http://ip.addr.of.eth0:PORT. According to link this seems to be a bug (or a feature).

docker run --gpus 'all' 报错,多卡不支持吗

A40双卡服务器,使用GPU部署服务时
docker run --gpus 'all' --shm-size 1g -p 9090:80 -v $HOME/models:/data
--env LOG_LEVEL="info,text_generation_router=debug"
ghcr.nju.edu.cn/huggingface/text-generation-inference:1.0.3
--model-id /data/CodeShell-7B-Chat --num-shard 1
--max-total-tokens 5000 --max-input-length 4096
--max-stop-sequences 12 --trust-remote-code

报错如下:
024-01-19T08:15:44.995533Z ERROR warmup{max_input_length=4096 max_prefill_tokens=4096}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
Error: Warmup(Generation("Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!"))
2024-01-19T08:15:45.052858Z ERROR text_generation_launcher: Webserver Crashed
2024-01-19T08:15:45.052873Z INFO text_generation_launcher: Shutting down shards
2024-01-19T08:15:45.395141Z INFO shard-manager: text_generation_launcher: Shard terminated rank=0
Error: WebserverFailed

RAG+Langchain to make its more powerful?

Adding to an AI-powered coding assistant involve advanced capabilities, such as understanding a codebase and its unique style, as well as providing your project specific assistance.

Retrieval-Augmented Generation for coding could work by scanning and indexing a project's codebase so that the model can retrieve relevant snippets of code or documentation when generating code or explanations.

If we can do this. It will be exceed any other AI assistants on the market :)

请问是否支持双卡设备部署

sudo docker run --gpus 'all' --shm-size 1g -p 9090:80 -v /home/llh/model_hub/WisdomShell_CodeShell-7B-Chat:/data  --env LOG_LEVEL="info,text_generation_router=debug" ghcr.nju.edu.cn/huggingface/text-generation-inference:1.0.3 --model-id /data --num-shard 2 --max-total-tokens 5000 --max-input-length 4096 --max-stop-sequences 12
--trust-remote-code

设备为双RTX6000,CUDA版本12.2,执行报错:

2023-10-25T03:43:06.938048Z  INFO text_generation_launcher: Args { model_id: "/data", revision: None, validation_workers: 2, sharded: None, num_shard: Some(2), quantize: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 12, max_top_n_tokens: 5, max_input_length: 4096, max_total_tokens: 5000, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "02da084c587e", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-10-25T03:43:06.938115Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `/data` do not contain malicious code.
2023-10-25T03:43:06.938126Z  INFO text_generation_launcher: Sharding model on 2 processes
2023-10-25T03:43:06.938328Z  INFO download: text_generation_launcher: Starting download process.
2023-10-25T03:43:09.670454Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-10-25T03:43:10.042577Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-10-25T03:43:10.042982Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-10-25T03:43:10.043031Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2023-10-25T03:43:12.861796Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2023-10-25T03:43:12.881244Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2023-10-25T03:43:12.933483Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 266, in get_model
    raise ValueError("sharded is not supported for AutoModel")
ValueError: sharded is not supported for AutoModel

2023-10-25T03:43:12.952449Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 266, in get_model
    raise ValueError("sharded is not supported for AutoModel")
ValueError: sharded is not supported for AutoModel

2023-10-25T03:43:13.348876Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 266, in get_model
    raise ValueError("sharded is not supported for AutoModel")

ValueError: sharded is not supported for AutoModel
 rank=0
2023-10-25T03:43:13.446787Z ERROR text_generation_launcher: Shard 0 failed to start
2023-10-25T03:43:13.446824Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
2023-10-25T03:43:13.448962Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 266, in get_model
    raise ValueError("sharded is not supported for AutoModel")

ValueError: sharded is not supported for AutoModel
 rank=1

Is is support deepseek coder yet ?

Hi, I was using llama cpp with deepseek coder connecting to this extension but code completion is not working (error about not string or something) but chat with server is works
Then I tried using llama_cpp_for_codeshell, it seems code completion is connecting (no error on VSCode) but the result is garbled characters

@@@@@@

The same also in chat with server

System : m2 pro

Thanks

生成内容有乱码

企业微信截图_dbb94793-3363-4b0b-9589-a8548e51e63f

【运行环境】:macos下, 模型服务部署起来,直接通过从chrome浏览器【version: 117.0.5938.88(正式版本) (arm64)】里访问服务的形式

使用TGI加载本地模型时报错

我是用TGI加载本地模型CodeShell-7B-Chat,但是加载过程中报错,我使用的命令如下:

sudo docker run --gpus 'all' --shm-size 1g -p 9090:80 -v /home/CodeShell/WisdomShell:/data --env LOG_LEVEL="info,text_generation_router=debug" ghcr.nju.edu.cn/huggingface/text-generation-inference:1.0.3 --model-id /data/CodeShell-7B-Chat --num-shard 1 --max-total-tokens 5000 --max-input-length 4096 --max-stop-sequences 12 --trust-remote-code

输出及报错信息如下:

2023-10-24T01:47:14.674168Z  INFO text_generation_launcher: Args { model_id: "/data/CodeShell-7B-Chat", revision: None, validation_workers: 2, sharded: None, num_shard: Some(1), quantize: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 12, max_top_n_tokens: 5, max_input_length: 4096, max_total_tokens: 5000, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "e2df4ceac2dc", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-10-24T01:47:14.674233Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `/data/CodeShell-7B-Chat` do not contain malicious code.
2023-10-24T01:47:14.685067Z  INFO download: text_generation_launcher: Starting download process.
2023-10-24T01:47:21.825629Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-10-24T01:47:23.136555Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-10-24T01:47:23.137089Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-10-24T01:47:30.969269Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
 rank=0
2023-10-24T01:47:30.969335Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 4 rank=0
Error: ShardCannotStart
2023-10-24T01:47:31.066204Z ERROR text_generation_launcher: Shard 0 failed to start
2023-10-24T01:47:31.066262Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

我目前使用的环境如下:

显卡:nvidia v100
系统:ubuntu20.04
python版本:3.10
docker版本: 24.0.5

M1 Max编译模型代码出错

系统:macos 14
编译错误:
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_METAL -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -pthread
I CXXFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_METAL -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi
I NVCCFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_METAL -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-pedantic -Xcompiler "-Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi "
I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
I CC: Apple clang version 15.0.0 (clang-1500.0.40.1)
I CXX: Apple clang version 15.0.0 (clang-1500.0.40.1)

cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_METAL -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -pthread -c ggml.c -o ggml.o
ggml.c:543:5: error: call to undeclared function 'clock_gettime'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
clock_gettime(CLOCK_MONOTONIC, &ts);
^
ggml.c:543:5: note: did you mean 'clock_set_time'?
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/mach/clock_priv.h:79:15: note: 'clock_set_time' declared here
kern_return_t clock_set_time
^
ggml.c:543:19: error: use of undeclared identifier 'CLOCK_MONOTONIC'
clock_gettime(CLOCK_MONOTONIC, &ts);
^
ggml.c:549:5: error: call to undeclared function 'clock_gettime'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
clock_gettime(CLOCK_MONOTONIC, &ts);
^
ggml.c:549:19: error: use of undeclared identifier 'CLOCK_MONOTONIC'
clock_gettime(CLOCK_MONOTONIC, &ts);
^
ggml.c:555:12: error: call to undeclared function 'clock'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
return clock();
^
ggml.c:559:12: error: use of undeclared identifier 'CLOCKS_PER_SEC'
return CLOCKS_PER_SEC/1000;
^
ggml.c:896:5: error: call to undeclared function 'assert'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
assert(k % qk == 0);
^
ggml.c:937:5: error: call to undeclared function 'assert'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
assert(k % qk == 0);
^
ggml.c:978:5: error: call to undeclared function 'assert'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
assert(k % qk == 0);
^
ggml.c:1026:5: error: call to undeclared function 'assert'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
assert(k % qk == 0);
^
ggml.c:1073:5: error: call to undeclared function 'assert'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
assert(k % QK8_0 == 0);
^
ggml.c:1098:5: error: call to undeclared function 'assert'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
assert(QK8_0 == 32);
^
ggml.c:1286:5: error: call to undeclared function 'assert'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
assert(QK8_1 == 32);
^
ggml.c:1321:5: error: call to undeclared function 'assert'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
assert(k % QK8_1 == 0);
^
ggml.c:1540:5: error: call to undeclared function 'assert'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
assert(k % qk == 0);
^
ggml.c:1560:5: error: call to undeclared function 'assert'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
assert(k % qk == 0);
^
ggml.c:1581:5: error: call to undeclared function 'assert'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
assert(k % qk == 0);
^
ggml.c:1607:5: error: call to undeclared function 'assert'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
assert(k % qk == 0);
^
ggml.c:1634:5: error: call to undeclared function 'assert'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
assert(k % qk == 0);
^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.
make: *** [ggml.o] Error 1

Illegal instruction running the codeshell in termux for android

Hi,
I have build visx and server success on termux which is running in Android phone.
But illegal instruction appeared and server stopped.
Do you have any clue to let server continue running?

e.g which source code I should try to debug or do the modification?

./server -m codeshell-chat-q4_0.gguf --host 127.0.0.1 --port 8081
..............................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 1344.00 MB
llama_new_context_with_model: compute buffer total size = 558.13 MB
Illegal instruction
~/llama_cpp_for_codeshell $ free -m -h

codeshll不支持shared吗

通过TGI托管的模型,启动方式为:
BNB_CUDA_VERSION=122 CUDA_VISIBLE_DEVICES=0,1 text-generation-launcher --model-id /data/llms/codeshell-7b-chat --tokenizer-config-path /data/llms/codeshell-7b-chat/tokenizer_config.json --sharded true --trust-remote-code --port=8080

CUDA_VISIBLE_DEVICES=0,1 和 --sharded true 设置后报错:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

一般运行的疑问与GPU运行无法工作的情况

我遵照教程, 依次执行克隆和编译.

情况1 Normal

执行 make后, server服务正常运行

但是插件这边已确认与服务器正常链接,但是vscode插件中,有如下情况:

  1. vscode左下角显示的codeshell服务链接总是显示失败,但是其实是能正常使用的
  2. Chat界面常常出现无返回的情况,即使调整阈值
  3. 自动补全目前没见到生效,即使调整Auto Completion Delay。或者说这个自动补全跟我理解的有差异吗?并不是在编辑的过程中显示灰色代码在当前编辑的后面?

经过上述问题,我推测可能与速度有关,于是开始使用GPU的方式,但后续问题更加糟糕

情况2 GPU

执行make LLAMA_CUBLAS=1

返回:

I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native
I NVCCFLAGS: --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native "
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  examples/main/main.cpp ggml.o llama.o common.o sampling.o console.o grammar-parser.o k_quants.o ggml-cuda.o ggml-alloc.o ggml-backend.o -o main -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib

====  Run ./main -h for help.  ====

g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -Iexamples/server examples/server/server.cpp ggml.o llama.o common.o sampling.o grammar-parser.o k_quants.o ggml-cuda.o ggml-alloc.o ggml-backend.o -o server -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib

根据log来看是执行成功的

随后我执行 ./main -m ../models/CodeShell-7B-Chat-int4/codeshell-chat-q4_0.gguf --color -i -r "User:"./server -m ../models/CodeShell-7B-Chat-int4/codeshell-chat-q4_0.gguf --host 0.0.0.0 --port 8008 -mg 0

都会显示异常

zsh: segmentation fault (core dumped)  ./main -m ../models/CodeShell-7B-Chat-int4/codeshell-chat-q4_0.gguf -n 256  -

{"timestamp":1697764284,"level":"INFO","function":"main","line":1356,"message":"build info","build":1385,"commit":"7382f26"}
{"timestamp":1697764284,"level":"INFO","function":"main","line":1358,"message":"system info","n_threads":16,"n_threads_batch":-1,"total_threads":32,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}
zsh: segmentation fault (core dumped)  ./server -m ../models/CodeShell-7B-Chat-int4/codeshell-chat-q4_0.gguf --host

如此。

希望开发组能解答相关疑问,1、GPU是否能正常运行q4? 2、CPU编译的server与插件之间除了Chat勉强可用外,其他功能均失效?是个人问题还是需要进一步完善呢?

Encountered issue when trying to build server on Windows

After running make server, error occurs:

process_begin: CreateProcess(NULL, uname -s, ...) failed.
process_begin: CreateProcess(NULL, uname -p, ...) failed.
process_begin: CreateProcess(NULL, uname -m, ...) failed.
process_begin: CreateProcess(NULL, cc --version, ...) failed.
'cc' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
process_begin: CreateProcess(NULL, expr >= 070100, ...) failed.
process_begin: CreateProcess(NULL, expr >= 080100, ...) failed.
process_begin: CreateProcess(NULL, cc -dumpmachine, ...) failed.
I llama.cpp build info:
I UNAME_S:
I UNAME_P:
I UNAME_M:
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -DGGML_USE_K_QUANTS  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -march=native -mtune=native
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -DGGML_USE_K_QUANTS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn  -Wno-array-bounds -march=native -mtune=native
I NVCCFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -DGGML_USE_K_QUANTS  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn    -Wno-pedantic -Xcompiler "-Wno-array-bounds -march=native -mtune=native "
I LDFLAGS:
'cc' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
I CC:
'head' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
I CXX:

'sh' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
make: *** [build-info.h] 错误 1

I'm not familiar with make. It seems that this Makefile does not support windows yet?

启动TGI报错

环境: ubuntu1804
内存: 64g

执行的命令为
`

docker run --gpus 'all' --shm-size 1g -p 8080:80 -v /opt/models:/data
--env LOG_LEVEL="info,text_generation_router=debug"
ghcr.nju.edu.cn/huggingface/text-generation-inference:1.0.3
--model-id /data/CodeShell-7B-Chat-int4 --num-shard 1
--max-total-tokens 5000 --max-input-length 4096
--max-stop-sequences 12`

日志:
`
2023-10-31T03:02:33.425212Z INFO text_generation_launcher: Args { model_id: "/data/CodeShell-7B-Chat-int4", revision: None, validation_workers: 2, sharded: None, num_shard: Some(1), quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 12, max_top_n_tokens: 5, max_input_length: 4096, max_total_tokens: 5000, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "f9c8519bf276", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-10-31T03:02:33.425555Z INFO download: text_generation_launcher: Starting download process.
2023-10-31T03:02:54.427892Z WARN text_generation_launcher: No safetensors weights found for model /data/CodeShell-7B-Chat-int4 at revision None. Converting PyTorch weights to safetensors.

Error: DownloadError
2023-10-31T03:03:25.740295Z ERROR download: text_generation_launcher: Download encountered an error: Traceback (most recent call last):

File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 195, in download_weights
utils.convert_files(local_pt_files, local_st_files, discard_names)

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 106, in convert_files
convert_file(pt_file, sf_file, discard_names)

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 68, in convert_file
to_removes = _remove_duplicate_names(loaded, discard_names=discard_names)

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 25, in _remove_duplicate_names
shareds = _find_shared_tensors(state_dict)

File "/opt/conda/lib/python3.9/site-packages/safetensors/torch.py", line 72, in _find_shared_tensors
if v.device != torch.device("meta") and storage_ptr(v) != 0 and storage_size(v) != 0:

AttributeError: 'list' object has no attribute 'device'
`

下载模型

下载模型总是提示网络连接超时怎么办

是否考虑 用template 来向LLM 提问?

在使用extension时候,对LLM的提问基本上是固定下来的,比如
解释代码
优化代码

现在的prompt 太简单,是不是可以考虑使用instructive prompt with an example,在7b model里没有什么特别的效果。
但可能对更大的model会有很好的效果。
另外,如果在fine tune的时候也使用同一个template,这可能将对LLM的一致性有很大的提高。

通过text-generation-inference部署时报错

命令如下

docker run --gpus 'all' --shm-size 1g -p 9090:80 -v $HOME/codeshell/CodeShell-7B-Chat:/data \
        --env LOG_LEVEL="info,text_generation_router=debug" \
        ghcr.nju.edu.cn/huggingface/text-generation-inference:1.0.3 \
        --model-id WisdomShell/CodeShell-7B-Chat-int4 --num-shard 1 \
        --max-total-tokens 5000 --max-input-length 4096 \
        --max-stop-sequences 12 --trust-remote-code 

2023-10-24T07:02:11.814270Z INFO download: text_generation_launcher: Starting download process.
Error: DownloadError
2023-10-24T07:02:16.019924Z ERROR download: text_generation_launcher: Download encountered an error: Traceback (most recent call last):

File "/opt/conda/lib/python3.9/site-packages/urllib3/connection.py", line 203, in _new_conn
sock = connection.create_connection(

File "/opt/conda/lib/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection
raise err

File "/opt/conda/lib/python3.9/site-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)

TimeoutError: [Errno 110] Connection timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py", line 790, in urlopen
response = self._make_request(

File "/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py", line 491, in _make_request
raise new_e

File "/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py", line 467, in _make_request
self._validate_conn(conn)

File "/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1092, in _validate_conn
conn.connect()

File "/opt/conda/lib/python3.9/site-packages/urllib3/connection.py", line 611, in connect
self.sock = sock = self._new_conn()

File "/opt/conda/lib/python3.9/site-packages/urllib3/connection.py", line 218, in _new_conn
raise NewConnectionError(

urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f1b325b6fd0>: Failed to establish a new connection: [Errno 110] Connection timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/opt/conda/lib/python3.9/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(

File "/opt/conda/lib/python3.9/site-packages/urllib3/connectionpool.py", line 844, in urlopen
retries = retries.increment(

File "/opt/conda/lib/python3.9/site-packages/urllib3/util/retry.py", line 515, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/CodeShell-7B-Chat (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1b325b6fd0>: Failed to establish a new connection: [Errno 110] Connection timed out'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 113, in download_weights
utils.weight_files(model_id, revision, extension)

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/hub.py", line 96, in weight_files
filenames = weight_hub_files(model_id, revision, extension)

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/hub.py", line 25, in weight_hub_files
info = api.model_info(model_id, revision=revision)

File "/opt/conda/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)

File "/opt/conda/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 1677, in model_info
r = get_session().get(path, headers=headers, timeout=timeout, params=params)

File "/opt/conda/lib/python3.9/site-packages/requests/sessions.py", line 602, in get
return self.request("GET", url, **kwargs)

File "/opt/conda/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)

File "/opt/conda/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)

File "/opt/conda/lib/python3.9/site-packages/huggingface_hub/utils/_http.py", line 63, in send
return super().send(request, *args, **kwargs)

File "/opt/conda/lib/python3.9/site-packages/requests/adapters.py", line 519, in send
raise ConnectionError(e, request=request)

requests.exceptions.ConnectionError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/CodeShell-7B-Chat (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1b325b6fd0>: Failed to establish a new connection: [Errno 110] Connection timed out'))"), '(Request ID: a7ec4c93-df44-453f-95e5-7e027bc442b9)')

国内的网络环境连接不上huggingface

是否考虑多种语言?

  1. Substitute hardcoded Chinese text with a language definition JSON file.
  2. In the CODESELL extension settings, incorporate a language choice for the user (English, 中文).
  3. When returning the answer from the LLM using the specified language.
    I would like to make contributions on this task.

VSCode里安装插件后一直是空的转菊花

如题,截图如下,模型服务已启动
问题表现:在VSCode里的插件点击后,左侧加载使用的时候是空的,一直转菊花状态

企业微信截图_e225969a-d865-4bf6-be72-4fc0de542717

本地chrome浏览器里打开可访问
企业微信截图_6c6d800d-4a73-4928-a289-aadce77272fa

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.