milvus-standalone-local | [2024/01/05 08:31:22.854 +00:00] [ERROR] [gr

Bug？，/model_repos/QAEnsemble/QAEnsemble.log <div class="snippet-clipboard-content

项目部署报错 about qanything HOT 7 OPEN

netease-youdao commented on September 27, 2024 1

项目部署报错

from qanything.

Comments (7)

xixihahaliu commented on September 27, 2024

It seems that there is a problem with the container startup of Milvus. You can try the following docker-compose.yaml, which is the official Milvus startup file. If executing this also results in an error, it means that Milvus cannot start properly.

version: '3.5'

services:
  etcd:
    container_name: milvus-etcd
    image: quay.io/coreos/etcd:v3.5.5
    environment:
      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
      - ETCD_QUOTA_BACKEND_BYTES=4294967296
      - ETCD_SNAPSHOT_COUNT=50000
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
    healthcheck:
      test: ["CMD", "etcdctl", "endpoint", "health"]
      interval: 30s
      timeout: 20s
      retries: 3

  minio:
    container_name: milvus-minio
    image: minio/minio:RELEASE.2023-03-20T20-16-18Z
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    ports:
      - "9001:9001"
      - "9000:9000"
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
    command: minio server /minio_data --console-address ":9001"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3

  standalone:
    container_name: milvus-standalone
    image: milvusdb/milvus:v2.3.3
    command: ["milvus", "run", "standalone"]
    security_opt:
    - seccomp:unconfined
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
      interval: 30s
      start_period: 90s
      timeout: 20s
      retries: 3
    ports:
      - "19530:19530"
      - "9091:9091"
    depends_on:
      - "etcd"
      - "minio"

networks:
  default:
    name: milvus

from qanything.

qmhl1 commented on September 27, 2024

Bug？，/model_repos/QAEnsemble/QAEnsemble.log

I0105 12:26:15.434683 85 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f1416000000' with size 268435456
I0105 12:26:15.435029 85 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0105 12:26:15.439093 85 model_lifecycle.cc:462] loading: rerank:1
I0105 12:26:15.439135 85 model_lifecycle.cc:462] loading: embed:1
I0105 12:26:15.439166 85 model_lifecycle.cc:462] loading: base:1
I0105 12:26:15.442887 85 onnxruntime.cc:2504] TRITONBACKEND_Initialize: onnxruntime
I0105 12:26:15.442939 85 onnxruntime.cc:2514] Triton TRITONBACKEND API version: 1.12
I0105 12:26:15.442967 85 onnxruntime.cc:2520] 'onnxruntime' TRITONBACKEND API version: 1.12
I0105 12:26:15.442993 85 onnxruntime.cc:2550] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0105 12:26:15.477325 85 onnxruntime.cc:2608] TRITONBACKEND_ModelInitialize: embed (version 1)
I0105 12:26:15.477402 85 onnxruntime.cc:2608] TRITONBACKEND_ModelInitialize: rerank (version 1)
I0105 12:26:15.477980 85 onnxruntime.cc:666] skipping model configuration auto-complete for 'embed': inputs and outputs already specified
I0105 12:26:15.478452 85 onnxruntime.cc:2651] TRITONBACKEND_ModelInstanceInitialize: embed (GPU device 0)
I0105 12:26:15.478770 85 onnxruntime.cc:666] skipping model configuration auto-complete for 'rerank': inputs and outputs already specified
I0105 12:26:15.480427 85 onnxruntime.cc:2651] TRITONBACKEND_ModelInstanceInitialize: rerank (GPU device 0)
[1704457575.790988] [739a2c81d313:85   :0]       **ib_device.c:1173 UCX  ERROR   ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::f652:14ff:feda:687e sgid_index=0 traffic_class=0) for UD verbs connect on mlx4_0 failed: Cannot allocate memory**
[739a2c81d313:00085] pml_ucx.c:424  Error: ucp_ep_create(proc=0) failed: Address not valid
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[739a2c81d313:00085] *** An error occurred in MPI_Init_thread
[739a2c81d313:00085] *** reported by process [682491905,0]
[739a2c81d313:00085] *** on a NULL communicator
[739a2c81d313:00085] *** Unknown error
[739a2c81d313:00085] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[739a2c81d313:00085] ***    and potentially your MPI job)

from qanything.

successren commented on September 27, 2024

我也遇到了，不过这个好像不重要，bug好像在别的地方，我这边是tritonserver启动失败，我泡了一天咖啡都没见它完成，查了查好像是内存问题，docker-compose.yaml里加点东西好像就可以了

How much memory does your machine have?

from qanything.

qmhl1 commented on September 27, 2024

After deleting the models folder, I can start it. Is there a problem with my decompression operation？

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:65:00.0 Off |                  N/A |
| 22%   26C    P8              15W / 250W |     15MiB / 22528MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1174      G   /usr/lib/xorg/Xorg                            9MiB |
|    0   N/A  N/A      1418      G   /usr/bin/gnome-shell                          4MiB |
+---------------------------------------------------------------------------------------+```
~/QAnything/models/...
> Bug，/model_repos/QAEnsemble/QAEnsemble.log
> 
> ```
> I0105 12:26:15.434683 85 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f1416000000' with size 268435456
> I0105 12:26:15.435029 85 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
> I0105 12:26:15.439093 85 model_lifecycle.cc:462] loading: rerank:1
> I0105 12:26:15.439135 85 model_lifecycle.cc:462] loading: embed:1
> I0105 12:26:15.439166 85 model_lifecycle.cc:462] loading: base:1
> I0105 12:26:15.442887 85 onnxruntime.cc:2504] TRITONBACKEND_Initialize: onnxruntime
> I0105 12:26:15.442939 85 onnxruntime.cc:2514] Triton TRITONBACKEND API version: 1.12
> I0105 12:26:15.442967 85 onnxruntime.cc:2520] 'onnxruntime' TRITONBACKEND API version: 1.12
> I0105 12:26:15.442993 85 onnxruntime.cc:2550] backend configuration:
> {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
> I0105 12:26:15.477325 85 onnxruntime.cc:2608] TRITONBACKEND_ModelInitialize: embed (version 1)
> I0105 12:26:15.477402 85 onnxruntime.cc:2608] TRITONBACKEND_ModelInitialize: rerank (version 1)
> I0105 12:26:15.477980 85 onnxruntime.cc:666] skipping model configuration auto-complete for 'embed': inputs and outputs already specified
> I0105 12:26:15.478452 85 onnxruntime.cc:2651] TRITONBACKEND_ModelInstanceInitialize: embed (GPU device 0)
> I0105 12:26:15.478770 85 onnxruntime.cc:666] skipping model configuration auto-complete for 'rerank': inputs and outputs already specified
> I0105 12:26:15.480427 85 onnxruntime.cc:2651] TRITONBACKEND_ModelInstanceInitialize: rerank (GPU device 0)
> [1704457575.790988] [739a2c81d313:85   :0]       **ib_device.c:1173 UCX  ERROR   ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::f652:14ff:feda:687e sgid_index=0 traffic_class=0) for UD verbs connect on mlx4_0 failed: Cannot allocate memory**
> [739a2c81d313:00085] pml_ucx.c:424  Error: ucp_ep_create(proc=0) failed: Address not valid
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>   PML add procs failed
>   --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> [739a2c81d313:00085] *** An error occurred in MPI_Init_thread
> [739a2c81d313:00085] *** reported by process [682491905,0]
> [739a2c81d313:00085] *** on a NULL communicator
> [739a2c81d313:00085] *** Unknown error
> [739a2c81d313:00085] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> [739a2c81d313:00085] ***    and potentially your MPI job)
> ```

from qanything.

successren commented on September 27, 2024

😂 So, have you successfully run it now?"

After deleting the models folder, I can start it. Is there a problem with my decompression operation？ QAnything/models/...

Bug，/model_repos/QAEnsemble/QAEnsemble.log

I0105 12:26:15.434683 85 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f1416000000' with size 268435456
I0105 12:26:15.435029 85 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0105 12:26:15.439093 85 model_lifecycle.cc:462] loading: rerank:1
I0105 12:26:15.439135 85 model_lifecycle.cc:462] loading: embed:1
I0105 12:26:15.439166 85 model_lifecycle.cc:462] loading: base:1
I0105 12:26:15.442887 85 onnxruntime.cc:2504] TRITONBACKEND_Initialize: onnxruntime
I0105 12:26:15.442939 85 onnxruntime.cc:2514] Triton TRITONBACKEND API version: 1.12
I0105 12:26:15.442967 85 onnxruntime.cc:2520] 'onnxruntime' TRITONBACKEND API version: 1.12
I0105 12:26:15.442993 85 onnxruntime.cc:2550] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0105 12:26:15.477325 85 onnxruntime.cc:2608] TRITONBACKEND_ModelInitialize: embed (version 1)
I0105 12:26:15.477402 85 onnxruntime.cc:2608] TRITONBACKEND_ModelInitialize: rerank (version 1)
I0105 12:26:15.477980 85 onnxruntime.cc:666] skipping model configuration auto-complete for 'embed': inputs and outputs already specified
I0105 12:26:15.478452 85 onnxruntime.cc:2651] TRITONBACKEND_ModelInstanceInitialize: embed (GPU device 0)
I0105 12:26:15.478770 85 onnxruntime.cc:666] skipping model configuration auto-complete for 'rerank': inputs and outputs already specified
I0105 12:26:15.480427 85 onnxruntime.cc:2651] TRITONBACKEND_ModelInstanceInitialize: rerank (GPU device 0)
[1704457575.790988] [739a2c81d313:85   :0]       **ib_device.c:1173 UCX  ERROR   ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::f652:14ff:feda:687e sgid_index=0 traffic_class=0) for UD verbs connect on mlx4_0 failed: Cannot allocate memory**
[739a2c81d313:00085] pml_ucx.c:424  Error: ucp_ep_create(proc=0) failed: Address not valid
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[739a2c81d313:00085] *** An error occurred in MPI_Init_thread
[739a2c81d313:00085] *** reported by process [682491905,0]
[739a2c81d313:00085] *** on a NULL communicator
[739a2c81d313:00085] *** Unknown error
[739a2c81d313:00085] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[739a2c81d313:00085] ***    and potentially your MPI job)

from qanything.

qmhl1 commented on September 27, 2024

Could it be a problem with /opt/tritonserver/backends/qa_ensemble？
This resulted in the inability to load the base model

from qanything.

xixihahaliu commented on September 27, 2024

| 22% 26C P8 15W / 250W | 15MiB / 22528MiB | 0% Default |

Why does the 2080TI have 22GB of VRAM? Is it a non-official version of the GPU that may cause unknown errors?

from qanything.

项目部署报错 about qanything HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent