kserve / modelmesh Goto Github PK

Distributed Model Serving Framework

License: Apache License 2.0

Dockerfile 0.39% Java 97.57% Python 0.20% Shell 1.36% Thrift 0.48%

modelmesh's Introduction

KServe

KServe provides a Kubernetes Custom Resource Definition for serving predictive and generative machine learning (ML) models. It aims to solve production model serving use cases by providing high abstraction interfaces for Tensorflow, XGBoost, ScikitLearn, PyTorch, Huggingface Transformer/LLM models using standardized data plane protocols.

It encapsulates the complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features like GPU Autoscaling, Scale to Zero, and Canary Rollouts to your ML deployments. It enables a simple, pluggable, and complete story for Production ML Serving including prediction, pre-processing, post-processing and explainability. KServe is being used across various organizations.

For more details, visit the KServe website.

KFServing has been rebranded to KServe since v0.7.

Why KServe?

KServe is a standard, cloud agnostic Model Inference Platform for serving predictive and generative AI models on Kubernetes, built for highly scalable use cases.
Provides performant, standardized inference protocol across ML frameworks including OpenAI specification for generative models.
Support modern serverless inference workload with request based autoscaling including scale-to-zero on CPU and GPU.
Provides high scalability, density packing and intelligent routing using ModelMesh.
Simple and pluggable production serving for inference, pre/post processing, monitoring and explainability.
Advanced deployments for canary rollout, pipeline, ensembles with InferenceGraph.

Learn More

To learn more about KServe, how to use various supported features, and how to participate in the KServe community, please follow the KServe website documentation. Additionally, we have compiled a list of presentations and demos to dive through various details.

🛠️ Installation

Standalone Installation

Serverless Installation: KServe by default installs Knative for serverless deployment for InferenceService.
Raw Deployment Installation: Compared to Serverless Installation, this is a more lightweight installation. However, this option does not support canary deployment and request based autoscaling with scale-to-zero.
ModelMesh Installation: You can optionally install ModelMesh to enable high-scale, high-density and frequently-changing model serving use cases.
Quick Installation: Install KServe on your local machine.

Kubeflow Installation

KServe is an important addon component of Kubeflow, please learn more from the Kubeflow KServe documentation. Check out the following guides for running on AWS or on OpenShift Container Platform.

🛫 Create your first InferenceService

💡 Roadmap

📘 InferenceService API Reference

🧰 Developer Guide

✍️ Contributor Guide

🤝 Adopters

modelmesh's People

Contributors

Stargazers

Watchers

modelmesh's Issues

Document design decisions for decentralized scaling and share performance evaluation

Hi guys modelmesh is great, especially I known IBM FASS OpenWhisk.
I'm wandering the reasons why you guys designed decentralized scale decisions?

do you guys have tested the performance benefits?
more thoughts behind this design would be appreciated~

ModelMesh serving not working with Kserve Client APIs

I used the KServe Client API to serve the model in modelmesh serving.

name='iris-minio'
_namespace = utils.get_default_target_namespace()
default_model_spec = V1beta1InferenceServiceSpec(predictor=V1beta1PredictorSpec(sklearn=V1beta1SKLearnSpec(storage_uri='s3://model-serving-scikit/v4/model.pkl')))
isvc = V1beta1InferenceService(api_version=constants.KSERVE_V1BETA1,kind=constants.KSERVE_KIND, metadata=client.V1ObjectMeta(name=name, namespace=_namespace, annotations={'serving.kserve.io/deploymentMode': 'ModelMesh', 'serving.kserve.io/secretKey': 'localMinIO'}), spec=default_model_spec)
KServe = KServeClient()
KServe.create(isvc)

It created in isvc

but then it does not yet ready:

  Model Status:
    Copies:
      Failed Copies:  0
    Last Failure Info:
      Message:              No ServingRuntime supports specified model type and/or protocol
      Model Revision Name:  iris-minio__isvc-b789ab6f4a
      Reason:               NoSupportingRuntime
      Time:                 2022-12-22T09:45:59Z
    States:
      Active Model State:  FailedToLoad
      Target Model State:

Please check, looks to be a bug.

HTTPS support for RemotePayloadProcessor

It should be possible to use https for RemotePayloadProcessor to communicate to consumers of MM Payloads.

ByteBuf leak observed in production deployment

A few instances of this were reported recently

2023-05-05 14:30:25
modelmesh-runtime ERROR LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:
	io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:403)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
	io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:140)
	com.ibm.watson.modelmesh.GrpcSupport$3.parse(GrpcSupport.java:403)
	com.ibm.watson.modelmesh.GrpcSupport$3.parse(GrpcSupport.java:318)
	io.grpc.MethodDescriptor.parseRequest(MethodDescriptor.java:307)
	io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:333)
	io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:316)
	io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:835)
	io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
	java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	java.base/java.lang.Thread.run(Thread.java:833)
	com.ibm.watson.litelinks.server.ServerRequestThread.run(ServerRequestThread.java:47)

indicates a bug in either model-mesh or grpc-java. Circumstantially it may have been introduced by #84 - needs investigation.

Questions about functioning of ModelMesh

I'm currently trying out ModelMesh to see if it's a good solution for the following problems:

Only load models into GPUs and serve them when they are actually being used.
Distribute the total set of loaded models across GPUs because they can not all fit in the memory of a single GPU.
Automatically scale up models and GPUs when the workload increases.

It seems like ModelMesh handles (1) through its local and global LRU caching. To test this behavior, I created about 20 Predictors resources, expecting the models to only be loaded once the first inference request comes in. However, it will try to load all of the models in memory immediately after creating the Predictors. Is this expected behavior? Are there ways to configure the LRU parameters?

I was testing the 20 Predictors with 2 NVIDIA Triton runtime instances (using the CPU and 1 GiB of RAM) and noticed that the pods would continuously be OOMKilled. It seemed like ModelMesh was trying to load more models into memory than the available capacity. This happened even when changing the memBufferBytes to 950 MiB. While debugging this, I realized that I never had to specify the required memory for any of the Predictors, and now I wonder how ModelMesh determines if a model will "fit". Does it try to estimate how much memory a model requires by simply trying to load it one first time and then measuring how much memory it consumes?

The third thing I'm wondering is how ModelMesh ensures that there are enough runtime instances to host all of the (scaled) models. I understand that it can auto-scale models based on their usage and distributes them across more runtime instances in that case, but how does it scale the runtime instances themselves? I couldn't find any references to "replicas" in the source code of ModelMesh.

So, to summarize my questions:

Why does ModelMesh try to load a model immediately after a Predictor has been defined, even if no inference requests have come in yet?
How does ModelMesh determine a model's memory usage and whether it will fit on a given runtime instance?
How does ModelMesh scale the number of runtime instances to fit all of the models and their multiple replicas?

Error when enabling TLS for ModelMesh

Hi, I am follow this doc to enable the TLS for ModelMesh.

I have enabled rest-proxy and the rest-proxy enable TLS successfully, so the TLS secret should be mounted correctly into the Pod. The error is coming from the mm conatiner.

Error logs from mm container:

❯ k logs modelmesh-serving-custom-mlserver-1.x-6755949cc9-mvr7j mm
Running as uid=2000(app) gid=2000(app) groups=2000(app): app
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
LD_LIBRARY_PATH=
PKG_CONFIG_PATH=
XDG_DATA_DIRS=

Running in Kubernetes
Registering internal (pod) endpoint only
PORT_ARGS=-p 8080 -r 8080
WATSON_SERVICE_ADDRESS=10.244.0.73
Registering instance id (from Kubernetes pod name) as "949cc9-mvr7j"
WARNING: MM_SVC_GRPC_PRIVATE_KEY_PATH not set *AND/OR* MM_ENABLE_SSL=false, using PLAINTEXT for internal comms
SERVICE_VERSION=
build-version=20220721-36830
Registering ModelMesh Service version as "20220721-36830"
JAVA_HOME set to /usr/lib/jvm/jre-17-openjdk
Java version information:
openjdk version "17.0.3" 2022-04-19 LTS
OpenJDK Runtime Environment 21.9 (build 17.0.3+7-LTS)
OpenJDK 64-Bit Server VM 21.9 (build 17.0.3+7-LTS, mixed mode, sharing)
KV_STORE=etcd:/opt/kserve/mmesh/etcd/etcd_connection
LL_REGISTRY=
ZOOKEEPER=
WATSON_SERVICE_ADDRESS=10.244.0.73
MM_SERVICE_NAME=modelmesh-serving
PRIVATE_ENDPOINT=
MM_LOCATION=172.18.0.2
MM_SERVICE_CLASS=com.ibm.watson.modelmesh.SidecarModelMesh
INTERNAL_GRPC_PORT=8085
SERVICE_ARGS=-p 8080 -r 8080 -i 949cc9-mvr7j -v 20220721-36830
Certificate was added to keystore
Imported provided CA cert into litelinks truststore: /opt/kserve/mmesh/tls/tls.crt
Using provided private key for litelinks (internal) TLS: /opt/kserve/mmesh/tls/tls.key
Certificate was added to keystore
Imported Kubernetes CA certificate into litelinks truststore: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
No process mem limit provided or found, defaulting to 1536MiB
MEM_LIMIT_MB=1536
Using default heap size of MIN(41% of MEM_LIMIT_MB, 640MiB) = 629MiB
HEAP_SIZE_MB=629
MEM_HEADROOM_MB=189
MAX_GC_PAUSE=50 millisecs
MAX_DIRECT_BUFS_MB=715
Health probe HTTP endpoint will use port 8089
SHUTDOWN_TIMEOUT_MS=90000
+ exec /usr/lib/jvm/jre-17-openjdk/bin/java -cp 'lib/litelinks-core-1.7.2.jar:lib/*' -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:-ResizePLAB -Xmx629m -Xms629m -XX:MaxInlineLevel=28 -Xlog:gc:/opt/kserve/mmesh/log/vgc_modelmesh-serving-custom-mlserver-1.x-6755949cc9-mvr7j.log -Dfile.encoding=UTF8 -Dio.netty.tryReflectionSetAccessible=true --add-opens=java.base/java.nio=ALL-UNNAMED -Dcom.redhat.fips=false -XX:MaxDirectMemorySize=33554432 -Dio.netty.maxDirectMemory=749731840 -Dio.grpc.netty.useCustomAllocator=false -Dlitelinks.ssl.key.path=/opt/kserve/mmesh/tls/tls.key -Dlitelinks.ssl.key.certpath=/opt/kserve/mmesh/tls/tls.crt -Dwatson.ssl.truststore.path=/opt/kserve/mmesh/lib/truststore.jks -Dwatson.ssl.truststore.password=watson15qa -Dlitelinks.cancel_on_client_close=true -Dlitelinks.threadcontexts=log_mdc -Dlitelinks.shutdown_timeout_ms=90000 -Dlitelinks.produce_pooled_bytebufs=true -Dlitelinks.ssl.use_jdk=false -Dlog4j.configurationFile=/opt/kserve/mmesh/lib/log4j2.xml -Dlog4j2.enable.threadlocals=true -Dlog4j2.garbagefree.threadContextMap=true com.ibm.watson.litelinks.server.LitelinksService -s com.ibm.watson.modelmesh.SidecarModelMesh -n modelmesh-serving -a /opt/kserve/mmesh/model-mesh.anchor -p 8080 -r 8080 -i 949cc9-mvr7j -v 20220721-36830 -h 8089
using service registry string: etcd:/opt/kserve/mmesh/etcd/etcd_connection
{"instant":{"epochSecond":1665134888,"nanoOfSecond":351742124},"thread":"main","level":"INFO","loggerName":"com.ibm.watson.litelinks.server.WatchedService","message":"Starting service-watching wrapper; hostname=10.244.0.73","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1665134888,"nanoOfSecond":449461245},"thread":"main","level":"INFO","loggerName":"com.ibm.watson.litelinks.server.ProbeHttpServer","message":"Starting litelinks health probe http server on port 8089","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1665134888,"nanoOfSecond":455341678},"thread":"main","level":"INFO","loggerName":"com.ibm.watson.litelinks.NettyCommon","message":"Litelinks using native transport (epoll)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1665134888,"nanoOfSecond":480884593},"thread":"main","level":"INFO","loggerName":"com.ibm.watson.litelinks.NettyCommon","message":"Creating litelinks shared worker ELG with 3 threads","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":1,"threadPriority":5}
{"instant":{"epochSecond":1665134888,"nanoOfSecond":524638590},"thread":"ll-svc-events-3","level":"INFO","loggerName":"com.ibm.watson.litelinks.server.DefaultThriftServer","message":"initializing service implementation...","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":20,"threadPriority":5}
service starting, type "stop" to stop
{"instant":{"epochSecond":1665134888,"nanoOfSecond":751308930},"thread":"ll-svc-events-3","level":"INFO","loggerName":"com.ibm.watson.kvutils.factory.KVUtilsFactory","message":"KV_STORE=etcd:/opt/kserve/mmesh/etcd/etcd_connection","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":20,"threadPriority":5}
{"instant":{"epochSecond":1665134888,"nanoOfSecond":752015442},"thread":"ll-svc-events-3","level":"INFO","loggerName":"com.ibm.watson.kvutils.factory.KVUtilsFactory","message":"creating new etcd KV factory with config file: /opt/kserve/mmesh/etcd/etcd_connection","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":20,"threadPriority":5}
service failed, exiting
java.lang.IllegalStateException: Expected the service modelmesh-serving to be RUNNING, but the service has FAILED
        at com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:381)
        at com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:321)
        at com.ibm.watson.litelinks.server.LitelinksService.run(LitelinksService.java:699)
        at com.ibm.watson.litelinks.server.LitelinksService.launch(LitelinksService.java:144)
        at com.ibm.watson.litelinks.server.LitelinksService.main(LitelinksService.java:108)
Caused by: java.io.IOException: Can't find certificate file: tls.key
        at com.ibm.etcd.client.config.EtcdClusterConfig.certFromJson(EtcdClusterConfig.java:225)
        at com.ibm.etcd.client.config.EtcdClusterConfig.fromJson(EtcdClusterConfig.java:199)
        at com.ibm.etcd.client.config.EtcdClusterConfig.fromJsonFileOrSimple(EtcdClusterConfig.java:265)
        at com.ibm.watson.etcd.EtcdUtilsFactory.<init>(EtcdUtilsFactory.java:58)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
        at com.ibm.watson.kvutils.factory.KVUtilsFactory.newFactory(KVUtilsFactory.java:99)
        at com.ibm.watson.kvutils.factory.KVUtilsFactory.getDefaultFactory(KVUtilsFactory.java:60)
        at com.ibm.watson.modelmesh.ModelMesh.<init>(ModelMesh.java:470)
        at com.ibm.watson.modelmesh.SidecarModelMesh.<init>(SidecarModelMesh.java:147)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
        at java.base/java.lang.reflect.ReflectAccess.newInstance(ReflectAccess.java:128)
        at java.base/jdk.internal.reflect.ReflectionFactory.newInstance(ReflectionFactory.java:347)
        at java.base/java.lang.Class.newInstance(Class.java:645)
        at com.ibm.watson.litelinks.server.DefaultThriftServer.lambda$doStart$0(DefaultThriftServer.java:384)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
        at com.ibm.watson.litelinks.ThreadPoolHelper$3$1.run(ThreadPoolHelper.java:91)
{"instant":{"epochSecond":1665134888,"nanoOfSecond":769517095},"thread":"ll-svc-events-5","level":"INFO","loggerName":"com.ibm.watson.litelinks.server.ProbeHttpServer","message":"Stopping litelinks health probe http server on port 8089","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":22,"threadPriority":5}

mm container spec:

  containers:
  - env:
    - name: MM_SERVICE_NAME
      value: modelmesh-serving
    - name: MM_SVC_GRPC_PORT
      value: "8033"
    - name: WKUBE_POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: WKUBE_POD_IPADDR
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: MM_LOCATION
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.hostIP
    - name: KV_STORE
      value: etcd:/opt/kserve/mmesh/etcd/etcd_connection
    - name: MM_METRICS
      value: prometheus:port=2112;scheme=https
    - name: SHUTDOWN_TIMEOUT_MS
      value: "90000"
    - name: MM_SVC_GRPC_MAX_HEADERS_SIZE
      value: "32768"
    - name: INTERNAL_SERVING_GRPC_PORT
      value: "8001"
    - name: INTERNAL_GRPC_PORT
      value: "8085"
    - name: MM_SVC_GRPC_MAX_MSG_SIZE
      value: "16777216"
    - name: MM_KVSTORE_PREFIX
      value: mm
    - name: MM_DEFAULT_VMODEL_OWNER
      value: ksp
    - name: MM_LABELS
      value: mt:custom,mt:custom:1,rt:custom-mlserver-1.x
    - name: MM_TYPE_CONSTRAINTS_PATH
      value: /etc/watson/mmesh/config/type_constraints
    - name: MM_DATAPLANE_CONFIG_PATH
      value: /etc/watson/mmesh/config/dataplane_api_config
    - name: MM_TLS_KEY_CERT_PATH
      value: /opt/kserve/mmesh/tls/tls.crt
    - name: MM_TLS_PRIVATE_KEY_PATH
      value: /opt/kserve/mmesh/tls/tls.key
    image: kserve/modelmesh:v0.9.0
    imagePullPolicy: IfNotPresent
    lifecycle:
      preStop:
        exec:
          command:
          - /opt/kserve/mmesh/stop.sh
          - wait
    livenessProbe:
      failureThreshold: 2
      httpGet:
        path: /live
        port: 8089
        scheme: HTTP
      initialDelaySeconds: 90
      periodSeconds: 30
      successThreshold: 1
      timeoutSeconds: 5
    name: mm
    ports:
    - containerPort: 8033
      name: grpc
      protocol: TCP
    - containerPort: 2112
      name: prometheus
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /ready
        port: 8089
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: "3"
        memory: 448Mi
      requests:
        cpu: 300m
        memory: 448Mi
    securityContext:
      capabilities:
        drop:
        - ALL
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/watson/mmesh/config
      name: tc-config
    - mountPath: /opt/kserve/mmesh/etcd
      name: etcd-config
      readOnly: true
    - mountPath: /opt/kserve/mmesh/tls
      name: tls-certs
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-8gc8b
      readOnly: true

...
  volumes:
  - emptyDir:
      sizeLimit: 1536Mi
    name: models-dir
  - name: storage-config
    secret:
      defaultMode: 420
      secretName: storage-config
  - configMap:
      defaultMode: 420
      name: tc-config
    name: tc-config
  - name: etcd-config
    secret:
      defaultMode: 420
      secretName: model-serving-etcd
  - name: tls-certs
    secret:
      defaultMode: 420
      secretName: etcd-client-certificate
  - name: kube-api-access-8gc8b
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

“Nowhere available to load” error in inference response

Describe the bug

Setup: 2 serving runtimes (host-1 and host-2) and 2 inference services (dep-1 and dep-2) running within a single namespace, with 1 inference service loaded onto each serving runtime pod in the following way:

{
	“host-1”: “dep-1”,
	“host-2”: “dep-2”
}

What was done: 10 inference requests were sent to each inference service and the inference responses were recorded. About half of the requests returned a successful inference response, but the other half of the responses looked like this with status code 500:

{
	"code": 13,
	"message": "Nowhere available to load"
}

Logs from the mm sidecar containers of the 2 pods were recorded and some interesting observations are as follows:

The following error logs, which correspond with the error response shown to the user, are seen in the mm container. Based on the trace shown in the logs, this line of code that threw the error seems to suggest that some kind of state inconsistency that arose.

 {
            "instant": {
                "epochSecond": 1668759502,
                "nanoOfSecond": 171991809
            },
            "thread": "invoke-ex-dep-2__isvc-f9b60f9db5",
            "level": "WARN",
            "loggerName": "com.ibm.watson.modelmesh.ModelMesh",
            "message": "Nowhere available to load for model dep-2__isvc-f9b60f9db5: type=rt:host-2, constraintBlocked=true, loadTargetFilter=LTF[][l=[bbd974-mk99v], f=[], e=null], instanceTable=[bbd974-mk99v, 87f6fc-6rwf9]",
            "endOfBatch": false,
            "loggerFqcn": "org.apache.logging.log4j.spi.AbstractLogger",
            "contextMap": {},
            "threadId": 61,
            "threadPriority": 5
        },
        {
            "instant": {
                "epochSecond": 1668759502,
                "nanoOfSecond": 172214768
            },
            "thread": "mmesh-req-thread-4",
            "level": "ERROR",
            "loggerName": "com.ibm.watson.modelmesh.ModelMeshApi",
            "message": "Exception invoking inference.GRPCInferenceService/ModelInfer method of resolved model dep-2__isvc-f9b60f9db5 of vmodel dep-2: ModelLoadException: Nowhere available to load",
            "endOfBatch": false,
            "loggerFqcn": "org.apache.logging.slf4j.Log4jLogger",
            "contextMap": {},
            "threadId": 61,
            "threadPriority": 5
        },
        {
            "instant": {
                "epochSecond": 1668759502,
                "nanoOfSecond": 172293096
            },
            "thread": "mmesh-req-thread-4",
            "level": "ERROR",
            "loggerName": "com.ibm.watson.modelmesh.ModelMeshApi",
            "message": "Exception invoking inference.GRPCInferenceService/ModelInfer method for vmodel ids [dep-2]",
            "thrown": {
                "commonElementCount": 0,
                "localizedMessage": "Nowhere available to load",
                "message": "Nowhere available to load",
                "name": "com.ibm.watson.modelmesh.thrift.ModelLoadException",
                "extendedStackTrace": [
                    {
                        "class": "com.ibm.watson.modelmesh.ModelMesh",
                        "method": "throwIfLocalLoadNotAllowed",
                        "file": "ModelMesh.java",
                        "line": 3991,
                        "exact": false,
                        "location": "model-mesh-0.4.2-SNAPSHOT.jar",
                        "version": "?"
                    },
                    {
                        "class": "com.ibm.watson.modelmesh.ModelMesh",
                        "method": "invokeModel",
                        "file": "ModelMesh.java",
                        "line": 3818,
                        "exact": false,
                        "location": "model-mesh-0.4.2-SNAPSHOT.jar",
                        "version": "?"
                    },
                    {
                        "class": "com.ibm.watson.modelmesh.SidecarModelMesh",
                        "method": "callModel",
                        "file": "SidecarModelMesh.java",
                        "line": 1106,
                        "exact": false,
                        "location": "model-mesh-0.4.2-SNAPSHOT.jar",
                        "version": "?"
                    },
                    {
                        "class": "com.ibm.watson.modelmesh.ModelMeshApi",
                        "method": "callModel",
                        "file": "ModelMeshApi.java",
                        "line": 436,
                        "exact": false,
                        "location": "model-mesh-0.4.2-SNAPSHOT.jar",
                        "version": "?"
                    },
                    {
                        "class": "com.ibm.watson.modelmesh.ModelMeshApi$4",
                        "method": "onHalfClose",
                        "file": "ModelMeshApi.java",
                        "line": 705,
                        "exact": false,
                        "location": "model-mesh-0.4.2-SNAPSHOT.jar",
                        "version": "?"
                    },
                    {
                        "class": "io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl",
                        "method": "halfClosed",
                        "file": "ServerCallImpl.java",
                        "line": 340,
                        "exact": false,
                        "location": "grpc-core-1.46.0.jar",
                        "version": "1.46.0"
                    },
                    {
                        "class": "io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed",
                        "method": "runInContext",
                        "file": "ServerImpl.java",
                        "line": 866,
                        "exact": false,
                        "location": "grpc-core-1.46.0.jar",
                        "version": "1.46.0"
                    },
                    {
                        "class": "io.grpc.internal.ContextRunnable",
                        "method": "run",
                        "file": "ContextRunnable.java",
                        "line": 37,
                        "exact": false,
                        "location": "grpc-core-1.46.0.jar",
                        "version": "1.46.0"
                    },
                    {
                        "class": "io.grpc.internal.SerializingExecutor",
                        "method": "run",
                        "file": "SerializingExecutor.java",
                        "line": 133,
                        "exact": false,
                        "location": "grpc-core-1.46.0.jar",
                        "version": "1.46.0"
                    },
                    {
                        "class": "java.util.concurrent.ThreadPoolExecutor",
                        "method": "runWorker",
                        "file": "ThreadPoolExecutor.java",
                        "line": 1136,
                        "exact": false,
                        "location": "?",
                        "version": "?"
                    },
                    {
                        "class": "java.util.concurrent.ThreadPoolExecutor$Worker",
                        "method": "run",
                        "file": "ThreadPoolExecutor.java",
                        "line": 635,
                        "exact": false,
                        "location": "?",
                        "version": "?"
                    },
                    {
                        "class": "io.netty.util.concurrent.FastThreadLocalRunnable",
                        "method": "run",
                        "file": "FastThreadLocalRunnable.java",
                        "line": 30,
                        "exact": false,
                        "location": "netty-common-4.1.77.Final.jar",
                        "version": "4.1.77.Final"
                    },
                    {
                        "class": "java.lang.Thread",
                        "method": "run",
                        "file": "Thread.java",
                        "line": 833,
                        "exact": false,
                        "location": "?",
                        "version": "?"
                    },
                    {
                        "class": "com.ibm.watson.litelinks.server.ServerRequestThread",
                        "method": "run",
                        "file": "ServerRequestThread.java",
                        "line": 47,
                        "exact": false,
                        "location": "litelinks-core-1.7.2.jar",
                        "version": "?"
                    }
                ]
            },
            "endOfBatch": false,
            "loggerFqcn": "org.apache.logging.slf4j.Log4jLogger",
            "contextMap": {},
            "threadId": 61,
            "threadPriority": 5
        }

Even though dep-2 should be loaded onto host-2, there were logs from the thread invoke-ex-dep-2__isvc-f9b60f9db5 found in the host-1 pod. These logs are:

        {
            "instant": {
                "epochSecond": 1668759502,
                "nanoOfSecond": 171180816
            },
            "thread": "invoke-ex-dep-2__isvc-f9b60f9db5",
            "level": "INFO",
            "loggerName": "com.ibm.watson.litelinks.client.ClientInvocationHandler",
            "message": "Retryable failure occurred for service=modelmesh-serving, method=applyModelMulti; retrying: class com.ibm.watson.litelinks.TConnectionClosedException: NOT_OPEN",
            "endOfBatch": false,
            "loggerFqcn": "org.apache.logging.slf4j.Log4jLogger",
            "contextMap": {},
            "threadId": 61,
            "threadPriority": 5
        },
        {
            "instant": {
                "epochSecond": 1668759502,
                "nanoOfSecond": 171306712
            },
            "thread": "invoke-ex-dep-2__isvc-f9b60f9db5",
            "level": "WARN",
            "loggerName": "com.ibm.watson.modelmesh.ModelMesh",
            "message": "ForwardingLB received no instances",
            "endOfBatch": false,
            "loggerFqcn": "org.apache.logging.log4j.spi.AbstractLogger",
            "contextMap": {},
            "threadId": 61,
            "threadPriority": 5
        },
        {
            "instant": {
                "epochSecond": 1668759502,
                "nanoOfSecond": 171343456
            },
            "thread": "invoke-ex-dep-2__isvc-f9b60f9db5",
            "level": "WARN",
            "loggerName": "com.ibm.watson.modelmesh.ModelMesh",
            "message": "ForwardingLB received no instances",
            "endOfBatch": false,
            "loggerFqcn": "org.apache.logging.log4j.spi.AbstractLogger",
            "contextMap": {},
            "threadId": 61,
            "threadPriority": 5
        },
        {
            "instant": {
                "epochSecond": 1668759502,
                "nanoOfSecond": 171991809
            },
            "thread": "invoke-ex-dep-2__isvc-f9b60f9db5",
            "level": "WARN",
            "loggerName": "com.ibm.watson.modelmesh.ModelMesh",
            "message": "Nowhere available to load for model dep-2__isvc-f9b60f9db5: type=rt:host-2, constraintBlocked=true, loadTargetFilter=LTF[][l=[bbd974-mk99v], f=[], e=null], instanceTable=[bbd974-mk99v, 87f6fc-6rwf9]",
            "endOfBatch": false,
            "loggerFqcn": "org.apache.logging.log4j.spi.AbstractLogger",
            "contextMap": {},
            "threadId": 61,
            "threadPriority": 5
        },
        {
            "instant": {
                "epochSecond": 1668759502,
                "nanoOfSecond": 172214768
            },
            "thread": "mmesh-req-thread-4",
            "level": "ERROR",
            "loggerName": "com.ibm.watson.modelmesh.ModelMeshApi",
            "message": "Exception invoking inference.GRPCInferenceService/ModelInfer method of resolved model dep-2__isvc-f9b60f9db5 of vmodel dep-2: ModelLoadException: Nowhere available to load",
            "endOfBatch": false,
            "loggerFqcn": "org.apache.logging.slf4j.Log4jLogger",
            "contextMap": {},
            "threadId": 61,
            "threadPriority": 5
        },
        {
            "instant": {
                "epochSecond": 1668759502,
                "nanoOfSecond": 172293096
            },
            "thread": "mmesh-req-thread-4",
            "level": "ERROR",
            "loggerName": "com.ibm.watson.modelmesh.ModelMeshApi",
            "message": "Exception invoking inference.GRPCInferenceService/ModelInfer method for vmodel ids [dep-2]",
            "thrown": {
                "commonElementCount": 0,
                "localizedMessage": "Nowhere available to load",
                "message": "Nowhere available to load",
                "name": "com.ibm.watson.modelmesh.thrift.ModelLoadException",
                "extendedStackTrace": [
                    {
                        "class": "com.ibm.watson.modelmesh.ModelMesh",
                        "method": "throwIfLocalLoadNotAllowed",
                        "file": "ModelMesh.java",
                        "line": 3991,
                        "exact": false,
                        "location": "model-mesh-0.4.2-SNAPSHOT.jar",
                        "version": "?"
                    },
                    {
                        "class": "com.ibm.watson.modelmesh.ModelMesh",
                        "method": "invokeModel",
                        "file": "ModelMesh.java",
                        "line": 3818,
                        "exact": false,
                        "location": "model-mesh-0.4.2-SNAPSHOT.jar",
                        "version": "?"
                    },
                    {
                        "class": "com.ibm.watson.modelmesh.SidecarModelMesh",
                        "method": "callModel",
                        "file": "SidecarModelMesh.java",
                        "line": 1106,
                        "exact": false,
                        "location": "model-mesh-0.4.2-SNAPSHOT.jar",
                        "version": "?"
                    },
                    {
                        "class": "com.ibm.watson.modelmesh.ModelMeshApi",
                        "method": "callModel",
                        "file": "ModelMeshApi.java",
                        "line": 436,
                        "exact": false,
                        "location": "model-mesh-0.4.2-SNAPSHOT.jar",
                        "version": "?"
                    },
                    {
                        "class": "com.ibm.watson.modelmesh.ModelMeshApi$4",
                        "method": "onHalfClose",
                        "file": "ModelMeshApi.java",
                        "line": 705,
                        "exact": false,
                        "location": "model-mesh-0.4.2-SNAPSHOT.jar",
                        "version": "?"
                    },
                    {
                        "class": "io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl",
                        "method": "halfClosed",
                        "file": "ServerCallImpl.java",
                        "line": 340,
                        "exact": false,
                        "location": "grpc-core-1.46.0.jar",
                        "version": "1.46.0"
                    },
                    {
                        "class": "io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed",
                        "method": "runInContext",
                        "file": "ServerImpl.java",
                        "line": 866,
                        "exact": false,
                        "location": "grpc-core-1.46.0.jar",
                        "version": "1.46.0"
                    },
                    {
                        "class": "io.grpc.internal.ContextRunnable",
                        "method": "run",
                        "file": "ContextRunnable.java",
                        "line": 37,
                        "exact": false,
                        "location": "grpc-core-1.46.0.jar",
                        "version": "1.46.0"
                    },
                    {
                        "class": "io.grpc.internal.SerializingExecutor",
                        "method": "run",
                        "file": "SerializingExecutor.java",
                        "line": 133,
                        "exact": false,
                        "location": "grpc-core-1.46.0.jar",
                        "version": "1.46.0"
                    },
                    {
                        "class": "java.util.concurrent.ThreadPoolExecutor",
                        "method": "runWorker",
                        "file": "ThreadPoolExecutor.java",
                        "line": 1136,
                        "exact": false,
                        "location": "?",
                        "version": "?"
                    },
                    {
                        "class": "java.util.concurrent.ThreadPoolExecutor$Worker",
                        "method": "run",
                        "file": "ThreadPoolExecutor.java",
                        "line": 635,
                        "exact": false,
                        "location": "?",
                        "version": "?"
                    },
                    {
                        "class": "io.netty.util.concurrent.FastThreadLocalRunnable",
                        "method": "run",
                        "file": "FastThreadLocalRunnable.java",
                        "line": 30,
                        "exact": false,
                        "location": "netty-common-4.1.77.Final.jar",
                        "version": "4.1.77.Final"
                    },
                    {
                        "class": "java.lang.Thread",
                        "method": "run",
                        "file": "Thread.java",
                        "line": 833,
                        "exact": false,
                        "location": "?",
                        "version": "?"
                    },
                    {
                        "class": "com.ibm.watson.litelinks.server.ServerRequestThread",
                        "method": "run",
                        "file": "ServerRequestThread.java",
                        "line": 47,
                        "exact": false,
                        "location": "litelinks-core-1.7.2.jar",
                        "version": "?"
                    }
                ]
            },
            "endOfBatch": false,
            "loggerFqcn": "org.apache.logging.slf4j.Log4jLogger",
            "contextMap": {},
            "threadId": 61,
            "threadPriority": 5
        }

To Reproduce
Steps to reproduce the behavior:

Create 2 serving runtimes and 2 inference services. The spec used for both serving runtimes and both inference services are the same.

Spec of inference service:

  spec:
    predictor:
      model:
        modelFormat:
          name: custom
          version: "1"
        name: ""
        resources: {}
        runtime: host-1
        storageUri: xxx

Spec of serving runtime:

spec:
  builtInAdapter:
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000
    runtimeManagementPort: 8001
    serverType: mlserver
  containers:
  - env:
    - name: MLSERVER_MODELS_DIR
      value: /models/
    - name: MLSERVER_GRPC_PORT
      value: "8001"
    - name: MLSERVER_HTTP_PORT
      value: "8002"
    - name: MLSERVER_LOAD_MODELS_AT_STARTUP
      value: "false"
    - name: MLSERVER_MODEL_NAME
      value: dummy-model-fixme
    - name: MLSERVER_HOST
      value: 127.0.0.1
    - name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
      value: "16777216"
    - name: MLSERVER_DEBUG
      value: "true"
    - name: MLSERVER_MODEL_PARALLEL_WORKERS
      value: "0"
    image: lizzzcai/my-custom-mlserver:0.0.5
    name: mlserver
    resources:
      limits:
        cpu: "2"
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 1Gi
  grpcDataEndpoint: port:8001
  grpcEndpoint: port:8085
  multiModel: true
  replicas: 1
  storageHelper:
    disabled: false
  supportedModelFormats:
  - autoSelect: true
    name: custom
    version: "1"
  volumes:
  - name: etcd-peer-certificate
    secret:
      defaultMode: 420
      secretName: etcd-peer-certificate

Once the models are loaded, send multiple inference requests to each inference service. There should be some inference responses returning with the "Nowhere available to load" error.

Expected behavior

Expected that all inference requests would return successful responses.

Screenshots

Environment (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Additional context

Here is the etcd data for the 2 serving runtimes and inference services.

modelmesh-serving/mm_ns/my_namespace/litelinks/modelmesh-serving/100.96.35.150_8080_18489c040ab
{"hostname":"100.96.35.150","instanceId":"87f6fc-6rwf9","port":8080,"version":"20220721-36830","registrationTime":1668758388909,"connConfig":{"transport.tprotocol.factory":"org.apache.thrift.protocol.TCompactProtocol$Factory","transport.framed":"false","transport.ssl.enabled":"false","transport.extrainfo_supported":"true","service.class":"com.ibm.watson.modelmesh.thrift.ModelMeshService","methodinfo.applyModel":"idp=t","methodinfo.applyModelMulti":"idp=t","app.kv_store_type":"etcd"}}

modelmesh-serving/mm_ns/my_namespace/litelinks/modelmesh-serving/100.96.35.151_8080_18489c1df9f
{"hostname":"100.96.35.151","instanceId":"bbd974-mk99v","port":8080,"version":"20220721-36830","registrationTime":1668758495137,"connConfig":{"transport.tprotocol.factory":"org.apache.thrift.protocol.TCompactProtocol$Factory","transport.framed":"false","transport.ssl.enabled":"false","transport.extrainfo_supported":"true","service.class":"com.ibm.watson.modelmesh.thrift.ModelMeshService","methodinfo.applyModel":"idp=t","methodinfo.applyModelMulti":"idp=t","app.kv_store_type":"etcd"}}

modelmesh-serving/mm_ns/my_namespace/mm/modelmesh-serving/instances/87f6fc-6rwf9
{"startTime":1668758384148,"loc":"10.250.7.232","labels":["mt:custom","mt:custom:1","rt:host-1"],"actionable":true,"lruTime":1668758582923,"count":1,"cap":373064,"used":49,"lThreads":1}

modelmesh-serving/mm_ns/my_namespace/mm/modelmesh-serving/instances/bbd974-mk99v
{"startTime":1668758490389,"loc":"10.250.7.232","labels":["mt:custom","mt:custom:1","rt:host-2"],"actionable":true,"lruTime":1668758586937,"count":1,"cap":373064,"used":49,"lThreads":1}

modelmesh-serving/mm_ns/my_namespace/mm/modelmesh-serving/leaderLatch/_87f6fc-6rwf9
_87f6fc-6rwf9

modelmesh-serving/mm_ns/my_namespace/mm/modelmesh-serving/leaderLatch/_bbd974-mk99v
_bbd974-mk99v

modelmesh-serving/mm_ns/my_namespace/mm/modelmesh-serving/registry/dep-1__isvc-69314cdd9d
{"type":"rt:host-1","encKey":"{\"storage_key\":\"default\",\"storage_params\":{\"bucket\":\"hcp-160e0506-618b-49a7-b3ac-269b5849c4e3\",\"type\":\"s3\"},\"model_type\":{\"name\":\"custom\",\"version\":\"1\"}}","mPath":"i543026/mlserver_models/text-classifier","autoDel":true,"instanceIds":{"87f6fc-6rwf9":1668758581118},"refs":1,"lu":1668758581118}

modelmesh-serving/mm_ns/my_namespace/mm/modelmesh-serving/registry/dep-2__isvc-f9b60f9db5
{"type":"rt:host-2","encKey":"{\"storage_key\":\"default\",\"storage_params\":{\"bucket\":\"hcp-160e0506-618b-49a7-b3ac-269b5849c4e3\",\"type\":\"s3\"},\"model_type\":{\"name\":\"custom\",\"version\":\"1\"}}","mPath":"i543026/mlserver_models/text-classifier","autoDel":true,"instanceIds":{"bbd974-mk99v":1668758583512},"refs":1,"lu":1668758583512}

modelmesh-serving/mm_ns/my_namespace/mm/modelmesh-serving/vmodels/dep-1
{"o":"isvc","amid":"dep-1__isvc-69314cdd9d","tmid":"dep-1__isvc-69314cdd9d"}

modelmesh-serving/mm_ns/my_namespace/mm/modelmesh-serving/vmodels/dep-2
{"o":"isvc","amid":"dep-2__isvc-f9b60f9db5","tmid":"dep-2__isvc-f9b60f9db5"}

Incorrect limits in dockerfile when cgroup v2 is used

Currently script start.sh contains following code to automatically set memory limits:

if [ "$MEM_LIMIT_MB" = "" ]; then
	DOCKER_LIM_FILE="/sys/fs/cgroup/memory/memory.limit_in_bytes"

	if [ -e "${DOCKER_LIM_FILE}" ]; then
		MEM_LIMIT_MB=$(($(cat ${DOCKER_LIM_FILE})/1024/1024))
		echo "Using process mem limit of ${MEM_LIMIT_MB}MiB from ${DOCKER_LIM_FILE}"
	else
    	MEM_LIMIT_MB="1536"
    	echo "No process mem limit provided or found, defaulting to ${MEM_LIMIT_MB}MiB"
	fi
fi

But it doesn't work in my case in kubernetes, printing following message:

No process mem limit provided or found, defaulting to 1536MiB
MEM_LIMIT_MB=1536
Using default heap size of MIN(41% of MEM_LIMIT_MB, 640MiB) = 629MiB
HEAP_SIZE_MB=629

But there is a limit on this container (set by default by modelmesh controller) equal to 448Mi. This leads to OOMKilled because Java heap tries to grow to 629MiB which is greater than actual limit. I checked and actual limit is present at this path:

[app@modelmesh-serving-test-server-745dc7fb57-fxvxv mmesh]$ cat /sys/fs/cgroup/memory.max 
469762048

This is due to cgroup v2 that stores limits differently (at different path). Is it possible to change start.sh script so it works with both cgroup versions?

OOMKilled Error

When running a ServingRuntime with 2 replicas, models are not unloaded/loaded correctly leading to OOMKilled error.
In this example, model size is ~891.7mb. 4 InferenceServices were created incrementally. As described in the example below (no. of models -> memory usage), not all models were loaded in both replicas. Since Pod 2 only had 1 model loaded and there was sufficient memory available, the fourth model was expected to be loaded in Pod 2. However that was not the case.

Pod 1:
2 models -> 3.1GiB
3 models -> 4GiB
4 models -> crash

Pod 2:
Only 1 copy of model loaded

mlserver-adapter container logs:


2022-12-16T15:05:38+08:00 1.6711743387692351e+09	INFO	MLServer Adapter.MLServer Adapter Server.Load Model	Setting 'SizeInBytes' to a multiple of model disk size	{"model_id": "de-id-55__isvc-60ac5fc3c4", "SizeInBytes": 1118056985, "disk_size": 894445588, "multiplier": 1.25}
2022-12-16T15:05:38+08:00 1.6711743387692757e+09	INFO	MLServer Adapter.MLServer Adapter Server.Load Model	MLServer model loaded	{"model_id": "de-id-55__isvc-60ac5fc3c4"}
2022-12-16T15:08:00+08:00 1.6711744809072714e+09	INFO	MLServer Adapter.MLServer Adapter Server.Load Model	Using model type	{"model_id": "de-id-66__isvc-60ac5fc3c4", "model_type": "custom"}
2022-12-16T15:08:00+08:00 1.6711744809073634e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	Reading storage credentials
2022-12-16T15:08:00+08:00 1.6711744809074411e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	found existing client in cache	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e"}
2022-12-16T15:08:00+08:00 1.6711744809523559e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	found objects to download	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e", "path": "example/model-2/", "count": 7}
2022-12-16T15:08:00+08:00 1.6711744809525683e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	downloading object	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e", "path": "example/model-2/classifier_pipeline.pkl", "filename": "/models/de-id-66__isvc-60ac5fc3c4/model-2/classifier_pipeline.pkl"}
2022-12-16T15:08:00+08:00 1.6711744809526064e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	downloading object	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e", "path": "example/model-2/config.json", "filename": "/models/de-id-66__isvc-60ac5fc3c4/model-2/config.json"}
2022-12-16T15:08:00+08:00 1.6711744809526336e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	downloading object	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e", "path": "example/model-2/model-settings.json", "filename": "/models/de-id-66__isvc-60ac5fc3c4/model-2/model-settings.json"}
2022-12-16T15:08:00+08:00 1.6711744809526622e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	downloading object	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e", "path": "example/model-2/pytorch_model.bin", "filename": "/models/de-id-66__isvc-60ac5fc3c4/model-2/pytorch_model.bin"}
2022-12-16T15:08:00+08:00 1.671174480952697e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	downloading object	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e", "path": "example/model-2/special_tokens_map.json", "filename": "/models/de-id-66__isvc-60ac5fc3c4/model-2/special_tokens_map.json"}
2022-12-16T15:08:00+08:00 1.6711744809527254e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	downloading object	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e", "path": "example/model-2/tokenizer.json", "filename": "/models/de-id-66__isvc-60ac5fc3c4/model-2/tokenizer.json"}
2022-12-16T15:08:00+08:00 1.6711744809527464e+09	DEBUG	MLServer Adapter.MLServer Adapter Server	downloading object	{"type": "s3", "cacheKey": "s3|0x2989f8038a8de21e", "path": "example/model-2/tokenizer_config.json", "filename": "/models/de-id-66__isvc-60ac5fc3c4/model-2/tokenizer_config.json"}
2022-12-16T15:08:36+08:00 1.6711745169826386e+09	ERROR	MLServer Adapter.MLServer Adapter Server.Load Model	MLServer failed to load model	{"model_id": "de-id-66__isvc-60ac5fc3c4", "error": "rpc error: code = Unavailable desc = error reading from server: EOF"}
2022-12-16T15:08:36+08:00 github.com/kserve/modelmesh-runtime-adapter/internal/proto/mmesh._ModelRuntime_LoadModel_Handler
2022-12-16T15:08:36+08:00 	/opt/app/internal/proto/mmesh/model-runtime_grpc.pb.go:181
2022-12-16T15:08:36+08:00 google.golang.org/grpc.(*Server).processUnaryRPC
2022-12-16T15:08:36+08:00 	/root/go/pkg/mod/google.golang.org/[email protected]/server.go:1282
2022-12-16T15:08:36+08:00 google.golang.org/grpc.(*Server).handleStream
2022-12-16T15:08:36+08:00 	/root/go/pkg/mod/google.golang.org/[email protected]/server.go:1619
2022-12-16T15:08:36+08:00 google.golang.org/grpc.(*Server).serveStreams.func1.2
2022-12-16T15:08:36+08:00 	/root/go/pkg/mod/google.golang.org/[email protected]/server.go:921
2022-12-16T15:08:38+08:00 1.6711745180047865e+09	ERROR	MLServer Adapter.MLServer Adapter Server	Failed to unload model from MLServer	{"model_id": "de-id-66__isvc-60ac5fc3c4", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:8001: connect: connection refused\""}
2022-12-16T15:08:38+08:00 github.com/kserve/modelmesh-runtime-adapter/internal/proto/mmesh._ModelRuntime_UnloadModel_Handler
2022-12-16T15:08:38+08:00 	/opt/app/internal/proto/mmesh/model-runtime_grpc.pb.go:199
2022-12-16T15:08:38+08:00 google.golang.org/grpc.(*Server).processUnaryRPC
2022-12-16T15:08:38+08:00 	/root/go/pkg/mod/google.golang.org/[email protected]/server.go:1282
2022-12-16T15:08:38+08:00 google.golang.org/grpc.(*Server).handleStream
2022-12-16T15:08:38+08:00 	/root/go/pkg/mod/google.golang.org/[email protected]/server.go:1619
2022-12-16T15:08:38+08:00 google.golang.org/grpc.(*Server).serveStreams.func1.2
2022-12-16T15:08:38+08:00 	/root/go/pkg/mod/google.golang.org/[email protected]/server.go:921
2022-12-16T15:08:39+08:00 1.6711745190083663e+09	ERROR	MLServer Adapter.MLServer Adapter Server	Failed to unload model from MLServer	{"model_id": "de-id-66__isvc-60ac5fc3c4", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:8001: connect: connection refused\""}
2022-12-16T15:08:39+08:00 github.com/kserve/modelmesh-runtime-adapter/internal/proto/mmesh._ModelRuntime_UnloadModel_Handler
2022-12-16T15:08:39+08:00 	/opt/app/internal/proto/mmesh/model-runtime_grpc.pb.go:199
2022-12-16T15:08:39+08:00 google.golang.org/grpc.(*Server).processUnaryRPC
2022-12-16T15:08:39+08:00 	/root/go/pkg/mod/google.golang.org/[email protected]/server.go:1282
2022-12-16T15:08:39+08:00 google.golang.org/grpc.(*Server).handleStream
2022-12-16T15:08:39+08:00 	/root/go/pkg/mod/google.golang.org/[email protected]/server.go:1619
2022-12-16T15:08:39+08:00 google.golang.org/grpc.(*Server).serveStreams.func1.2
2022-12-16T15:08:39+08:00 	/root/go/pkg/mod/google.golang.org/[email protected]/server.go:921
2022-12-16T15:08:49+08:00 1.6711745290097501e+09	INFO	MLServer Adapter.MLServer Adapter Server	Unload request for model not found in MLServer	{"error": "rpc error: code = NotFound desc = Model de-id-66__isvc-60ac5fc3c4 not found", "model_id": "de-id-66__isvc-60ac5fc3c4"}

mm container logs

2022-12-16T15:05:05+08:00 {"instant":{"epochSecond":1671174305,"nanoOfSecond":669210223},"thread":"model-load-de-id-55__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Starting load for model de-id-55__isvc-60ac5fc3c4 type=rt:dep-host-11","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:05:05+08:00 {"instant":{"epochSecond":1671174305,"nanoOfSecond":674346508},"thread":"mm-task-thread-9","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Published new instance record: InstanceRecord [lruTime=1671169768301 (76 minutes ago), count=3, capacity=575816, used=273087 (47%), loc=10.250.94.161, zone=<none>, labels=[mt:custom, mt:custom:1, rt:dep-host-11], startTime=1671173325261 (16 minutes ago), vers=0, loadThreads=1, loadInProg=1, reqsPerMin=0]","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":48,"threadPriority":5}
2022-12-16T15:05:38+08:00 {"instant":{"epochSecond":1671174338,"nanoOfSecond":770227346},"thread":"model-load-de-id-55__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Load of model de-id-55__isvc-60ac5fc3c4 type=rt:dep-host-11 completed in 33100ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:05:38+08:00 {"instant":{"epochSecond":1671174338,"nanoOfSecond":770733818},"thread":"model-load-de-id-55__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Model de-id-55__isvc-60ac5fc3c4 loading completed successfully","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:05:38+08:00 {"instant":{"epochSecond":1671174338,"nanoOfSecond":771008980},"thread":"model-load-de-id-55__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Model de-id-55__isvc-60ac5fc3c4 size = 136482 units, ~1066MiB","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:05:38+08:00 {"instant":{"epochSecond":1671174338,"nanoOfSecond":778074687},"thread":"mm-task-thread-9","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Published new instance record: InstanceRecord [lruTime=1671169768301 (76 minutes ago), count=3, capacity=575816, used=409446 (71%), loc=10.250.94.161, zone=<none>, labels=[mt:custom, mt:custom:1, rt:dep-host-11], startTime=1671173325261 (17 minutes ago), vers=0, loadThreads=1, loadInProg=0, reqsPerMin=0]","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":48,"threadPriority":5}
2022-12-16T15:06:15+08:00 {"instant":{"epochSecond":1671174375,"nanoOfSecond":3748926},"thread":"invoke-ex-de-id-55__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Attempting to add second copy of model de-id-55__isvc-60ac5fc3c4 in another instance since it is in use","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":57,"threadPriority":5}
2022-12-16T15:06:51+08:00 {"instant":{"epochSecond":1671174411,"nanoOfSecond":773707170},"thread":"janitor-task","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Janitor cache pruning task took 0ms for 3 entries","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":47,"threadPriority":5}
2022-12-16T15:06:51+08:00 {"instant":{"epochSecond":1671174411,"nanoOfSecond":774445128},"thread":"janitor-task","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Janitor registry pruning task took 0ms for 3/8 entries","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":47,"threadPriority":5}
2022-12-16T15:08:00+08:00 {"instant":{"epochSecond":1671174480,"nanoOfSecond":899406533},"thread":"mmesh-req-thread-3","level":"INFO","loggerName":"com.ibm.watson.modelmesh.VModelManager","message":"Added new VModel de-id-66 pointing to model de-id-66__isvc-60ac5fc3c4","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":56,"threadPriority":5}
2022-12-16T15:08:00+08:00 {"instant":{"epochSecond":1671174480,"nanoOfSecond":905774032},"thread":"invoke-ex-de-id-66__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"About to enqueue load for model de-id-66__isvc-60ac5fc3c4 with initial weight 123 units (~1MiB), with priority 1671170880884","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":56,"threadPriority":5}
2022-12-16T15:08:00+08:00 {"instant":{"epochSecond":1671174480,"nanoOfSecond":906271861},"thread":"model-load-de-id-66__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Starting load for model de-id-66__isvc-60ac5fc3c4 type=rt:dep-host-11","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:08:00+08:00 {"instant":{"epochSecond":1671174480,"nanoOfSecond":911276807},"thread":"mm-task-thread-7","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Published new instance record: InstanceRecord [lruTime=1671169768301 (79 minutes ago), count=4, capacity=575816, used=409569 (71%), loc=10.250.94.161, zone=<none>, labels=[mt:custom, mt:custom:1, rt:dep-host-11], startTime=1671173325261 (19 minutes ago), vers=0, loadThreads=1, loadInProg=1, reqsPerMin=0]","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":46,"threadPriority":5}
2022-12-16T15:08:36+08:00 {"instant":{"epochSecond":1671174516,"nanoOfSecond":985824480},"thread":"model-load-de-id-66__isvc-60ac5fc3c4","level":"ERROR","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Load failed for model de-id-66__isvc-60ac5fc3c4 type=rt:dep-host-11 after 36078ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:08:37+08:00 {"instant":{"epochSecond":1671174516,"nanoOfSecond":986546840},"thread":"model-load-de-id-66__isvc-60ac5fc3c4","level":"ERROR","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Model load failed for de-id-66__isvc-60ac5fc3c4 type=rt:dep-host-11","thrown":{"commonElementCount":0,"localizedMessage":"UNAVAILABLE: Failed to load Model due to MLServer runtime error: rpc error: code = Unavailable desc = error reading from server: EOF","message":"UNAVAILABLE: Failed to load Model due to MLServer runtime error: rpc error: code = Unavailable desc = error reading from server: EOF","name":"io.grpc.StatusRuntimeException","extendedStackTrace":[{"class":"io.grpc.Status","method":"asRuntimeException","file":"Status.java","line":535,"exact":false,"location":"grpc-api-1.46.0.jar","version":"1.46.0"},{"class":"io.grpc.stub.ClientCalls$UnaryStreamToFuture","method":"onClose","file":"ClientCalls.java","line":542,"exact":false,"location":"grpc-stub-1.46.0.jar","version":"1.46.0"},{"class":"io.grpc.internal.ClientCallImpl","method":"closeObserver","file":"ClientCallImpl.java","line":562,"exact":false,"location":"grpc-core-1.46.0.jar","version":"1.46.0"},{"class":"io.grpc.internal.ClientCallImpl","method":"access$300","file":"ClientCallImpl.java","line":70,"exact":false,"location":"grpc-core-1.46.0.jar","version":"1.46.0"},{"class":"io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed","method":"runInternal","file":"ClientCallImpl.java","line":743,"exact":false,"location":"grpc-core-1.46.0.jar","version":"1.46.0"},{"class":"io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed","method":"runInContext","file":"ClientCallImpl.java","line":722,"exact":false,"location":"grpc-core-1.46.0.jar","version":"1.46.0"},{"class":"io.grpc.internal.ContextRunnable","method":"run","file":"ContextRunnable.java","line":37,"exact":false,"location":"grpc-core-1.46.0.jar","version":"1.46.0"},{"class":"io.grpc.internal.SerializingExecutor","method":"run","file":"SerializingExecutor.java","line":133,"exact":false,"location":"grpc-core-1.46.0.jar","version":"1.46.0"},{"class":"java.util.concurrent.ThreadPoolExecutor","method":"runWorker","file":"ThreadPoolExecutor.java","line":1136,"exact":false,"location":"?","version":"?"},{"class":"java.util.concurrent.ThreadPoolExecutor$Worker","method":"run","file":"ThreadPoolExecutor.java","line":635,"exact":false,"location":"?","version":"?"},{"class":"java.lang.Thread","method":"run","file":"Thread.java","line":833,"exact":false,"location":"?","version":"?"}]},"endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:08:37+08:00 {"instant":{"epochSecond":1671174517,"nanoOfSecond":2424372},"thread":"model-load-de-id-66__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Delaying unload of model de-id-66__isvc-60ac5fc3c4 for at least 1000ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:08:37+08:00 {"instant":{"epochSecond":1671174517,"nanoOfSecond":11694103},"thread":"model-load-de-id-66__isvc-60ac5fc3c4","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"ModelRecord updated for de-id-66__isvc-60ac5fc3c4: locations={}, failures={458b8f-m2cgf=1671174516986}","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":51,"threadPriority":5}
2022-12-16T15:08:38+08:00 {"instant":{"epochSecond":1671174518,"nanoOfSecond":3067017},"thread":"mm-task-thread-3","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Initiating unload of model de-id-66__isvc-60ac5fc3c4","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":36,"threadPriority":5}
2022-12-16T15:08:38+08:00 {"instant":{"epochSecond":1671174518,"nanoOfSecond":6060163},"thread":"grpc-default-executor-7","level":"WARN","loggerName":"com.ibm.watson.modelmesh.SidecarModelMesh","message":"Unload of model de-id-66__isvc-60ac5fc3c4 failed, queueing 90 more time(s) for retry: UNAVAILABLE: Failed to unload model from MLServer","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":81,"threadPriority":5}
2022-12-16T15:08:49+08:00 {"instant":{"epochSecond":1671174529,"nanoOfSecond":370139037},"thread":"grpc-default-executor-7","level":"INFO","loggerName":"com.ibm.watson.modelmesh.SidecarModelMesh","message":"Model de-id-66__isvc-60ac5fc3c4 unloaded successfully following retry","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":81,"threadPriority":5}
2022-12-16T15:08:49+08:00 {"instant":{"epochSecond":1671174529,"nanoOfSecond":370627156},"thread":"grpc-default-executor-7","level":"INFO","loggerName":"com.ibm.watson.modelmesh.SidecarModelMesh","message":"Model de-id-66__isvc-60ac5fc3c4 unloaded successfully following retry","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":81,"threadPriority":5}
2022-12-16T15:08:49+08:00 {"instant":{"epochSecond":1671174529,"nanoOfSecond":370910525},"thread":"grpc-default-executor-7","level":"INFO","loggerName":"com.ibm.watson.modelmesh.ModelMesh","message":"Unload of de-id-66__isvc-60ac5fc3c4 completed in 11367ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.log4j.spi.AbstractLogger","contextMap":{},"threadId":81,"threadPriority":5}

Version 0.9.0

inference with timeouts leads to Internal error

ServingRuntime: torchserve

Current behavior

sent requests with client timeouts (load our modelmesh)
after some time, client starts to receive

ERROR:
  Code: Internal
  Message: org.pytorch.serve.grpc.inference.InferenceAPIsService/Predictions: INTERNAL: Model "test-search-vectorizer__isvc-409d074e44" has no worker to serve inference request. Please use scale workers API to add workers.

if we stop our load and start to send only one request without timeouts, we have same error or sticking

In my opinion, some resources can't released under client timeout, that lead to workers exousted. Becouse if we turn off client timeouts, there are no errors.

Expected behaivor

requests with timeouts can't lead to workers exousted and sticking on single request to modelmesh

Log Info

message

org.pytorch.serve.grpc.inference.InferenceAPIsService/Predictions: INTERNAL: Model "test-search-vectorizer__isvc-409d074e44" has no worker to serve inference request. Please use scale workers API to add workers.
InternalServerException.()

trace

ApplierException(message:org.pytorch.serve.grpc.inference.InferenceAPIsService/Predictions: INTERNAL: Model "test-search-vectorizer__isvc-409d074e44" has no worker to serve inference request. Please use scale workers API to add workers.
InternalServerException.(), causeStacktrace:io.grpc.StatusException: INTERNAL: Model "test-search-vectorizer__isvc-409d074e44" has no worker to serve inference request. Please use scale workers API to add workers.
InternalServerException.()
	at com.ibm.watson.modelmesh.SidecarModelMesh$ExternalModel$1.onClose(SidecarModelMesh.java:450)
, grpcStatusCode:INTERNAL)
	at com.ibm.watson.modelmesh.thrift.ModelMeshService$applyModelMulti_result$applyModelMulti_resultStandardScheme.read(ModelMeshService.java:1940)
	at com.ibm.watson.modelmesh.thrift.ModelMeshService$applyModelMulti_result$applyModelMulti_resultStandardScheme.read(ModelMeshService.java:1905)
	at com.ibm.watson.modelmesh.thrift.ModelMeshService$applyModelMulti_result.read(ModelMeshService.java:1815)
	at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:93)
	at com.ibm.watson.modelmesh.thrift.ModelMeshService$Client.recv_applyModelMulti(ModelMeshService.java:74)
	at com.ibm.watson.modelmesh.thrift.ModelMeshService$Client.applyModelMulti(ModelMeshService.java:59)
	at jdk.internal.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at com.ibm.watson.litelinks.client.ClientInvocationHandler.invoke(ClientInvocationHandler.java:496)
	at com.ibm.watson.litelinks.client.ClientInvocationHandler.invokeWithRetries(ClientInvocationHandler.java:383)
	at com.ibm.watson.litelinks.client.ClientInvocationHandler.doInvoke(ClientInvocationHandler.java:184)
	at com.ibm.watson.litelinks.client.ClientInvocationHandler.invoke(ClientInvocationHandler.java:118)
	at jdk.proxy2/jdk.proxy2.$Proxy30.applyModelMulti(Unknown Source)
	at jdk.internal.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at com.ibm.watson.modelmesh.SidecarModelMesh.invokeRemoteModel(SidecarModelMesh.java:1071)
	at com.ibm.watson.modelmesh.ModelMesh.invokeRemote(ModelMesh.java:4399)
	at com.ibm.watson.modelmesh.ModelMesh.invokeModel(ModelMesh.java:3644)
	at com.ibm.watson.modelmesh.SidecarModelMesh.callModel(SidecarModelMesh.java:1106)
	at com.ibm.watson.modelmesh.ModelMeshApi.callModel(ModelMeshApi.java:457)
	at com.ibm.watson.modelmesh.ModelMeshApi$4.onHalfClose(ModelMeshApi.java:733)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:355)
	at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:867)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:833)
	at com.ibm.watson.litelinks.server.ServerRequestThread.run(ServerRequestThread.java:47)

AsyncPayloadProcessor owned payloads are released too early

Issue:

AsyncPayloadProcessor sometimes sends null response payloads, despite the original infer caller getting a valid output back to the calling process.

Triggers

If the payloads are processed instantly by the AsyncProcessor, the majority (but not all) responses arrive intact. If the AsyncProcessor is slow, the majority of responses arrive with 'null' data.

Investigation

It looks like the response bytebufs are read too early somwehere, causing their readerIndex to equal their capacity at processing time. This read occurs somewhere between their addition to the queue in the AsyncPayloadProcessor but before payloads.take() is called. A hacky patch is to call byteBuf = byteBuf.readerIndex(0); as the first line in RemotePayloadProcessor.encodeBinaryToString(), to reset their reader index, and indeed this prevents the issue from arising.

Is it possible to log the actual payload data instead of just the number of bytes?

The log entries emitted by LoggingPayloadProcessor only logs the number of bytes in of the payload data. It's not useful if the request data is JSON.

Example:

The value of the data field is data=86B.

{"instant":{"epochSecond":1694276873,"nanoOfSecond":586464093},"thread":"pool-4-thread-1","level":"INFO","loggerName":"com.ibm.watson.modelmesh.payload.LoggingPayloadProcessor","message":"Payload: Payload{id='69-1', modelId='with-logging', method='inference.GRPCInferenceService/ModelInfer', status=request, metadata=Metadata(content-type=application/grpc,user-agent=grpc-go/1.51.0,grpcgateway-user-agent=python-requests/2.31.0,authorization=Basic xxxx,grpcgateway-authorization=Basic xxxx,grpcgateway-accept=/,grpcgateway-content-type=application/json,x-forwarded-host=with-logging-wines.apps.fmflask2.faisallabs.net,x-forwarded-for=3.105.194.26, 10.131.2.48, ::1), data=63B}","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":45,"threadPriority":5}
{"instant":{"epochSecond":1694276873,"nanoOfSecond":586949553},"thread":"pool-4-thread-1","level":"INFO","loggerName":"com.ibm.watson.modelmesh.payload.LoggingPayloadProcessor","message":"Payload: Payload{id='69-1', modelId='with-logging', method='inference.GRPCInferenceService/ModelInfer', status=Status{code=OK, description=null, cause=null}, metadata=Metadata(grpc-encoding=identity,grpc-accept-encoding=gzip), data=86B}","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":45,"threadPriority":5}
Wouldn't it be useful if we log the actual payload data instead?

At line 108 of the Payload.java file, I think it will be more useful if we use ByteBuf.toString(charset) instead of .readallbytes(). Or at least allow an option to configure the payload serialization/deserialization options for payload logging.

Are there any plans to support streaming of prediction responses?

I'm currently trying to setup streaming reponses of LLM generation from vLLM, however I receive an Streaming not yet supported error from modelmesh. I think this is coming from this code snippet:

modelmesh/src/main/java/com/ibm/watson/modelmesh/SidecarModelMesh.java

Line 433 in eaa2fde

String msg = "Streaming not yet supported";

It looks like implementing streaming is a non trivial task in this SidecarModelMesh class. Are there any plans on implementing streaming support or are there any blockers for this?

Log reason for unready model mesh container

I'm currently debugging constantly unready model mesh containers, which do not directly indicate the problem in the logs. Only through reading the code I discovered abortStartup, which fails the ready probe without logging the reason

modelmesh/src/main/java/com/ibm/watson/modelmesh/ModelMesh.java

Lines 1311 to 1313 in c4fc041

 if (abortStartup) { 

 return false; 

 }

ModelMesh.abortStartup indicates an unrecoverable failure, which can only be resolved with an restart of model mesh container.
If the readiness probe fails because of abortStartup it should be logged, to allow debugging the root cause of the issue.

Alternatively, the Liveness probe of the container should fail to indicate an unrecoverable application failure.

missing com.ibm.watson.modelmesh.api package?

Hi guys,
I cloned the code and loaded the project in ide, I found some compile errors, it seems we miss the api package.
Where to get them? Thanks.

import com.ibm.watson.modelmesh.api.ModelInfo;
import com.ibm.watson.modelmesh.api.ModelStatusInfo;
import com.ibm.watson.modelmesh.api.ModelStatusInfo.ModelStatus;
import com.ibm.watson.modelmesh.api.SetVModelRequest;
import com.ibm.watson.modelmesh.api.VModelStatusInfo;
import com.ibm.watson.modelmesh.api.VModelStatusInfo.VModelStatus;

Add documentation related to "standalone" use of modelmesh container

How to run modelmesh without k8s

Can you describe steps please, for running modelmesh locally with runtime adapter, etcd and some serving?
It needs for local debugging and clarifying some logic of work

Option to configure Inference request timeout

I'm looking for an option to configure request timeouts for inference requests.
Either a global or a per request timeout would be nice.
Currently we are experiencing many "stuck" inference requests which block the modelmesh. To resolve this we currently need to restart the model mesh when this happens.
We currently don't know the root cause for this behaviour, but setting a timeout would mitigate the impact of some requests getting stuck.
We have already set timeouts on all upstream components, but the upstream timeout/cancelation of the request is not propagated down to the modelmesh and so the inference requests are not canceled and consuming resources somewhere in the modelmesh.
Here it would also be nice to have distributed tracing in model mesh to see where in the mesh the requests are stuck.

We are currently using v0.11.2 with triton runtime.

Support individual model metrics

Model-mesh currently exposes a comprehensive set of prometheus metrics, but those associated with a particular model are not currently labeled with the model id. This was an intentional decision since in the use cases model-mesh was designed for, there's a prohibitively large number of models managed, and many of them change frequently. Prometheus guidelines assert that the cardinality of label value permutations across all metrics should be constrained.

However, there are many usecases where the number of models managed is smaller, and for those it can be very useful to monitor metrics at the model level.

We should support this as a configurable option, either globally or per-model.

Some thoughts:

This would not apply to all published metrics. Some of them which would be of questionable use are TBD, such as req_queue_delay_milliseconds, age_at_eviction_milliseconds.
We could consider less granular labels such as the model type as an (additional) intermediate option. It might be more complicated however to determine the model type in the relevant metric publishing contexts. Also, the way that the type field is currently used within modelmesh-serving would limit the utility.
We need to think about what to do in relation to VModels - in many cases the VModel rather than Model would be of more interest. Probably configuration can take a form similar to labels=model,vmodel,type, so that an arbitrary combo of supported labels can be chosen.

Add unit test for Payload Processor

It would also be useful to have a unit test for this, but the tests included here don't exercise the actual bug.

Ideally we'd have a test that actually runs a local modelmesh with an asyncprocessor configured and can trigger the problem (other unit tests already run modelmesh to test other stuff and could hopefully be used as a starting point and adapted).

Related issues:

Originally posted by @njhill in #112 (comment)

Local deployment (Docker, not Kubernetes/MiniKube)

I am new with modelmesh but so interested in this project.
Could we deploy modelmesh using Docker only without k8s cluster?
Thanks

ModelMesh model-loading decision process

Hi,

We are currently examining mm capabilities and see if it can run many (similar) models, around 100k, each weighs about 500MB on disk.
We are running it on a K8s cluster with GPU pods (g4dn.xlarge) and using the model serving controller to orchestrate the model registration and servingRuntime creation.

The main issue that we are currently experiencing is that the runtime machines are constantly being killed when they require more and more memory (classic OOMKilled) while we run predictions that require loading of different models.
How can we troubleshoot that? We do see that the triton machines are requesting for more memory without a limit, but we could not find the piece of code which is supposed to manage that, i.e., decide which model to load on which instance, and which model to unload in order to evict memory for new models.

So basically my questions are:
(1) Where is the code that manages the loading/unloading decision- runtime-adapter/mm container/somewhere else? Which class/function?
(2) How does it work? I understood that the model size is inferred via some heuristics that try to predict the size in advance. I could not find something decisive in the docs.

This is our servingRuntime configuration:

piVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  annotations:
    maxLoadingConcurrency: "2"
  labels:
    app.kubernetes.io/instance: modelmesh-controller
    app.kubernetes.io/managed-by: modelmesh-controller
    app.kubernetes.io/name: modelmesh-controller
    name: modelmesh-serving-triton-2.x-SR
  name: triton-2.x
spec:
  builtInAdapter:
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000
    runtimeManagementPort: 8001
    serverType: triton
  containers:
    - args:
        - -c
        - 'mkdir -p /models/_triton_models; chmod 777 /models/_triton_models; exec tritonserver
      "--model-repository=/models/_triton_models" "--model-control-mode=explicit"
      "--strict-model-config=false" "--strict-readiness=false" "--allow-http=true"
      "--allow-sagemaker=false" "--log-verbose=1" '
      command:
        - /bin/sh
      image: nvcr.io/nvidia/tritonserver:22.09-py3
      livenessProbe:
        exec:
          command:
            - curl
            - --fail
            - --silent
            - --show-error
            - --max-time
            - "9"
            - http://localhost:8000/v2/health/live
        initialDelaySeconds: 5
        periodSeconds: 30
        timeoutSeconds: 10
      name: triton
      resources:
        limits:
          cpu: 2000m
          memory: 4Gi
        requests:
          cpu: 2000m
          memory: 4Gi
  grpcDataEndpoint: port:8001
  grpcEndpoint: port:8085
  multiModel: true
  protocolVersions:
    - grpc-v2
  supportedModelFormats:
    - name: triton
      version: "2"

Thanks!

rewrite modelmesh from java to golang

thank you for this project!

do you have plan to rewrite modelmesh from java to golang in future? Its seems more consistent with modelmesh-serving general logic. And its seems to me will be more easly for contributing to all project.

best regards.

how about using modelmesh to serve thousands of stable diffusion models

I want to use modelmesh to serving thousands of stable diffusions models. Any advice would be appreciated~

I'm using triton as serving runtime. Inference time is about 3~10s.
I'm using ensemble in triton to leverage business logics like audit and watermark, maybe they can be standalone service in the future
currently every model have it's own k8s service and ingress rules.

Goals:

make higher cluster resources utilization, especially GPU
make every inference request latency as fast as possible

Is it possible to use a different container for each individual model?

I'm trying to understand the project as it seems so promising. I'd like to ask for some clarification.

Let me start with the main idea:

'Model' is rarely just the model's weights, in contrary with what we see in many tutorials. In practice the model is the whole pipeline from API request, through the data imputation and feature extraction, to classifier/regressor prediction. It is often inconvenient to dump the model using joblib/onnx formats in python world.

And the question is:
Is it possible to use a different container for each individual model?

The idea of puller and the charts suggest to me that it's not possible, the only difference between the models is the downloaded artifacts (model weights by default). This would be too big restriction both in terms of versioning and individual models/clients requirements. The new models are being trained with the different underlying packages versions etc, that in practice requires the different container. Therefore each time new model is trained on some packages version and added, all the rest of the models have to be retrained on the new env version to assure the quality.

On the other hand

modelmesh-runtime-adapter - the containers which run in each model serving pod and act as an intermediary between ModelMesh and third-party model-server containers.

suggest that the third-party model-server containers are what I am looking for, if different version can be used for each model.

Model per instance model-mesh by default

Now model load only on one instance, and lazy loading on another pods, when reauest has come.

Can we modify internal modelmesh parameters for default loading model on all ServingRuntime instances?

Fundamental understanding

Hi @njhill, I'm trying understand the basic idea about model mesh. Based on document https://kserve.github.io/website/0.7/modelserving/mms/modelmesh/overview/#overview, it looks like model A is only loaded in pod 1 of Runtime X Deployment. I assume it may fails to load it in pod 2 for some reason, for example memory shortage. Then when an inference request comes in and happens to forward to pod 2, what's happening here? Would mesh component know that the model is loaded in pod 1 and forward the request?

Thank you very much!

fq_names boolean metrics sub-parameter is inverted

fq_names is a boolean sub-parameter that can be used to alter the metrics exposed by model-mesh. Unfortunately however, its value is inverted, so setting it to "true" actually disables fully qualified method names (which is already the default).

Exclude zookeeper dependency from the project

Zookeeper is a transitive dependency in ModelMesh (see litelinks-core and kv-utils) but it is not used when ModelMesh is used in K8s via modelmesh-serving controller so it should be possible to remove it.
The main benefit of this is to reduce the dependency tree thus the surface attack of the project when used in K8s.

The usage of ModelMesh "core" as standalone component with Zookeeper is not really a documented/supported use case but it should remain possible adding Zookeeper back manually.

Include the resolved model Id in Payloads

A resolved modelId in model-mesh might correspond to different versions of an existing model. Requests sent to the same vModelId could hit different models over time and it would be important to know which version of the model certain Payloads correspond to.
This issue should be addressed on top of #84.