Giter VIP home page Giter VIP logo

Comments (13)

5had3z avatar 5had3z commented on August 22, 2024 1

Yes, I was just in the process of commenting that it seems like things are running, but as soon I peeked behind the curtain at the logs I saw that error. I'm giving up on 1.22, ElasticOperator, Volcano and PytorchOperator are all using depricated now removed APIs. I'm going into uni today to just reset all the workstations and load on a fresh version 1.21.3 and hopefully get at least one of them to run so I can get back to actual research.

from torchx.

d4l3k avatar d4l3k commented on August 22, 2024 1

I think the path forward here is:

  1. make torchx kubernetes scheduler robust to missing status and show "UNKNOWN" status for the job
  2. file an issue on Volcano for 1.22 compatibility
  3. update torchx documentation to show compatible versions

from torchx.

d4l3k avatar d4l3k commented on August 22, 2024 1

Filed volcano-sh/volcano#1665

from torchx.

tiagovrtr avatar tiagovrtr commented on August 22, 2024 1

@d4l3k is this no longer an issue?

from torchx.

5had3z avatar 5had3z commented on August 22, 2024

Inserting lines at torchx/schedulers/kubernetes_scheduler.py:ln343, just before status = resp['status']

for key_, obj_ in resp.items():
    print(f"{key_}: {obj_}")

Output is below:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata: {'creationTimestamp': '2021-08-07T06:56:36Z', 'generateName': 'echo-', 'generation': 1, 'managedFields': [{'apiVersion': 'batch.volcano.sh/v1alpha1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:generateName': {}}, 'f:spec': {'.': {}, 'f:maxRetry': {}, 'f:plugins': {'.': {}, 'f:env': {}, 'f:svc': {}}, 'f:queue': {}, 'f:schedulerName': {}, 'f:tasks': {}}}, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': '2021-08-07T06:56:36Z'}], 'name': 'echo-5rksp', 'namespace': 'default', 'resourceVersion': '475515', 'uid': 'b178cdba-7773-4e8d-8893-8727c628b1a3'}
spec: {'maxRetry': 0, 'plugins': {'env': [], 'svc': []}, 'queue': 'test', 'schedulerName': 'volcano', 'tasks': [{'maxRetry': 0, 'name': 'echo-0', 'policies': [{'action': 'RestartJob', 'event': 'PodEvicted'}, {'action': 'RestartJob', 'event': 'PodFailed'}], 'replicas': 1, 'template': {'spec': {'containers': [{'command': ['/bin/echo', 'hello'], 'env': [], 'image': '/tmp', 'name': 'echo-0', 'ports': [], 'resources': {'limits': {}, 'requests': {}}}], 'restartPolicy': 'Never'}}}]}

so the call at ln335 is just returning the kubernetes Job configuration rather than its status?

from torchx.

d4l3k avatar d4l3k commented on August 22, 2024

Hi, thanks for reporting this! I'll take a look and see if I can reproduce it

from torchx.

d4l3k avatar d4l3k commented on August 22, 2024

I ran into this issue when I created a cluster with no workers but once I create the workers it seems to be just fine. We've tested this on 1.18 and 1.21. I don't have access to a 1.22 cluster but I'll try to spin up a local one on my laptop.

Can you install vcctl and send me the output from:

$ vcctl queue list
$ vcctl job list
$ kubectl get job.batch.volcano.sh/<jobid> -o yaml

This worked for me:

$ eksctl create cluster \
  --name torchx-dev-1-21 \
  --version 1.21 \
  --with-oidc \
  --without-nodegroup
            
$ eksctl create nodegroup \
  --cluster torchx-dev-1-21 \
  --name torchx-dev-1-21-workers \
  --node-type t3.medium \
  --nodes 3 \
  --nodes-min 1 \
  --nodes-max 4 \
  --ssh-access \
  --ssh-public-key <key>
  
$ kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.3.0/installer/volcano-development.yaml
                                 
$ vcctl queue create test
$ torchx run --scheduler kubernetes --scheduler_args queue=test utils.echo --image alpine:latest --msg hello
kubernetes://torchx_tristanr/default:echo-dhbfd
=== RUN RESULT ===
Launched app: kubernetes://torchx_tristanr/default:echo-dhbfd
AppStatus:
  msg: <NONE>
  num_restarts: -1
  roles: []
  state: PENDING (2)
  structured_error_msg: <NONE>
  ui_url: null

Job URL: None
$ torchx log kubernetes://torchx_tristanr/default:echo-dhbfd/echo
echo/0 2021-08-09T20:57:01.479780331Z hello

from torchx.

d4l3k avatar d4l3k commented on August 22, 2024

Looks like there's a compatibility issue between Volcano v1.3.0 and Kubernetes v1.22

https://kubernetes.io/docs/reference/using-api/deprecation-guide/#priorityclass-v122

tristanr@tristanr-arch2 ~> kubectl logs --namespace volcano-system pods/volcano-scheduler-5665cdc4d9-cv5kx
W0809 21:56:49.894796       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
E0809 21:56:49.920722       1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
I0809 21:56:49.931104       1 event_handlers.go:199] Added pod <volcano-system/volcano-admission-init--1-srdfl> into cache.
I0809 21:56:49.931138       1 event_handlers.go:199] Added pod <volcano-system/volcano-controllers-5bbcc9c49f-fq5fl> into cache.
I0809 21:56:49.931147       1 event_handlers.go:199] Added pod <volcano-system/volcano-scheduler-5665cdc4d9-cv5kx> into cache.
I0809 21:56:49.931158       1 event_handlers.go:199] Added pod <kube-system/kube-controller-manager-minikube> into cache.
I0809 21:56:49.931171       1 event_handlers.go:199] Added pod <kube-system/kube-apiserver-minikube> into cache.
I0809 21:56:49.931178       1 event_handlers.go:199] Added pod <kube-system/storage-provisioner> into cache.
I0809 21:56:49.931185       1 event_handlers.go:199] Added pod <kube-system/coredns-78fcd69978-hrs6m> into cache.
I0809 21:56:49.931191       1 event_handlers.go:199] Added pod <kube-system/etcd-minikube> into cache.
I0809 21:56:49.931199       1 event_handlers.go:199] Added pod <kube-system/kube-scheduler-minikube> into cache.
I0809 21:56:49.931212       1 event_handlers.go:199] Added pod <kube-system/kube-proxy-4kkmq> into cache.
I0809 21:56:49.931231       1 event_handlers.go:199] Added pod <volcano-system/volcano-admission-5bb77cd5b7-zxqf9> into cache.
E0809 21:56:51.279727       1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:56:53.375096       1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:56:58.251774       1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:57:09.886813       1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:57:25.430937       1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:58:00.806485       1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource
E0809 21:58:40.374955       1 reflector.go:127] volcano.sh/volcano/pkg/scheduler/cache/cache.go:440: Failed to watch *v1beta1.PriorityClass: failed to list *v1beta1.PriorityClass: the server could not find the requested resource

from torchx.

5had3z avatar 5had3z commented on August 22, 2024

Unless there is a more proper way to check to see if the job has launched successfully, I think this could be treated as an indirect measurement of determining that the job didn't start successfully for whatever reason.

try:
    status = resp["status"]
except KeyError:
    raise RuntimeError("Failed to retrieve status, job possibly didn't start???")

from torchx.

d4l3k avatar d4l3k commented on August 22, 2024

The method throwing this error is in describe which just describes the job so it doesn't make a ton of sense to throw that error there. Possibly a warning? but that's a bit clunky

We do have an UNKNOWN appstate that would be a good fit for this https://github.com/pytorch/torchx/blob/master/torchx/specs/api.py#L316

If we do want to throw an error might be good to add something in the CLI run instead on unknown status

from torchx.

d4l3k avatar d4l3k commented on August 22, 2024

This is still an outstanding issue with volcano v1.4

from torchx.

d4l3k avatar d4l3k commented on August 22, 2024

This might be fixed with volcano v1.5 beta but I haven't tested it. The volcano issue is still open. https://github.com/volcano-sh/volcano/releases/tag/v1.5.0-Beta

from torchx.

d4l3k avatar d4l3k commented on August 22, 2024

Yes this is fixed on the Volcano side with the newer volcano releases

from torchx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.