Please describe your wishes and possible alternatives to achieve the desired result.<

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Here’s the docs on how to set up custom images with CiRun: <a href="https://docs.cirun

You want the second section in that doc: <a href="https://docs.cirun.io/custom-images/

Thanks for the info <a class="user-mention notranslate" data-hovercard-type="user" dat

Thanks for the info <a class="user-mention notranslate" data-hovercard-ty

Set up GPU CI about anndata HOT 19 CLOSED

ivirshup commented on August 17, 2024

Set up GPU CI

from anndata.

Comments (19)

aktech commented on August 17, 2024 2

But on our billing console it says it took about 12 minutes. So what up with that? Is our billing console reporting time in an unexpected way, is the machine running for longer than github actions knows?

GitHub only reports the time it took to run the job on it, nothing before or after.

There are the following times to consider:

If you're using NVIDIA images, they take a very long time to start (they have some internal provisioning, I don't know a lot about it)
there is some time to provision/initialize github actions by cirun (less than a minute)
There is some time delay (variable: 10s to 2 min) for github to start the job on it, it's a known github issue: actions/runner#676

Using your custom AMI, (it's just basically spinning up an ubuntu machine with gpu and installing nvidia drivers yourself, and creating and AMI from it), would reduce the spinup time significantly, I can get on a call to help with this, if required.

from anndata.

aktech commented on August 17, 2024 1

@ivirshup Another tip: To reduce cost you can use preemptible (spot) instances:

runners:
  - name: aws-gpu-runner
    cloud: aws
    instance_type: g4dn.xlarge
    machine_image: ami-067a4ba2816407ee9
    region: eu-north-1
    preemptible:
      - true
      - false
    labels:
      - cirun-aws-gpu

Doc: https://docs.cirun.io/reference/fallback-runners#example-3-preemptiblenon-preemptible-instances

This would try to spinup a preemptible instance first and if that fails, then it will spinup on-demand instance.
They are upto 90% cheaper and 50% on a average, current price in a couple of regions (https://aws.amazon.com/ec2/spot/pricing/):

us-east-2

Capacity/Instance	Spot	On-Demand
g4dn.xlarge	$0.1578 per Hour	$0.3418 per Hour

eu-north-1

Capacity/Instance	Spot	On-Demand
g4dn.xlarge	$0.1674 per Hour	$0.3514 per Hour

from anndata.

ivirshup commented on August 17, 2024

New issue: caching. My understanding is that all of github actions caching caches data to github servers. This doesn't reduce data ingress when our job isn't running on github servers. Right now this job is downloading about 1gb of data per run.

We should try and enable caching through aws.

from anndata.

ivirshup commented on August 17, 2024

So the billing was a little higher than expected. Basically about $2 for the one PR.

Admittedly this PR had a lot of trouble shooting pushes – about 29 commits had checks start. However a number of these checks could have been cancelled by follow up pushes, which I'll add.

Right now the entire CI run (as reported by github actions) takes about 4 minutes:

But on our billing console it says it took about 12 minutes. So what up with that? Is our billing console reporting time in an unexpected way, is the machine running for longer than github actions knows?

Any thoughts @Zethson @aktech?

from anndata.

Zethson commented on August 17, 2024

I wouldn't be surprised if fetching the image and connecting to Github actions takes some time? But I guess @aktech knows this better...

from anndata.

flying-sheep commented on August 17, 2024

Here’s the docs on how to set up custom images with CiRun: https://docs.cirun.io/custom-images/cloud-custom-images

from anndata.

aktech commented on August 17, 2024

You want the second section in that doc: https://docs.cirun.io/custom-images/cloud-custom-images#aws-building-custom-images-with-user-modification (first one ues nvidia image)

from anndata.

ivirshup commented on August 17, 2024

Thanks for the info @aktech! I've been able to get something running using that. I had been trying to create a docker file for this from some AWS docs, but it turns out it gets more complicated to generate dockerfiles for GPU setups 😢.

Right now I am trying to see how long the instance was actually around for, but I'm not actually sure where I can see logs for this. I think our setup is a little obfuscated here, and the view I have doesn't seem to update quickly.

from anndata.

aktech commented on August 17, 2024

Thanks for the info @aktech! I've been able to get something running using that. I had been trying to create a docker file for this from some AWS docs, but it turns out it gets more complicated to generate dockerfiles for GPU setups 😢.

Are you planning to run test inside docker container? You'd still need nvidia/cuda in the base VM image I think.

Right now I am trying to see how long the instance was actually around for, but I'm not actually sure where I can see logs for this. I think our setup is a little obfuscated here, and the view I have doesn't seem to update quickly.

Currently I don't think we have that statistics in the UI anywhere, but I can consider adding it in the check run. Meanwhile, until the instance is visible in the aws dashboard (it is usually visible for sometime in the dashboard), you can run the following command to see how long it was alive for:

aws ec2 describe-instances --instance-ids INSTANCE_ID --query 'Reservations[0].Instances[0].LaunchTime' --region eu-north-1

aws ec2 describe-instances --instance-ids INSTANCE_ID --query 'Reservations[0].Instances[0].StateTransitionReason' --region eu-north-1

from anndata.

ivirshup commented on August 17, 2024

I had been trying to create a docker file for this from some AWS docs, but it turns out it gets more complicated to generate dockerfiles for GPU setups 😢.
Are you planning to run test inside docker container?

No, but this was just following the amazon ECR instructions for "how to create an image".

I believe you need nvidia-container-toolkit (an extension to docker) to do this kind of thing.

You'd still need nvidia/cuda in the base VM image I think.

I would like to have a more programatic way to construct these containers, so will look into this. Thanks!

I found out that I had managed to get logged into a scope with very little access, which is what was making it so difficult to see anything... Still no idea how I did that, I think maybe via the rackspace site? But now I can look at CloudTrail and have set up Config so I think we can use that.

I think times are down now. It was taking about 12 min a run last friday (according to rackspace), now it's more like 4.5 (according to aws). Github still says it's more like 2, so there's room for improvement, but still better. Of course will be good to compare measurements from the same place.

@aktech, do you have any suggestions for how we could do caching for our CI? A non-trivial amount of time is spent building wheels and downloading things, which I think we could get down. However, I don't think that Github Actions caching is going to help a ton here since it's on github's servers.

from anndata.

aktech commented on August 17, 2024

I believe you need nvidia-container-toolkit (an extension to docker) to do this kind of thing.

Yes, correct.

I would like to have a more programatic way to construct these containers, so will look into this. Thanks!

Yep, makes sense. We would have built for customers, but NVIDIA's license doesn't allows distribution, but yeah if we had the CI setup for automating this, you could have used that, but currently we don't have one public, it's a WIP.

@aktech, do you have any suggestions for how we could do caching for our CI? A non-trivial amount of time is spent building wheels and downloading things, which I think we could get down. However, I don't think that Github Actions caching is going to help a ton here since it's on github's servers.

I didn't see that, which workflow? This one seems to take less than 2.5 minutes: https://github.com/scverse/anndata/actions/runs/5716599171/job/15494250907?pr=1084

from anndata.

ivirshup commented on August 17, 2024

I didn't see that, which workflow?

It's not that it takes a long time, it's that it takes longer to setup than to run the tests, so I'd like to bring that down.

from anndata.

ivirshup commented on August 17, 2024

Triggering GPU CI

So after billing was a little higher than expected (which may be fixed, but need to confirm once billing updates) we decided not to run CI on every commit. We set the action to run on workflow_dispatch so it would be manually triggered, but it seems like we can't use this as branch protection since workflow_dispatch doesn't count towards passing a check.

So, we need something else to trigger this. It seems our options are:

a tag

Implementation (I think)

https://stackoverflow.com/questions/62325286/run-github-actions-when-pull-requests-have-a-specific-label

on:
  pull_request:
    types:
      - labeled
      - edited
      - synchronize

jobs:
  test:
    if: ${{ contains(github.event.pull_request.labels, 'run-gpu-ci') }}

a comment
approving review
merge queue

Currently thinking a tag makes the most sense, since we can easily enable and disable it, and it isn't neccesarily linked with merging. It could be that either a label or auto merge are enough.

from anndata.

flying-sheep commented on August 17, 2024

yup, as I thought. except for the merge queue, all of these of course mean that it’ll run for all commits after the label/comment/whatever is added.

one option would be to have the workflow remove the label again:

for each commit, all tests except for GPU tests are run
we add the run GPU tests once label
the workflow job first removes the label again, then runs the GPU tests

from anndata.

Intron7 commented on August 17, 2024

a comment

@ivirshup the rapids-team does this with a comment from a member. this tiggers a ci run. But from what I can tell they use workflow_dispatch

from anndata.

flying-sheep commented on August 17, 2024

they use workflow_dispatch

Let’s avoid this if possible. It might be possible to manually call the GitHub API to list all PRs for a branch and then create and update a check for the PR being found, but I’d rather not go down that road when it looks that there’s a much simpler solution.

from anndata.

ivirshup commented on August 17, 2024

I think there is value to giving a PR the green light to use paid CI, and not needed to approve each individual commit.

The one off case could be useful too, but I think triggering via a comment makes more sense in this case.

from anndata.

ivirshup commented on August 17, 2024

@Intron7, I think rapids are using API calls from checks to trigger workflow_dispatch. But they also have a pretty involved CI system: https://github.com/rapidsai/cudf/blob/branch-23.10/.github/workflows/pr.yaml

from anndata.

ivirshup commented on August 17, 2024

We've still got a little room for improvement on GPU CI, but I think it's pretty much set up!

Costs per run are now down to about 1 cent for anndata

from anndata.

Set up GPU CI about anndata HOT 19 CLOSED

Comments (19)

Triggering GPU CI

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent