Giter VIP home page Giter VIP logo

ngc-container-replicator's Introduction

NGC Replicator

Clones nvcr.io using the either DGX (compute.nvidia.com) or NGC (ngc.nvidia.com) API keys.

The replicator will make an offline clone of the NGC/DGX container registry. In its current form, the replicator will download every CUDA container image as well as each Deep Learning framework image in the NVIDIA project.

Tarfiles will be saved in /output inside the container, so be sure to volume mount that directory. In the following example, we will collect our images in /tmp on the host.

Use --min-version to limit the number of versions to download. In the example below, we will only clone versions 17.10 and later DL framework images.

docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock -v /tmp:/output \
    deepops/replicator --project=nvidia --min-version=17.12 \
                       --api-key=<your-dgx-or-ngc-api-key>

You can also filter on specific images. If you want to filter only on image names containing the strings "tensorflow", "pytorch", and "tensorrt", you would simply add --image for each option, e.g.

docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock -v /tmp:/output \
    deepops/replicator --project=nvidia --min-version=17.12 \
                       --image=tensorflow --image=pytorch --image=tensorrt \
                       --dry-run \
                       --api-key=<your-dgx-or-ngc-api-key>

Note: the --dry-run option lets you see what will happen without committing to a lengthy download.

By default, the --image flag does a substring match in order to ensure you match all images that may be desired. Sometimes, however, you only want to download a specific image with no substring matching. In this case, you can add the --strict-name-match flag, e.g.

docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock -v /tmp:/output \
    deepops/replicator --project=nvidia --min-version=17.12 \
                       --image=tensorflow \
                       --strict-name-match \
                       --dry-run \
                       --api-key=<your-dgx-or-ngc-api-key>

Note: a state.yml file will be created the output directory. This saved state will be used to avoid pulling images that were previously pulled. If you wish to repull and save an image, just delete the entry in state.yml corresponding to the image_name and tag you wish to refresh.

Kubernetes Deployment

If you don't already have a deepops namespace, create one now.

kubectl create namespace deepops

Next, create a secret with your NGC API Key

kubectl -n deepops create secret generic  ngc-secret
--from-literal=apikey=<your-api-key-goes-here>

Next, create a persistent volume claim that will life outside the lifecycle of the CronJob. If you are using DeepOps you can use a Rook/Ceph PVC similar to:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ngc-replicator-pvc
  namespace: deepops
  labels:
    app: ngc-replicator
spec:
  storageClassName: rook-raid0-retain  # <== Replace with your StorageClass
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 32Mi

Finally, create a CronJob that executes the replicator on a schedule. This eample run the replicator every hour. Note: This example used Rook block storage to provide a persistent volume to hold the state.yml between executions. This ensures you will only download new container images. For more details, see our DeepOps project.

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: replicator-config
  namespace: deepops
data:
  ngc-update.sh: |
    #!/bin/bash
    ngc_replicator                                        \
      --project=nvidia                                    \
      --min-version=$(date +"%y.%m" -d "1 month ago")     \
      --py-version=py3                                    \
      --image=tensorflow --image=pytorch --image=tensorrt \
      --no-exporter                                       \
      --registry-url=registry.local  # <== Replace with your local repo
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: ngc-replicator
  namespace: deepops
  labels:
    app: ngc-replicator
spec:
  schedule: "0 4 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          nodeSelector:
            node-role.kubernetes.io/master: ""
          containers:
            - name: replicator
              image: deepops/replicator
              imagePullPolicy: Always
              command: [ "/bin/sh", "-c", "/ngc-update/ngc-update.sh" ]
              env:
              - name: NGC_REPLICATOR_API_KEY
                valueFrom:
                  secretKeyRef:
                    name: ngc-secret
                    key: apikey
              volumeMounts:
              - name: registry-config
                mountPath: /ngc-update
              - name: docker-socket
                mountPath: /var/run/docker.sock
              - name: ngc-replicator-storage
                mountPath: /output
          volumes:
            - name: registry-config
              configMap:
                name: replicator-config
                defaultMode: 0777
            - name: docker-socket
              hostPath:
                path: /var/run/docker.sock
                type: File
            - name: ngc-replicator-storage
              persistentVolumeClaim:
                claimName: ngc-replicator-pvc
          restartPolicy: Never

Developer Quickstart

make dev
py.test

TODOs

  • save markdown readmes for each image. these are not version controlled
  • test local registry push service. coded, beta testing
  • add templater to workflow

ngc-container-replicator's People

Contributors

ryanolson avatar samcmill avatar dependabot[bot] avatar ajdecon avatar dholt avatar ksasagit avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.