Giter VIP home page Giter VIP logo

ofed-docker's Introduction

This project has been deprecated!

License Build Status

Containerized Nvidia Mellanox drivers

This repository provides means to build driver containers for various distributions.

Driver containers offered:

  • Mellanox OFED driver container : Mellanox out of tree networking driver
  • NV Peer Memory driver container : Nvidia Peer memory client driver for GPU-Direct

What are driver containers ?

Driver containers are containers that allow provisioning of a driver on the host. They provide several benefits over a standard driver installation, for example:

  • Ease of deployment
  • Fast installation

Containerized Mellanox OFED driver

This container is intended to be used as an alternative to host installation by simply deploying the container image on the host the container will:

  • Reload Kernel modules provided by Mellanox OFED
  • Mount the container's root fs to /run/mellanox/drivers/. Should this directory be mapped to the host, the content of this container will be made available to be shared with host or other containers. A use-case for it would be compilation of Nvidia Peer Memory client modules.

Containerized Mellanox OFED - Image build

It is required to build the image on the same OS and kernel as it will be deployed.

The provided Dockerfiles provide several build arguments to provide the flexibility to build a container image for various driver version and platforms.

Build arguments

  • D_OFED_VERSION : Mellanox OFED version as appears in Mellanox OFED download page, e.g 5.0-2.1.8.0
  • D_OS : Operating System version as appears in Mellanox OFED downlload page, e.g ubuntu20.04
  • D_ARCH: CPU architecture as appears in Mellanox OFED download page, e.g x86_64
  • D_BASE_IMAGE : Base image to be used for driver container image build. Default: ubuntu:20.04

Build - Ubuntu

# docker build -t ofed-driver \
--build-arg D_BASE_IMAGE=ubuntu:20.04 \
--build-arg D_OFED_VERSION=5.0-2.1.8.0 \
--build-arg D_OS=ubuntu20.04 \
--build-arg D_ARCH=x86_64 \
ubuntu/

Build - Centos

Coming soon...

Containerized Mellanox OFED - Run

# docker run --rm -it \
-v /run/mellanox/drivers:/run/mellanox/drivers:shared \
-v /etc/network:/etc/network \
-v /etc:/host/etc \
-v /lib/udev:/host/lib/udev \
--net=host --privileged ofed-driver

Containerized Nvidia Peer Memory Client driver

This container is intended to be used as an alternative to host installation by simply deploying the container image on the host the container will:

  • Compile nv_peer_mem kernel module
  • Reload nv_peer_mem kernel module

As Nvidia peer memory client module requires to be compiled against Mellanox OFED and Nvidia drivers currently installed on the machine, it expects the root fs where Mellanox OFED drivers are installed to be mounted at /run/mellanox/drivers And the root fs where Nvidia drivers are installed to be mounted at /run/nvidia/drivers.

This is best suited when both Mellanox NIC and Nvidia GPU drivers are provisioned via driver containers as they offer to expose their container rootfs.

Containerized Nvidia Peer Memory Client driver - Image build

Build arguments

  • D_BASE_IMAGE Base image to be used when building the container image (Default: ubuntu:20.04)
  • D_NV_PEER_MEM_BRANCH Branch/Tag of nv_peer_memory repositroy (Default: master)

Build - Ubuntu

# docker build -t nv-peer-mem \
--build-arg D_BASE_IMAGE=ubuntu:20.04 \
--build-arg D_NV_PEER_MEM_BRANCH=1.0-9 \
gpu-direct/ubuntu/

Build - Centos

Coming soon...

Containerized Nvidia Peer Memory Client driver - Run

In the example below, Mellanox driver container rootfs is mounted on the host at /run/mellanox/drivers and Nvidia driver container rootfs is mounted on the host at /run/nvidia/driver

# docker run --rm -it \
-v /run/mellanox/drivers:/run/mellanox/drivers \
-v /run/nvidia/driver:/run/nvidia/drivers \
--privileged nv-peer-mem

Driver container readiness

A driver container load kernel modules into the running kernel preceded by a possible compilation step.

The process is not atomic as:

  1. A driver is often composed of multiple modules which are loaded sequentially into the kernel.
  2. Compilation (if it takes place) takes time.

To mark the completion of the driver loading phase by the driver container, a file is created at the container's root directory: /.driver-ready. Its existence indicates that the driver has been successfully loaded into the running kernel. This can be used by a container orchestrator to probe for readiness of a driver container.

Limitations

Having rdma-core package installed on the host may prevent Mellanox OFED driver container to properly load drivers. This is due to the fact that rdma-core places udev rules that trigger driver module load from the host as well as load storage modules on system startup.

ofed-docker's People

Contributors

adrianchiris avatar e0ne avatar maze88 avatar moshe010 avatar sjug avatar yehorov-nvidia avatar ykulazhenkov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ofed-docker's Issues

Question about Driver Container Updating

Is this container consider replacing. (like updating)?
For example
If some version (for example 5.2) is running, and updating (for example 5.3) is effective on some host server?
Of course, during the updating all containers are stopped. And at this moment this is NOT real issue.
I am wondering driver installation/uninstallation process is considered in driver container or not.(from seeing the entrypoint.sh, it is not considered.)

Add support for RHEL 7.x

I know the CentOS/RHEL support isn't official yet, but what is in the repo is RHEL8 based.

Would be great to have a working RHEL 7.x (7.7 or later) Dockerfile and entrypoint.sh as well (we use Ubuntu 20.04, RHEL 7.7 and RHEL 8.4 at present).

I'd be happy to test it on RHEL 7.x if that's an issue for the dev team prior to merge into master.

remove hardcoded kver value from nv-peer-mem entrypoint

in inject_nvidia_driver() function ${KERNEL_VERSION} should be used instead.

function inject_nvidia_driver() {
    # NVIDIA driver may be installed either with/out DKMs which affects the module location
    # always inject the modules under dkms as thats where nv_peer_mem is looking for the modules
    # alternative is to modify nv_peer_mem/create_nv_symvers.sh to support both locations
    has_files_matching ${NVIDIA}/usr/src/ nvidia-*
    if  [[ $? -eq 0 ]]; then
        ln -sf ${NVIDIA}/usr/src/nvidia-* /usr/src/.
    else
        echo "ERROR: Nvidia GPU driver sources not found."
        return 1
    fi

    has_files_matching ${NVIDIA}/lib/modules/${KERNEL_VERSION}/updates/dkms nvidia
    if [[ $? -eq 0 ]]; then
        ln -sf ${NVIDIA}/lib/modules/${KERNEL_VERSION}/updates/dkms/* /lib/modules/${KERNEL_VERSION}/updates/dkms/
    else
        has_files_matching ${NVIDIA}/lib/modules/4.15.0-109-generic/kernel/drivers/video/ nvidia
        if [[ $? -eq 0 ]]; then
            # Driver installed as non dkms kernel module
            ln -sf ${NVIDIA}/lib/modules/4.15.0-109-generic/kernel/drivers/video/nvidia* /lib/modules/4.15.0-109-generic/updates/dkms/
        else
            echo "ERROR: Failed to locate Nvidia GPU drivers in mount: ${NVIDIA}"
            return 1
        fi
    fi
    # ln -sf ${NVIDIA}/var/lib/dkms/nvidia /var/lib/dkms/nvidia
}

container with ofed driver will see all ib cards of the host

when we run a container with ofed driver, we can find all the ib cards in the container, is there a method that we can only see one ib card?, just like nvidia-docker
when we set NVIDIA_VISIBLE_DEVICE env , we we only see the GPU card which is set in this env

enable NFS over RDMA(--with-nfsrdma) by default

There are all the default install parameters.

/bin/bash -c '/root/${D_OFED_PATH}/mlnxofedinstall --without-fw-update --kernel-only --add-kernel-support --distro ${D_OS} --skip-repo --force ${D_WITHOUT_FLAGS}'

If I want to enable NFS over RDMA(--with-nfsrdma), what me to do? We can write a new ENV like "MLNXOFEDINSTALL_ARGS" to https://github.com/Mellanox/ofed-docker/blob/master/deployment/ofed-driver-pod.yaml, and pass it to https://github.com/Mellanox/ofed-docker/blob/master/rhel/entrypoint.sh.

For example:

apiVersion: v1
kind: Pod
metadata:
  name: ofed-driver
  labels:
    app: ofed-driver
spec:
  hostNetwork: true
  ENV:
  - name: MLNXOFEDINSTALL_ARGS
    value: "--with-nfsrdma --with-nvmf"
  containers:
    - name: ofed-driver
      image: ofed_container
      imagePullPolicy: Never
      securityContext:
        privileged: true
      volumeMounts:
        - name: run-mofed
          mountPath: /run/mellanox/drivers
          mountPropagation: Bidirectional
        - name: etc-network
          mountPath: /etc/network
        - name: host-etc
          mountPath: /host/etc
        - name: host-udev
          mountPath: /host/lib/udev

How do you think? If you guys think so, I'm willing to do this.

should support ubuntu 20.04

Please add support for ubuntu 20.04

docker build -t ofed-driver\
--build-arg D_OFED_VERSION=5.0-2.1.8.0 \
--build-arg D_OS=ubuntu20.04 \
--build-arg D_ARCH=x86_64 \
ubuntu/

command above can't be executed successfully.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.