Continuous testing of xsuite

Since GitHub do not yet support GPU on their own Actions runners, we set up our own self-hosted runner on OpenStack for this purpose. The configuration of the test workflow can be found in .github/test_gpu.yaml of the xsuite repository. However, the test machine needs to first be prepared. As we use Docker to run the tests in an isolated and controlled environment, some dependencies need to be installed. These are Docker and Nvidia drivers and Container Toolkit (assuming the test machine uses an Nvidia GPU). The steps to accomplish this are listed below.

Setup of the test runner machine (Ubuntu)

An OpenStack GPU-capable machine

This can be set up largely following the guide here.

# Find the image uuid you want
openstack image list --community | grep -i ubuntu
# Commands to be passed during the setup of our machine
cat > ubuntu-bootcmd.txt <<- EOF
#cloud-config
bootcmd:
- printf "[Resolve]\nDNS=137.138.16.5 137.138.17.5\n" > /etc/systemd/resolved.conf
- [systemctl, restart, systemd-resolved]
EOF
# Create the VM
openstack server create \
  --image <IMAGE_UUID> \
  --user-data ubuntu-bootcmd.txt \
  --flavor g2.5xlarge \
  --key-name lxplus xsuite-ubuntu-tests

Configure the test machine

Once we ssh to the machine (as user ubuntu) we need to install Docker and suitable Nvidia drivers. We can use the convenience script provided by the Docker people, as described here.

# Install docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
# Set up rootless for good practice
dockerd-rootless-setuptool.sh install

To be able to run containers with GPU support we need the Nvidia container toolkit. A prerequisite for that are the Nvidia drivers. The up-to-date install instructions for the toolking can be found here. There is also a useful page available on the topic on the CERN OpenStack guide.

# Install cuda drivers
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda

# Install the container toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

After restarting the Docker daemon with sudo systemctl restart docker, we can check that everything works by running a sample image from Nvidia:

docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

Install the GitHub Actions runner

We can follow the steps listed under xsuite/xsuite > Settings > Actions > Runners > New self-hosted runner.

This involves downloading and configuring the runner with the repository.

Afterwards, we install and run the runner as a service with user ubuntu:

./svc.sh install ubuntu
./svc.sh start

Setup of the test runner machine (Alma 8)

Synopsis

On the AlmaLinux 8 virtual machine (the “host”) will be running a GitHub runner executing Xsuite tests in a containerised environment. In order to support GPU execution contexts, Nvidia drivers and the Nvidia Container Toolkit will need to be installed. At the time of writing this guide, the Nvidia guide states that Docker is not supported under RHEL 8/CentOS 8 (and so effectively Alma 8 as well), and that is why we will use Podman instead of Docker. Podman is a container environment similar to Docker, however it does not require a separate daemon to run containers, which makes it more lighweight.

Setup a user account

We can set up an appropriate GPU-capable OpenStack VM in the same way as in the previous section (Ubuntu), or simply by following the GUI wizard on openstack.cern.ch.

On the fresh Alma VM we first set up a user account which we will use to run our actions:

adduser xsuite

We add the user to the sudoers file, by appending the line xsuite ALL=(ALL) NOPASSWD:ALL to /etc/sudoers. If necessary, copy the authorised SSH key from the root account (/root/.ssh/authorized_keys) to the new account (/home/xsuite/.ssh/authorized_keys). Fix permissions (sudo chown -R xsuite:xsuite .ssh and chmod -R +rw .ssh).

Installing Nvidia drivers

We will largely be following the official Nvidia guide [1], however only as far as installing the drivers. CUDA is not necessary on the host machine, only inside the containers.

First, some prerequisites are necessary. In this guide, we will be installing the drivers using the “network RPM” method in Nvidia’s guide. We will perform a DKMS installation, so that the drivers get recompiled whenever there is a kernel update, so that it does not need to be done manually. To this end, we need to install kernel headers:

sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r)

To satisfy other requirements of the Nvidia driver package, we enable the third party repository EPEL:

sudo dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm

Then we can enable the network repo and install the drivers:

sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
sudo dnf clean all
sudo dnf -y module install nvidia-driver:latest-dkms
sudo dnf clean expire-cache  # clean dnf cache afterwards

To check if everything works, we can sudo reboot, then run the following command, which, if all went well, should return a summary of the available GPUs:

nvidia-smi

Troubleshooting Note: If at this stage the driver is not working, it could be that it was not picked up by DKMS. We can verify this by running dkms status: if there is no nvidia/... entry, or if to the right of it its status is not listed as installed, we can run dkms autoinstall to attempt to recompile the drivers.

Installing the Nvidia Container Toolkit

We will follow the instruction of the official Nvidia guide [2], the steps of which are summarised below.

A container environment is a prerequisite for installing the NCT. We can easily install Podman with:

sudo dnf install podman

Podman is compatible with the Container Device Interface specification, which means that only the base components of the Nvidia Container Toolkit are needed. We install the required package:

sudo dnf clean expire-cache
sudo dnf install -y nvidia-container-toolkit-base

Check that it works:

nvidia-ctk --version

And generate the CDI specification with:

sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

To be able to run rootless containers with podman, we change the following configuration:

sudo sed -i 's/^#no-cgroups = false/no-cgroups = true/;' /etc/nvidia-container-runtime/config.toml

When running rootless, we may also encounter permission issues with SELinux. We need to add an appropriate label container_file_t to the Nvidia device files:

sudo semanage fcontext -a -t container_file_t '/dev/nvidia.*'
restorecon -v /dev/*

Note that it may be necessary to relabel the device files with the restorecon command in the case of changes/updates to the hypervisor.

Check that everything works with:

podman run --rm --gpus all cupy/cupy:latest nvidia-smi

Finally, we make a link called docker pointing to podman, so that the workflows (which presume Docker) work on the new machine:

sudo ln -s /usr/bin/podman /usr/bin/docker

Setup the GitHub runner

Navigate to Settings > Actions > Runners on GitHub and follow the instructions for creating the new runner. Once this is done, there are three final steps that need to be done before we enable the runner service.

Configure SELinux

We need to label the service script as an executable to SELinux, otherwise it will prevent us from launching the service.

sudo semanage fcontext -a -t bin_t '/home/xsuite/actions-runner/runsvc.sh'
sudo semanage fcontext -a -t bin_t '/home/xsuite/actions-runner/bin(/.*)?'
sudo restorecon -v -R /home/xsuite/actions-runner/

Set the right container format

Since we are using a docker Dockerfile format, which is slightly different to the OCI format, to which podman defaults, we need to change the setting for podman to use the Docker format. To achieve this, we add an environment variable to the runner service file:

sudo ./svc.sh install xsuite && sudo ./svc.sh stop  # create but don't start the service
sudo systemctl edit actions.runner.xsuite-xsuite.xsuite-alma-tests.service

In the opened editor (which may be empty), we paste the following:

[Service]
Environment="BUILDAH_FORMAT=docker"

Enable account lingering

In order for Podman to be able to function headless, we need to enable account lingering, as otherwise, systemd will kill any user process when there is no login session for the user.

sudo loginctl enable-linger xsuite

Enable and start the service

Finally, we can start the runner service, which will immediately begin listening for new jobs:

sudo ./svc.sh install xsuite
sudo ./svc.sh start