BuildKite + ArgoWF for ML Training Jobs

MLOps

MLOps: Leveraging ArgoWF and Buildkite to train models

Author

Sachin Abeywardana

Published

July 31, 2024

Now a disclaimer before we go on. I am not a MLOps engineer, and have no plans to do so. However, the past week has been quite an interesting learning curve.

To start with the problem that we were having, in the past week one of our engineers (Mark) was doing some amazing work on the modelling front. However, this work was continuing on in a silo in a notebook, and it was hard to give feedback on certain modelling choices.

Using BuildKite and Argo WorkFlows, we were able to build out a CLI app where you can code in a repo, and push your experiments to start training. We structured out code so that we could test out most if not all our hypotheses in parallel without having to wait for the notebook to finish.

We will not be covering how to setup Argo and Buildkite in your workplaces stack as this was done by another amazing engineer (Wei Feng) for me.

Dockerising your experiments

Docker is like a virtual environment but better. It goes beyond just your python environment to control almost the entire software stack. The best part is that you can build on top of the work of other engineers to build the most optimal environments without having to install software yourself.

ARG SHA_ID=fc47f8018254e6df30f48c48f2db1c758d44de21a8c553de1a1c451a65baa70a
FROM pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime@sha256:$SHA_ID

WORKDIR /app

# Copy the requirements.txt file into the container at /app
COPY requirements.txt /app/

# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Copy the current directory contents into the container at /app
COPY ./src/ /app

In our case we built our container on top of meta’s pytorch base, and installed a few libraries via a requirements.txt file. All the python source code lives under the src folder at the same level within the repo.

Given that we have a number of projects within the repo, we do not want to build the container if we make a change to something irrelevant. In order to do this, we leverage buildkite to only build a container if there has been a change in code in the current commit, for the current folder. We use the following snippet to do this work:

DOCKERFILE_PATHS=$(find "${ROOT_FOLDER}/${PROJECT_FOLDER}" -name "Dockerfile" -type f -exec dirname {} \;)

for DOCKERFILE_PATH in ${DOCKERFILE_PATHS}; do
    RELATIVE_PATH="${DOCKERFILE_PATH#"${ROOT_FOLDER}"}"
    IMAGE_NAME=$(basename "${DOCKERFILE_PATH}")
    IMAGE_NAME=${IMAGE_NAME//_/-}

    if [[ $(git diff --name-only HEAD~1..HEAD "${ROOT_FOLDER}/${RELATIVE_PATH}") ]]; then
        echo "Change detected in subfolder: ${RELATIVE_PATH}"

        cd "${ROOT_FOLDER}/${RELATIVE_PATH}"

        REPOSITORY="${HOST}/${PROJECT_NAME}/${ARTIFACTS_REPOSITORY}/${IMAGE_NAME}"

        echo "+++ Building and tagging"
        docker build --platform linux/amd64 --file "./Dockerfile" \
        -t "${IMAGE_NAME}:${BUILDKITE_BUILD_NUMBER}" .

We also go on to tag the build with the commit sha and latest if its the master/ main branch. We want to make this distintion because we want any production workflows using the latest stable release, and not any experimental features we are running on a branch.

Argo WorkFlows

Argo is a container orchestrater which is built upon kubernetes for data workflows. At my previous workplace I had experience using Argo, however, the way it was strucutured through jsonnet meant only a few engineers could work on that code base (I’m looking at you Greg 😒).

Instead of using pure yaml, we chose to use Hera which is a python wrapper for Argo WF. Some of the advantage of using python as opposed to yaml was that, 1. It was readable, 2. Had code completion 3. Did I say readable? Compared to a yaml config where you had to guess the number of spaces for an indentation, and where you had no idea about what the parameters meant without going back and forth with documentation, having Argo Hera was a god send.

The following code snippet explains how we achieved our multiple parallel training runs:

from hera.auth import ArgoCLITokenGenerator
from hera.shared import global_config
from hera.workflows import Container, EmptyDirVolume, Resources, Step, Steps, Workflow

def run_workflow(trainer_config: str):
    logger.info(f"Running the workflow with tag: {tag}")
    # Define the workflow
    workflow_name = WORKFLOW_NAME + datetime.now().strftime("-%Y%m%d-%H%M%S")
    with Workflow(name=workflow_name, entrypoint="steps", namespace=NAMESPACE) as w:
        shared_memory_volume = EmptyDirVolume(
            name="dshm",
            mount_path="/dev/shm",
            size_limit="50Gi",
        )
        model_training_task = Container(
            name=TASK_NAME,
            image=f"{MODEL_IMAGE}:{_get_tag()}",
            command=["python", "main.py"],
            args=[
                "train",
                "--trainer-config-json",
                trainer_config,
            ],
            resources=Resources(
                gpus=1, cpu_request=4, cpu_limit=8, memory_request="64Gi", memory_limit="128Gi"
            ),
            volumes=[shared_memory_volume],
        )

        # Define the steps
        with Steps(name="steps") as _:
            Step(name=TASK_NAME, template=model_training_task)

    # Submit the workflow
    w.create()
    url = f"{global_config.host}/workflows/{global_config.service_account_name}/{w.name}"
    logger.info(f"Open url {url}")

if __name__ == "__main__":
    trainer_configs = [
        '{"is_local": false, "epochs": 10, "learning_rate": 0.01, "batch_size": 8, ...}',
        '{"is_local": false, "epochs": 5, "learning_rate": 0.001, "batch_size": 16, ...}',
        '{"is_local": false, "epochs": 10, "learning_rate": 0.001, "batch_size": 16, ...}',
    ]
    for trainer_config in trainer_configs:
        run_workflow(trainer_config)
        time.sleep(2) # to ensure no name clashes

Please ask your devops engineer who set up your argo cluster what the host and service_account_name is. Apart from that you should be able to copy and paste most of my code.

Some of the most important things to note about the above code are the following:

You can allocate the resources for your machine without having to write terraform as shown here:

resources=Resources(
    gpus=1, cpu_request=4, cpu_limit=8, memory_request="64Gi", memory_limit="128Gi"
),

In case you noticed, the container had a tag associated with it, and that was calculated using the following function which calls git underneath.

def get_tag() -> str:
    # Get the current branch name
    branch = subprocess.check_output(["git", "rev-parse", "--abbrev-ref", "HEAD"]).decode().strip()

    root_folder = subprocess.check_output(["git", "rev-parse", "--show-toplevel"]).decode().strip()

    # Get the last commit sha where this folder was updated
    commit_sha = (
        subprocess.check_output(
            [
                "git",
                "log",
                "-n",
                "1",
                "--pretty=format:%H",
                f"{root_folder}/training/cnn_finetune_attributes",
            ]
        )
        .decode()
        .strip()
    )

    # Set the tag variable based on the branch
    tag = "latest" if branch == "main" else commit_sha
    return tag

There was a big gotcha with the shared memory needed for docker containers due to pytorch dataloaders. I still don’t understand the problem entirely, but the solution was to include a shared_memory_volume as shown above.
The biggest strength is that I can spam the kubernetes cluster with as many experiments as I wish via the trainer_configs as shown in the for loop above.

Local Feedback loop

Running all your experiments through exclusively argo is a slow feedback loop, not to mention expensive if using a large GPU. If you had a silly typo in your code, you will only find out after the container builds, argo provisions a GPU (this can take a while these days due to the competition for GPU machines), finally hitting the error and hunting through the logs.

Most of the above can be avoided if you have a decent local vscode setup. In my case I use the launch.json file under the .vscode folder to run my code. It is best to also setup a virtual env with the following code snippet:

cd /path/to/parent/of/src/folder
python3 -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt

Use the following and edit accordingly in your launch.json file:

{
    "configurations": [
        {
            "name": "CNN Trainer",
            "type": "debugpy",
            "request": "launch",
            "program": "${workspaceFolder}/path/to/parent/of/src/folder/src/main.py",
            "env": {
                "PYTHONPATH": "${workspaceFolder}/path/to/parent/of/src/folder/src"
            },
            "python": "/path/to/parent/of/src/folder/myenv/bin/python",
            "args": [
                "train",
                "--trainer-config-json",
                "{\"is_local\": true, \"epochs\": 1, \"batch_size\": 16, ...}",
            ],
            "console": "integratedTerminal",
            "justMyCode": false
        }
    ]
}

Once you have this file, you can put visual breakpoints in your code if needed, and run the debugger under Run And Debug tab.

Note that you would most likely want to run this on a tiny subset of your dataset as opposed to a full training set which, you can leave for a proper computer on Argo. I have done the following in my main.py to make the local training as small as possible. You can and should add other optimisations to save your time.

if trainer_config.is_local:
    df = df.sample(100)

Final Thoughts

While notebooks can be an amazing tool for iteration and experimentation, once we are satisfied with the general direction of our model, it is best to create a CLI so that you can test multiple hypotheses. Argo (or rather Hera) has been an amazing tool to achieve this goal.

I personally hope more and more people adopt this tool to train their own models.

Kudos

Big shout out to Ryan senpai who (patiently 😅) taught me everything I know about Argo and Geoff Breemer who showed me launch.json and many other vscode goodies. Of course, biggest kudos goes to Wei Feng for setting up the Argo cluster for us.