Reproducible ML Training Pipelines With dstack And WandB

2022-10-17 06:15:12 • 395 Views

Introduction To Training Pipelines

The typical machine learning training process consists of various tasks such as data preparation, model training, model validation, model deployment, and others.

Because you may want to iterate on your model even after you deploy it to production, it is paramount to automate these tasks, e.g. to run them with new data, hyper-parameters, changes in the code, etc. Training pipelines help automate these tasks.

At the same time, it is crucial to keep the pipeline reproducible, in the sense that you as well as others can replicate it and obtain similar results. In order to ensure that pipelines are reproducible, it is crucial to track infrastructure, code, data, hyper-parameters, experiment metrics, etc.

In this article, I’ll give an overview of how Wanda and stack together can help build reproducible training pipelines.

Experiment tracking

First, we’ll start with experiment tracking.

Experiment tracking involves tracking

hyper-parameters (e.g., step-size, batch size, etc.)
experiment metrics (e.g., accuracy, training loss, validation loss, etc.)
hardware system metrics (e.g., the utilization of GPUs, memory, etc.)

Wanda provides one of the easiest ways to perform experiment tracking, as it

is easy to integrate into the ML code (addition of only a few lines of code)
uses the cloud to store the results so you can always find these metrics by your run name (don’t underestimate the importance of this, as storing metrics on your local machine may be rather an anti-pattern in regard to reproducibility)
automatically provides the visualization of various metrics with nice charts

Wanda is very easy to integrate into your existing codebase. Below is an example of how to do it if you use PyTorch Lightning.

We need to import the WandbLogger object. Then, instantiate the object and pass a project name as its argument, as shown below.

from pytorch_lightning.loggers import WandbLogger

wandb_logger = WandbLogger(project="my-test-project")

The createdwandb_logger can be passed into the logger argument in the Trainer object as shown below.

trainer = Trainer(logger=wandb_logger)

Once the run is completed, the information about the run is assigned to my-test-project.

You may check wandb.ai and click your project for the results (loss, accuracy, system metrics, etc) pertaining to various runs of the project, as shown below.

In order to store hyper-parameters or other configuration parameters, wandb.config is used and its integration into the ML code is easy as shown below.

import wandb
wandb.init()
wandb.config.epochs = 10
wandb.config.batch_size = 32

The experiment configuration for each run is automatically stored in the cloud, and it can be found in theconfig.yaml file in the Files the tab of a run at wandb.ai.

Automating workflows

Now, let’s talk about the automation of tasks and tracking the rest of our pipeline, which may include data, code, and infrastructure.

For this purpose, we use the stack. Here’s a brief list of what stock can do:

version data, code, and infrastructure
automate training tasks through declarative configuration files
automatically provision infrastructure in a linked cloud account (it supports AWS, GCP, and Azure),
Most importantly, it can be used from your favorite IDE (or terminal)

To automate workflows with stock, one needs to define them in the ./dstack/workflows.yaml file.

Here’s a very simple example:

workflows:
  - name: prepare
    provider: python
    script: "prepare.py"
    requirements: "requirements.txt"
    artifacts: ["data"]

 - name: train
   provider: python
   version: "3.9"
   requirements: "requirements.txt"
   script: "train.py"
   depends-on:
      - prepare
   artifacts: ["model"]
   resources:
     gpu: 4

Here, we can define multiple workflows (i.e. tasks) and configure dependencies between them.

In this particular example, we have two workflows: prepare and train.

The train workflow depends on the prepare workflow.

As you see, each workflow defines how to run the code, including what folders to store as output artifacts, and what infrastructure is needed (e.g. the number of GPUs, amount of memory, etc).

stack CLI is used to run the workflow from your local terminal. To run the above-mentioned workflow, execute the following command in your local terminal.

dstack run train

Once the run is submitted, the user can access relevant logs, changes in code, artifacts, etc in the stack UI or CLI.

In order to monitor the run, log in to your dstack.ai account, and the contents in the stack UI should look like the one below.

Artifacts can be browsed through the user interface.

Moreover, it is possible to download the contents of artifacts using the stack CLI with the following command.

dstack artifacts download <run-name>

Versioning data

In order to ensure the reproducibility of the training pipelines, it is crucial to track data too.

With stack, data artifacts can be versioned by assigning tags, which can later be referenced in other workflows.

In the example above, the train workflow depended on the prepare workflow. Each time you ran the train workflow, the stack also ran the download workflow, and then passed the output artifacts of the prepare workflow to the train workflow.

Now let’s imagine that we’d like to run the prepare workflow independently and then reuse the output artifacts of that particular run with the train workflows.

In order to do that, you have to run the prepareworkflow, and then assign a tag to it (e.g. through the UI as shown below or the CLI).

Then, you can refer to this tag from the train workflow:

workflows:
  - name: prepare
    provider: python
    script: "prepare.py"
    requirements: "requirements.txt"
    artifacts: ["data"]

 - name: train
   provider: python
   version: "3.9"
   requirements: "requirements.txt"
   script: "train.py"
   depends-on:
      - prepare:latest
   artifacts: ["model"]
   resources:
     gpu: 4

By decoupling the workflows of preparing data and training the model, it becomes easier to train the model iteratively and keep every run reproducible.

dstack + WandB Configuration

It is possible to seamlessly use stack and Wanda together.

To do that, obtain the WandB API key from “Settings” in Wanda as shown below

And add it to stack Secrets.

You can do it via “Settings” in dstack.ai. Click the “Add secret” button and add the key WANDB_API_KEY the secret with the copied WandB API key value.

The dstack settings should look like below.

In order to use the same run names across stack and Wanda, you can use the environment variable RUN_NAME to get the stack run name and pass it as a display name to WandbLogger as shown below:

run_name = os.environ['RUN_NAME']

wandb log results to a project

wandb_logger = WandbLogger(name=run_name, project="my-test-project")

That’s it for now. The source code used in this blog post can be found here.

You’re very welcome to share your feedback, ask questions, and of course, give a spin to WandB and stack yourself.

I want to try it. Where do I start?

#physics #nasa #space #facts #universe #knowledge #dailyfacts #biology #factz #chemistry #astronomy #education #earth #memes #cosmos #amazing #nature #allfacts #tech #innovation #astrophysics #code #ui #html #design #programming #userinterface #appdesign #digitaldesign #dribbble #uxdesign #webdesign #websitedesign #developer #celafarai #frasedelgiorno #webdeveloper #graphicdesign #marketing #instagram #dev #aforisma #frasi #softwareengineering #pensierodelgiorno #frontend #programmer #radio #pensiero #wireframe #clubdelcomic
#userinterface #appdesign #digitaldesign #dribbble #uxdesign #candivit #demetozdemir #canem

Please log in to like, share and comment!