Weights & Biases Experiment Tracking

Weights & Biases (W&B) Experiment Tracking provides a centralized, collaborative platform for logging, visualizing, and comparing machine learning experiments. It enables developers to systematically track model training, evaluate performance, and debug issues across various iterations and configurations. The primary purpose is to bring clarity and reproducibility to the iterative process of machine learning development, moving beyond ad-hoc logging to a structured, shareable system.

Core Capabilities

The platform offers a robust set of features designed to streamline the ML development lifecycle:

Run Tracking: Automatically logs hyperparameters, output metrics, system statistics (CPU, GPU, memory usage), and custom data for each experiment run. This creates a comprehensive record of every training session.
Interactive Visualizations: Provides dynamic dashboards to visualize logged data. Developers can create custom plots, compare multiple runs side-by-side, and analyze trends in metrics, loss curves, and data distributions.
Artifact Management: Facilitates versioning and tracking of datasets, models, and other arbitrary files used or generated during experiments. This ensures reproducibility and provides a clear lineage for all assets.
Hyperparameter Optimization (Sweeps): Automates the process of finding optimal hyperparameters. Developers define a search space and strategy (e.g., grid search, random search, Bayesian optimization), and W&B orchestrates the execution of multiple runs with different configurations.
Reports: Enables the creation of interactive, shareable reports directly from experiment data. These reports combine visualizations, code snippets, and markdown text to document findings and share insights with teams.
Model Registry: Manages the lifecycle of machine learning models, allowing for versioning, tagging, and promotion of models through different stages (e.g., staging, production).

Common Use Cases

Developers leverage W&B Experiment Tracking in various scenarios:

Hyperparameter Tuning: Systematically explores different hyperparameter combinations to identify the best performing model configurations.
Model Comparison and Selection: Evaluates and compares multiple model architectures, training strategies, or data preprocessing techniques to select the most effective solution.
Debugging and Analysis: Pinpoints issues in training by visualizing loss curves, gradient flows, and other internal model states, helping to diagnose problems like overfitting or underfitting.
Reproducibility: Ensures that any experiment can be recreated exactly as it was run, including the code, data, dependencies, and environment.
Team Collaboration: Shares experiment results, insights, and reports with team members, fostering efficient collaboration and knowledge transfer.
Production Monitoring: Integrates with deployed models to track their performance in real-time, logging predictions, actuals, and drift metrics.

Getting Started and Basic Tracking

Integrating W&B into a project typically begins with initializing a run and then logging relevant information.

Initialization

The init function establishes a connection to the W&B service and creates a new experiment run. It accepts parameters like project (to group runs), entity (for team collaboration), and config (for initial hyperparameters).

import wandb

# Initialize a new run
wandb.init(project="my-ml-project", entity="my-team", config={
    "learning_rate": 0.01,
    "epochs": 10,
    "batch_size": 32
})

# Access configuration later
config = wandb.config
print(f"Learning rate: {config.learning_rate}")

Logging Metrics and Hyperparameters

During training, the log function records metrics, images, plots, and other data. It accepts a dictionary where keys are metric names and values are their current states. The config object, accessible via wandb.config, stores static hyperparameters for the run.

# Simulate a training loop
for epoch in range(config.epochs):
    # ... perform training step ...
    loss = 0.5 - epoch * 0.01 # Example loss
    accuracy = 0.7 + epoch * 0.02 # Example accuracy

    # Log metrics for the current epoch
    wandb.log({"loss": loss, "accuracy": accuracy, "epoch": epoch})

# Log a final metric or summary
wandb.log({"final_accuracy": accuracy})

Tracking System Metrics

To monitor the computational resources consumed by a model, the watch function integrates with popular deep learning frameworks. It automatically logs gradients, model parameters, and system utilization (CPU, GPU, memory).

import torch
import torch.nn as nn

# Define a simple model
model = nn.Linear(10, 1)

# Watch the model for gradients and parameters
# 'log' can be "gradients", "parameters", or "all"
# 'log_freq' specifies how often to log (in steps)
wandb.watch(model, log="all", log_freq=10)

# ... continue with training loop ...

Advanced Tracking and Artifact Management

Beyond basic metrics, W&B provides robust tools for managing complex data and models.

Model and Data Versioning

The Artifact class manages datasets, models, and other files as versioned objects. This ensures that experiments are reproducible and provides a clear lineage for all assets. Artifacts can be logged, downloaded, and linked across different runs and projects.

# Create an artifact
artifact = wandb.Artifact(name="my-dataset", type="dataset")
artifact.add_dir("data/processed_data") # Add a directory
artifact.add_file("data/metadata.json") # Add a single file

# Log the artifact to the current run
wandb.log_artifact(artifact)

# To use an artifact from a previous run:
# artifact = run.use_artifact('my-dataset:v0', type='dataset')
# artifact_dir = artifact.download()
# print(f"Dataset downloaded to: {artifact_dir}")

Saving Files and Checkpoints

The save function uploads files from the local run directory to the W&B run page. This is particularly useful for saving model checkpoints, configuration files, or custom plots.

# Save a model checkpoint
torch.save(model.state_dict(), "model_checkpoint.pth")
wandb.save("model_checkpoint.pth")

# Save a custom plot
import matplotlib.pyplot as plt
plt.plot([1,2,3])
plt.savefig("my_plot.png")
wandb.log({"custom_plot": wandb.Image("my_plot.png")})

Hyperparameter Optimization with Sweeps

W&B Sweeps automate the process of hyperparameter tuning, allowing developers to efficiently explore the parameter space.

Defining a Sweep Configuration

A sweep is defined by a configuration dictionary that specifies the search strategy (e.g., grid, random, bayes), the metric to optimize (metric), and the parameter space (parameters).

sweep_config = {
    'method': 'random', # or 'grid', 'bayes'
    'metric': {
        'name': 'val_accuracy',
        'goal': 'maximize'
    },
    'parameters': {
        'learning_rate': {
            'min': 0.0001,
            'max': 0.1
        },
        'optimizer': {
            'values': ['adam', 'sgd']
        },
        'epochs': {
            'value': 10
        }
    }
}

# Initialize the sweep
sweep_id = wandb.sweep(sweep_config, project="my-ml-project")
print(f"Sweep ID: {sweep_id}")

Running a Sweep Agent

After defining a sweep, an agent runs the training function multiple times, each time with a new set of hyperparameters suggested by the sweep controller. The training function must call wandb.init() without a config argument, as the agent injects the parameters.

def train():
    # Initialize a new run for each sweep iteration
    # The config is automatically set by the sweep agent
    wandb.init()
    config = wandb.config

    # Use config parameters in your training logic
    print(f"Training with LR: {config.learning_rate}, Optimizer: {config.optimizer}")

    # Simulate training
    for epoch in range(config.epochs):
        val_accuracy = 0.5 + (config.learning_rate * 10) + (epoch * 0.01) # Example
        wandb.log({"val_accuracy": val_accuracy, "epoch": epoch})

# Run the sweep agent
# This command is typically run in a separate terminal or process
# wandb agent <sweep_id>

Integration Patterns

W&B integrates seamlessly with popular machine learning frameworks, often requiring minimal code changes.

PyTorch: Use wandb.watch() for model introspection and wandb.log() within the training loop.
TensorFlow/Keras: Integrate with Keras Callbacks for automatic logging of metrics, gradients, and model checkpoints.
Scikit-learn: Log hyperparameters and final evaluation metrics after model training.
Hugging Face Transformers: The Trainer class offers direct integration, automatically logging metrics and model checkpoints.

# Example with Keras Callback
import tensorflow as tf
from tensorflow import keras
from wandb.keras import WandbCallback

model = keras.Sequential([
    keras.layers.Dense(10, activation='relu', input_shape=(784,)),
    keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

wandb.init(project="keras-example")
model.fit(x_train, y_train, epochs=5, validation_data=(x_test, y_test),
          callbacks=[WandbCallback()])

Best Practices and Considerations

Structuring Runs

Organize experiments logically using projects and groups. A project groups related experiments, while group within a project can denote a specific model architecture or dataset version. This improves navigability in the W&B UI.

wandb.init(project="image-classification", group="resnet50-v2", name="run-001-baseline")

Performance Implications

While W&B is optimized for minimal overhead, frequent logging of large objects (e.g., high-resolution images, large tensors) can introduce latency. Log critical metrics at appropriate intervals (e.g., per epoch, or every N steps) rather than every single step if performance is a concern. The log_freq parameter in wandb.watch helps manage this for model tracking.

Security and Privacy

For sensitive data, consider using W&B's self-hosted solution (W&B Local) to keep all data within your private infrastructure. When using the cloud service, ensure that no personally identifiable information (PII) or highly sensitive data is logged directly. Artifacts provide a secure way to manage data versions, but the content itself should be handled with care.

Reproducibility

To ensure full reproducibility, always log the exact code version (e.g., Git commit hash), environment dependencies (e.g., requirements.txt), and data artifacts used for each run. W&B automatically captures some environment details, but explicit logging of these elements is crucial.

Core Capabilities​

Common Use Cases​

Getting Started and Basic Tracking​

Initialization​

Logging Metrics and Hyperparameters​

Tracking System Metrics​

Advanced Tracking and Artifact Management​

Model and Data Versioning​

Saving Files and Checkpoints​

Hyperparameter Optimization with Sweeps​

Defining a Sweep Configuration​

Running a Sweep Agent​

Integration Patterns​

Best Practices and Considerations​

Structuring Runs​

Performance Implications​

Security and Privacy​

Reproducibility​