Skip to main content

Examples

Practical examples demonstrating how to use Galadriel for common ML/AI workloads.

Training a 70B LLM

Complete example of training a large language model from start to finish.

1. Purchase GPUs with Cost Optimization

# Check orderbook first
galadriel orderbook --gpu-type h100 --zone us-west-1

# Use limit order to save 15%
galadriel order buy \
  --gpu-type h100 \
  --gpus 16 \
  --duration 72h \
  --price 3.20 \
  --zone us-west-1

2. Setup Training Environment

# Set kubeconfig
export KUBECONFIG=~/.kube/galadriel-lse_xyz123.conf

# Verify nodes
kubectl get nodes

3. Deploy Training Job

Create llm-training.yaml:
apiVersion: batch/v1
kind: Job
metadata:
  name: llama-70b-training
spec:
  template:
    spec:
      containers:
      - name: trainer
        image: nvcr.io/nvidia/pytorch:23.10-py3
        command:
          - torchrun
          - --nproc_per_node=8
          - --nnodes=2
          - --node_rank=$NODE_RANK
          - --master_addr=llama-70b-training-0
          - --master_port=29500
          - train.py
          - --model=llama-70b
          - --batch-size=4
          - --gradient-checkpointing
        resources:
          limits:
            nvidia.com/gpu: 8
        volumeMounts:
        - name: infiniband
          mountPath: /dev/infiniband
        - name: shm
          mountPath: /dev/shm
        env:
        - name: NCCL_IB_DISABLE
          value: "0"
        - name: NCCL_DEBUG
          value: "INFO"
      volumes:
      - name: infiniband
        hostPath:
          path: /dev/infiniband
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 32Gi
      restartPolicy: Never
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
Deploy:
kubectl apply -f llm-training.yaml
kubectl logs -f job/llama-70b-training

4. Monitor Training

# Watch pods
kubectl get pods -w

# Check GPU utilization
kubectl exec -it llama-70b-training-xxx -- nvidia-smi

# View logs
kubectl logs -f llama-70b-training-xxx

5. Finished Early? Resell!

# Model converged after 48h instead of 72h
# Resell remaining 24h

galadriel lease resell lse_xyz123 \
  --price 1.20 \
  --from "2025-11-13T14:00:00Z"
Cost Breakdown:
Purchase: 72h × 16 GPUs × $1.30 = $1,497.60
Used: 48h × 16 GPUs × $1.30 = $998.40
Resold: 24h × 16 GPUs × $1.20 = $460.80
Platform fee (10%): -$46.08
Net recovery: $414.72

Effective cost: $998.40 - $414.72 = $583.68
Effective rate: $0.76/GPU/hr (42% savings!)

Distributed Training with Multiple Nodes

Multi-Node PyTorch Training

# train.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def main():
    # Initialize distributed training
    dist.init_process_group(backend='nccl')

    # Get rank and world size
    rank = dist.get_rank()
    world_size = dist.get_world_size()

    # Set device
    local_rank = rank % torch.cuda.device_count()
    torch.cuda.set_device(local_rank)

    print(f"Rank {rank}/{world_size} on GPU {local_rank}")

    # Load model
    model = YourLargeModel().cuda()
    model = DDP(model, device_ids=[local_rank])

    # Training loop
    for epoch in range(num_epochs):
        for batch in dataloader:
            loss = train_step(model, batch)

            if rank == 0:
                print(f"Epoch {epoch}, Loss: {loss}")

    # Cleanup
    dist.destroy_process_group()

if __name__ == "__main__":
    main()

Kubernetes Job for Multi-Node

apiVersion: v1
kind: Service
metadata:
  name: training-headless
spec:
  clusterIP: None
  selector:
    job-name: distributed-training
  ports:
  - port: 29500
    name: master
---
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
spec:
  completions: 2  # 2 nodes
  parallelism: 2
  template:
    spec:
      containers:
      - name: trainer
        image: your-training-image:latest
        command:
          - python
          - -m
          - torch.distributed.run
          - --nproc_per_node=8  # 8 GPUs per node
          - --nnodes=2
          - --master_addr=distributed-training-0.training-headless
          - --master_port=29500
          - train.py
        resources:
          limits:
            nvidia.com/gpu: 8
        volumeMounts:
        - name: infiniband
          mountPath: /dev/infiniband
      volumes:
      - name: infiniband
        hostPath:
          path: /dev/infiniband
      restartPolicy: OnFailure

High-Throughput Inference

vLLM Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
spec:
  replicas: 2  # 2 nodes for load balancing
  selector:
    matchLabels:
      app: vllm-inference
  template:
    metadata:
      labels:
        app: vllm-inference
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command:
          - python
          - -m
          - vllm.entrypoints.openai.api_server
          - --model=meta-llama/Llama-2-70b-hf
          - --tensor-parallel-size=8
          - --gpu-memory-utilization=0.95
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 8
        env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: huggingface-token
              key: token
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-inference
  ports:
  - port: 8000
    targetPort: 8000
  type: LoadBalancer

Test Inference

# Deploy
kubectl apply -f vllm-deployment.yaml

# Get service IP
kubectl get svc vllm-service

# Test inference
curl http://<service-ip>:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-70b-hf",
    "prompt": "Once upon a time",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Fine-Tuning with LoRA

Efficient Fine-Tuning Setup

# finetune.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# Load base model in 8-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    load_in_8bit=True,
    device_map="auto",
    torch_dtype=torch.float16
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 134M || all params: 70B || trainable: 0.19%

# Training loop
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

trainer.train()

Deployment

# Purchase smaller GPU allocation for LoRA
galadriel order buy \
  --gpu-type h100 \
  --gpus 8 \
  --duration 24h \
  --price 3.20

# Deploy fine-tuning job
kubectl apply -f lora-finetune.yaml

Cost Optimization Strategies

Strategy 1: Off-Peak Scheduling

from galadriel import Client
from datetime import datetime, timedelta

client = Client(api_token="your_token")

def schedule_offpeak_training():
    # Get price history
    prices = client.prices.history(
        instance_type="8xh100",
        zone="us-west-1",
        start=(datetime.utcnow() - timedelta(days=7)).isoformat(),
        end=datetime.utcnow().isoformat(),
        interval="1h"
    )

    # Find cheapest hour
    cheapest = min(prices, key=lambda p: p.close)
    print(f"Cheapest time: {cheapest.timestamp} at ${cheapest.close}/hr")

    # Schedule for similar time tomorrow
    tomorrow = datetime.fromisoformat(cheapest.timestamp) + timedelta(days=1)

    # Create flexible order
    order = client.orders.create(
        side="buy",
        type="flexible",
        gpu_type="h100",
        gpu_count=16,
        duration_hours=24,
        limit_price=3.00,
        zone="us-west-1",
        start_time=tomorrow.isoformat(),
        flex_window_hours=6  # 6-hour window around target time
    )

    print(f"Order created: {order.order_id}")
    print(f"Target start: {tomorrow}")

schedule_offpeak_training()

Strategy 2: Automatic Reselling

from galadriel import Client
import time

client = Client(api_token="your_token")

def auto_resell_on_completion(lease_id, check_interval=300):
    """Monitor job and automatically resell when complete"""

    while True:
        # Check if training job is complete
        # (implement your own completion check)
        job_complete = check_training_complete()

        if job_complete:
            print("Training complete! Reselling unused time...")

            # Get current market price
            orderbook = client.orderbook.get(
                instance_type="8xh100",
                zone="us-west-1"
            )

            # List at competitive price
            resell_price = orderbook.market_depth.best_ask * 0.95

            # Create resell order
            resell_order = client.leases.resell(
                lease_id=lease_id,
                limit_price=resell_price,
                transfer_immediately=True
            )

            print(f"Resell order created: {resell_order.order_id}")
            print(f"Recovery: ${resell_order.net_recovery}")
            break

        time.sleep(check_interval)

# Usage
auto_resell_on_completion("lse_xyz123")

Strategy 3: Batch Processing with Limit Orders

#!/bin/bash
# batch-training.sh

# Submit multiple limit orders at different prices
for price in 3.00 3.10 3.20 3.30; do
  galadriel order buy \
    --gpu-type h100 \
    --gpus 8 \
    --duration 12h \
    --price $price \
    --zone us-west-1
done

echo "Submitted 4 limit orders at different price points"
echo "Will execute when market reaches those levels"

Jupyter Notebook on GPUs

Launch JupyterLab

apiVersion: v1
kind: Pod
metadata:
  name: jupyter-gpu
spec:
  containers:
  - name: jupyter
    image: jupyter/tensorflow-notebook:latest
    command:
      - jupyter
      - lab
      - --ip=0.0.0.0
      - --allow-root
      - --NotebookApp.token=''
    ports:
    - containerPort: 8888
    resources:
      limits:
        nvidia.com/gpu: 1
    volumeMounts:
    - name: workspace
      mountPath: /home/jovyan/work
  volumes:
  - name: workspace
    emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: jupyter-service
spec:
  selector:
    app: jupyter-gpu
  ports:
  - port: 8888
    targetPort: 8888
  type: LoadBalancer
Deploy and access:
kubectl apply -f jupyter.yaml
kubectl get svc jupyter-service  # Get external IP
# Open http://<external-ip>:8888 in browser

Ray Cluster for Distributed Computing

Deploy Ray Head

apiVersion: v1
kind: Service
metadata:
  name: ray-head
spec:
  ports:
  - port: 6379
    name: client
  - port: 8265
    name: dashboard
  selector:
    component: ray-head
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ray-head
spec:
  replicas: 1
  selector:
    matchLabels:
      component: ray-head
  template:
    metadata:
      labels:
        component: ray-head
    spec:
      containers:
      - name: ray-head
        image: rayproject/ray:latest-gpu
        command:
          - ray
          - start
          - --head
          - --port=6379
          - --dashboard-host=0.0.0.0
        resources:
          limits:
            nvidia.com/gpu: 8

Deploy Ray Workers

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ray-worker
spec:
  replicas: 1  # Scale as needed
  selector:
    matchLabels:
      component: ray-worker
  template:
    metadata:
      labels:
        component: ray-worker
    spec:
      containers:
      - name: ray-worker
        image: rayproject/ray:latest-gpu
        command:
          - ray
          - start
          - --address=ray-head:6379
        resources:
          limits:
            nvidia.com/gpu: 8

Monitoring & Debugging

GPU Monitoring Dashboard

# Install NVIDIA DCGM exporter
kubectl create -f https://github.com/NVIDIA/dcgm-exporter/deployments/kubernetes/dcgm-exporter.yaml

# Port-forward to view metrics
kubectl port-forward svc/dcgm-exporter 9400:9400

# View metrics
curl http://localhost:9400/metrics | grep gpu

Debug GPU Issues

# Check GPU status
kubectl exec -it <pod-name> -- nvidia-smi

# Check GPU memory
kubectl exec -it <pod-name> -- nvidia-smi --query-gpu=memory.used,memory.free --format=csv

# Check CUDA version
kubectl exec -it <pod-name> -- nvcc --version

# Test GPU performance
kubectl exec -it <pod-name> -- python -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU count: {torch.cuda.device_count()}')
print(f'Current GPU: {torch.cuda.current_device()}')
print(f'GPU name: {torch.cuda.get_device_name(0)}')
"

Next Steps