Examples
Practical examples demonstrating how to use Galadriel for common ML/AI workloads.Training a 70B LLM
Complete example of training a large language model from start to finish.1. Purchase GPUs with Cost Optimization
Copy
# Check orderbook first
galadriel orderbook --gpu-type h100 --zone us-west-1
# Use limit order to save 15%
galadriel order buy \
--gpu-type h100 \
--gpus 16 \
--duration 72h \
--price 3.20 \
--zone us-west-1
2. Setup Training Environment
Copy
# Set kubeconfig
export KUBECONFIG=~/.kube/galadriel-lse_xyz123.conf
# Verify nodes
kubectl get nodes
3. Deploy Training Job
Createllm-training.yaml:
Copy
apiVersion: batch/v1
kind: Job
metadata:
name: llama-70b-training
spec:
template:
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:23.10-py3
command:
- torchrun
- --nproc_per_node=8
- --nnodes=2
- --node_rank=$NODE_RANK
- --master_addr=llama-70b-training-0
- --master_port=29500
- train.py
- --model=llama-70b
- --batch-size=4
- --gradient-checkpointing
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: infiniband
mountPath: /dev/infiniband
- name: shm
mountPath: /dev/shm
env:
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_DEBUG
value: "INFO"
volumes:
- name: infiniband
hostPath:
path: /dev/infiniband
- name: shm
emptyDir:
medium: Memory
sizeLimit: 32Gi
restartPolicy: Never
nodeSelector:
nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
Copy
kubectl apply -f llm-training.yaml
kubectl logs -f job/llama-70b-training
4. Monitor Training
Copy
# Watch pods
kubectl get pods -w
# Check GPU utilization
kubectl exec -it llama-70b-training-xxx -- nvidia-smi
# View logs
kubectl logs -f llama-70b-training-xxx
5. Finished Early? Resell!
Copy
# Model converged after 48h instead of 72h
# Resell remaining 24h
galadriel lease resell lse_xyz123 \
--price 1.20 \
--from "2025-11-13T14:00:00Z"
Copy
Purchase: 72h × 16 GPUs × $1.30 = $1,497.60
Used: 48h × 16 GPUs × $1.30 = $998.40
Resold: 24h × 16 GPUs × $1.20 = $460.80
Platform fee (10%): -$46.08
Net recovery: $414.72
Effective cost: $998.40 - $414.72 = $583.68
Effective rate: $0.76/GPU/hr (42% savings!)
Distributed Training with Multiple Nodes
Multi-Node PyTorch Training
Copy
# train.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def main():
# Initialize distributed training
dist.init_process_group(backend='nccl')
# Get rank and world size
rank = dist.get_rank()
world_size = dist.get_world_size()
# Set device
local_rank = rank % torch.cuda.device_count()
torch.cuda.set_device(local_rank)
print(f"Rank {rank}/{world_size} on GPU {local_rank}")
# Load model
model = YourLargeModel().cuda()
model = DDP(model, device_ids=[local_rank])
# Training loop
for epoch in range(num_epochs):
for batch in dataloader:
loss = train_step(model, batch)
if rank == 0:
print(f"Epoch {epoch}, Loss: {loss}")
# Cleanup
dist.destroy_process_group()
if __name__ == "__main__":
main()
Kubernetes Job for Multi-Node
Copy
apiVersion: v1
kind: Service
metadata:
name: training-headless
spec:
clusterIP: None
selector:
job-name: distributed-training
ports:
- port: 29500
name: master
---
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training
spec:
completions: 2 # 2 nodes
parallelism: 2
template:
spec:
containers:
- name: trainer
image: your-training-image:latest
command:
- python
- -m
- torch.distributed.run
- --nproc_per_node=8 # 8 GPUs per node
- --nnodes=2
- --master_addr=distributed-training-0.training-headless
- --master_port=29500
- train.py
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: infiniband
mountPath: /dev/infiniband
volumes:
- name: infiniband
hostPath:
path: /dev/infiniband
restartPolicy: OnFailure
High-Throughput Inference
vLLM Deployment
Copy
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
spec:
replicas: 2 # 2 nodes for load balancing
selector:
matchLabels:
app: vllm-inference
template:
metadata:
labels:
app: vllm-inference
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
command:
- python
- -m
- vllm.entrypoints.openai.api_server
- --model=meta-llama/Llama-2-70b-hf
- --tensor-parallel-size=8
- --gpu-memory-utilization=0.95
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 8
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm-inference
ports:
- port: 8000
targetPort: 8000
type: LoadBalancer
Test Inference
Copy
# Deploy
kubectl apply -f vllm-deployment.yaml
# Get service IP
kubectl get svc vllm-service
# Test inference
curl http://<service-ip>:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-70b-hf",
"prompt": "Once upon a time",
"max_tokens": 100,
"temperature": 0.7
}'
Fine-Tuning with LoRA
Efficient Fine-Tuning Setup
Copy
# finetune.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# Load base model in 8-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
load_in_8bit=True,
device_map="auto",
torch_dtype=torch.float16
)
# Prepare for training
model = prepare_model_for_kbit_training(model)
# LoRA config
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 134M || all params: 70B || trainable: 0.19%
# Training loop
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
Deployment
Copy
# Purchase smaller GPU allocation for LoRA
galadriel order buy \
--gpu-type h100 \
--gpus 8 \
--duration 24h \
--price 3.20
# Deploy fine-tuning job
kubectl apply -f lora-finetune.yaml
Cost Optimization Strategies
Strategy 1: Off-Peak Scheduling
Copy
from galadriel import Client
from datetime import datetime, timedelta
client = Client(api_token="your_token")
def schedule_offpeak_training():
# Get price history
prices = client.prices.history(
instance_type="8xh100",
zone="us-west-1",
start=(datetime.utcnow() - timedelta(days=7)).isoformat(),
end=datetime.utcnow().isoformat(),
interval="1h"
)
# Find cheapest hour
cheapest = min(prices, key=lambda p: p.close)
print(f"Cheapest time: {cheapest.timestamp} at ${cheapest.close}/hr")
# Schedule for similar time tomorrow
tomorrow = datetime.fromisoformat(cheapest.timestamp) + timedelta(days=1)
# Create flexible order
order = client.orders.create(
side="buy",
type="flexible",
gpu_type="h100",
gpu_count=16,
duration_hours=24,
limit_price=3.00,
zone="us-west-1",
start_time=tomorrow.isoformat(),
flex_window_hours=6 # 6-hour window around target time
)
print(f"Order created: {order.order_id}")
print(f"Target start: {tomorrow}")
schedule_offpeak_training()
Strategy 2: Automatic Reselling
Copy
from galadriel import Client
import time
client = Client(api_token="your_token")
def auto_resell_on_completion(lease_id, check_interval=300):
"""Monitor job and automatically resell when complete"""
while True:
# Check if training job is complete
# (implement your own completion check)
job_complete = check_training_complete()
if job_complete:
print("Training complete! Reselling unused time...")
# Get current market price
orderbook = client.orderbook.get(
instance_type="8xh100",
zone="us-west-1"
)
# List at competitive price
resell_price = orderbook.market_depth.best_ask * 0.95
# Create resell order
resell_order = client.leases.resell(
lease_id=lease_id,
limit_price=resell_price,
transfer_immediately=True
)
print(f"Resell order created: {resell_order.order_id}")
print(f"Recovery: ${resell_order.net_recovery}")
break
time.sleep(check_interval)
# Usage
auto_resell_on_completion("lse_xyz123")
Strategy 3: Batch Processing with Limit Orders
Copy
#!/bin/bash
# batch-training.sh
# Submit multiple limit orders at different prices
for price in 3.00 3.10 3.20 3.30; do
galadriel order buy \
--gpu-type h100 \
--gpus 8 \
--duration 12h \
--price $price \
--zone us-west-1
done
echo "Submitted 4 limit orders at different price points"
echo "Will execute when market reaches those levels"
Jupyter Notebook on GPUs
Launch JupyterLab
Copy
apiVersion: v1
kind: Pod
metadata:
name: jupyter-gpu
spec:
containers:
- name: jupyter
image: jupyter/tensorflow-notebook:latest
command:
- jupyter
- lab
- --ip=0.0.0.0
- --allow-root
- --NotebookApp.token=''
ports:
- containerPort: 8888
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: workspace
mountPath: /home/jovyan/work
volumes:
- name: workspace
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: jupyter-service
spec:
selector:
app: jupyter-gpu
ports:
- port: 8888
targetPort: 8888
type: LoadBalancer
Copy
kubectl apply -f jupyter.yaml
kubectl get svc jupyter-service # Get external IP
# Open http://<external-ip>:8888 in browser
Ray Cluster for Distributed Computing
Deploy Ray Head
Copy
apiVersion: v1
kind: Service
metadata:
name: ray-head
spec:
ports:
- port: 6379
name: client
- port: 8265
name: dashboard
selector:
component: ray-head
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ray-head
spec:
replicas: 1
selector:
matchLabels:
component: ray-head
template:
metadata:
labels:
component: ray-head
spec:
containers:
- name: ray-head
image: rayproject/ray:latest-gpu
command:
- ray
- start
- --head
- --port=6379
- --dashboard-host=0.0.0.0
resources:
limits:
nvidia.com/gpu: 8
Deploy Ray Workers
Copy
apiVersion: apps/v1
kind: Deployment
metadata:
name: ray-worker
spec:
replicas: 1 # Scale as needed
selector:
matchLabels:
component: ray-worker
template:
metadata:
labels:
component: ray-worker
spec:
containers:
- name: ray-worker
image: rayproject/ray:latest-gpu
command:
- ray
- start
- --address=ray-head:6379
resources:
limits:
nvidia.com/gpu: 8
Monitoring & Debugging
GPU Monitoring Dashboard
Copy
# Install NVIDIA DCGM exporter
kubectl create -f https://github.com/NVIDIA/dcgm-exporter/deployments/kubernetes/dcgm-exporter.yaml
# Port-forward to view metrics
kubectl port-forward svc/dcgm-exporter 9400:9400
# View metrics
curl http://localhost:9400/metrics | grep gpu
Debug GPU Issues
Copy
# Check GPU status
kubectl exec -it <pod-name> -- nvidia-smi
# Check GPU memory
kubectl exec -it <pod-name> -- nvidia-smi --query-gpu=memory.used,memory.free --format=csv
# Check CUDA version
kubectl exec -it <pod-name> -- nvcc --version
# Test GPU performance
kubectl exec -it <pod-name> -- python -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU count: {torch.cuda.device_count()}')
print(f'Current GPU: {torch.cuda.current_device()}')
print(f'GPU name: {torch.cuda.get_device_name(0)}')
"