Skip to main content

Cluster Specs

Galadriel offers the latest NVIDIA datacenter GPUs for AI/ML workloads. All clusters come with full node exclusivity, native InfiniBand support, and bare-metal Kubernetes access.

Example Cluster Specifications

The specifications below are examples of available cluster configurations. Actual specifications may vary based on availability and location.
B300B200H200
Operating SystemUbuntu 22.04/24.04Ubuntu 22.04/24.04Ubuntu 22.04/24.04
Processor2x Intel Xeon 6787P + 86 cores 2.0GHz2x Intel Xeon Platinum 8568Y + 48 Core 2.3GHz2x Intel Xeon Platinum 8568Y + 48 Core 2.3 GHz
GPU8x ThinkSystem NVIDIA B300 GPU 288GB8x ThinkSystem NVIDIA HGX B200 180GB8x ThinkSystem NVIDIA H200 141 GB
Memory4TB Total: 32x ThinkSystem DDR5 6400 MHz2TB Total: 32x ThinkSystem 64GB TruDDR5 5600MHz2TB Total: 32x ThinkSystem 32GB DDR5 5600MHz
OS drives1.92TB Total: 2x ThinkSystem M.2 960GB SSD1.92TB Total: 2x ThinkSystem M.2 PM9A3 960GB SSD1.92TB Total: 2x Samsung 960GB M.2 SSD
Data drives30.72TB Total: 8x ThinkSystem 2.5” 3.84TB NVMe30.72TB Total: 8x ThinkSystem 2.5” U.3 PM1743 3.84TB SSD30.72TB Total: 8x Samsung 3.84TB U.2 SSD
Power8x ThinkSystem 3200W PSU N+N6x ThinkSystem 2600W PSU N+N8x ThinkSystem 2400W PSUs N+N
Standard networkingThinkSystem Dual 400Gb/s BlueField-3 B3240 DPUThinkSystem NVIDIA BlueField-3 B3220 2-Port 200GThinkSystem Nvidia ConnectX-7 NDR200/HDR 2-port 200G
Additional networking8x OSFP 800Gb/s ports via ConnectX-88x NVIDIA BlueField-3 B3140H SuperNIC 1-Port 400 GbE8x ThinkSystem NVIDIA ConnectX-7 NDR400 1-port 400G
Lenovo Type ModelThinkSystem SR680a V4ThinkSystem SR780a V3ThinkSystem SR680a V3
Data center locationsCharlotte NCCharlotte NCRaleigh NC, Dallas/Ft.Worth TX

Available GPU Clusters

H100 (80GB)

Proven workhorse. Best price/performance for most workloads.

H200 (141GB)

Higher memory capacity. Same architecture as H100 with more memory.

B200

Blackwell architecture. Current generation with 2.3x faster training.

B300

Blackwell high-end. Maximum performance available.

Instance Types

GPUs are allocated in full-node increments (8 GPUs per node):
Instance TypeGPUsNodesUse Case
8xh10081Single-node training
16xh100162Multi-node training
24xh100243Large-scale training
32xh100324Distributed training
8xh20081Memory-intensive single-node
16xh200162Memory-intensive multi-node
8xb20081Frontier model training
16xb200162Large frontier models
8xb30081Maximum performance
Minimum allocation: 8 GPUs (1 full node). This ensures you get full node exclusivity and optimal InfiniBand performance.

System & Access

Storage

Each node includes 2TB NVMe SSD for local storage: Directory Structure:
/mnt/local/              # 2TB NVMe scratch space (shared on node)
/mnt/customers/          # Customer data directories
  └─ {customer_id}/
     └─ {lease_id}/      # Your lease-specific directory
        ├─ workspace/    # Mounted as /workspace in pods
        └─ tmp/          # Temporary files
Key Details:
  • Per-Node Storage: Each node in multi-node setups gets its own 2TB NVMe
  • Access: Storage mounted as /workspace in your Kubernetes pods
  • Persistence: Ephemeral - data is deleted when your lease expires
  • Backup: Save important data before lease termination (no automatic backups)
Example: Saving Your Data
# Copy results before lease expires
kubectl cp my-pod:/workspace/results ./local-results

# Or use your own S3/cloud storage from within pods
aws s3 sync /workspace/checkpoints s3://my-bucket/checkpoints

Operating System

Base System:
  • OS: Ubuntu 22.04 LTS
  • Kernel: Linux 5.15+
  • Container Runtime: containerd 1.7+
GPU Stack:
  • NVIDIA Driver: 535+
  • CUDA Toolkit: 12.3.0
  • NVIDIA Container Toolkit: Pre-configured
  • GPU Operator: Managed by Galadriel
Network Stack:
  • InfiniBand: Mellanox OFED 5.8+
  • RDMA: GPUDirect RDMA enabled
  • Kubernetes: v1.30+

Access Model

Galadriel uses Bare Metal Kubernetes - the industry standard for GPU clouds (same as CoreWeave, Lambda Labs): Architecture:
Your Workload (Container)

Kubernetes Pod (privileged)

Bare Metal Node (no hypervisor)

Physical Hardware (GPUs)
What This Means:
  • 0-2% overhead vs 5-10% for VMs (no hypervisor layer)
  • Direct GPU access through CUDA in your containers
  • Full node exclusivity - no noisy neighbors
  • InfiniBand/RDMA for multi-node communication
  • Privileged containers with elevated permissions
Access Methods:
  1. kubectl - Deploy and manage workloads via Kubernetes
  2. SSH - Direct shell access to your pod environment
  3. Kubeconfig - Standard Kubernetes API access
Example: SSH Access
# SSH into your pod
ssh [email protected]

# Check GPUs
nvidia-smi

# Your storage is mounted at /workspace
ls /workspace

What You Can Do

Within your privileged pods, you can:
  • ✅ Install any software packages (apt, pip, conda)
  • ✅ Run CUDA applications and ML frameworks
  • ✅ Access all GPUs directly via nvidia-smi
  • ✅ Use InfiniBand for multi-node RDMA
  • ✅ Modify files and environment in your container
  • ✅ Run docker-in-docker for custom workflows
  • ✅ Deploy any containerized application

Limitations

Due to the Kubernetes security model:
  • ❌ Cannot load custom kernel modules
  • ❌ Cannot modify host system networking
  • ❌ Cannot access other customers’ data (node exclusivity ensures isolation)
  • ❌ Cannot reboot or modify the physical host
These limitations are industry standard for GPU clouds. If you need capabilities like custom kernel modules, please contact [email protected] to discuss dedicated hardware options.

Network & Interconnect

All GPUs come with:

InfiniBand/RDMA

  • Bandwidth: 400 Gbps per node
  • Latency: Under 2μs GPU-to-GPU
  • GPUDirect RDMA: Enabled
  • Topology: Fat-tree non-blocking
  • H100/H200: 900 GB/s (4th gen)
  • B200/B300: 1.8 TB/s (5th gen)
  • GPU-to-GPU: Full mesh within node
This allows near-linear scaling for multi-GPU and multi-node training.

Performance Characteristics

Training Throughput (Relative to H100)

GPU TypeFP16/BF16FP8TF32Effective Speed
H1001.0xN/A1.0xBaseline
H2001.0xN/A1.0xSame as H100
B2002.3x4.5x2.3x2.3-4.5x faster
B3003.0x6.0x3.0x3-6x faster
B200/B300 offer massive speedups when using FP8 precision. Most modern frameworks (PyTorch, JAX) support FP8 training with minimal code changes.

Memory Capacity vs Bandwidth

H100:  80GB  @ 3.35 TB/s = 42 GB/s per GB
H200: 141GB  @ 4.80 TB/s = 34 GB/s per GB
B200: 192GB  @ 8.00 TB/s = 42 GB/s per GB
B300: 256GB  @ 10.0 TB/s = 39 GB/s per GB
Insight: All GPUs maintain similar bandwidth-per-GB ratios, ensuring consistent performance across memory sizes.

Pricing Comparison

Prices are determined by marketplace supply and demand. Check current prices using the orderbook:
# H100 pricing
galadriel orderbook --gpu-type h100 --zone us-west-1

# H200 pricing
galadriel orderbook --gpu-type h200 --zone us-west-1

# B200 pricing
galadriel orderbook --gpu-type b200 --zone us-west-1

# B300 pricing
galadriel orderbook --gpu-type b300 --zone us-west-1
Example H100 Orderbook:
Market Depth:
  Best Bid:  $1.30/GPU/hr  (24 GPUs)
  Best Ask:  $1.40/GPU/hr  (16 GPUs)
  Spread:    $0.10
Example B200 Orderbook:
Market Depth:
  Best Bid:  $3.10/GPU/hr  (16 GPUs)
  Best Ask:  $3.20/GPU/hr  (8 GPUs)
  Spread:    $0.10
Prices fluctuate based on demand. Check the orderbook before placing orders to see real-time pricing.

Cost Optimization by GPU Type

Cost per TFLOP (FP16/BF16)

Based on example orderbook prices above:
H100:  $1.40/hr ÷ 1,979 TFLOPS = $0.00071 per TFLOP
H200:  ~$5.00/hr ÷ 1,979 TFLOPS = $0.00253 per TFLOP
B200:  $7.20/hr ÷ 4,500 TFLOPS = $0.00160 per TFLOP
B300:  ~$10.00/hr ÷ 6,000 TFLOPS = $0.00167 per TFLOP
Winner: H100 offers best compute per dollar when comparing raw TFLOPS at current market prices. However, consider:
  • H100 excellent value for models that fit in 80GB
  • H200 worth it for memory-bound workloads despite higher cost
  • B200 best value among Blackwell GPUs
  • B300 for absolute maximum throughput

Availability

Check current availability:
# View inventory
galadriel inventory --zone us-west-1

# Check specific GPU type
galadriel prices estimate --gpu-type h100 --gpus 16 --zone us-west-1
All GPU types are available in:
  • us-west-1 (California)
  • us-east-1 (Virginia)
  • eu-west-1 (Ireland) - Coming soon

Next Steps