Cluster Specs
Galadriel offers the latest NVIDIA datacenter GPUs for AI/ML workloads. All clusters come with full node exclusivity, native InfiniBand support, and bare-metal Kubernetes access.Example Cluster Specifications
The specifications below are examples of available cluster configurations. Actual specifications may vary based on availability and location.
| B300 | B200 | H200 | |
|---|---|---|---|
| Operating System | Ubuntu 22.04/24.04 | Ubuntu 22.04/24.04 | Ubuntu 22.04/24.04 |
| Processor | 2x Intel Xeon 6787P + 86 cores 2.0GHz | 2x Intel Xeon Platinum 8568Y + 48 Core 2.3GHz | 2x Intel Xeon Platinum 8568Y + 48 Core 2.3 GHz |
| GPU | 8x ThinkSystem NVIDIA B300 GPU 288GB | 8x ThinkSystem NVIDIA HGX B200 180GB | 8x ThinkSystem NVIDIA H200 141 GB |
| Memory | 4TB Total: 32x ThinkSystem DDR5 6400 MHz | 2TB Total: 32x ThinkSystem 64GB TruDDR5 5600MHz | 2TB Total: 32x ThinkSystem 32GB DDR5 5600MHz |
| OS drives | 1.92TB Total: 2x ThinkSystem M.2 960GB SSD | 1.92TB Total: 2x ThinkSystem M.2 PM9A3 960GB SSD | 1.92TB Total: 2x Samsung 960GB M.2 SSD |
| Data drives | 30.72TB Total: 8x ThinkSystem 2.5” 3.84TB NVMe | 30.72TB Total: 8x ThinkSystem 2.5” U.3 PM1743 3.84TB SSD | 30.72TB Total: 8x Samsung 3.84TB U.2 SSD |
| Power | 8x ThinkSystem 3200W PSU N+N | 6x ThinkSystem 2600W PSU N+N | 8x ThinkSystem 2400W PSUs N+N |
| Standard networking | ThinkSystem Dual 400Gb/s BlueField-3 B3240 DPU | ThinkSystem NVIDIA BlueField-3 B3220 2-Port 200G | ThinkSystem Nvidia ConnectX-7 NDR200/HDR 2-port 200G |
| Additional networking | 8x OSFP 800Gb/s ports via ConnectX-8 | 8x NVIDIA BlueField-3 B3140H SuperNIC 1-Port 400 GbE | 8x ThinkSystem NVIDIA ConnectX-7 NDR400 1-port 400G |
| Lenovo Type Model | ThinkSystem SR680a V4 | ThinkSystem SR780a V3 | ThinkSystem SR680a V3 |
| Data center locations | Charlotte NC | Charlotte NC | Raleigh NC, Dallas/Ft.Worth TX |
Available GPU Clusters
H100 (80GB)
Proven workhorse. Best price/performance for most workloads.
H200 (141GB)
Higher memory capacity. Same architecture as H100 with more memory.
B200
Blackwell architecture. Current generation with 2.3x faster training.
B300
Blackwell high-end. Maximum performance available.
Instance Types
GPUs are allocated in full-node increments (8 GPUs per node):| Instance Type | GPUs | Nodes | Use Case |
|---|---|---|---|
8xh100 | 8 | 1 | Single-node training |
16xh100 | 16 | 2 | Multi-node training |
24xh100 | 24 | 3 | Large-scale training |
32xh100 | 32 | 4 | Distributed training |
8xh200 | 8 | 1 | Memory-intensive single-node |
16xh200 | 16 | 2 | Memory-intensive multi-node |
8xb200 | 8 | 1 | Frontier model training |
16xb200 | 16 | 2 | Large frontier models |
8xb300 | 8 | 1 | Maximum performance |
Minimum allocation: 8 GPUs (1 full node). This ensures you get full node exclusivity and optimal InfiniBand performance.
System & Access
Storage
Each node includes 2TB NVMe SSD for local storage: Directory Structure:- Per-Node Storage: Each node in multi-node setups gets its own 2TB NVMe
- Access: Storage mounted as
/workspacein your Kubernetes pods - Persistence: Ephemeral - data is deleted when your lease expires
- Backup: Save important data before lease termination (no automatic backups)
Operating System
Base System:- OS: Ubuntu 22.04 LTS
- Kernel: Linux 5.15+
- Container Runtime: containerd 1.7+
- NVIDIA Driver: 535+
- CUDA Toolkit: 12.3.0
- NVIDIA Container Toolkit: Pre-configured
- GPU Operator: Managed by Galadriel
- InfiniBand: Mellanox OFED 5.8+
- RDMA: GPUDirect RDMA enabled
- Kubernetes: v1.30+
Access Model
Galadriel uses Bare Metal Kubernetes - the industry standard for GPU clouds (same as CoreWeave, Lambda Labs): Architecture:- ✅ 0-2% overhead vs 5-10% for VMs (no hypervisor layer)
- ✅ Direct GPU access through CUDA in your containers
- ✅ Full node exclusivity - no noisy neighbors
- ✅ InfiniBand/RDMA for multi-node communication
- ✅ Privileged containers with elevated permissions
- kubectl - Deploy and manage workloads via Kubernetes
- SSH - Direct shell access to your pod environment
- Kubeconfig - Standard Kubernetes API access
What You Can Do
Within your privileged pods, you can:- ✅ Install any software packages (apt, pip, conda)
- ✅ Run CUDA applications and ML frameworks
- ✅ Access all GPUs directly via nvidia-smi
- ✅ Use InfiniBand for multi-node RDMA
- ✅ Modify files and environment in your container
- ✅ Run docker-in-docker for custom workflows
- ✅ Deploy any containerized application
Limitations
Due to the Kubernetes security model:- ❌ Cannot load custom kernel modules
- ❌ Cannot modify host system networking
- ❌ Cannot access other customers’ data (node exclusivity ensures isolation)
- ❌ Cannot reboot or modify the physical host
These limitations are industry standard for GPU clouds. If you need capabilities like custom kernel modules, please contact [email protected] to discuss dedicated hardware options.
Network & Interconnect
All GPUs come with:InfiniBand/RDMA
- Bandwidth: 400 Gbps per node
- Latency: Under 2μs GPU-to-GPU
- GPUDirect RDMA: Enabled
- Topology: Fat-tree non-blocking
NVLink
- H100/H200: 900 GB/s (4th gen)
- B200/B300: 1.8 TB/s (5th gen)
- GPU-to-GPU: Full mesh within node
Performance Characteristics
Training Throughput (Relative to H100)
| GPU Type | FP16/BF16 | FP8 | TF32 | Effective Speed |
|---|---|---|---|---|
| H100 | 1.0x | N/A | 1.0x | Baseline |
| H200 | 1.0x | N/A | 1.0x | Same as H100 |
| B200 | 2.3x | 4.5x | 2.3x | 2.3-4.5x faster |
| B300 | 3.0x | 6.0x | 3.0x | 3-6x faster |
Memory Capacity vs Bandwidth
Pricing Comparison
Prices are determined by marketplace supply and demand. Check current prices using the orderbook:Cost Optimization by GPU Type
Cost per TFLOP (FP16/BF16)
Based on example orderbook prices above:- H100 excellent value for models that fit in 80GB
- H200 worth it for memory-bound workloads despite higher cost
- B200 best value among Blackwell GPUs
- B300 for absolute maximum throughput
Availability
Check current availability:- us-west-1 (California)
- us-east-1 (Virginia)
- eu-west-1 (Ireland) - Coming soon