Architecture

The Seven-Layer AI Factory Reference Architecture

A reference approach for teams building serious GPU platforms: workload-first design, opinionated component choices, and a delivery plan that maps to real capital and timelines.

reference-stack/ai-factory

Seven layers, one operating model.

Each engagement maps your workload onto the seven layers of an AI factory. Every layer has measurable budgets — watts, tokens, dollars, latency, risk — and a clear owner.

stack/ai-factoryv1.0
L7Applications
AgentsRAGInference APIsFine-tuningInternal copilots
L6Platform
Model registryAPI gatewayAuthBillingPolicyObservability
L5Runtime
PyTorchvLLMSGLangTritonTensorRT-LLMNIM-style services
L4Orchestration
KubernetesSlurmRayKueueVolcano
L3Fabric
InfiniBandRoCESpectrum-X EthernetObject & parallel storage
L2Hardware
GPU nodesNVLink / NVSwitchNICsLocal NVMe
L1Facility
PowerCoolingRacksCabling
How it comes together

Seven structured workstreams.

/01workstream

Workload-first architecture

Define the workload — model sizes, batch shapes, throughput, latency SLOs, training cadence — before any vendor conversation. Architecture choices follow workload, not the other way round.

/02workstream

Hardware and topology

GPU class, NVLink / NVSwitch domains, NIC count and placement, local NVMe, rack and pod design. Sized to the workload and the facility envelope.

/03workstream

Cluster orchestration

Kubernetes with the GPU Operator, Slurm or Ray for batch and training, Kueue / Volcano for fair-share scheduling, and policy-driven multi-tenancy.

/04workstream

Inference & training platform

vLLM, SGLang, Triton, TensorRT-LLM, and NIM-style services for inference. PyTorch DDP / FSDP / DiLoCo for training. Shared model registry, checkpoints, and evaluation harnesses.

/05workstream

Security & tenancy

Tenant isolation, secrets, audit logs, agent policy gates, supply-chain controls, and SOC 2 / ISO 27001 / GDPR-aligned mapping.

/06workstream

Observability & operating model

DCGM, Prometheus, Grafana, Datadog, OpenSearch. GPU utilisation, queue depth, latency percentiles, tokens/sec, cost/token. SLOs, runbooks, incident response.

/07workstream

Delivery roadmap

Phased delivery plan — assessment, design, build, harden, scale — with measurable acceptance criteria for each milestone.

Next

Bring the architecture into focus.

An infrastructure review captures the workload, facility, vendor, and operating assumptions, and produces a concrete reference architecture and delivery plan.