Architecture

The Seven-Layer AI Factory Reference Architecture

A reference approach for teams building serious GPU platforms: workload-first design, opinionated component choices, and a delivery plan that maps to real capital and timelines.

reference-stack/ai-factory

Seven layers, one operating model.

Each engagement maps your workload onto the seven layers of an AI factory. Every layer has measurable budgets — watts, tokens, dollars, latency, risk — and a clear owner.

stack/ai-factoryv1.0

L7Applications

AgentsRAGInference APIsFine-tuningInternal copilots

L6Platform

Model registryAPI gatewayAuthBillingPolicyObservability

L5Runtime

PyTorchvLLMSGLangTritonTensorRT-LLMNIM-style services

L4Orchestration

KubernetesSlurmRayKueueVolcano

L3Fabric

InfiniBandRoCESpectrum-X EthernetObject & parallel storage

L2Hardware

GPU nodesNVLink / NVSwitchNICsLocal NVMe

L1Facility

PowerCoolingRacksCabling

How it comes together

Seven structured workstreams.

/01workstream

Workload-first architecture

Define the workload — model sizes, batch shapes, throughput, latency SLOs, training cadence — before any vendor conversation. Architecture choices follow workload, not the other way round.

/02workstream

Hardware and topology

GPU class, NVLink / NVSwitch domains, NIC count and placement, local NVMe, rack and pod design. Sized to the workload and the facility envelope.

/03workstream

Cluster orchestration

Kubernetes with the GPU Operator, Slurm or Ray for batch and training, Kueue / Volcano for fair-share scheduling, and policy-driven multi-tenancy.

/04workstream

Inference & training platform

vLLM, SGLang, Triton, TensorRT-LLM, and NIM-style services for inference. PyTorch DDP / FSDP / DiLoCo for training. Shared model registry, checkpoints, and evaluation harnesses.

/05workstream

Security & tenancy

Tenant isolation, secrets, audit logs, agent policy gates, supply-chain controls, and SOC 2 / ISO 27001 / GDPR-aligned mapping.

/06workstream

Observability & operating model

DCGM, Prometheus, Grafana, Datadog, OpenSearch. GPU utilisation, queue depth, latency percentiles, tokens/sec, cost/token. SLOs, runbooks, incident response.

/07workstream

Delivery roadmap

Phased delivery plan — assessment, design, build, harden, scale — with measurable acceptance criteria for each milestone.

Bring the architecture into focus.

An infrastructure review captures the workload, facility, vendor, and operating assumptions, and produces a concrete reference architecture and delivery plan.

Book an Infrastructure Review View Services

Explore all pages

Home→About→Services→Distributed GPU Networks→Private & Sovereign AI→Technical Evaluation→Contact→Privacy→InfiniBand vs RoCE vs Spectrum-X→vLLM vs Triton vs SGLang→Building AI Data Centres→NVIDIA GB200 NVL72 Architecture→