Generative AI infrastructure

Published on
Last updated on
Technical
AI

This document is a preparation to interview of GenAI infrastructure, I want to try something different, to use gemini guided learning, it may not cover all aspects, but give me basic understanding on what is the main considerations when choose the tech stack

Comparing Sandboxed Runtimes

Each of these solutions aims to provide a stronger boundary than standard Linux namespaces and cgroups, but they do so through very different architectural paths.

FeatureKata ContainersFirecrackergVisorSysbox
Isolation ModelLightweight VM 🖥️Micro-VM ⚡User-space Kernel 🛡️Enhanced Container 📦
TechnologyKVM / QEMUKVM (Minimalist)ptrace / KVMLinux Namespaces+
PerformanceNear-native CPUUltra-fast bootHigh syscall overheadNear-native
CompatibilityHigh (full kernel)Low (minimal devices)Medium (Go-based Sentry)High (System software)
AI/GPU SupportExcellentLimited (experimental)LimitedGood
  1. Kata Containers: The “Luxury Suite” 🏨 Kata wraps each container (or Pod) in a dedicated lightweight VM. Because it runs a real (though slimmed) Linux kernel inside, it is 100% OCI compatible. If it runs in Docker, it runs in Kata. It’s the go-to for production Kubernetes clusters where you need hardware-level isolation but want to keep using standard tools.

  2. Firecracker: The “Safety Airbag” 🚗 Developed by AWS for Lambda and Fargate, Firecracker is a Virtual Machine Monitor (VMM) written in Rust. It strips away almost everything a traditional VM has (no BIOS, no legacy devices) to achieve sub-150ms boot times. It is designed for “transient” workloads—perfect for a quick AI agent task, but harder to manage for long-running, complex services.

  3. gVisor: The “Bodyguard” 👤 Google’s approach is different. Instead of a VM, gVisor provides a “user-space kernel” called the Sentry. When an app makes a system call (like opening a file), gVisor intercepts it and handles it itself rather than letting it reach the host kernel. This provides a huge security buffer but creates a “performance tax” on applications that do a lot of I/O.

  4. Sysbox: The “Invisible Upgrade” 🪄 Sysbox is an alternative to runc. It doesn’t use a VM at all. Instead, it uses advanced Linux kernel features to allow you to run “system” software—like Docker-in-Docker or Kubernetes nodes—inside a container without needing the dangerous —privileged flag. It’s perfect for building “infrastructure-as-a-service” platforms.

gvisor

The Three Pillars of gVisor

  • The Sentry (The Brain) 🧠: This is the “Application Kernel” written in Go. It implements the Linux system call interface. When an application in the sandbox tries to do something (like getpid or mmap), the Sentry handles it. Crucially, the Sentry never passes a syscall directly through to the host. It has its own PID table, memory management, and process lifecycle.

  • The Gofer (The File Guard) 🛡️: The Sentry is too “privileged” to talk to the host’s filesystem directly. Instead, it talks to a separate process called the Gofer. The Gofer acts as a mediator, using a protocol (like 9P or LISAFS) to provide file descriptors to the Sentry. This way, even if the Sentry is compromised, the attacker still doesn’t have direct access to the host’s files.

  • The Netstack (The Networker) 🌐: To avoid using the host’s network stack (which is a massive attack surface), gVisor includes its own Userspace Network Stack. It handles TCP/IP, UDP, and more, entirely within the Sentry. It only talks to the host via a simple virtual ethernet device (tap/tun).

Why not IO bound worklod

This “Handshake” is what kills performance in heavy I/O workloads:

Context Switches: Every syscall bounces from the application to the Sentry, and potentially to the Gofer.

Data Copying: For security, data often has to be copied between these different memory spaces.

Protocol Overhead: The communication between Sentry and Gofer adds latency that a native kernel simply doesn’t have.

Mitigation

“How would you make gVisor performant for an AI training run?” you can mention these two advanced levers:

  • VFS2 and LISAFS: Google recently rewrote the filesystem layer (VFS2) and the communication protocol (LISAFS) to reduce the number of “trips” between the Sentry and Gofer. This is why newer GKE sandboxes feel faster than old ones.

  • Direct I/O & Caching Modes: You can configure gVisor with different caching flags. For example, using —vfs-cache=always allows gVisor to trust its internal cache more, reducing the need to ask the Gofer for every single stat or open call.

trap-and-relay

  • Step 1: Setting the Trap

    When you start a container with the Systrap platform, gVisor installs a seccomp filter on the application’s threads. Specifically, it uses a feature called SECCOMP_RET_TRAP.

    In a normal container, seccomp might just kill a process if it tries a forbidden syscall.

    In Systrap, the filter says: “If this application tries to make any syscall, don’t execute it. Instead, pause the thread and send a SIGSYS signal to it.”

  • Step 2: The Stub Signal Handler

    This is where it gets clever. Every application thread in the sandbox has a tiny companion called a Stub.

    When the SIGSYS signal arrives, the application thread is forced to jump into a signal handler (the Stub). This handler is very minimal. Its only job is to:

    Save the current state of the CPU (registers, stack pointer, etc.).

    Tell the Sentry (the Go-based kernel) that a syscall is waiting to be handled.

  • Step 3: The “Fast Lane” Communication

    In older versions of gVisor (ptrace), the Sentry had to use slow system calls to read the application’s memory. Systrap replaces this with a shared memory region called sysmsg.

    The Stub writes the syscall details into sysmsg.

    The Sentry reads from sysmsg, does the work (like writing to a virtual file), and writes the result back.

    The Stub then restores the CPU registers and lets the application continue.

  • the futex

    A futex allows a process to wait on a specific value in a shared memory location:

    Wait: The Stub writes the syscall into shared memory and then calls futex(wait). The Linux kernel puts that thread to sleep, so it consumes zero CPU cycles. 😴

    Wake: The Sentry (the Go kernel) sees the new work, processes it, and then calls futex(wake). The kernel then “taps the Stub on the shoulder” and wakes it up to continue. ⏰

WHY and WHY Not

  • why heavy CPU workload<“Staying in the CPU” (User Mode / Ring 3)>

    When your code is “staying in the CPU,” it means the processor is executing instructions that it can handle all by itself without asking the Operating System (the Kernel) for help.

  • WHY NOT IO<“Talking to the Kernel” (System Calls / Ring 0)>

    to touch the hardware (Disk, Network, Memory) directly. To do anything “real,” it must perform a System Call (syscall).

    What it looks like: open() a file, send() a packet over the network, malloc() to get more memory, or printf() to show something on the screen.

    The Process:

    1. The CPU pauses your application.
    2. It switches from “User Mode” to “Kernel Mode.”
    3. The Kernel (the boss) checks if you are allowed to do this.
    4. The Kernel does the work and then switches back to your app.

Firecracker: The Micro-VM Approach

Firecracker shrinks the hardware environment. Developed by AWS for Lambda and Fargate, it uses the Linux KVM (Kernel-based Virtual Machine) to run a real, but extremely minimal, Linux kernel.

Key differences from a traditional VM (like VMware or VirtualBox):

  • No Legacy Devices: It doesn’t emulate old keyboards, VGA cards, or floppy disks. It only provides a few essential virtio devices (Network, Block storage, and a serial console). 💾

  • Speed: It can boot in less than 150ms.

  • Memory Footprint: Each Micro-VM uses about 5MB of RAM overhead. 📏

The Kernel Panic Problem

In a standard Docker/Kubernetes cluster, every container shares the same host kernel.

  • The Risk: If an AI model triggers a bug that causes a “Kernel Panic,” the entire physical server (and every other container on it) crashes.

  • The Firecracker Solution: Because Firecracker uses KVM to create a Micro-VM, each agent has its own isolated kernel. If the AI agent crashes its kernel, only that 5MB Micro-VM dies. The host and other agents are completely unaffected.

The Attack Surface: “Minimalist by Design” 📉

Traditional VMs (like those in OpenStack or VMware) emulate a lot of hardware: keyboards, VGA cards, USB controllers. Each of these is a potential path for a hacker to “break out.”

  • Firecracker only emulates 5 devices:

    • virtio-net: For network connectivity.
    • virtio-block: For virtual hard drive/storage.
    • virtio-vsock: A secure channel for host-guest communication.
    • Serial Console: For logging and basic interaction.
    • Minimal Keyboard Controller: Only used to signal shutdown to the microVM.
  • By removing everything else, the “attack surface” is almost zero.

Performance: The “Snapshot” Advantage 📸

Firecracker’s secret weapon for scaling is MicroVM Snapshots. You can “freeze” a Micro-VM that has already loaded a massive AI model into memory and save it to disk.

To start a new agent, you don’t “boot” it; you just restore the snapshot. This can happen in milliseconds, allowing you to spin up thousands of agents to handle a sudden burst of AI requests.

Sysbox: The “System Container” (The Third Way)

Sysbox allows you to run Docker-in-Docker or Kubernetes-in-Docker without using the dangerous —privileged flag.

Why use it? If the “agentic infrastructure” you are building needs to launch other containers or system services (like systemd) inside a sandbox, Sysbox is the tool for the job. It uses advanced Linux kernel features (like User Namespaces and shiftfs) to trick the container into thinking it is a full physical host.

Kata Containers

How it works:

  • The Wrapper: When you run a container, Kata starts a lightweight Virtual Machine (using QEMU, Cloud Hypervisor, or even Firecracker as the engine).

  • The Agent: Inside that VM, it runs a minimal Linux Guest Kernel and a process called the kata-agent.

  • The Lifecycle: The kata-agent manages the container lifecycle inside the VM, while the kata-runtime on the host talks to Kubernetes (CRI-O/containerd) [1.4, 2.1].

Why it matters for AI:

  • GPU & Hardware Passthrough: Unlike gVisor (which struggles with hardware access), Kata supports PCIe Passthrough. This is huge for AI because it means you can give the container direct, isolated access to a GPU (NVIDIA/AMD) or TPU without losing VM-level security [2.4].

  • Confidential Containers (CoCo): Kata is the foundation for the “Confidential Containers” project. It allows you to run AI models in Trusted Execution Environments (TEEs) like Intel TDX or AMD SEV, ensuring that even the cloud provider cannot see your sensitive training data or model weights [2.3].

”Post-K8s” Platform

Why does Kubernetes fail at “Massive Scale”?

  • The etcd BottleNeck: Kubernetes uses a single database (etcd). When you have 100,000 nodes and millions of Pods, etcd becomes the “single point of slowness.”

  • Centralized Scheduler: The Kube-scheduler tries to look at the entire cluster to make one perfect decision. At massive scale, the “queue” of things waiting to be scheduled becomes a mile long.

  • The “Thundering Herd”: If 5,000 nodes reboot at once, they all try to talk to the Kube-API server simultaneously, effectively DDoS-ing the control plane.

1. Advanced Scheduling: Beyond the Centralized Queue

In a standard cluster, a single scheduler watches an API server. At 100k nodes, the time to “find a home” for a pod becomes the bottleneck. Here are the three architectures used by the world’s largest clusters (Google, AWS, and top AI labs):

  • A. Shared-State Scheduling (The “Parallel” Path) 🔄

    Instead of one scheduler, you run multiple schedulers in parallel.

    Mechanism: Every scheduler has a local, read-only copy of the entire cluster state. They make “optimistic” decisions.

    Conflict Resolution: If two schedulers pick the same GPU for two different agents, the State Store (the database) rejects the second transaction.

    Scaling: This allows you to handle thousands of scheduling requests per second because the schedulers don’t have to wait in line.

  • B. Two-Level / Hierarchical Scheduling 🥪 This is how systems like Mesos or the newer Kueue work.

    • Level 1 (The Resource Allocator): It doesn’t know about containers. It just looks at the hardware and hands out “Resource Offers” (e.g., “Here is a slice of 80GB VRAM on Node X”) to different sub-schedulers.

    • Level 2 (The Task Scheduler): A specialized scheduler (e.g., one specifically for PyTorch, another for Batch inference) takes those offers and fits its specific tasks into them.

C. Workload-Aware “Gang” Scheduling

🤝AI training doesn’t care about “single pods.” It needs NN GPUs to start at the exact same time.

  • The Problem: Standard K8s might schedule 7 out of 8 pods. The 7 pods sit there doing nothing, wasting expensive GPU time, waiting for the 8th.

  • The Solution: A custom orchestrator uses Gang Scheduling (or Atomic Scheduling). It only “commits” the resources if the entire training job can be placed. If not, it holds the resources in a queue.

2. State Management: The “etcd” Successors

Kubernetes uses etcd, which is a Strongly Consistent (CP) database using the Raft consensus. This is its “Achilles’ heel” at scale. For a novel platform, you look at these alternatives:

  • A. FoundationDB/NewSQL (The “Scalable Backbone”) 🛡️

    Used by Snowflake and Apple (iCloud), and increasingly explored as a K8s backend replacement.

    Why it’s better: It decouples the “Transaction” layer from the “Storage” layer. You can scale the database horizontally without the Raft bottleneck.

    Multi-Tenancy: It handles “Tenants” natively, allowing you to isolate different AI research teams at the database level.

  • B. Sharded State & “Cells” 🏘️

    Instead of one giant database, you shard the state by Resource Type or Geographic Cell.

    Cell Isolation: If the database for “Training Region A” crashes, “Training Region B” continues uninterrupted.

    Latency: By keeping the state physically closer to the compute nodes (the edge), you reduce the round-trip time for agent status updates.

  • C. Event-Driven State (The “Log” Pattern) 📜

    For AI agents that only live for 20 seconds, you don’t need a persistent record in a heavy database.

    The Pattern: Use a distributed log like Kafka or a high-speed memory store like Redis for “Hot” state. Only move the “Cold” state (final results, logs) to a persistent store once the agent finishes.

The API Control Plane: FastAPI & gRPC 🔌

This serves as the interface for your “agentic infrastructure.”

  • FastAPI: Use for the external-facing REST interface. Leverages async/await to handle thousands of concurrent tool-call requests from AI agents.

    We use FastAPI (REST) for the external dashboard because it is browser-native. You can’t call gRPC directly from a standard web browser without a proxy. FastAPI’s async/await is perfect for the “I/O-heavy” task of a dashboard—it can wait for 1,000 different database queries simultaneously without blocking the server, ensuring the UI remains responsive even during heavy cluster load [3.2].

  • gRPC (Protobufs): The internal “Engine Room” language. Binary serialization makes it 5-10x faster than JSON for high-density node-to-master communication.

    the binary Protobuf format is much denser than JSON. For internal service-to-service communication, this isn’t just about saving bandwidth—it’s about CPU efficiency. Parsing JSON is a heavy string-processing task; Protobuf is a memory-copy task. In a 100k-node cluster, this saves megawatts of power and reduces latency by up to 10x [1.1]. Additionally, gRPC supports Bidirectional Streaming, allowing agents to report status and receive tool-call results over a single, long-lived TCP connection without the overhead of repeating HTTP headers. .

  • Handling the “Thundering Herd”

    “If 50,000 agents suddenly reconnect after a network partition, we face a ‘Thundering Herd’ that could crash our API servers. To prevent this, I would implement three layers of defense”:

    • Exponential Backoff with Jitter: We don’t let agents retry immediately. They must wait 1s, then 2s, then 4s, etc. Crucially, we add Jitter (randomness) so that the 50,000 agents don’t all retry at the exact same millisecond [2.1].

    • Load Shedding & Quotas: I would implement Circuit Breakers and Rate Limiting at the API Gateway. If the backend is overwhelmed, we return a 503 Service Unavailable or gRPC RESOURCE_EXHAUSTED status immediately to preserve the health of the core database.

    • gRPC Connection Management: I’d use HTTP/2 Multiplexing to ensure that once a connection is established, it stays open for multiple requests, reducing the CPU “handshake” storm during the reconnection phase [1.4].

Infrastructure as Code (Terraform)

  • State Separation: Use separate state files for Core VPC, GPU Clusters, and Regional Control Planes.

  • Module Interfaces: Standardize the “Agent Sandbox” module so any research team can deploy a secure environment with one click.

Back to Blog