On this page
- 1. Kubernetes Architecture Overview
- 2. Control Plane Components
- 3. Worker Node Components
- 4. Pod Lifecycle
- 5. Deployments & ReplicaSets
- 6. Services & Service Types
- 7. ConfigMaps & Secrets
- 8. Resource Requests & Limits
- 9. Liveness & Readiness Probes
- 10. Horizontal Pod Autoscaler (HPA)
- 11. Rolling Updates & Rollbacks
- 12. Namespaces & Multi-Tenancy
- 13. RBAC & Access Control
- 14. Persistent Volumes & Storage Classes
- 15. StatefulSets vs Deployments
- 16. Kubernetes Networking Basics
Kubernetes Production Concepts
1. Kubernetes Architecture Overview
Before diving into individual components, you need a mental model of what Kubernetes actually is and why it exists, because everything else makes more sense with that foundation.
Imagine you’re running a large restaurant chain with hundreds of locations. Each location needs staff, supplies, and coordination. You could manage each location manually — calling each one, making sure they have what they need, replacing staff who don’t show up. But that doesn’t scale. You’d hire a regional management system that automatically ensures each location is staffed, restocked, and operating correctly — and if something goes wrong at one location, the system responds without you having to intervene personally.
Kubernetes is that regional management system, but for containerized applications. You have applications packaged as containers, and you need to run them reliably across many machines, scale them based on demand, replace them when they crash, and update them without downtime. Kubernetes automates all of this.
The fundamental model of Kubernetes is declarative. You tell Kubernetes what you want — “I need 5 replicas of this model serving container, each with 2 CPUs and 8GB of RAM, accessible on port 8080” — and Kubernetes figures out how to make that happen and keeps making it happen continuously. If a machine dies and takes 2 of your replicas with it, Kubernetes notices and starts 2 new replicas on healthy machines without you doing anything.
Kubernetes is a cluster — a group of machines (called nodes) that work together as a single system. These machines are split into two roles: the control plane and worker nodes. The control plane is the brain — it makes decisions, maintains the desired state, and orchestrates everything. Worker nodes are the muscle — they actually run your application containers.
Everything in Kubernetes is represented as an object — a structured record stored in the control plane’s database. A Pod is an object. A Deployment is an object. A Service is an object. You interact with Kubernetes by creating, modifying, and deleting these objects, and Kubernetes continuously works to make the real world match the objects you’ve declared.
The API server is the single entry point for all interactions with Kubernetes. Whether you’re using kubectl from a terminal, a CI/CD pipeline deploying an application, or an internal Kubernetes component checking the state of the cluster — everything goes through the API server. This centralization is what makes Kubernetes’s consistent, auditable, and securable.
For ML systems specifically, Kubernetes provides the foundation for running training jobs that need GPU resources, serving models at scale with automatic scaling, managing the many microservices that make up an ML platform, and isolating different teams’ workloads from each other. Understanding Kubernetes at a production level is non-negotiable for anyone working in MLOps or platform engineering.
2. Control Plane Components
The control plane is the collection of processes that make up Kubernetes’s brain. It runs on dedicated nodes (separate from where your application workloads run in production setups) and is responsible for maintaining the desired state of the entire cluster. If the control plane goes down, your running workloads continue running, but you lose the ability to make changes, and Kubernetes stops automatically healing failures.
Let’s go through each control plane component and understand exactly what it does.
kube-apiserver is the front door to everything. Every interaction with the cluster — from kubectl commands, from internal components talking to each other, from your CI/CD pipeline deploying a new model — goes through the API server. It’s a RESTful HTTP API. When you run kubectl apply -f deployment.yaml, kubectl sends an HTTP request to the API server with your deployment object. The API server validates the request (is this syntactically valid? does this user have permission to do this?), persists the object to etcd, and returns a response. The API server is stateless — it doesn’t store anything itself. It reads from and writes to etcd for all persistent data. In production, you run multiple replicas of the API server behind a load balancer for high availability.
etcd is the cluster’s database — a distributed, consistent key-value store where all cluster state is persisted. Every Kubernetes object you create — every Pod, Deployment, ConfigMap, Secret, Service — is stored in etcd. It’s the source of truth for the entire cluster. If etcd is lost and you have no backup, your cluster state is gone. You know what you’re running from memory, but Kubernetes doesn’t. This is why backing up etcd is critical in production. etcd uses the Raft consensus algorithm to ensure consistency across multiple etcd nodes — writes are only confirmed after a majority of etcd nodes have acknowledged them. In production, you run etcd as a cluster of 3 or 5 nodes to tolerate node failures while maintaining consistency.
kube-scheduler is responsible for deciding which worker node each new Pod should run on. When a new Pod is created and doesn’t yet have a node assigned, the scheduler notices it, evaluates all available nodes against the Pod’s requirements, and assigns it to the most suitable node. The scheduler considers many factors: does the node have enough CPU and memory (resource requests)? Does the Pod have node affinity rules specifying it must or should run on certain nodes? Does the Pod have anti-affinity rules saying it shouldn’t run on the same node as other specific Pods (important for spreading replicas across nodes for fault tolerance)? Does the Pod require a GPU and does the node have one? The scheduler writes its decision back to the API server (updating the Pod object with the assigned node), and then the kubelet on that node picks up the Pod and starts running it.
kube-controller-manager runs a collection of control loops, each of which watches the state of some type of Kubernetes object and takes action to reconcile the current state with the desired state. The Deployment controller watches Deployment objects and ensures the correct number of ReplicaSets and Pods exist. The ReplicaSet controller watches ReplicaSet objects and ensures the correct number of Pods are running. The Node controller watches Node objects and marks nodes as unhealthy if they stop reporting in. The Job controller manages batch Jobs. There are dozens of built-in controllers. Each controller runs an independent loop: observe current state, compare to desired state, take action to close the gap, repeat. This reconciliation loop model is the core pattern of Kubernetes and understanding it conceptually unlocks almost everything else.
cloud-controller-manager integrates Kubernetes with the underlying cloud provider. When you create a Service of type LoadBalancer on AWS, something needs to actually call the AWS API to create an Elastic Load Balancer. That’s the cloud controller manager. It handles cloud-specific operations: provisioning load balancers, managing cloud storage volumes, updating cloud routing tables when nodes join or leave the cluster. It runs as a separate component from kube-controller-manager because each cloud provider implements it differently.
3. Worker Node Components
Worker nodes are the machines that actually run your application containers. Each worker node runs a set of components that enable it to receive work from the control plane, run containers, report status back, and provide networking.
kubelet is the primary agent on every worker node. It’s the link between the control plane and the actual container runtime on the node. The kubelet constantly watches the API server for Pods that have been scheduled to its node. When a new Pod is assigned to the node, the kubelet reads the Pod specification and instructs the container runtime to start the containers described in the spec. Once containers are running, the kubelet continuously monitors their health, runs liveness and readiness probes (more on these later), and reports back to the API server. If a container crashes, the kubelet restarts it according to the Pod’s restart policy. If the kubelet itself stops communicating with the API server (node failure, network partition), the node controller in the control plane marks the node as NotReady and eventually evicts its Pods to be rescheduled elsewhere.
Container Runtime is the software that actually pulls container images and runs containers. Kubernetes doesn’t run containers directly — it delegates this to a container runtime via the Container Runtime Interface (CRI). containerd is the most common container runtime in modern Kubernetes clusters, and it’s what most managed Kubernetes services (EKS, GKE, AKS) use. Docker used to be the dominant runtime but was deprecated as a Kubernetes runtime (not as a build tool — Docker is still fine for building images). The container runtime pulls images from registries, creates container file systems, manages container namespaces and cgroups, and handles container lifecycle operations.
kube-proxy runs on every node and maintains network rules that allow Pods to communicate with Services. When you create a Kubernetes Service, kube-proxy is responsible for ensuring that traffic sent to the Service’s IP address gets forwarded to one of the backing Pods. It does this by maintaining iptables rules (or more modern alternatives like IPVS) on each node. When a Pod on node A makes a request to a Service, kube-proxy’s rules on node A intercept the traffic and forward it to an appropriate backend Pod — which might be on the same node or a different node. In newer Kubernetes environments, eBPF-based CNI plugins like Cilium replace kube-proxy with a more efficient implementation.
Container Network Interface (CNI) Plugin handles Pod-to-Pod networking. Every Pod in a Kubernetes cluster gets its own IP address, and Pods on different nodes need to be able to communicate with each other directly using those IP addresses. The CNI plugin (Calico, Cilium, Flannel, Weave) implements the networking overlay or underlay that makes this possible. It creates network interfaces inside each Pod, assigns IP addresses, and sets up routing so that packets between Pods on different nodes get routed correctly. The choice of CNI plugin also determines which networking features are available — Calico and Cilium support NetworkPolicy for pod-level network isolation, while Flannel doesn’t.
Node-level resource management — the kubelet also manages resources on the node. It enforces the CPU and memory limits you set on containers using Linux cgroups. It manages local ephemeral storage. It coordinates with the device plugin framework to expose specialized hardware to containers — this is how GPU resources are made available to ML training and inference containers.
4. Pod Lifecycle
A Pod is the smallest deployable unit in Kubernetes. Not a container — a Pod. A Pod is a group of one or more containers that share a network namespace (same IP address), the same storage volumes, and run on the same node. Containers within a Pod communicate with each other via localhost. In practice, most Pods have a single application container, sometimes with sidecar containers (like a service mesh proxy, a log shipper, or a metrics exporter) that provide auxiliary functionality.
Understanding the Pod lifecycle — the states a Pod goes through from creation to termination — is essential for debugging and for designing reliable systems.
Pending — the Pod has been created (the object exists in etcd) but hasn’t been scheduled to a node yet, or has been scheduled but the containers haven’t started yet. A Pod can stay in Pending for several reasons: the scheduler hasn’t found a suitable node (not enough CPU/memory/GPU available), the container image is being pulled, or init containers are running. When debugging a stuck Pending Pod, kubectl describe pod <name> shows you the events and the reason it’s not progressing.
Init Containers — before the main containers in a Pod start, you can define init containers that run to completion first. These are used for setup tasks: waiting for a database to be ready before starting the application, downloading configuration from an external service, initializing a shared volume. Init containers run sequentially — each must complete successfully before the next starts. Only after all init containers complete do the main containers start. This is useful for ML systems where the model server needs to download model weights before it can start serving.
Running — all containers in the Pod have been created and at least one is running. Running doesn’t necessarily mean healthy — a container can be running but unable to serve traffic.
ContainerCreating — a sub-state of Pending/Running where the container runtime is pulling the image and setting up the container environment. This can take time for large images (ML model serving images are often several gigabytes).
Succeeded — all containers in the Pod have terminated successfully (exit code 0) and will not be restarted. This is the terminal state for batch job Pods.
Failed — all containers have terminated, and at least one container terminated with a failure (non-zero exit code or killed by the system). For Pods managed by Deployments, the controller creates a replacement Pod.
CrashLoopBackOff — this is one of the most common problematic states you’ll see. It means a container is repeatedly crashing and Kubernetes keeps restarting it, but each time it crashes again. The “BackOff” part means Kubernetes adds increasing delays between restart attempts (10s, 20s, 40s, 80s, up to 5 minutes) to avoid hammering a broken container. The causes range from application errors (an exception on startup), misconfiguration (missing environment variable, wrong config file path), missing dependencies, insufficient resources, or invalid startup commands. kubectl logs <pod> and kubectl logs <pod> --previous (to see logs from the last crashed container) are your primary debugging tools.
Termination — when a Pod is deleted or evicted, it goes through a graceful termination process. Kubernetes sends a SIGTERM signal to all containers, giving them time to finish in-flight requests and clean up. The grace period is 30 seconds by default (configurable via terminationGracePeriodSeconds). During this period, the Pod is removed from Service endpoints so it no longer receives new traffic. After the grace period, if containers haven’t exited, Kubernetes sends SIGKILL, forcefully terminating them. For ML serving containers, proper SIGTERM handling is important — you want to finish any in-flight prediction requests before shutting down.
Restart policies — Always restarts containers regardless of exit code (default for Deployment Pods, appropriate for long-running services). OnFailure only restarts on non-zero exit (appropriate for batch jobs). Never never restarts (for one-shot tasks).
5. Deployments & ReplicaSets
In practice, you almost never create Pods directly. You create Deployments, and Deployments manage Pods through an intermediate object called a ReplicaSet. Understanding this layering is important for understanding how updates and rollbacks work.
ReplicaSet is the object responsible for ensuring that a specified number of identical Pod replicas are running at any time. A ReplicaSet has a pod template (the specification for what the Pods should look like), a replica count (how many Pods should be running), and a selector (how it identifies which Pods it manages — typically using labels). The ReplicaSet controller continuously compares the number of running Pods matching its selector against its desired replica count. Too few Pods — create more. Too many — delete some. Pod crashes — create a replacement.
You could use ReplicaSets directly, but almost never should, because ReplicaSets don’t support updates well. If you change the Pod template in a ReplicaSet (to update the container image version, for example), existing Pods are not automatically updated. You’d have to delete old Pods manually and let the ReplicaSet create new ones with the new template. This is error-prone.
Deployment is a higher-level abstraction that manages ReplicaSets to provide controlled, rolling updates and rollback capability. When you create a Deployment, it creates a ReplicaSet. When you update a Deployment (changing the container image, environment variables, or any part of the Pod template), the Deployment controller creates a new ReplicaSet with the updated Pod template and orchestrates a controlled transition from the old ReplicaSet to the new one.
A Deployment spec looks like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-server
namespace: ml-production
spec:
replicas: 5
selector:
matchLabels:
app: model-server
template:
metadata:
labels:
app: model-server
spec:
containers:
- name: model-server
image: myregistry/model-server:v2.1.0
resources:
requests:
memory: "8Gi"
cpu: "2"
limits:
memory: "16Gi"
cpu: "4"
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # Create up to 2 extra Pods during update
maxUnavailable: 0 # Never reduce below 5 Pods during updateThe strategy section controls how updates happen. RollingUpdate is the default and appropriate for most ML serving scenarios — it gradually replaces old Pods with new ones, maintaining service throughout the update. Recreate kills all old Pods before creating new ones, causing a brief downtime — only appropriate when you absolutely cannot run two versions simultaneously.
maxSurge controls how many extra Pods above the desired replica count can exist during an update. maxUnavailable controls how many Pods can be unavailable below the desired count. Setting maxSurge: 2 and maxUnavailable: 0 means during an update, Kubernetes first creates 2 new Pods with the new version (taking you to 7 total), waits for them to be ready, then deletes 2 old Pods (back to 5), and repeats until all old Pods are replaced. You always have at least 5 serving Pods throughout.
Deployment vs Pod directly — always use Deployments (or other controllers like StatefulSets or Jobs) rather than bare Pods. A bare Pod deleted is gone forever — nothing recreates it. A Deployment whose Pod is deleted immediately creates a replacement. Deployments give you declarative updates, rollback history, and the full suite of Kubernetes reliability features.
6. Services & Service Types
Pods are ephemeral. They come and go — they crash and get replaced, they get rescheduled to different nodes, they’re replaced during updates. Each new Pod gets a new IP address. If your frontend needs to talk to your model server, it can’t rely on a fixed Pod IP because that IP changes constantly.
Services solve this by providing a stable network endpoint that sits in front of a dynamic set of Pods. A Service has a fixed IP address (called a ClusterIP) and a fixed DNS name that never change, regardless of how many times the backing Pods are replaced. Traffic sent to the Service is load-balanced across the healthy Pods matching the Service’s selector.
The selector on a Service uses the same label system as everything else in Kubernetes. If your model server Pods have the label app: model-server, a Service with selector app: model-server automatically directs traffic to all Pods with that label. When a new Pod is created with that label, it’s automatically added to the Service’s endpoint list. When a Pod is deleted, it’s automatically removed. The Service tracks healthy backends dynamically without any manual configuration.
Kubernetes has four Service types, each appropriate for different networking scenarios.
ClusterIP is the default. It creates a virtual IP address accessible only from within the cluster. Services use this for internal communication — your API gateway talks to your model server via ClusterIP, your model server talks to your feature store via ClusterIP. This IP is not accessible from outside the cluster, which is correct for most backend services. Every ClusterIP Service also gets a DNS name in the format service-name.namespace.svc.cluster.local, so services can discover each other by name rather than IP.
NodePort exposes the Service on a static port (between 30000-32767) on every node in the cluster. Traffic sent to any node’s IP address on that port is forwarded to the Service. This makes the Service accessible from outside the cluster by connecting to <any-node-ip>:<nodeport>. NodePort is rarely used in production because you’re exposing a high-numbered port on every node, there’s no load balancing across nodes, and you’re relying on node IPs which can change. It’s sometimes used for development or debugging.
LoadBalancer is the standard way to expose a Service to the internet or external network in cloud environments. When you create a LoadBalancer Service, the cloud-controller-manager calls the cloud provider’s API (AWS, GCP, Azure) to provision an external load balancer (ELB, GCLB, Azure LB) and configures it to forward traffic to the NodePort on all cluster nodes, which then forwards to the appropriate Pods. You get a single, stable external IP or DNS name. This is how you’d expose a model serving API to external clients. The downside is cost — each LoadBalancer Service provisions a cloud load balancer, which costs money. For exposing many services, an Ingress controller is more cost-effective.
ExternalName maps a Service to an external DNS name rather than a Pod selector. This lets you refer to external services using Kubernetes’s internal DNS system. Your model server can access a legacy external database at legacy-db instead of a long external hostname, and the ExternalName Service handles the DNS alias. This is useful for migrating services gradually — you point the ExternalName at an external service, and when you bring that service into the cluster, you update the Service type without changing how other services reference it.
Ingress is not a Service type but is closely related. An Ingress is a set of rules for routing external HTTP/HTTPS traffic to Services within the cluster. An Ingress controller (nginx, Traefik, AWS ALB Ingress Controller) watches Ingress objects and configures itself accordingly. Ingress lets you route different URL paths to different Services, handle TLS termination, and expose many services through a single external load balancer. For ML platforms with many services (prediction API, training dashboard, monitoring, model registry UI), Ingress is the typical production approach.
7. ConfigMaps & Secrets
Applications need configuration — database hostnames, feature flags, model hyperparameters, API endpoints, log levels. This configuration should not be hardcoded into container images, because then every configuration change requires rebuilding and redeploying the image. ConfigMaps and Secrets are Kubernetes’s mechanisms for injecting configuration into containers without embedding it in images.
ConfigMaps store non-sensitive configuration data as key-value pairs. A ConfigMap can store simple values (a single string for an environment variable) or entire configuration files (a YAML config file, a properties file, an nginx configuration). ConfigMaps are stored in etcd as plain text — they’re not encrypted or treated with any special security.
Creating a ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: model-server-config
namespace: ml-production
data:
MODEL_VERSION: "v2.1.0"
BATCH_SIZE: "32"
LOG_LEVEL: "INFO"
feature_store_config.yaml: |
host: feature-store-service
port: 6379
connection_pool_size: 10
timeout_seconds: 5This ConfigMap has both simple key-value pairs and an entire YAML file as a value.
You consume ConfigMaps in Pods in two ways. As environment variables, where individual keys become environment variables in the container:
env:
- name: MODEL_VERSION
valueFrom:
configMapKeyRef:
name: model-server-config
key: MODEL_VERSIONOr mounting the entire ConfigMap as files in a directory, where each key becomes a filename and the value becomes the file contents:
volumeMounts:
- name: config-volume
mountPath: /etc/model-server
volumes:
- name: config-volume
configMap:
name: model-server-configAfter mounting, your container finds /etc/model-server/feature_store_config.yaml with the content you defined in the ConfigMap. This is how you inject complete configuration files into containers.
Secrets work identically to ConfigMaps in terms of how they’re consumed (environment variables or volume mounts), but they’re designed for sensitive data — passwords, API keys, TLS certificates, database credentials. Kubernetes applies some protections to Secrets that aren’t applied to ConfigMaps: they’re not displayed in kubectl describe output (values are masked), they’re stored in tmpfs (memory) on nodes rather than disk, and they’re only sent to nodes that have Pods requesting them.
apiVersion: v1
kind: Secret
metadata:
name: ml-database-credentials
namespace: ml-production
type: Opaque
data:
username: bWwtdXNlcg== # base64-encoded "ml-user"
password: c3VwZXJzZWNyZXQ= # base64-encoded "supersecret"Important caveat: Kubernetes Secrets are base64-encoded, not encrypted. Base64 is encoding, not encryption — anyone with access to the Secret object can decode it trivially. Kubernetes Secrets are more about operational separation (distinguishing sensitive from non-sensitive config) than genuine security. For real security, you should either enable encryption at rest for Secrets in etcd (so the data is actually encrypted in the database) or use an external secrets management system like Vault with the External Secrets Operator, which keeps the actual secret values in Vault and only creates Kubernetes Secrets at runtime.
Automatic Secret injection — Kubernetes automatically injects a service account token Secret into every Pod, which the Pod’s processes can use to authenticate to the Kubernetes API. This enables patterns where application code queries Kubernetes (checking pod count, reading ConfigMaps dynamically, etc.) without needing separately managed credentials.
8. Resource Requests & Limits
Resource management in Kubernetes is one of the most important and most often misconfigured aspects of running production workloads, especially ML workloads which are inherently resource-intensive.
Every container in a Pod can (and in production should) specify two things: how much CPU and memory it expects to use (requests), and the maximum CPU and memory it’s allowed to use (limits).
Resource Requests are the amount of CPU and memory that the scheduler uses when deciding where to place a Pod. The scheduler finds nodes that have at least as much unallocated (not already requested by other Pods) CPU and memory as the Pod’s requests. Requests don’t mean the container is actually using that much — they’re a reservation. If you request 4 CPUs and your container is idle, it’s using almost no CPU, but the scheduler treats 4 CPUs as reserved on that node and won’t place other Pods there if it would exceed the node’s total capacity.
Resource Limits are the maximum amount of CPU and memory the container is allowed to use, enforced by Linux cgroups.
CPU limits are enforced through throttling — if a container tries to use more CPU than its limit, the kernel throttles it. The container continues running but is slowed down. CPU is measured in millicores — 1000m (millicores) equals 1 CPU core. cpu: "500m" means half a CPU core.
Memory limits are enforced through OOMKill — if a container tries to allocate more memory than its limit, the kernel’s Out-of-Memory killer terminates the process with an OOMKilled exit code. The container is then restarted by the kubelet according to the restart policy. Memory is measured in bytes, with suffixes: Mi (mebibytes), Gi (gibibytes). memory: "4Gi" means 4 gibibytes.
resources:
requests:
memory: "8Gi"
cpu: "2"
limits:
memory: "16Gi"
cpu: "4"This container is guaranteed at least 2 CPUs and 8Gi of memory (the scheduler ensures this is available), and is allowed to burst up to 4 CPUs and 16Gi.
Setting requests and limits correctly is nuanced and critically important for ML workloads.
If requests are too low, the scheduler will overcommit nodes — placing many Pods on a node whose combined requests fit, but whose combined actual usage exceeds the node’s capacity. This causes performance problems and OOMKills when actual usage spikes.
If requests are too high (over-requesting), Pods won’t be scheduled efficiently — nodes appear full to the scheduler even when actual resource usage is low, wasting money on underutilized nodes.
If memory limits are too low, your model server will be OOMKilled mid-inference, causing failed predictions and poor user experience. This is a common problem with ML containers where memory usage can be variable depending on input size.
If you don’t set limits at all, a single misbehaving container can consume all resources on a node, starving other containers. In production, always set limits.
GPU resources — for ML training and inference containers, you also request GPU resources:
resources:
limits:
nvidia.com/gpu: 2 # Request 2 GPUsGPU resources are always specified as limits only (requests equal limits automatically). GPUs are not shared between containers by default — one Pod gets exclusive access to the requested GPUs.
LimitRange is a namespace-level object that sets default requests and limits for Pods that don’t specify them, and enforces minimum and maximum values. This prevents anyone from deploying a container with no limits (which could monopolize node resources) or requesting more resources than is reasonable.
ResourceQuota sets aggregate resource limits for an entire namespace — the total CPU, memory, and GPU that all Pods in the namespace combined can request. This is how you implement multi-tenancy resource allocation — give the ML training namespace a budget of 100 CPUs and 400Gi of memory, and Kubernetes enforces that no more than that is requested across all Pods in that namespace.
9. Liveness & Readiness Probes
Kubernetes can automatically detect that a container’s process has crashed (the process exits), but it can’t automatically detect more subtle failures — a web server that’s running but stuck in a deadlock and not responding to requests, a model server that’s loaded but still initializing and not ready to serve traffic yet, or an application that has run out of a connection pool and is returning errors. Probes are how you teach Kubernetes to detect these conditions.
There are three types of probes, each serving a distinct purpose.
Liveness Probes answer the question: is this container still alive and functional? If a liveness probe fails, Kubernetes considers the container to be broken and kills and restarts it. This handles scenarios where the process is still running but is in an unrecoverable bad state — hung, deadlocked, or stuck in an infinite error loop.
A liveness probe for an HTTP server checks a specific endpoint:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # Wait 30s after container starts before probing
periodSeconds: 10 # Check every 10 seconds
failureThreshold: 3 # Kill and restart after 3 consecutive failures
timeoutSeconds: 5 # Probe times out after 5 secondsThe initialDelaySeconds is critical — without it, Kubernetes might try to probe your container before it has finished starting up and immediately kill it, causing a crash loop before the application even has a chance to start. For ML model servers that need to load large model weights, this might need to be 60, 120, or even more seconds.
Readiness Probes answer the question: is this container ready to receive traffic? A container that fails its readiness probe is removed from the Service’s endpoint list — traffic stops being routed to it — but the container is not killed or restarted. This handles the startup phase (container is running but still loading model weights, still warming up caches, still connecting to dependencies) and temporary overload (container is too busy to handle more traffic).
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
successThreshold: 1 # Must pass once to be considered ready againThe difference between liveness and readiness is subtle but important. Liveness failure = restart. Readiness failure = stop sending traffic. A container can be live (not dead, no need to restart) but not ready (temporarily unable to serve). A common pattern for model servers: the readiness probe checks that the model is loaded and the server can handle requests, while the liveness probe checks that the server process itself is not hung.
Startup Probes are the third type and solve a specific problem — applications that have very long initialization times. If your liveness probe runs during startup and the application takes 3 minutes to load models, the liveness probe would repeatedly fail and restart the container before it ever gets a chance to finish starting. You could set initialDelaySeconds: 180 on the liveness probe, but then you’d have a 3-minute detection gap for actual failures after startup.
Startup probes solve this cleanly. The startup probe runs from container start and disables both liveness and readiness probes until the startup probe succeeds. Once the startup probe succeeds (application has finished initializing), normal liveness and readiness probes take over. This gives slow-starting applications all the time they need to start without compromising health detection once they’re running.
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30 # Try 30 times
periodSeconds: 10 # Every 10 seconds = up to 300 seconds (5 minutes) to startProbe types — probes are not limited to HTTP. You can also use TCP probes (checks that a TCP connection can be established on a port — useful for non-HTTP services) and exec probes (runs a command inside the container and checks the exit code — flexible but more resource-intensive).
10. Horizontal Pod Autoscaler (HPA)
Manually setting replica counts for your Deployments is appropriate for stable, predictable workloads. But ML serving workloads are rarely stable and predictable. Traffic has daily patterns (low at night, high during business hours), viral moments can cause sudden 10× traffic spikes, and batch inference jobs can suddenly increase load. The Horizontal Pod Autoscaler automatically adjusts the number of Pods in a Deployment based on observed metrics.
The HPA is a control loop. It periodically (every 15 seconds by default) queries the metrics server for current resource usage of the Pods it’s managing, compares that to the target you’ve specified, calculates the desired replica count, and updates the Deployment if the current count differs.
The basic form of HPA scales on CPU utilization:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-server
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale to maintain 70% CPU utilization averageThis HPA maintains between 3 and 20 replicas of the model-server Deployment, scaling up when average CPU utilization across all Pods exceeds 70% and scaling down when it drops significantly below 70%. The formula Kubernetes uses is: desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric)). If you have 5 replicas at 140% CPU utilization and target is 70%, desired = ceil(5 × (140/70)) = ceil(10) = 10 replicas.
Custom metrics — for ML systems, CPU is often not the best scaling signal. A model server might be CPU-idle but GPU-saturated. Or it might be better to scale on inference request latency or queue depth. The HPA supports custom metrics via adapters like the Prometheus Adapter or KEDA.
Scaling on requests per second using Prometheus metrics:
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100" # Target 100 requests per second per PodKEDA (Kubernetes Event-Driven Autoscaling) extends HPA with many more scaling sources — Kafka topic lag, SQS queue depth, Redis list length, Prometheus metrics. For ML systems where inference requests come through a message queue, scaling on queue depth is much more effective than scaling on CPU. More messages in the queue means more workers needed — straightforward and responsive.
Scale-down stabilization — Kubernetes is conservative about scaling down to prevent flapping (rapid scale-up and scale-down cycles). By default, scale-down decisions require 5 minutes of sustained low load before pods are removed. This window is configurable. For ML serving, you generally want fast scale-up (add capacity immediately when traffic spikes) and slow scale-down (don’t remove pods until you’re confident traffic has sustainably dropped).
Resource requests are required — HPA based on CPU utilization only works if your containers have resource requests defined. The utilization percentage is calculated relative to the requested CPU. Without requests, the metrics server can’t calculate utilization percentage, and HPA won’t function.
11. Rolling Updates & Rollbacks
Updates to production services are a constant reality — new model versions, bug fixes, configuration changes, dependency updates. Kubernetes provides built-in mechanisms for updating Deployments with zero downtime and quickly reverting if something goes wrong.
Rolling Updates are the default update strategy for Deployments. When you update a Deployment (change the container image tag, modify an environment variable, update resource limits), Kubernetes doesn’t kill all old Pods and start new ones simultaneously. It gradually replaces old Pods with new ones, ensuring that the service remains available throughout the update.
The process: Kubernetes creates a new ReplicaSet with the updated Pod template. It starts creating new Pods from the new ReplicaSet. As new Pods become Ready (pass their readiness probes), old Pods are terminated from the old ReplicaSet. This continues until all Pods are running the new version.
The speed and safety of the rolling update is controlled by maxSurge and maxUnavailable in the Deployment strategy (covered in the Deployments section). For ML serving, the typical production configuration is maxSurge: 25% and maxUnavailable: 0 — this ensures you always have full capacity, at the cost of needing up to 25% extra capacity during the update.
Triggering an update — the most common trigger is updating the container image:
kubectl set image deployment/model-server model-server=myregistry/model-server:v2.2.0Or editing the Deployment manifest and applying it:
kubectl apply -f deployment.yamlMonitoring the rollout — Kubernetes provides a command to watch the rollout progress:
kubectl rollout status deployment/model-serverThis blocks and shows progress until the rollout completes. In CI/CD pipelines, this command is used to verify that a deployment actually succeeded before marking the pipeline as successful.
Rollout history — Kubernetes keeps a history of Deployment revisions. You can view the history:
kubectl rollout history deployment/model-serverThis shows each revision with the change cause (if you annotate your deployments with change descriptions using --record or the kubernetes.io/change-cause annotation).
Rollbacks — if a new version is causing problems, rolling back is a single command:
kubectl rollout undo deployment/model-serverThis reverts to the previous revision. To revert to a specific revision:
kubectl rollout undo deployment/model-server --to-revision=3Under the hood, a rollback is just a new rolling update — Kubernetes updates the Deployment’s Pod template back to the previous version and performs a new rolling update. The old ReplicaSet (which Kubernetes kept around for exactly this reason) is scaled back up while the current ReplicaSet is scaled down.
The key to safe rollouts — readiness probes. During a rolling update, Kubernetes only considers a new Pod ready (and starts terminating an old Pod) when the readiness probe passes. If your new model version fails its readiness probe — because it crashes on startup, because the model weights are wrong, because there’s a configuration error — Kubernetes stops the rollout and leaves your old Pods running. The rollout hangs rather than replacing working Pods with broken ones. This is the primary safety mechanism — without readiness probes, Kubernetes would replace working Pods with broken ones regardless.
12. Namespaces & Multi-Tenancy
A Kubernetes cluster is a shared resource. Multiple teams, multiple applications, multiple environments might all share the same cluster. Without mechanisms for isolation and organization, this shared resource becomes chaotic — one team’s misbehaving application affects another’s, there are no resource boundaries, and permissions are a mess.
Namespaces are Kubernetes’s primary mechanism for organizing a cluster into virtual partitions. Each namespace is a scope for resources — Pods, Deployments, Services, ConfigMaps, Secrets — that provides naming isolation, access control boundaries, and resource quota enforcement.
The key properties of namespaces:
Naming isolation — resource names must be unique within a namespace but can be duplicated across namespaces. You can have a Deployment named model-server in both the ml-team-a namespace and ml-team-b namespace simultaneously. Each team manages their own resources without naming conflicts.
Access control — RBAC permissions are scoped to namespaces. You can give Team A’s engineers admin access to the ml-team-a namespace without giving them any access to ml-team-b. Service accounts in one namespace don’t automatically have access to resources in another. This is how you implement multi-tenancy where different teams can use the same cluster without being able to interfere with each other.
Resource quotas — you can set ResourceQuota objects on namespaces to limit the total CPU, memory, GPU, and number of objects that can be created in that namespace. Give the training namespace a budget of 200 CPUs and 800Gi of memory. Once that budget is consumed, new Pods won’t be scheduled until existing ones are deleted. This prevents one team from monopolizing shared cluster resources.
Network isolation — combined with NetworkPolicy objects, namespaces provide a unit for network segmentation. You can define policies that allow traffic within a namespace but restrict traffic between namespaces, or that allow specific cross-namespace communication for well-defined integration points.
Typical namespace organization patterns:
By environment: development, staging, production namespaces. Each environment gets resource quotas appropriate to its scale. This works well for smaller organizations but has the downside that a noisy staging workload can affect production if they’re in the same cluster.
By team: ml-platform, data-science-team-a, data-science-team-b, monitoring, infrastructure. Each team manages their namespace independently. Resource quotas enforce fair sharing.
By function: model-training, model-serving, feature-store, monitoring, data-pipelines. Different components of your ML platform in different namespaces with appropriate resource allocations for each function.
Hybrid approaches are common — production-model-serving, production-model-training, staging-ml — combining environment and function for clear separation.
Default Kubernetes namespaces — clusters come with built-in namespaces: default (where resources land if you don’t specify a namespace — avoid using this in production), kube-system (Kubernetes system components), kube-public (publicly readable cluster info), kube-node-lease (node heartbeat objects). Don’t put your workloads in kube-system or default.
13. RBAC & Access Control
We covered RBAC conceptually in the Security section, but here we’ll go deeper into the Kubernetes-specific mechanics because understanding these mechanics is essential for production cluster operation.
RBAC in Kubernetes has four object types that combine to define who can do what.
Role defines a set of permissions within a specific namespace. It contains rules that each specify: which API groups, which resources, and which verbs (actions) are allowed.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: model-deployer
namespace: ml-production
rules:
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: [""]
resources: ["pods", "pods/logs"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch"]This Role allows getting, listing, watching, creating, and updating Deployments in the ml-production namespace, and reading Pods and ConfigMaps. It cannot delete Deployments, cannot access Secrets, and cannot do anything in other namespaces.
ClusterRole works like Role but applies cluster-wide across all namespaces, or for cluster-scoped resources (Nodes, PersistentVolumes, Namespaces themselves) that don’t belong to any namespace. Use ClusterRole for permissions that genuinely need cluster-wide scope, and namespace-scoped Roles for everything else.
RoleBinding grants a Role to a subject (user, group, or service account) within a specific namespace:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: ml-engineer-model-deployer
namespace: ml-production
subjects:
- kind: User
name: "harsh@company.com"
apiGroup: rbac.authorization.k8s.io
- kind: Group
name: "ml-engineers"
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: model-deployer
apiGroup: rbac.authorization.k8s.ioClusterRoleBinding grants a ClusterRole to subjects cluster-wide. Use sparingly — most permissions should be namespace-scoped.
Service Accounts are the identity mechanism for Pods. Every Pod runs as a service account. By default, Pods run as the default service account in their namespace, which has minimal permissions. You create dedicated service accounts for Pods that need specific Kubernetes API access:
apiVersion: v1
kind: ServiceAccount
metadata:
name: model-server-sa
namespace: ml-productionBind a Role to this service account:
subjects:
- kind: ServiceAccount
name: model-server-sa
namespace: ml-productionAnd specify it in the Pod spec:
spec:
serviceAccountName: model-server-saCommon RBAC mistakes to avoid:
Using the cluster-admin ClusterRole for anything other than absolute necessity — it gives unrestricted access to everything and bypasses all RBAC checks.
Binding ClusterRoles using ClusterRoleBindings when RoleBindings would suffice — giving namespace-scoped permissions via ClusterRoleBinding accidentally grants access to all namespaces.
Giving service accounts of application Pods broad permissions — a model serving Pod almost certainly needs zero Kubernetes API access. If it does need some (reading its own ConfigMaps dynamically, for example), scope it precisely.
Not auditing RBAC regularly — permissions accumulate. That service account created for a one-time migration task six months ago still has broad permissions and nobody remembers what it does.
14. Persistent Volumes & Storage Classes
Containers are ephemeral by design — when a container restarts, its local filesystem is reset to the image state. Any data written inside the container during its lifetime is lost. For most stateless application containers, this is fine and even desirable. But for ML workloads — training data, model checkpoints, experiment logs, feature data — you need storage that persists independently of container lifecycle.
Kubernetes has a three-layer storage abstraction: PersistentVolumes, PersistentVolumeClaims, and StorageClasses.
PersistentVolume (PV) represents an actual storage resource in the cluster — an EBS volume on AWS, a Persistent Disk on GCP, an NFS share, a local disk on a node. PVs exist independently of Pods. A PV provisioned from an AWS EBS volume persists even when all Pods using it are deleted. PVs have a capacity, access mode, and reclaim policy.
Access modes define how the volume can be mounted:
ReadWriteOnce (RWO)— can be mounted read-write by a single node at a time. Most block storage (EBS, GCE PD) is RWO. You can have multiple Pods on the same node reading and writing, but not Pods on different nodes.ReadOnlyMany (ROX)— can be mounted read-only by many nodes simultaneously. Useful for distributing static data (like pre-trained model weights) to many Pods.ReadWriteMany (RWX)— can be mounted read-write by many nodes simultaneously. Requires network-attached storage (NFS, EFS, CephFS). Important for ML training jobs that need multiple Pods to write to shared storage.
PersistentVolumeClaim (PVC) is a request for storage by a user or application. A Pod doesn’t reference a PV directly — it references a PVC, and Kubernetes finds a PV that satisfies the claim and binds them together. This decouples the Pod specification from the specific storage implementation. Your Pod just says “I need 50Gi of ReadWriteOnce storage” — whether that comes from an EBS volume or a GCE Persistent Disk is an infrastructure detail the Pod doesn’t need to know.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-checkpoint-storage
namespace: ml-training
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: gp3-encryptedAnd the Pod references the PVC:
volumes:
- name: checkpoints
persistentVolumeClaim:
claimName: model-checkpoint-storage
containers:
- name: training-job
volumeMounts:
- name: checkpoints
mountPath: /checkpointsStorageClass defines how PVs are dynamically provisioned. Instead of manually creating EBS volumes and PV objects before every training job, StorageClass enables dynamic provisioning — when a PVC is created, Kubernetes automatically provisions the underlying storage and creates the PV. You define a StorageClass with the provisioner (the component that creates storage), parameters (volume type, encryption, IOPS), and reclaim policy (what happens to the PV when the PVC is deleted — Delete to destroy the storage, or Retain to keep it).
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3-encrypted
provisioner: ebs.csi.aws.com
parameters:
type: gp3
encrypted: "true"
iops: "3000"
throughput: "125"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumerWaitForFirstConsumer means the volume is only provisioned when a Pod actually claims it, ensuring it’s provisioned in the same availability zone as the Pod (critical for EBS which is AZ-local).
For ML workloads, storage is a first-class concern. Training checkpoints need durable, high-throughput storage. Model artifacts need persistent storage accessible by serving infrastructure. Feature data might need ReadWriteMany storage accessible by multiple training jobs in parallel. Shared dataset storage for distributed training. Understanding what access modes and performance characteristics each workload needs and selecting the appropriate StorageClass is part of production ML infrastructure design.
15. StatefulSets vs Deployments
Deployments are the right abstraction for stateless applications — your model serving containers, API gateways, and most microservices. But some components of your ML infrastructure are stateful — they have identity, stable storage, and ordered startup/shutdown requirements. For these, StatefulSets are the appropriate abstraction.
Understanding the difference between Deployments and StatefulSets comes down to understanding the difference between stateless and stateful workloads.
In a Deployment, all Pods are interchangeable. If Pod A dies, Pod B (its replacement) is functionally identical and clients don’t care which one they’re talking to. Pods don’t have persistent individual identities. You can scale a Deployment up and down and any Pod can handle any request.
StatefulSets provide three things that Deployments don’t:
Stable, predictable Pod names — Pods in a StatefulSet are named <statefulset-name>-0, <statefulset-name>-1, <statefulset-name>-2, etc. These names are stable — if Pod ml-feature-store-2 is deleted, its replacement is also named ml-feature-store-2, not a random generated name. This is important for systems where identity matters — like distributed databases where each node has a specific role or partition assignment.
Stable network identity — each StatefulSet Pod gets a stable DNS hostname: <pod-name>.<statefulset-name>.<namespace>.svc.cluster.local. This hostname is persistent even when the Pod is rescheduled to a different node. Cluster members can address each other by predictable hostnames rather than ephemeral IPs.
Ordered, controlled startup and shutdown — StatefulSet Pods are created in order: Pod 0 must be Running and Ready before Pod 1 starts, Pod 1 before Pod 2, and so on. They’re also deleted in reverse order. This ordering is essential for distributed systems that have bootstrap sequences — the first node initializes the cluster, subsequent nodes join it. Disordered startup would leave the cluster in an undefined state.
Persistent storage per Pod — StatefulSets support volume claim templates, which automatically create a dedicated PVC for each Pod. Pod 0 gets its own PVC, Pod 1 gets its own, and so on. When a Pod is rescheduled, it reconnects to its own PVC — the same storage it had before. This is how distributed databases maintain their data across Pod restarts.
When to use StatefulSets:
- Databases (PostgreSQL, MySQL, MongoDB, Cassandra)
- Distributed coordination services (ZooKeeper, etcd when running as application)
- Message brokers (Kafka, RabbitMQ)
- Feature stores with stateful backends (Redis cluster, Cassandra-backed feature stores)
- Any application where each instance has unique identity, stable storage, or ordered lifecycle requirements
When to use Deployments:
- Model serving containers (stateless — any replica can serve any request)
- API gateways
- Web servers
- Data processing workers (stateless — any worker can process any job)
- Most microservices
A common mistake is reaching for StatefulSets when Deployments would suffice, because it seems like the “more serious” choice. StatefulSets are significantly more complex to operate — rolling updates are slower (sequential rather than parallel), debugging requires understanding Pod ordinals, and scaling has operational implications. Use Deployments by default and only switch to StatefulSets when you genuinely need what they provide.
16. Kubernetes Networking Basics
Networking in Kubernetes is a topic that confuses many engineers because there are multiple layers of networking happening simultaneously, each solving a different problem. Let’s build up the mental model layer by layer.
The fundamental networking model — Kubernetes has a flat networking model with three rules: every Pod gets a unique IP address, every Pod can communicate directly with every other Pod in the cluster without NAT (no address translation, the original source IP is preserved), and Pods and nodes can communicate directly. This is a remarkably simple model for what could be a very complex problem across many machines.
The flat model is implemented by the CNI plugin (Calico, Cilium, Flannel). The CNI plugin creates an overlay network (a virtual network on top of the physical network) or configures the physical network routing so that Pod IPs are routable across nodes. The specifics differ between CNI plugins, but the result is the same: Pod A on Node 1 can send a packet to Pod B on Node 3 using Pod B’s IP address directly.
Pod networking — each Pod gets a private network namespace. The Pod’s containers share this namespace — they all have the same IP address and can communicate with each other via localhost. The Pod’s IP is assigned from the cluster’s Pod CIDR (a range of IP addresses designated for Pods, typically something like 10.244.0.0/16).
Service networking — Services have a separate IP range (ClusterIP range, typically 10.96.0.0/12). Service IPs are virtual — no machine actually has that IP address. When a Pod sends a packet to a Service IP, kube-proxy’s iptables rules on the local node intercept the packet, select a backend Pod from the Service’s endpoint list, and rewrite the destination IP to the chosen Pod’s real IP before forwarding. This NAT happens transparently, and the reverse NAT happens on the return path.
DNS — CoreDNS runs as a Deployment in the kube-system namespace and provides DNS for the cluster. Every Service gets a DNS record: <service-name>.<namespace>.svc.cluster.local. Pods are configured to use CoreDNS as their DNS resolver, so curl http://feature-store.ml-platform.svc.cluster.local from any Pod resolves to the feature-store Service’s ClusterIP. Within the same namespace, you can use just the service name: curl http://feature-store.
Ingress networking — traffic coming from outside the cluster enters through an Ingress controller (or a LoadBalancer Service). The Ingress controller is typically a reverse proxy (nginx, Envoy, Traefik) running as a Pod with a Service of type LoadBalancer. External traffic hits the load balancer (provisioned in the cloud), which forwards to the Ingress controller Pod, which inspects the HTTP host and path, matches it against Ingress rules, and forwards to the appropriate backend Service.
Network Policies — by default, all Pods can communicate with all other Pods. NetworkPolicy objects restrict this. A NetworkPolicy selects Pods by label and defines ingress rules (who can send traffic to the selected Pods) and egress rules (where the selected Pods can send traffic). NetworkPolicies are additive — if no NetworkPolicy selects a Pod, all traffic is allowed. Once any NetworkPolicy selects a Pod, only traffic explicitly allowed by NetworkPolicies is permitted.
The typical production pattern is a default-deny NetworkPolicy in each namespace that blocks all ingress and egress, followed by specific allow policies for required communication paths. Your model server allows ingress from the API gateway namespace on port 8080, and egress to the feature store on port 6379. Everything else is blocked.
DNS resolution troubleshooting — a very common networking issue in Kubernetes is DNS resolution failures. Debug using: kubectl run test --image=busybox --rm -it -- nslookup feature-store.ml-platform.svc.cluster.local. This creates a temporary Pod and runs a DNS lookup from inside the cluster, confirming whether the Service DNS record resolves correctly.
Port forwarding for debugging — when you need to access a service from your local machine without exposing it externally: kubectl port-forward service/model-server 8080:80. This tunnels traffic from your localhost port 8080 to the cluster Service’s port 80, letting you test the service from your laptop without creating an external LoadBalancer.
Headless Services — a Service with clusterIP: None doesn’t get a virtual IP. Instead, DNS queries for the Service return the individual Pod IPs directly. This is how StatefulSet Pods get their stable DNS names — each Pod gets its own DNS record via a headless Service. Applications that need to connect to specific individual backends (like a Cassandra driver connecting to specific Cassandra nodes) use headless Services to discover and connect to individual Pods by name.
The unifying idea across all of them is the reconciliation loop — Kubernetes is continuously watching the difference between what you declared and what actually exists, and constantly working to close that gap. Every component, from the scheduler to the HPA to the Deployment controller, is a specialized reconciliation loop watching some part of the cluster state and taking action to make reality match your declarations. Once you internalize this mental model, Kubernetes stops being a collection of confusing objects and becomes a coherent system with a clear purpose.