Blog/Article

Why kubernetes' latest device failure logic is good news for bare metal

Diego Lima•August 28, 2025

AI and machine learning workloads have transformed infrastructure requirements far beyond simple CPU upgrades, and modern workloads demand specialized hardware, such as GPUs and accelerators, which can cost ten times more than traditional compute resources.

Summary

If you're running workloads on bare metal infrastructure, Kubernetes' new device failure handling directly addresses some of the most frustrating challenges you've been facing with hardware reliability and workload management.

The Hidden Cost of Device Failures in Your Workloads

When you choose bare metal, you gain direct access to high-performance hardware, but you also inherit the complexity of managing hardware failures.

Recent data shared in a Kubernetes blog post reveals the scope of this challenge: NVIDIA reports 19 remediation requests per 1000 nodes daily in their GeForce NOW infrastructure. That's nearly 2% of nodes requiring intervention every single day.

For your LLM training jobs that have been running for weeks, a GPU failure doesn't just mean losing compute time. It often means losing weeks of progress as your entire multi-node workload needs to restart from scratch.

Where Kubernetes Has Been Letting You Down

Until recently, Kubernetes has treated hardware in frustratingly binary terms: a device either exists and works perfectly, or it doesn't exist at all. But anyone who's worked with real hardware knows that failures are rarely that clean.

Your GPU might start throwing ECC errors, overheating, or losing NVLink connectivity, yet Kubernetes will still consider it "available" and continue scheduling your workloads to it. Your pods end up sitting on degraded hardware indefinitely, delivering unpredictable slowdowns or mysterious failures that can take hours to diagnose and resolve.

The detection mechanisms make this even worse. Kubernetes' liveness and readiness probes are designed to detect application crashes, not the subtle signs of hardware degradation, such as intermittent errors or gradual performance loss.

As long as your container process keeps running, Kubernetes assumes everything is fine, even while the hardware underneath silently degrades your job's performance.

When these issues surface, your options are limited and disruptive. Without proper device health tracking, you're often forced to drain entire nodes, recreate compute pools, or manually migrate workloads to healthier hardware.

For large AI/ML training jobs spanning hundreds of pods, this typically means scrapping days of progress and starting over.

The Improvements

The enhancements outlined in Kubernetes' device failure roadmap directly address the operational pain points you've been experiencing with bare metal workloads. Here's what's particularly significant for your day-to-day operations:

Enhanced Device Health Status (KEP-4680): Instead of the current binary working/broken status, you'll get detailed health metrics about your hardware through the "Add Resource Health Status to the Pod Status" enhancement. This means you can make informed decisions about workload placement and catch problems before they destroy your jobs.

Intelligent Pod Rescheduling: New capabilities will automatically move your workloads away from unhealthy devices, including support for descheduling pods with restartPolicy: Always. No more manually hunting down stuck workloads or losing days of progress to failed hardware.

In-Place Recovery Options: Rather than constantly rescheduling to new nodes, Kubernetes will support restarting containers in place. For bare metal users, this means faster recovery times and less bandwidth usage from repeatedly pulling large container images.

Node-Local Retry Policies: Instead of immediately escalating every failure to the cluster level, Kubernetes will handle more transient issues locally through improved pod failure policies, reducing the disruption to your workloads from temporary hardware hiccups.

Why This Transformation Matters More on Bare Metal

When you run workloads in major cloud platforms, hardware failures are largely abstracted away. A failed GPU typically results in you being assigned a new instance with working hardware, so the cloud provider basically handles the complexity behind the scenes.

On bare metal, you are choosing direct hardware access for the performance and control advantages it provides. But that choice has historically come with the burden of dealing directly with hardware complexity and failures.

These Kubernetes improvements essentially give you cloud-like failure resilience while preserving the performance and control benefits that drew you to bare metal in the first place.

What These Changes Mean for Your Operations

As these Kubernetes improvements become available, users running current releases will experience significant operational benefits:

Higher Reliability for Long-Running Jobs: Better failure handling means your weeks-long AI training runs are far less likely to be destroyed by hardware issues that could have been detected and handled gracefully.

Reduced Time Spent on Infrastructure Problems: Automatic detection and remediation of device issues enable you to focus on your actual workloads rather than constantly troubleshooting mysterious performance issues or stuck pods.

More Predictable Performance: With proper device health monitoring, you gain visibility into hardware degradation before it impacts your applications, enabling proactive rather than reactive management.

Lower Total Cost of Ownership: Less time spent on manual intervention and fewer job restarts translates directly into better utilization of your infrastructure investment and reduced operational overhead.

Planning for the Transition

The Kubernetes community is implementing these improvements with a focus on extensibility and backwards compatibility, which is good news if you have existing workloads and established operational processes.

Rather than prescriptive one-size-fits-all solutions, the new capabilities provide better extension points that can be configured to match your specific hardware configurations and operational requirements.

So now is an excellent time to start preparing for these capabilities. Understanding how the new device health status reporting will work, planning how to integrate improved failure handling into your existing workflows, and potentially testing these features in development environments will help you take full advantage when they reach general availability.

The demand for reliable, high-performance infrastructure for AI/ML workloads continues to grow, and Kubernetes' enhanced device failure handling represents a significant step toward making bare metal infrastructure operationally smoother, while maintaining the performance and cost advantages that make bare metal attractive for hardware-demanding workloads.

If you haven’t signed up yet, get started on Latitude.sh today and deploy bare metal GPUs in just a few seconds.

FAQ

How will these improvements affect my running AI/ML jobs?

Your long-running training jobs will be less likely to fail catastrophically due to hardware issues. Instead of losing weeks of progress when a GPU starts failing, Kubernetes will detect the problem and automatically move your workload to healthy hardware.

What's the most critical change for bare metal users?

Moving from binary device status (working/broken) to detailed health monitoring, combined with automatic workload migration away from failing devices. This eliminates the frustrating scenario where pods get stuck forever on degraded hardware.

When should I plan to upgrade my Kubernetes clusters?

These improvements are rolling out across upcoming Kubernetes releases. Staying up to date with these will give you access to better failure handling, which can dramatically improve the reliability of your bare metal workloads.