Loading…
Wednesday November 6, 2024 3:00pm - 3:25pm PST
Sourav Khandelwal, Databricks, Sr Software Engineer

Managing a vast fleet of Kubernetes clusters across multiple cloud providers like AWS, Azure, and GCP is often fraught with inefficient manual processes involving scripts and pipelines. These methods are time-consuming, error-prone, and consume valuable engineering resources. In this session, I will present how we addressed these challenges at Databricks by developing an automated system for the lifecycle management of over a thousand multi-cloud-managed Kubernetes clusters.
Our solution features a Kubernetes-style continuous reconciliation mechanism for cluster provisioning and deprovisioning, a fast and reliable cluster state change discovery system integrated with Databricks’ product services, and blue-green cluster rotations (cluster swaps) that allow for seamless upgrades by creating new clusters with updated configurations and shifting workloads without downtime.
This automation enables us to implement major infrastructure changes and upgrade Kubernetes versions with low risk through staged rollouts, seamless rollbacks, zero downtime, and minimal operator intervention. I will share our methodologies and experiences in constructing this loosely coupled system that orchestrates product workloads and cloud provider APIs.
Speakers
avatar for Sourav Khandelwal

Sourav Khandelwal

Sr Software Engineer, Databricks
I am a seasoned software engineer with over 12 years of experience in designing and managing large-scale platforms in cloud-native environments. At Databricks, I have led and contributed to several innovative projects that have scaled and automated our Kubernetes Compute Platform... Read More →
Wednesday November 6, 2024 3:00pm - 3:25pm PST
CloudX -- Main Stage
Feedback form is now closed.

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link