Today, software teams frequently face significant challenges with Kubernetes operations. Developers often grapple with limited visibility, constrained independence, and restricted access to resources within Kubernetes environments. Concurrently, DevOps teams are tasked with daily support duties such as provisioning environments, monitoring usage, and managing governance across multiple clusters, which prevents them from focusing on core DevOps initiatives. These issues contribute to a poor developer experience, reduced productivity, and increased frustration within teams. This talk will delve into strategies for enabling developer self-service in Kubernetes, emphasizing how to maintain both flexibility and control over infrastructure to enhance overall efficiency and satisfaction.
As the Co-Founder & Chief Product Officer at mogenius, Jan Lepsky is specialized in Platform Engineering, focusing on enhancing DevOps efficiency and empowering developers to work independently with Kubernetes. mogenius provides practical tools that enable developers to define and... Read More →
Sourav Khandelwal, Databricks, Sr Software Engineer
Managing a vast fleet of Kubernetes clusters across multiple cloud providers like AWS, Azure, and GCP is often fraught with inefficient manual processes involving scripts and pipelines. These methods are time-consuming, error-prone, and consume valuable engineering resources. In this session, I will present how we addressed these challenges at Databricks by developing an automated system for the lifecycle management of over a thousand multi-cloud-managed Kubernetes clusters. Our solution features a Kubernetes-style continuous reconciliation mechanism for cluster provisioning and deprovisioning, a fast and reliable cluster state change discovery system integrated with Databricks’ product services, and blue-green cluster rotations (cluster swaps) that allow for seamless upgrades by creating new clusters with updated configurations and shifting workloads without downtime. This automation enables us to implement major infrastructure changes and upgrade Kubernetes versions with low risk through staged rollouts, seamless rollbacks, zero downtime, and minimal operator intervention. I will share our methodologies and experiences in constructing this loosely coupled system that orchestrates product workloads and cloud provider APIs.
I am a seasoned software engineer with over 12 years of experience in designing and managing large-scale platforms in cloud-native environments. At Databricks, I have led and contributed to several innovative projects that have scaled and automated our Kubernetes Compute Platform... Read More →