icon peppermint

Peppermint AI

Design distributed scheduler

A distributed scheduler manages time-based job execution across a cluster of machines. It ensures precise scheduling, fault tolerance, and exactly-once execution by coordinating locks and shared state. For example, Kubernetes CronJobs use this model to reliably run recurring tasks like database backups or report generation.

Practice NowTranscript

About

Deconstruct a distributed job scheduler—from architecture to code. We unpack trade-offs: pessimistic DB locking for concurrency control, idempotent APIs to prevent duplicate work, and retries with exponential backoff + jitter. The post summarizes a real candidate’s mock interview—what they focused on, the design they proposed, and alternatives they could have considered—so you learn to justify choices and demonstrate the architectural maturity to build resilient, scalable systems and ace interviews.

Terminology

  • Job: A job is the definition of work that needs to be done, when it needs to be done usually specified by the user
  • Run: A scheduled job is called a run. A job can have multiple runs depending on its definition.
  • Scheduler: Scheduler schedules and queues the run
  • Executor: Executor executes a scheduled run making sure to handle failed runs.

Requirements

  • Allow users to create and schedule new jobs (one-time or recurring)
  • Provide visibility into job runs, including current status, execution logs, and results
  • Enable users to cancel an in-progress job run or delete an entire scheduled job definition
  • Ensure high reliability and fault tolerance so jobs run exactly once, even under failures
  • Scale to handle up to 1000 job run executions per second across the cluster
  • Minimize latency between scheduled time and actual job start for timely execution

♡ Drafted by PAI, verified by engineers