PATTERN
Scheduler-Agent-Supervisor
The Scheduler-Agent-Supervisor pattern addresses the challenge of reliably executing tasks in a distributed or concurrent environment. A Scheduler is responsible for creating and assigning tasks to Agents. Agents execute these tasks and report their status. A Supervisor independently monitors the Agents and, if an Agent fails, restarts it to ensure continued task execution. This provides fault tolerance and resilience to task processing.
Usage
This pattern is commonly used in scenarios where tasks need to be reliably executed, even in the face of agent failures or network instability. Examples include:
- Distributed Task Queues: Systems like Celery or RQ utilize this pattern to distribute jobs across multiple worker processes. The scheduler adds jobs to the queue, agents pick them up and run them, and a monitor (often the queue system itself) restarts failing agents.
- Cron Job Management: While often simple, more robust cron implementations might employ a supervisor to ensure cron daemons themselves are healthy and restart them if they crash.
- Continuous Integration/Continuous Delivery (CI/CD): Agents execute build and deployment steps, monitored by a supervisor to handle errors and guarantee pipeline completion.
- Monitoring Systems: Agents collect metrics from systems and report them to a central scheduler; a supervisor keeps the agents running.
Examples
-
Celery (Python): Celery is a popular asynchronous task queue/job queue based on distributed message passing. The Scheduler is the Celery client that publishes tasks to a message broker (e.g., RabbitMQ, Redis). Agents (Celery workers) consume tasks from the broker and execute them. A supervisor process (like systemd or Supervisord) manages the Celery worker processes, restarting them if they become unresponsive.
-
Kubernetes: Kubernetes embodies this pattern at a system level. Scheduler components assign pods (containing containers, which are the Agents) to nodes. Kubernetes’ node managers act as Supervisors, constantly monitoring the health of the pods on each node and automatically restarting failed pods or rescheduling them to healthy nodes. The entire system is designed with agent (pod) failure in mind.