Highly Available MAAS
Discussion around what is required to make MAAS highly available (resistant to failures).
Blueprint information
- Status:
- Complete
- Approver:
- Daniel Westervelt
- Priority:
- Essential
- Drafter:
- None
- Direction:
- Approved
- Assignee:
- None
- Definition:
- Obsolete
- Series goal:
- None
- Implementation:
- Unknown
- Milestone target:
- None
- Started by
- Completed by
- Adam Collard
Whiteboard
MAAS HA
=======
Components that *should* be HA
- DNS
- Web app
- RabbitMQ
- Region's Squid
SPOFs:
* Region Celery
* We would need to run perhaps one celery per appserver instance and use Celery's "Broadcast" queue type so that a task is sent to all region works consuming from the broadcast queue.
* postgres? can be done but hard; defer responsibility to charms (out of scope for MAAS project)
* Rabbit does not guarantee messages are always delivered in HA mode (server that dies takes messages with it)
Other problems:
* If a cluster dies, the region controller does not know and would try to allocate machines in it
* What about pending celery jobs when a cluster dies?
* We don't look for & handle silent failures, e.g. nodes not netbooting.
To do:
* Find out if we can bin the CD installers & its related Avahi service.
* Investigate Celery's HA story
Notes on postgres HA:
* Switching masters is a manual step. Has to be.
* Multi-master is coming, according to Herb McNew.