Making sure LAVA test boards stay healthy
Registered by
Paul Larson
We've started running health check jobs in the lab every day, to try to help us spot problems in the boards, infrastructure, or lava itself. There are some things we can do to improve this though.
1. Health check UI
Spring has been working on this, we should review what he has so far, and talk about what we'd like to see here for helping us more easily track the health of machines
2. automatic detection and response to problems
Once we are at a point where we feel comfortable that these jobs will ONLY fail when there's a real problem, we should make sure we have the pieces in place to automatically offline the board, and notify the team that something needs to be looked at
Blueprint information
- Status:
- Complete
- Approver:
- Paul Larson
- Priority:
- Undefined
- Drafter:
- Spring Zhang
- Direction:
- Needs approval
- Assignee:
- Spring Zhang
- Definition:
- Superseded
- Series goal:
- None
- Implementation:
- Unknown
- Milestone target:
- None
- Started by
- Completed by
- Paul Larson
Whiteboard
(?)