Health Checking Pattern
Certain services in flight software are critical for the correct execution of the system. For example, command dispatching is crucial to maintain control of the system. It is good practice to monitor these services to ensure they remain responsive during execution of the system. The health checking pattern is used to establish a component as a critical component of the system and periodically check it for responsiveness.
The fprime-examples repository provides an example of the health checking pattern in its Manager Component as the Manager is intended to stay responsive at all times.
Applicability
Any active
component that must remain responsive for the system's continued functioning should implement the health checking pattern. Such components represent most active components in the system and might include:
- Command Dispatch
- Telemetry Handling
- Event Handling
- File Management
- Data Product Management
- Communications
Additionally, if a component is at-risk for losing responsiveness (e.g. potentially long-running operations, unbounded file i/o, etc.), the health checking pattern can be applied to ensure it remains alive.
Design
An active
component needing periodic health-checking should implement a set of callback ports of type Svc.Ping
and should be connected to the Svc.Health
component. The input Svc.Ping
port must be asynchronous to ensure the test is run on the component's thread. Upon receiving a ping, the component responds immediately back with the same message.
sequenceDiagram
Health->>+Component: Ping In
Component->>-Health: Ping Response
Svc.Health
tracks how long it takes for the component to respond to the ping message placed on its queue. Svc.Health
will produce a WARNING_HI
event after a configurable amount of time followed by FATAL
event after a longer configured time. Thus the system will issue a WARNING_HI event if a component does not respond, and escalate to a FATAL event (triggering a reset or other FATAL handling) if the component remains unresponsive.
Implementation
Implementation of the health checking pattern involves placing a pair of Svc.Ping
ports on your active
component, one as an async input
and the other as an output
. Typically these ports are named pingIn
and pingOut
respectively.
Component Model Snippet
active component CriticalComponent {
@ Ping input port to show responsiveness
async input port pingIn: Svc.Ping
@ Ping output port for response to the ping
output port pingOut: Svc.Ping
}
The C++ implementation of the component must respond via pingOut
when handling pingIn
.
Component C++ Snippet
void CriticalComponent ::pingIn_handler(FwIndexType portNum, U32 key) {
this->pingOut_out(portNum, key);
}
Note
This implementation of a component's pingIn_handler
is always the same. It is safe to copy the above code verbatim as long as CriticalComponent
is replaced with your component's name.
At the system topology level, you must specify a component handling system health.
topology MyTopology {
...
health connections instance $health # Use instance 'health' as the handler of health (Svc.Ping) connections
}
You must also configure the delays (measured in the Health component's rate group ticks) that invoke a warning and fatal response from Svc.Health
. This is done by defining a component instance configuration block in the PingEntries
namespace under your topology's namespace
namespace MyTopology {
namespace PingEntries {
// Health
namespace criticalComponent {
enum {
WARN = 3, // WARNING_HI after 3 ticks without a response from criticalComponent instance of CriticalComponent
FATAL = 5 // FATAL after 5 ticks without a response from criticalComponent instance of CriticalComponent
};
}
}
}
Note
This configuration is set for each instance of the component. In this example criticalComponent
is an instance of CriticalComponent
Testing and Verification
The health checking pattern can be tested by a combination of unit and integration tests. For basic functionality, invoke the input
Svc.Ping port in a unit test, dispatch the component, and assert the output
Svc.Ping port returned the supplied key.
Integration tests can be used to test this pattern alongside the Svc.Health component. Set the ping timeout of the component below the minimum time for the component using HLTH_CHNG_PING
and assert that appropriate WARNING and/or FATAL events occur.
Other Considerations
The health checking pattern can be used to test any active component, not just critical ones. However, care should be taken with configured values as Svc.Health will FATAL the system in response to unresponsive components leading to system reset or other FATAL handling actions.
Conclusion
The health checking pattern can be used to ensure critical services within the system remain responsive over the course of the software's execution. Should the component fail to respond for a pair of configurable durations, a WARNING_HI and FATAL event will respectively result.