Redundancy Manager Service

group Redundancy Manager

Aggregates component faults to determine system-wide health.

group Types

Enumerations for fault sources and system states.

Defines

REDUNDANCY_MANAGER_MAX_FAULTS

Maximum number of active faults to track simultaneously.

REDUNDANCY_MANAGER_SERVICE_UID

Service Unique Identifier (16-bit). Used to construct unique Event IDs. 0x5366 = “Sf” (SafeMode/System Fault)

REDUNDANCY_EVENT_CRITICAL_HEALTH
REDUNDANCY_EVENT_HEALTH_DEGRADED
REDUNDANCY_EVENT_HEALTH_RECOVERED
REDUNDANCY_EVENT_COMPONENT_DEGRADED
REDUNDANCY_EVENT_COMPONENT_RECOVERED
REDUNDANCY_EVENT_HEALTH_RESPONSE
REDUNDANCY_EVENT_COMPONENT_STATUS_RESPONSE
REDUNDANCY_EVENT_FAULT_LIST_RESPONSE
REDUNDANCY_EVENT_TELEMETRY

Typedefs

typedef uint32_t fault_code_t

Service-specific error code.

Specific codes are defined in the reporting service’s header. Example: A FAULT_SOURCE_RAIL might report RAIL_FAULT_OVERCURRENT.

Enums

enum redundancy_manager_event_id_t

Events published by the Redundancy Manager.

Values:

enumerator REDUNDANCY_EVENT_CRITICAL_HEALTH

Published when system enters critical health state.

Applications should initiate Safe Mode transition.

Payload: system_health_t (always SYSTEM_HEALTH_FAULT)

enumerator REDUNDANCY_EVENT_HEALTH_DEGRADED

Published when system health becomes degraded.

Mission can continue with reduced capability. Services may need to adapt (e.g., reduce power consumption, disable non-critical features).

Payload: system_health_t (always SYSTEM_HEALTH_DEGRADED)

enumerator REDUNDANCY_EVENT_HEALTH_RECOVERED

Published when system recovers to nominal health.

All critical and degraded faults have been cleared.

Payload: system_health_t (always SYSTEM_HEALTH_OK)

enumerator REDUNDANCY_EVENT_COMPONENT_DEGRADED

Published when a specific component becomes degraded.

Indicates a component failure with fallback available. Affected services should switch to backup/redundant hardware.

Example: Primary UART fails → switch to auxiliary UART

Payload: component_degradation_t

enumerator REDUNDANCY_EVENT_COMPONENT_RECOVERED

Published when a degraded component recovers.

Services may optionally switch back to primary hardware.

Payload: component_id_t

enumerator REDUNDANCY_EVENT_HEALTH_RESPONSE

Published in response to REQUEST_REDUNDANCY_HEALTH query.

Payload: health_response_t

enumerator REDUNDANCY_EVENT_COMPONENT_STATUS_RESPONSE

Published in response to REQUEST_REDUNDANCY_COMPONENT_STATUS query.

Payload: component_status_response_t

enumerator REDUNDANCY_EVENT_FAULT_LIST_RESPONSE

Published in response to REQUEST_REDUNDANCY_FAULT_LIST query.

May require multiple events if fault list is large (chunked response).

Payload: fault_list_response_t

enumerator REDUNDANCY_EVENT_TELEMETRY

Periodic telemetry broadcast (e.g., every 30 seconds).

Contains summary of system health and fault counts.

Payload: redundancy_telemetry_t

enum fault_source_t

Identifies the subsystem reporting a failure.

Values:

enumerator FAULT_SOURCE_BATTERY

BMS issues (voltage, temperature, etc.)

enumerator FAULT_SOURCE_MPPT

Solar charging failures

enumerator FAULT_SOURCE_RAIL

Rail controller (overcurrent, enable failures)

enumerator FAULT_SOURCE_SENSOR

I2C/SPI sensor timeouts or bad data

enumerator FAULT_SOURCE_UART

UART communication errors

enumerator FAULT_SOURCE_WATCHDOG

Watchdog timeout or service hang

enumerator FAULT_SOURCE_MEMORY

Flash/EEPROM errors

enumerator FAULT_SOURCE_COUNT

Number of fault sources

enum fault_severity_t

Severity classification for individual faults.

Values:

enumerator FAULT_SEVERITY_INFO

Informational, no action required

enumerator FAULT_SEVERITY_WARNING

Potential issue, monitor closely

enumerator FAULT_SEVERITY_DEGRADED

Component degraded, fallback available

enumerator FAULT_SEVERITY_CRITICAL

Critical failure, Safe Mode required

enum system_health_t

High-level classification of EPS health.

Used by applications to drive state transitions.

Values:

enumerator SYSTEM_HEALTH_OK

All systems nominal

enumerator SYSTEM_HEALTH_DEGRADED

Non-critical faults, mission continues

enumerator SYSTEM_HEALTH_FAULT

Critical failure, requires Safe Mode

enum component_id_t

Identifiers for components with redundancy/fallback options.

Values:

enumerator COMPONENT_UART_PRIMARY

Primary UART (port 1)

enumerator COMPONENT_UART_SECONDARY

Secondary UART (port 3)

enumerator COMPONENT_I2C_BUS_1

I2C bus 1

enumerator COMPONENT_I2C_BUS_2

I2C bus 2

enumerator COMPONENT_I2C_BUS_3

I2C bus 3

enumerator COMPONENT_I2C_BUS_4

I2C bus 4

enumerator COMPONENT_SOLAR_STRING_1

Solar panel string 1

enumerator COMPONENT_SOLAR_STRING_2

Solar panel string 2

enumerator COMPONENT_SOLAR_STRING_3

Solar panel string 3

enumerator COMPONENT_SOLAR_STRING_4

Solar panel string 4

enumerator COMPONENT_SOLAR_STRING_5

Solar panel string 5

enumerator COMPONENT_SOLAR_STRING_6

Solar panel string 6

enumerator COMPONENT_COUNT

Number of tracked components

struct fault_t
#include <redundancy_manager.h>

Represents a single active fault in the system.

Public Members

fault_source_t source

Subsystem that reported the fault

fault_code_t code

Service-specific error code

fault_severity_t severity

Severity classification

uint32_t timestamp_ms

When the fault was first detected

uint32_t count

Number of times this fault has occurred

bool active

True if fault is currently active

struct component_degradation_t
#include <redundancy_manager.h>

Payload for component degradation events.

Public Members

component_id_t component

Which component is degraded

fault_source_t fault_source

What caused the degradation

bool fallback_available

True if fallback/redundant option exists

struct health_response_t
#include <redundancy_manager.h>

Response payload for health queries.

Public Members

system_health_t health

Current system health

uint32_t active_fault_count

Number of active faults

uint32_t timestamp_ms

When health was sampled

struct component_status_request_t
#include <redundancy_manager.h>

Request payload to query specific component status.

Public Members

component_id_t component

Which component to query

struct component_status_response_t
#include <redundancy_manager.h>

Response payload for component status queries.

Public Members

component_id_t component

Requested component

bool is_ok

True if operational, false if degraded

fault_source_t fault_source

What caused degradation (if degraded)

uint32_t timestamp_ms

When status was sampled

struct fault_list_response_t
#include <redundancy_manager.h>

Response payload for fault list queries.

Contains a subset of active faults. May require multiple events for complete fault list.

Public Members

uint32_t total_faults

Total number of active faults

uint32_t chunk_index

Which chunk this is (0-based)

uint32_t faults_in_chunk

Number of faults in this response

fault_t faults[4]

Up to 4 faults per response

struct redundancy_telemetry_t
#include <redundancy_manager.h>

Periodic telemetry payload.

Public Members

system_health_t health

Current health

uint32_t active_fault_count

Active faults

uint32_t total_faults_since_boot

Lifetime fault counter

uint32_t degraded_components

Bitmask of degraded components

uint32_t timestamp_ms

Telemetry timestamp

struct redundancy_manager_t
#include <redundancy_manager.h>

The redundancy manager state.

Public Members

fault_t faults[REDUNDANCY_MANAGER_MAX_FAULTS]

Active fault list

system_health_t health

Current aggregated system health

bool component_status[COMPONENT_COUNT]

True = OK, False = Degraded

uint32_t total_fault_count

Total faults since boot (for telemetry)

bool initialized

True if initialized

group Public API

Functions for initializing the redundancy manager.

All interaction with the redundancy manager occurs through events. The manager publishes health updates and responds to query requests via the event bus.

Functions

void redundancy_manager_init(redundancy_manager_t *manager)

Initialize the Redundancy Manager.

Clears all faults, sets system health to SYSTEM_HEALTH_OK, and subscribes to:

  • Fault events from all services (e.g., BATTERY_EVENT_FAULT_DETECTED)

  • Query request events from applications

  • Recovery events from services

Publishes initial REDUNDANCY_EVENT_HEALTH_RECOVERED event after init.

Parameters:

manager[inout] The redundancy manager to initialize.