Resilient Failure Model
Overview
Resilient’s failure model is designed for safety-critical embedded systems. It unifies compile-time guarantees, runtime diagnostics, recoverable faults, and explicit recovery semantics into one coherent framework.
The model answers four core questions:
- What can fail? (contract violations, invalid operations, resource exhaustion)
- How does it fail? (panics, errors, recoverable exceptions, proof of safety)
- What can recover? (explicit
live {}blocks, error handling, fault tolerance) - What is statically guaranteed? (memory safety, type safety, provable contracts)
Failure Categories
Category 1: Compile-Time Guarantees (No Runtime Failure)
Definition: Errors provably impossible at compile time.
Guarantee: Code that passes typechecking is guaranteed not to produce these errors at runtime.
Examples:
- Memory safety violations (use-after-free, double-free)
- Type mismatches in expressions
- Dangling pointer dereferences
- Array out-of-bounds access (statically known)
- Missing function parameters
Mechanism:
- Rust-like borrow checker
- Type system constraints
- Bounds checking at compile time
- Ownership and lifetime rules
Error reporting: Compiler diagnostic, non-recoverable.
Category 2: Runtime Errors (Recoverable via Result/try)
Definition: Errors that occur at runtime but can be caught and handled.
Guarantee: Error is either handled via Result / try or causes program termination.
Examples:
- Division by zero
- Array access (dynamically unknown bounds)
- Integer overflow (checked arithmetic)
- File I/O failures
- Network timeouts
- Resource allocation failures
Mechanism:
Result<T, E>return typestryexpressions for error propagation- Match expressions for error handling
Example:
fn safe_divide(a: int, b: int) -> Result<int, string> {
if b == 0 {
return Err("division by zero");
}
return Ok(a / b);
}
fn main() {
match safe_divide(10, 2) {
Ok(result) => { print(result); }
Err(msg) => { eprintln(msg); }
}
}
Error reporting: Result or panic (if unhandled).
Category 3: Recoverable Faults (via live {} blocks)
Definition: Failures from which execution can recover and continue from a known safe state.
Guarantee: Faults trigger explicit recovery code; without recovery, execution terminates.
Examples:
- Invalid input format
- Expected data not received
- Sensor reading out of range
- Configuration mismatch
- Transient hardware faults
Mechanism:
live { attempt } recover { recovery_code }blocks- Automatic state snapshots and rollback
- Fault injection testing with Stateright
Example:
live {
sensor_reading = read_sensor(); // May return bad value
if sensor_reading < MIN || sensor_reading > MAX {
fault "invalid sensor";
}
} recover {
sensor_reading = default_value;
log_fault("sensor out of range");
}
process(sensor_reading); // Always valid after recover
Guarantees:
- If
attemptsucceeds, execution continues normally - If
attemptsignals fault,recoverexecutes - After recovery, state is guaranteed to be valid per contract
- No partial state leaks between attempt and recovery
Error reporting: Fault signal + recovery execution + continue.
Category 4: Contract Violations (Checked at Entry/Exit)
Definition: Precondition and postcondition violations on functions.
Guarantee (with Z3 verifier):
- If verified: no contract violation is possible
- If not verified: violations cause runtime error with diagnostic
Example:
#[requires(x >= 0)]
#[requires(y >= 0)]
#[ensures(return >= x and return >= y)]
fn max(int x, int y) -> int {
if x > y { return x; }
return y;
}
Precondition violation: Caller passes invalid arguments → runtime error
Postcondition violation: Function returns invalid result → runtime error
Mechanism:
- Contract assertions at function entry/exit
- Optional SMT verification (Z3) for static proof
- Runtime checking when not verified
Error reporting: Contract assertion failure + diagnostic.
Category 5: Panic (Unrecoverable System Failure)
Definition: A state from which safe recovery is impossible.
Guarantee: Program terminates immediately with diagnostic.
Examples:
- Invariant violation
- Unreachable code path
- Stack overflow
- Out-of-memory (allocator failure)
Mechanism:
panic!()macro- Invariant violation detection
- Stack/memory exhaustion detection
- Uncaught exceptions
Rule: Panics should not occur in production-grade code. If a panic is possible, it’s a bug that should be fixed via proper error handling.
Error reporting: Panic message + stack trace + termination.
Effect Annotations and Failure
Pure Functions (No Effects)
fn add(int x, int y) -> int {
return x + y; // No failures possible
}
Guarantee: Function cannot fail (no I/O, no allocation, no division).
I/O Functions
fn read_file(path: string) -> Result<string, string> ! IO {
// May fail: file not found, permission denied, read error
}
Guarantee: Failures are only via Result or panic.
Mutation Functions
fn update_counter(counter: &mut int) -> unit ! Mutation {
*counter += 1; // Cannot fail (mutation is atomic)
}
Guarantee: Mutation itself cannot fail; side effects are atomic from caller’s perspective.
Error Handling Patterns
Pattern 1: Propagate Errors with try
fn read_and_process(path: string) -> Result<string, string> {
let data = try read_file(path); // Errors propagate
let result = try parse_data(data); // Errors propagate
return Ok(result);
}
Pattern 2: Recover with live {}
live {
let value = fetch_from_network();
if value < 0 {
fault "invalid network value";
}
} recover {
value = cached_value;
}
process_value(value); // Guaranteed to be valid
Pattern 3: Convert Panics to Errors
fn safe_divide(a: int, b: int) -> Result<int, string> {
if b == 0 {
return Err("division by zero");
}
return Ok(a / b);
}
Pattern 4: Contract-Based Safety
#[requires(items.len() > 0)]
#[ensures(return >= 0)]
fn find_minimum(items: &array<int>) -> int {
let min = items[0];
for i in 1..items.len() {
if items[i] < min {
min = items[i];
}
}
return min;
}
// Caller must ensure non-empty array
fn main() {
let data = [1, 2, 3];
let min = find_minimum(&data); // Precondition satisfied
}
Verification and Proof
Static Verification (Z3-based)
#[verify] // Request formal verification
fn safe_add(x: int, y: int) -> int {
return x + y;
}
After verification:
safe_addis proven to never overflow (for bounded inputs)- Z3 generates a proof
- No runtime checking needed for this function
Runtime Checking (Fallback)
#[requires(x >= 0 and y >= 0)]
fn checked_add(int x, int y) -> int {
// Runtime check: x >= 0 and y >= 0
return x + y; // May overflow but precondition verified
}
Safety Guarantees
Memory Safety
| Guarantee | Mechanism |
|---|---|
| No use-after-free | Borrow checker + lifetime rules |
| No double-free | Ownership model |
| No dangling pointers | Reference lifetime checking |
| No data races | Exclusive access rules + effect tracking |
| No buffer overflows | Bounds checking + static analysis |
Type Safety
| Guarantee | Mechanism |
|---|---|
| No type mismatches | Compile-time type checking |
| No invalid casts | Type system constraints |
| No null pointer dereferences | Non-null by default (Option for nullable) |
| No uninitialized variables | Must-init checking |
Concurrency Safety
| Guarantee | Mechanism |
|---|---|
| No data races | Exclusive vs. shared access rules |
| No deadlocks (future) | Effect types + Stateright verification |
| No use-after-free in concurrent code | Lifetime + borrow rules |
Failure Timeline
┌─────────────────┐
│ Function called │
└────────┬────────┘
│
v
┌─────────────────────────────┐
│ Check preconditions │
│ (if @requires present) │
└────────┬──────────┬─────────┘
│ │
│ FAIL │ OK
v │
Error │
(Contract) │
v
┌──────────────────┐
│ Execute function │
│ body │
└────┬────┬────┬───┘
│ │ │
┌───────┘ │ └──────┐
v v v
Success Error/Fault Panic
(Normal) (Recoverable) (Fatal)
│ │ │
v v v
┌──────────┐ ┌────────────┐ ┌──────┐
│Continue │ │Live/Recover│ │Abort │
└────┬─────┘ │or try/match│ └──────┘
│ └────────────┘
v
┌─────────────────┐
│ Check │
│ postconditions │
└────┬────────┬───┘
│ │
│ OK │ FAIL
v v
Return Error
(Contract)
When to Use Each Failure Mode
| Scenario | Recommended | Why |
|---|---|---|
| File not found | Result<T, Error> |
Expected, recoverable |
| Invalid JSON | Result<T, Error> + try |
Expected, needs specific handling |
| Sensor out of range | live {} recover {} |
Transient, needs fallback |
| Precondition violated | Contract assertion | Caller bug, should not happen |
| Out of memory | Panic | System failure, cannot recover |
| Network timeout | Result<T, Error> |
Expected in distributed systems |
| Invariant broken | panic!() |
Internal logic error |
| User input mismatch | live {} recover {} |
Transient, auto-recovery |
Best Practices
1. Be Explicit About Failure
// Good
fn parse_int(s: string) -> Result<int, string> {
// ... parsing logic ...
}
// Avoid: silent fallback
fn parse_int_unsafe(s: string) -> int {
// may panic on bad input
}
2. Prefer Errors Over Panics in Libraries
// Library function: use Result
pub fn query_database(sql: string) -> Result<Data, string> {
// Return errors, don't panic
}
// Application code: can panic if unrecoverable
fn main() {
let data = query_database("SELECT ...")
.expect("database must be accessible");
}
3. Use live {} for Transient Faults
// Good: recover from transient sensor failure
live {
reading = sensor.read();
} recover {
reading = last_known_good_value;
}
// Avoid: silently ignoring failures
let reading = sensor.read_or_zero(); // Hides real problems
4. Document Contract Assumptions
#[requires(list.len() > 0, "list must not be empty")]
fn head<T>(list: &array<T>) -> T {
return list[0];
}
5. Test Error Paths
#[test]
fn test_division_by_zero() {
let result = safe_divide(10, 0);
assert!(result.is_err());
assert!(result.unwrap_err().contains("division"));
}
Roadmap
v0.3: Enhance Error Messages
- Better contract violation diagnostics
- Suggestions for fixing common errors
- Error backtrace with function names
v0.4: Recovery Semantics
- Formalize
live {} recover {}semantics with Stateright - Add fault injection testing
- Document recovery patterns
v0.5: SMT Verification
- Z3 verification for all contract forms
- Automatic precondition inference
- Proof generation and storage
v0.6+: Future Directions
- Exception types (checked exceptions)
- Async error handling (Future<T, E>)
- Distributed system failure modes
References
- RES-3505: Consolidate the failure and recovery semantics
- RES-3504: Specify and enforce the memory model
- MEMORY_MODEL.md: Memory safety model
- TYPE_SYSTEM_ROADMAP.md: Type system design