Dom Williams

Application-wide panic handling

A panic in Rust is the final resort when the code reaches an erroneous or impossible state. The stack of the thread it occurred on is unwound, every value that falls out of scope dropped, and the thread terminated. This is isolated to that specific thread1, which allows other threads to detect and gracefully handle the error condition by e.g. spawning a new thread to replace it, or logging and exiting cleanly.

In most programs this is pretty useful, but I've been tripped up enough by this behaviour in my game engine to warrant implementing an alternate method of panic handling.

All panics are terminal

Ideally, no panics should ever occur. Clearly this isn't possible, so I'm striving for the next best thing - immediate detection and graceful shutdown. Any panic on any thread should be treated as a hard error, and the main thread should attempt to save all game data and exit as soon as possible.

This provides the benefit of increased confidence in finding error conditions during testing, especially automated end-to-end tests where the game is run headless with debug asserts enabled. I've been bitten by this many times, when a worker thread panics but the main thread continues on oblivious, exiting cleanly and marking the test as a "success". It can be frustrating to hunt down bugs like these where the error is swallowed and silently breaks things2, as I describe in the section below.

The pain of silent panics

My game engine makes heavy use of thread pools to process expensive workloads over several ticks without affecting the frame rate - examples include generating chunks of terrain, or calculating paths for navigation. The main thread will post work to the thread pool via a channel, then consume available results each tick, all without blocking. A pseudo-code example:

// set up channels
let (request_tx, request_rx) = channel();
let (result_tx, result_rx) = channel();

// spawn worker thread
std::thread::spawn(move || {
    while let Ok(req) = request_rx.recv() {
        // process request (potentially taking a long time)
        // and post response back
        let result = do_work(req);
        result_tx.send(result).unwrap();
    }
});

// core game loop
loop {
    // submit work to worker threads
    for chunk in required_chunks() {
        request_tx.send(chunk).unwrap(); // does not block
    }

    // consume completed results, again without blocking
    while let Ok(res) = result_rx.try_recv() {
        // ... use result ...
    }

    // ...
}   

This works swimmingly until something causes a panic in the worker thread. Depending on the exact setup, one or more of the following could happen:

I've seen all of these happen, and any of the last 4 are a real pain to either notice or track down. The most amusing is #4; the main thread has requested a chunk of terrain and now waits forever for it, leaving a gaping hole in the world where that terrain should be4.

Screenshot of single missing chunkA panic killed the thread before it could finish processing that chunk, so I guess they'll just have to learn to live without it.

I've written a small crate to avoid this happening again, which is the whole point of this post.

panik-rs logoI refuse to apologise for this.

panik-rs to the rescue

panik is a simple wrapper around the standard panic functionality, and does the following:

The docs include some basic examples, but I want to show here how the API fits into the code base it was designed for.

Example in context

Below is a cut-down snippet from the game engine's main. The game spends its whole runtime inside the context of panik::run_and_handle_panics, which boils down to installing a custom panic hook and a catch_unwind, which catches any panics on the main thread and prevents them from bubbling up to kill the program too soon.

// ... initialize game state ....

// run the game inside panik's handler
let result = panik::run_and_handle_panics(|| { do_main() });

let exit_code = match result {
    None => {
        // 1 or more panics occurred which have already been
        // logged, nothing more to do
        1
    }
    Some(_) => 0, // no panics occurred
};

// ... optionally use panik::panics() to programatically access
//     panic messages, backtraces etc ...

// ... attempt to save any game data ...

std::process::exit(exit_code);

Now that panics will be detected by the custom panic handler, we need a way to interrupt the main thread if a panic occurred on different thread, in order to gracefully exit. This is done in two places:

The first was an early cause of annoyance; the game initially requests a radius of terrain around the player's position, then blocks with a timeout until the world is available. While the game is in an early state, there isn't much to do and world generation finishes quite quickly, but if something panics it will sit there and wait until the timeout has elapsed5. That code now looks like this:

pub fn block_until_world_loaded(
    &mut self,
    timeout: Duration,
    bail: &impl Fn() -> bool,
) -> Result<SlabLocation, TerrainSourceError> {
    let end_time = Instant::now() + timeout;
    loop {
        if bail() {
            break Err(TerrainSourceError::Bailed);
        }

        let this_timeout = {
            let now = Instant::now();
            if now >= end_time {
                break Err(TerrainSourceError::TimedOut);
            }
            let left = end_time - now;
            left.min(Duration::from_secs(1)).max(Duration::from_secs(3))
        };

        // ... block for `this_timeout` and return `Ok(res)`
        //     if a result is returned ...
    }
}

The lines of interest are highlighted - a function pointer is passed in which triggers an early exit if it returns true when called with no arguments. The limit on the timeout and the loop mean that bail() will be called periodically. As you've probably guessed, a function that fits that function signature is panik::has_panicked!

The other place this is checked is every tick in the game loop. As soon as a panic is detected the loop is broken with an error:

// core game loop
loop {
        if panik::has_panicked() {
            debug!("breaking out of loop due to panics");
            break Exit::Stop;
        }

        // ... tick and render ...
}

That's it! The API is pretty simple and easy enough to integrate into a code base.

Before/after comparison

This is what a panic in a worker thread during world initialisation looked like before. The panic and backtrace is mixed in with other logs, and the game window was unresponsive for 30 seconds before dying.

...
  000000 DEBG populating chunk with slabs
  00000thread '0wrld-worker-0' panicked at 'intentional panic!!11!', /home/.../name-needed/game/world/src/loader/finalizer.rs:84: 21
stack backtrace:
DEBG removed 0 nodes in slab range, upper: SlabIndex(4), lower: SlabIndex(-1), removed: 0
  000000 DEBG added 960 edges and 256 nodes, area: ChunkArea { slab: SlabIndex(1), area: SlabAreaIndex(1) }, nodes: 256, edges: 960
   0: std::panicking::begin_panic
   1: world::loader::finalizer::SlabFinalizer<C>::finalize::{{closure}}
   2: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   ... lots of frames omitted ...
  45: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
... 30 seconds later ...
000000 CRIT critical error, error: Timed out
000000 CRIT more detail, error: TimedOut
000000 INFO waiting 2 seconds to allow other threads to finish logging, seconds: 2
000000 INFO exiting cleanly, code: 1

But now, with the power of panik:

  000000 DEBG populating chunk with slabs
000000 ERRO handling panic on thread ThreadId(9) (wrld-worker-3): 'intentional panic!!11!'
000000 WARN panic occurred in another thread, swallowing unpanicked result: Err(Error(Bailed))
000000 ERRO 1 threads panicked, count: 1
000000 CRIT panic on thread "ThreadId(9) (wrld-worker-3)": "intentional panic!!11!"
   0: panik::register_panic
   1: panik::GlobalStateGuard::init::{{closure}}
   ... lots of frames omitted ...
  62: __GI___clone
000000 INFO waiting 2 seconds to allow other threads to finish logging, seconds: 2
000000 INFO exiting cleanly, code: 1

The logs are clear and not intermingled, and the game exited immediately. Much better!

Watch out for async runtimes

Since moving my manual thread pools to tokio, panics have become more of a pain, especially unconditional panics when I forget to remove a todo!(). The runtime's behaviour of auto-spawning new threads to replace the dead ones doesn't lead to pretty results, and I'm sure the following log speaks for itself:

...
000000 ERRO handling panic on thread ThreadId(7) (wrld-worker-1): 'unconditional panic!!11!'
000000 ERRO handling panic on thread ThreadId(9) (wrld-worker-3): 'unconditional panic!!11!'
000000 ERRO handling panic on thread ThreadId(11) (wrld-worker-5): 'unconditional panic!!11!'
000000 ERRO handling panic on thread ThreadId(10) (wrld-worker-4): 'unconditional panic!!11!'
000000 ERRO handling panic on thread ThreadId(6) (wrld-worker-0): 'unconditional panic!!11!'
000000 ERRO handling panic on thread ThreadId(8) (wrld-worker-2): 'unconditional panic!!11!'
000000 ERRO handling panic on thread ThreadId(9) (wrld-worker-3): 'unconditional panic!!11!'
000000 ERRO handling panic on thread ThreadId(11) (wrld-worker-5): 'unconditional panic!!11!'
000000 ERRO handling panic on thread ThreadId(10) (wrld-worker-4): 'unconditional panic!!11!'
000000 ERRO handling panic on thread ThreadId(6) (wrld-worker-0): 'unconditional panic!!11!'
000000 ERRO handling panic on thread ThreadId(8) (wrld-worker-2): 'unconditional panic!!11!'
000000 ERRO handling panic on thread ThreadId(6) (wrld-worker-0): 'unconditional panic!!11!'
000000 ERRO handling panic on thread ThreadId(11) (wrld-worker-5): 'unconditional panic!!11!'
000000 ERRO handling panic on thread ThreadId(10) (wrld-worker-4): 'unconditional panic!!11!'
... many many more ...
000000 ERRO handling panic on thread ThreadId(7) (wrld-worker-1): 'unconditional panic!!11!'
000000 WARN panic occurred in another thread, swallowing unpanicked result: Err(Error(Bailed))
000000 ERRO 402 threads panicked, count: 402
000000 CRIT panic on thread "ThreadId(7) (wrld-worker-1)": "unconditional panic!!11!"
   0: panik::register_panic
   1: panik::GlobalStateGuard::init::{{closure}}
   ... lots of frames omitted ...
  61: start_thread
  62: __GI___clone

000000 CRIT panic on thread "ThreadId(9) (wrld-worker-3)": "unconditional panic!!11!"
   0: panik::register_panic
   ... repeat for every single panic ...

This is the reason for enforcing a limit on the number of backtraces to resolve, defaulting to 8. If there are 8+ panics occurring, the first 8 should give you enough information to fix the rest.

Summary

Hopefully this crate can help others react to panics properly in their fearlessly concurrent programs! I've linked to it a few times above, but here it is again in case you missed it.

Kalm meme, sorryOk, I apologise for this one.


  1. Unless it's the main thread or you're using panic = "abort", then you lose the whole process too. 

  2. By "silent" I mean breaking the game but not enough to crash it, leaving it running in a broken state. The panic backtrace will probably be visible in the logs, but sometimes interleaved with other logs if it's written directly to stderr while they're buffered and written to by a separate thread. And by "sometimes" I mean every time. 

  3. Assuming of course you check the return value of send and don't swallow it with let _ = request_tx.send(..)

  4. Ok, I'll fess up - it's not normally as surgical as shown in the image. I intentionally added an unconditional panic when loading that chunk in particular, just for the screenshot. But the point still stands! 

  5. This timeout was only 30 seconds, but when running and rerunning the game constantly during dev those lost seconds really add up; sometimes minutes per day are lost to waiting on a "peculiarly long load time".