Error handling considered hard. In an analysis of 48 critical failures (of large distributed databases) the cited paper[1] finds that most bugs were due to bad error handling. """Drilling down further, 25% of bugs are from simply ignoring an error, 8% are from catching the wrong exception, 2% are from incomplete TODOs, and another 23% are “easily detectable”, which are defined as cases where “the error handling logic of a non-fatal error was so wrong that any statement coverage testing or more careful code reviews by the developers would have caught the bugs”."""

On the upside, with the right tooling, "98% of critical failures can be reproduced in a 3 node cluster."

[1] "Simple testing can prevent most critical failures"
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/yuan
Shared publiclyView activity