RFC: Error Context and Compatibility

Status: Implemented

Applies to: Generated clients and shared rust-runtime crates

This RFC proposes a pattern for writing Rust errors to provide consistent error context AND forwards/backwards compatibility. The goal is to strike a balance between these four goals:

Errors are forwards compatible, and changes to errors are backwards compatible
Errors are idiomatic and ergonomic. It is easy to match on them and extract additional information for cases where that's useful. The type system prevents errors from being used incorrectly (for example, incorrectly retrieving context for a different error variant)
Error messages are easy to debug
Errors implement best practices with Rust's Error trait (for example, implementing the optional source() function where possible)

Note: This RFC is not about error backwards compatibility when it comes to error serialization/deserialization for transfer over the wire. The Smithy protocols cover that aspect.

Past approaches in smithy-rs

This section examines some examples found in aws-config that illustrate different problems that this RFC will attempt to solve, and calls out what was done well, and what could be improved upon.

Case study: `InvalidFullUriError`

To start, let's examine InvalidFullUriError (doc comments omitted):

#[derive(Debug)]
#[non_exhaustive]
pub enum InvalidFullUriError {
    #[non_exhaustive] InvalidUri(InvalidUri),
    #[non_exhaustive] NoDnsService,
    #[non_exhaustive] MissingHost,
    #[non_exhaustive] NotLoopback,
    DnsLookupFailed(io::Error),
}

impl Display for InvalidFullUriError {
    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
        match self {
            InvalidFullUriError::InvalidUri(err) => write!(f, "URI was invalid: {}", err),
            InvalidFullUriError::MissingHost => write!(f, "URI did not specify a host"),
            // ... omitted ...
        }
    }
}

impl Error for InvalidFullUriError {
    fn source(&self) -> Option<&(dyn Error + 'static)> {
        match self {
            InvalidFullUriError::InvalidUri(err) => Some(err),
            InvalidFullUriError::DnsLookupFailed(err) => Some(err),
            _ => None,
        }
    }
}

This error does a few things well:

Using #[non_exhaustive] on the enum allows new errors to be added in the future.
Breaking out different error types allows for more useful error messages, potentially with error-specific context. Customers can match on these different error variants to change their program flow, although it's not immediately obvious if such use cases exist for this error.
The error cause is available through the Error::source() impl for variants that have a cause.

However, there are also a number of things that could be improved:

All tuple/struct enum members are public, and InvalidUri is an error from the http crate. Exposing a type from another crate can potentially lock the GA SDK into a specific crate version if breaking changes are ever made to the exposed types. In this specific case, it prevents using alternate HTTP implementations that don't use the http crate.
DnsLookupFailed is missing #[non_exhaustive], so new members can never be added to it.
Use of enum tuples, even with #[non_exhaustive], adds friction to evolving the API since the tuple members cannot be named.
Printing the source error in the Display impl leads to error repetition by reporters that examine the full source chain.
The source() impl has a _ match arm, which means future implementers could forget to propagate a source when adding new error variants.
The error source can be downcasted to InvalidUri type from http in customer code. This is a leaky abstraction where customers can start to rely on the underlying library the SDK uses in its implementation, and if that library is replaced/changed, it can silently break the customer's application. Note: later in the RFC, I'll demonstrate why fixing this issue is not practical.

Case study: `ProfileParseError`

Next, let's look at a much simpler error. The ProfileParseError is focused purely on the parsing logic for the SDK config file:

#[derive(Debug, Clone)]
pub struct ProfileParseError {
    location: Location,
    message: String,
}

impl Display for ProfileParseError {
    fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
        write!(
            f,
            "error parsing {} on line {}:\n  {}",
            self.location.path, self.location.line_number, self.message
        )
    }
}

impl Error for ProfileParseError {}

What this error does well:

The members are private, so #[non_exhaustive] isn't even necessary
The error is completely opaque (maximizing compatibility) while still being debuggable thanks to the flexible messaging

What could be improved:

It needlessly implements Clone, which may prevent it from holding an error source in the future since errors are often not Clone.
In the future, if more error variants are needed, a private inner error kind enum could be added to change messaging, but there's not a nice way to expose new variant-specific information to the customer.
Programmatic access to the error Location may be desired, but this can be trivially added in the future without a breaking change by adding an accessor method.

Case study: code generated client errors

The SDK currently generates errors such as the following (from S3):

#[non_exhaustive]
pub enum Error {
    BucketAlreadyExists(BucketAlreadyExists),
    BucketAlreadyOwnedByYou(BucketAlreadyOwnedByYou),
    InvalidObjectState(InvalidObjectState),
    NoSuchBucket(NoSuchBucket),
    NoSuchKey(NoSuchKey),
    NoSuchUpload(NoSuchUpload),
    NotFound(NotFound),
    ObjectAlreadyInActiveTierError(ObjectAlreadyInActiveTierError),
    ObjectNotInActiveTierError(ObjectNotInActiveTierError),
    Unhandled(Box<dyn Error + Send + Sync + 'static>),
}

Each error variant gets its own struct, which can hold error-specific contextual information. Except for the Unhandled variant, both the error enum and the details on each variant are extensible. The Unhandled variant should move the error source into a struct so that its type can be hidden. Otherwise, the code generated errors are already aligned with the goals of this RFC.

Approaches from other projects

`std::io::Error`

The standard library uses an Error struct with an accompanying ErrorKind enum for its IO error. Roughly:

#![allow(unused)]
fn main() {
#[derive(Debug)]
#[non_exhaustive]
pub enum ErrorKind {
    NotFound,
    // ... omitted ...
    Other,
}

#[derive(Debug)]
pub struct Error {
    kind: ErrorKind,
    source: Box<dyn std::error::Error + Send + Sync>,
}
}

What this error does well:

It is extensible since the ErrorKind is non-exhaustive
It has an Other error type that can be instantiated by users in unit tests, making it easier to unit test error handling

What could be improved:

There isn't an ergonomic way to add programmatically accessible error-specific context to this error in the future
The source error can be downcasted, which could be a trap for backwards compatibility.

Hyper 1.0

Hyper has outlined some problems they want to address with errors for the coming 1.0 release. To summarize:

It's difficult to match on specific errors (Hyper 0.x's Error relies on is_x methods for error matching rather than enum matching).
Error reporters duplicate information since the hyper 0.x errors include the display of their error sources
Error::source() can leak internal dependencies

Opaque Error Sources

There is discussion in the errors working group about how to avoid leaking internal dependency error types through error source downcasting. One option is to create an opaque error wrapping new-type that removes the ability to downcast to the other library's error. This, however, can be circumvented via unsafe code, and also breaks the ability for error reporters to properly display the error (for example, if the error has backtrace information, that would be inaccessible to the reporter).

This situation might improve if the nightly request_value/request_ref/provide functions on std::error::Error are stabilized, since then contextual information needed for including things such as a backtrace could still be retrieved through the opaque error new-type.

This RFC proposes that error types from other libraries not be directly exposed in the API, but rather, be exposed indirectly through Error::source as &dyn Error + 'static.

Errors should not require downcasting to be useful. Downcasting the error's source should be a last resort, and with the understanding that the type could change at a later date with no compile-time guarantees.

Error Proposal

Taking a customer's perspective, there are two broad categories of errors:

Actionable: Errors that can/should influence program flow; where it's useful to do different work based on additional error context or error variant information
Informative: Errors that inform that something went wrong, but where it's not useful to match on the error to change program flow

This RFC proposes that a consistent pattern be introduced to cover these two use cases for all errors in the public API for the Rust runtime crates and generated client crates.

Actionable error pattern

Actionable errors are represented as enums. If an error variant has an error source or additional contextual information, it must use a separate context struct that is referenced via tuple in the enum. For example:

// Good: new error types can be added in the future
#[non_exhaustive]
pub enum Error {
    // Good: This is exhaustive and uses a tuple, but its sole member is an extensible struct with private fields
    VariantA(VariantA),

    // Bad: The fields are directly exposed and can't have accessor methods. The error
    // source type also can't be changed at a later date since.
    #[non_exhaustive]
    VariantB {
        some_additional_info: u32,
        source: AnotherError // AnotherError is from this crate
    },

    // Bad: There's no way to add additional contextual information to this error in the future, even
    // though it is non-exhaustive. Changing it to a tuple or struct later leads to compile errors in existing
    // match statements.
    #[non_exhaustive]
    VariantC,

    // Bad: Not extensible if additional context is added later (unless that context can be added to `AnotherError`)
    #[non_exhaustive]
    VariantD(AnotherError),

    // Bad: Not extensible. If new context is added later (for example, a second endpoint), there's no way to name it.
    #[non_exhaustive]
    VariantE(Endpoint, AnotherError),

    // Bad: Exposes another library's error type in the public API,
    // which makes upgrading or replacing that library a breaking change
    #[non_exhaustive]
    VariantF {
        source: http::uri::InvalidUri
    },

    // Bad: The error source type is public, and even though its a boxed error, it won't
    // be possible to change it to an opaque error type later (for example, if/when
    // opaque errors become practical due to standard library stabilizations).
    #[non_exhaustive]
    VariantG {
        source: Box<dyn Error + Send + Sync + 'static>,
    }
}

pub struct VariantA {
    some_field: u32,
    // This is private, so it's fine to reference the external library's error type
    source: http::uri::InvalidUri
}

impl VariantA {
    fn some_field(&self) -> u32 {
        self.some_field
    }
}

Error variants that contain a source must return it from the Error::source method. The source implementation should not use the catch all (_) match arm, as this makes it easy to miss adding a new error variant's source at a later date.

The error Display implementation must not include the source in its output:

// Good
impl fmt::Display for Error {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        match self {
            Self::VariantA => write!(f, "variant a"),
            Self::VariantB { some_additional_info, .. } => write!(f, "variant b ({some_additional_info})"),
            // ... and so on
        }
    }
}

// Bad
impl fmt::Display for Error {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        match self {
            Self::VariantA => write!(f, "variant a"),
            // Bad: includes the source in the `Display` output, which leads to duplicate error information
            Self::VariantB { some_additional_info, source } => write!(f, "variant b ({some_additional_info}): {source}"),
            // ... and so on
        }
    }
}

Informative error pattern

Informative errors must be represented as structs. If error messaging changes based on an underlying cause, then a private error kind enum can be used internally for this purpose. For example:

#[derive(Debug)]
pub struct InformativeError {
    some_additional_info: u32,
    source: AnotherError,
}

impl fmt::Display for InformativeError {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(f, "some informative message with {}", self.some_additional_info)
    }
}

impl Error for InformativeError {
    fn source(&self) -> Option<&(dyn Error + 'static)> {
        Some(&self.source)
    }
}

In general, informative errors should be referenced by variants in actionable errors since they cannot be converted to actionable errors at a later date without a breaking change. This is not a hard rule, however. Use your best judgement for the situation.

Displaying full error context

In code where errors are logged rather than returned to the customer, the full error source chain must be displayed. This will be made easy by placing a DisplayErrorContext struct in aws-smithy-types that is used as a wrapper to get the better error formatting:

tracing::warn!(err = %DisplayErrorContext(err), "some message");

This might be implemented as follows:

#[derive(Debug)]
pub struct DisplayErrorContext<E: Error>(pub E);

impl<E: Error> fmt::Display for DisplayErrorContext<E> {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write_err(f, &self.0)?;
        // Also add a debug version of the error at the end
        write!(f, " ({:?})", self)
    }
}

fn write_err(f: &mut fmt::Formatter<'_>, err: &dyn Error) -> fmt::Result {
    write!(f, "{}", err)?;
    if let Some(source) = err.source() {
        write!(f, ": ")?;
        write_err(f, source)?;
    }
    Ok(())
}

Changes Checklist

Update every struct/enum that implements Error in all the non-server Rust runtime crates
Hide error source type in Unhandled variant in code generated errors
Remove Clone from ProfileParseError and any others that have it

Error Code Review Checklist

This is a checklist meant to aid code review of new errors:

The error fits either the actionable or informative pattern
If the error is informative, it's clear that it will never be expanded with additional variants in the future
The Display impl does not write the error source to the formatter
The catch all _ match arm is not used in the Display or Error::source implementations
Error types from external libraries are not exposed in the public API
Error enums are #[non_exhaustive]
Error enum variants that don't have a separate error context struct are #[non_exhaustive]
Error context is exposed via accessors rather than by public fields
Actionable errors and their context structs are in an error submodule for any given module. They are not mixed with other non-error code

Smithy Rust