RFC: Fine-grained timeout configuration

Status: Implemented

For a summarized list of proposed changes, see the Changes Checklist section.

While it is currently possible for users to implement request timeouts by racing operation send futures against timeout futures, this RFC proposes a more ergonomic solution that would also enable users to set timeouts for things like TLS negotiation and "time to first byte".

Terminology

There's a lot of terminology to define, so I've broken it up into three sections.

General terms

Smithy Client: A aws_smithy_client::Client<C, M, R> struct that is responsible for gluing together the connector, middleware, and retry policy. This is not generated and lives in the aws-smithy-client crate.
Fluent Client: A code-generated Client<C, M, R> that has methods for each service operation on it. A fluent builder is generated alongside it to make construction easier.
AWS Client: A specialized Fluent Client that defaults to using a DynConnector, AwsMiddleware, and Standard retry policy.
Shared Config: An aws_types::Config struct that is responsible for storing shared configuration data that is used across all services. This is not generated and lives in the aws-types crate.
Service-specific Config: A code-generated Config that has methods for setting service-specific configuration. Each Config is defined in the config module of its parent service. For example, the S3-specific config struct is useable from aws_sdk_s3::config::Config and re-exported as aws_sdk_s3::Config. In this case, "service" refers to an AWS offering like S3.

HTTP stack terms

Service: A trait defined in the tower-service crate. The lowest level of abstraction we deal with when making HTTP requests. Services act directly on data to transform and modify that data. A Service is what eventually turns a request into a response.
Layer: Layers are a higher-order abstraction over services that is used to compose multiple services together, creating a new service from that combination. Nothing prevents us from manually wrapping services within services, but Layers allow us to do it in a flexible and generic manner. Layers don't directly act on data but instead can wrap an existing service with additional functionality, creating a new service. Layers can be thought of as middleware. NOTE: The use of Layers can produce compiler errors that are difficult to interpret and defining a layer requires a large amount of boilerplate code.
Middleware: a term with several meanings,
- Generically speaking, middleware are similar to Services and Layers in that they modify requests and responses.
- In the SDK, "Middleware" refers to a layer that can be wrapped around a DispatchService. In practice, this means that the resulting Service (and the inner service) must meet the bound T: where T: Service<operation::Request, Response=operation::Response, Error=SendOperationError>.
  - Note: This doesn't apply to the middlewares we use when generating presigned request because those don't wrap a DispatchService.
- The most notable example of a Middleware is the AwsMiddleware. Other notable examples include MapRequest, AsyncMapRequest, and ParseResponse.
DispatchService: The innermost part of a group of nested services. The Service that actually makes an HTTP call on behalf of a request. Responsible for parsing success and error responses.
Connector: a term with several meanings,
- DynConnectors (a struct that implements DynConnect) are Services with their specific type erased so that we can do dynamic dispatch.
- A term from hyper for any object that implements the Connect trait. Really just an alias for tower_service::Service. Sometimes referred to as a Connection.
Stage: A form of middleware that's not related to tower. These currently function as a way of transforming requests and don't have the ability to transform responses.
Stack: higher order abstraction over Layers defined in the tower crate e.g. Layers wrap services in one another and Stacks wrap layers within one another.

Timeout terms

Connect Timeout: A limit on the amount of time after making an initial connect attempt on a socket to complete the connect-handshake.
- TODO: the runtime is based on Hyper which reuses connection and doesn't currently have a way of guaranteeing that a fresh connection will be use for a given request.
TLS Negotiation Timeout: A limit on the amount of time a TLS handshake takes from when the CLIENT HELLO message is sent to the time the client and server have fully negotiated ciphers and exchanged keys.
Time to First Byte Timeout: Sometimes referred to as a "read timeout." A limit on the amount of time an application takes to attempt to read the first byte over an established, open connection after write request.
HTTP Request Timeout For A Single Attempt: A limit on the amount of time it takes for the first byte to be sent over an established, open connection and when the last byte is received from the service.
HTTP Request Timeout For Multiple Attempts: This timeout acts like the previous timeout but constrains the total time it takes to make a request plus any retries.
- NOTE: In a way, this is already possible in that users are free to race requests against timer futures with the futures::future::select macro or to use tokio::time::timeout. See relevant discussion in hyper#1097

Configuring timeouts

Just like with Retry Behavior Configuration, these settings can be configured in several places and have the same precedence rules (paraphrased here for clarity).

Service-specific config builders
Shared config builders
Environment variables
Profile config file (e.g., ~/.aws/credentials)

The above list is in order of decreasing precedence e.g. configuration set in an app will override values from environment variables.

Configuration options

The table below details the specific ways each timeout can be configured. In all cases, valid values are non-negative floats representing the number of seconds before a timeout is triggered.

Timeout	Environment Variable	AWS Config Variable	Builder Method
Connect	AWS_CONNECT_TIMEOUT	connect_timeout	connect_timeout
TLS Negotiation	AWS_TLS_NEGOTIATION_TIMEOUT	tls_negotiation_timeout	tls_negotiation_timeout
Time To First Byte	AWS_READ_TIMEOUT	read_timeout	read_timeout
HTTP Request - single attempt	AWS_API_CALL_ATTEMPT_TIMEOUT	api_call_attempt_timeout	api_call_attempt_timeout
HTTP Request - all attempts	AWS_API_CALL_TIMEOUT	api_call_timeout	api_call_timeout

SDK-specific defaults set by AWS service teams

QUESTION: How does the SDK currently handle these defaults?

Prior Art

hjr3/hyper-timeout is a Connector for hyper that enables setting connect, read, and write timeouts
sfackler/tokio-io-timeout provides timeouts for tokio IO operations. Used within hyper-timeout.
[tokio::time::sleep_until] creates a Future that completes after some time has elapsed. Used within tokio-io-timeout.

Behind the scenes

Timeouts are achieved by racing a future against a tokio::time::Sleep future. The question, then, is "how can I create a future that represents a condition I want to watch for?". For example, in the case of a ConnectTimeout, how do we watch an ongoing request to see if it's completed the connect-handshake? Our current stack of Middleware acts on requests at different levels of granularity. The timeout Middlewares will be no different.

Middlewares for AWS Client requests

View AwsMiddleware in GitHub

#[derive(Debug, Default)]
#[non_exhaustive]
pub struct AwsMiddleware;
impl<S> tower::Layer<S> for AwsMiddleware {
  type Service = <AwsMiddlewareStack as tower::Layer<S>>::Service;

  fn layer(&self, inner: S) -> Self::Service {
    let credential_provider = AsyncMapRequestLayer::for_mapper(CredentialsStage::new());
    let signer = MapRequestLayer::for_mapper(SigV4SigningStage::new(SigV4Signer::new()));
    let endpoint_resolver = MapRequestLayer::for_mapper(AwsAuthStage);
    let user_agent = MapRequestLayer::for_mapper(UserAgentStage::new());
    ServiceBuilder::new()
            .layer(endpoint_resolver)
            .layer(user_agent)
            .layer(credential_provider)
            .layer(signer)
            .service(inner)
  }
}

The above code is only included for context. This RFC doesn't define any timeouts specific to AWS so AwsMiddleware won't require any changes.

Middlewares for Smithy Client requests

View aws_smithy_client::Client::call_raw in GitHub

impl<C, M, R> Client<C, M, R>
  where
          C: bounds::SmithyConnector,
          M: bounds::SmithyMiddleware<C>,
          R: retry::NewRequestPolicy,
{
  // ...other methods omitted
  pub async fn call_raw<O, T, E, Retry>(
    &self,
    input: Operation<O, Retry>,
  ) -> Result<SdkSuccess<T>, SdkError<E>>
    where
            R::Policy: bounds::SmithyRetryPolicy<O, T, E, Retry>,
            bounds::Parsed<<M as bounds::SmithyMiddleware<C>>::Service, O, Retry>:
            Service<Operation<O, Retry>, Response=SdkSuccess<T>, Error=SdkError<E>> + Clone,
  {
    let connector = self.connector.clone();

    let mut svc = ServiceBuilder::new()
            // Create a new request-scoped policy
            .retry(self.retry_policy.new_request_policy())
            .layer(ParseResponseLayer::<O, Retry>::new())
            // These layers can be considered as occurring in order. That is, first invoke the
            // customer-provided middleware, then dispatch dispatch over the wire.
            .layer(&self.middleware)
            .layer(DispatchLayer::new())
            .service(connector);

    svc.ready().await?.call(input).await
  }
}

The Smithy Client creates a new Stack of services to handle each request it sends. Specifically:

A method retry is used set the retry handler. The configuration for this was set during creation of the Client.
ParseResponseLayer inserts a service for transforming responses into operation-specific outputs or errors. The O generic parameter of input is what decides exactly how the transformation is implemented.
A middleware stack that was included during Client creation is inserted into the stack. In the case of the AWS SDK, this would be AwsMiddleware.
DispatchLayer inserts a service for transforming an http::Request into an operation::Request. It's also responsible for re-attaching the property bag from the Operation that triggered the request.
The innermost Service is a DynConnector wrapping a hyper client (which one depends on the TLS implementation was enabled by cargo features.)

The HTTP Request Timeout For A Single Attempt and HTTP Request Timeout For Multiple Attempts can be implemented at this level. The same Layer can be used to create both TimeoutServices. The TimeoutLayer would require two inputs:

sleep_fn: A runtime-specific implementation of sleep. The SDK is currently tokio-based and would default to tokio::time::sleep (this default is set in the aws_smithy_async::rt::sleep module.)
The duration of the timeout as a std::time::Duration

The resulting code would look like this:

impl<C, M, R> Client<C, M, R>
  where
          C: bounds::SmithyConnector,
          M: bounds::SmithyMiddleware<C>,
          R: retry::NewRequestPolicy,
{
  // ...other methods omitted
  pub async fn call_raw<O, T, E, Retry>(
    &self,
    input: Operation<O, Retry>,
  ) -> Result<SdkSuccess<T>, SdkError<E>>
    where
            R::Policy: bounds::SmithyRetryPolicy<O, T, E, Retry>,
            bounds::Parsed<<M as bounds::SmithyMiddleware<C>>::Service, O, Retry>:
            Service<Operation<O, Retry>, Response=SdkSuccess<T>, Error=SdkError<E>> + Clone,
  {
    let connector = self.connector.clone();
    let sleep_fn = aws_smithy_async::rt::sleep::default_async_sleep();

    let mut svc = ServiceBuilder::new()
            .layer(TimeoutLayer::new(
              sleep_fn,
              self.timeout_config.api_call_timeout(),
            ))
            // Create a new request-scoped policy
            .retry(self.retry_policy.new_request_policy())
            .layer(TimeoutLayer::new(
              sleep_fn,
              self.timeout_config.api_call_attempt_timeout(),
            ))
            .layer(ParseResponseLayer::<O, Retry>::new())
            // These layers can be considered as occurring in order. That is, first invoke the
            // customer-provided middleware, then dispatch dispatch over the wire.
            .layer(&self.middleware)
            .layer(DispatchLayer::new())
            .service(connector);

    svc.ready().await?.call(input).await
  }
}

Note: Our HTTP client supports multiple TLS implementations. We'll likely have to implement this feature once per library.

Timeouts will be implemented in the following places:

HTTP request timeout for multiple requests will be implemented as the outermost Layer in Client::call_raw.
HTTP request timeout for a single request will be implemented within RetryHandler::retry.
Time to first byte, TLS negotiation, and connect timeouts will be implemented within the central hyper connector.

Changes checklist

Changes are broken into to sections:

HTTP requests (single or multiple) are implementable as layers within our current stack
Other timeouts will require changes to our dependencies and may be slower to implement

Implementing HTTP request timeouts

Add TimeoutConfig to smithy-types
Add TimeoutConfigProvider to aws-config
- Add provider that fetches config from environment variables
- Add provider that fetches config from profile
Add timeout method to aws_types::Config for setting timeout configuration
Add timeout method to generated Configs too
Create a generic TimeoutService and accompanying Layer
- TimeoutLayer should accept a sleep function so that it doesn't have a hard dependency on tokio
insert a TimeoutLayer before the RetryPolicy to handle timeouts for multiple-attempt requests
insert a TimeoutLayer after the RetryPolicy to handle timeouts for single-attempt requests
Add tests for timeout behavior
- test multi-request timeout triggers after 3 slow retries
- test single-request timeout triggers correctly
- test single-request timeout doesn't trigger if request completes in time

Smithy Rust