RFC: Serialization and Deserialization
Status: RFC
Applies to: Output, Input, and Builder types as well as
DateTime
,Document
,Blob
, andNumber
implemented inaws_smithy_types
crate.
Terminology
- Builder
Refers to data types prefixed with
Builder
, which converts itself into a corresponding data type upon being built. e.g.aws_sdk_dynamodb::input::PutItemInput
. - serde
Refers to
serde
crate. Serialize
Refers toSerialize
trait avaialble onserde
crate.Deserialize
Refers toDeserialize
trait available onserde
crate.
Overview
We are going to implement Serialize and Deserialize traits from serde
crate to some data types.
Data types that are going to be affected are;
- builder data types
- operation
Input
types - operation
Output
types - data types that builder types may have on their field(s)
aws_smithy_types::DateTime
aws_smithy_types::Document
aws_smithy_types::Blob
aws_smithy_types::Number
DateTime
and Blob
implements different serialization/deserialization format for human-readable and non-human readable format; We must emphasize that these 2 formats are not compatible with each other. The reason for this is explained in the Blob section and Date Time.
Additionally, we add fn set_fields
to fluent builders to allow users to set the data they deserialized to fluent builders.
Lastly, we emphasize that this RFC does NOT aim to serialize the entire response or request or implement serde
traits on data types for server-side code.
Use Case
Users have requested serde
traits to be implemented on data types implemented in rust SDK.
We have created this RFC with the following use cases in mind.
- [request]: Serialize/Deserialize of models for Lambda events #269
- Tests as suggested in the design FAQ.
- Building tools
Feature Gate
Enabling Feature
To enable any of the features from this RFC, user must pass --cfg aws-sdk-unstable
to rustc.
You can do this by specifying it on env-variable or by config.toml.
- specifying it on .cargo/config.toml
[build]
rustflags = ["--cfg", "aws-sdk-unstable"]
- As an environment variable
export RUSTFLAGS="--cfg aws-sdk-unstable"
cargo build
We considered allowing users to enable this feature on a crate-level.
e.g.
[dependencies]
aws_sdk_dynamodb = { version = "0.22.0", features = ["unstable", "serialize"] }
Compared to the cfg approach, it is lot easier for the users to enable this feature. However, we believe that cfg approach ensure users won't enable this feature by surprise, and communicate to users that features behind this feature gate can be taken-away or exprience breaking changes any time in future.
Feature Gate for Serialization and De-serialization
Serde
traits are implemented behind feature gates.
Serialize
is implemented behind serde-serialize
, while Deserialize
is implemented behind serde-deserialize
.
Users must enable the unstable
feature to expose those features.
We considered giving each feature a dedicated feature gate such as unstable-serde-serialize
.
In this case, we will need to change the name of feature gates entirely once it leaves the unstable status which will cause users to make changes to their code base.
We conclude that this brings no benefit to the users.
Furthermore, we considered naming the fature-gate serialize
/deserialize
.
However, this way it would be confusing for the users when we add support for different serialization/deserialization framework such as deser
.
Thus, to emphasize that the traits is from serde
crate, we decided to name it serde-serialize
/serde-deserialize
Keeping both features behind the same feature gate
We considered keeping both features behind the same feature gate. There is no significant difference in the complexity of implementation. We do not see any benefit in keeping them behind the same feature gate as this will only increase compile time when users do not need one of the features.
Different feature gates for different data types
We considered implementing different feature gates for output, input, and their corresponding data types.
For example, output and input types can have output-serde-*
and input-serde-*
.
We are unable to do this as relevant metadata is not available during the code-gen.
Implementation
Smithy Types
aws_smithy_types
is a crate that implements smithy's data types.
These data types must implement serde traits as well since SDK uses the data types.
Blob
Serialize
and Deserialize
is not implemented with derive macro.
In human-readable format, Blob
is serialized as a base64 encoded string and any data to be deserialized as this data type must be encoded in base 64.
Encoding must be carried out by base64::encode
function available from aws_smithy_types
crate.
Non-human readable format serializes Blob
with fn serialize_bytes
.
- Reason behind the implementation of human-readable format
aws_smithy_types
crate comes with functions for encoding/decoding base 64, which makes the implementation simpler.
Additionally, AWS CLI and AWS SDK for other languages require data to be encoded in base 64 when it requires Blob
type as input.
We also considered serializing them with serialize_bytes
, without encoding them with serialize_bytes
.
In this case, the implementation will depend on the implementation of the library author.
There are many different crates, so we decided to survey how some of the most popular crates implement this feature.
library | version | implementation | all-time downloads on crate.io as of writing (Dec 2022) |
---|---|---|---|
serde_json | 1.0 | Array of number | 109,491,713 |
toml | 0.5.9 | Array of number | 63,601,994 |
serde_yaml | 0.9.14 | Unsupported | 23,767,300 |
First of all, bytes could have hundreds of elements; reading an array of hundreds of numbers will never be a pleasing experience, and it is especially troubling when you are writing data for test cases. Additionally, it has come to our attention that some crates just doesn't support it, which would hinder users' ability to be productive and tie users' hand.
For the reasons described above, we believe that it is crucial to encode them to string and base64 is favourable over other encoding schemes such as base 16, 32, or Ascii85.
- Reason behind the implementation of a non-human readable format We considered using the same logic for non-human readable format as well. However, readable-ness is not necessary for non-human readable format. Additionally, non-human readable format tends to emphasize resource efficiency over human-readable format; Base64 encoded string would take up more space, which is not what the users would want.
Thus, we believe that implementing a tailored serialization logic would be beneficial to the users.
DateTime
Serialize
and Deserialize
is not implemented with derive macro.
For human-readable format, DateTime
is serialized in RFC-3339 format;
It expects the value to be in RFC-3339 format when it is Deserialized.
Non-human readable implements DateTime
as a tuple of u32
and i64
; the latter corresponds to seconds
field and the first is the seubsecond_nanos
.
- Reason behind the implementation of a human-readable format
For serialization, DateTime
format already implements a function to encode itself into RFC-3339 format.
For deserialization, it is possible to accept other formats, we can add this later if we find it reasonable.
- Reason behind the implementation of a non-human readable format
Serializing them as tuples of two integers results in a smaller data size and requires less computing power than any string-based format. Tuple will be smaller in size as it does not require tagging like in maps.
Document
Serialize
and Deserialize
is implemented with derive macro.
Additionally, it implements container attribute #[serde(untagged)]
.
Serde can distinguish each variant without tagging thanks to the difference in each variant's datatypes.
Number
Serialize
and Deserialize
is implemented with derive macro.
Additionally, it implements container attribute #[serde(untagged)]
.
Serde can distinguish each variant without a tag as each variant's content is different.
Builder Types and Non-Builder Types
Builder types and non Builder types implement Serialize
and Deserialize
with derive macro.
Example:
#[cfg_attr(
all(aws-sdk-unstable, feature = "serialize"),
derive(serde::Serialize)
)]
#[cfg_attr(
all(aws-sdk-unstable, feature = "deserialize"),
derive(serde::Deserialize)
)]
#[non_exhaustive]
#[derive(std::clone::Clone, std::cmp::PartialEq)]
pub struct UploadPartCopyOutput {
...
}
Enum Representation
serde
allows programmers to use one of four different tagging (internal, external, adjacent, and untagged) when serializing an enum.
untagged
You cannot deserialize serialized data in some cases.
For example, aws_sdk_dynamodb::model::AttributeValue has Null(bool)
and Bool(bool)
, which you cannot distinguish serialized values without a tag.
internal
This results in compile time error. Using a #[serde(tag = "...")] attribute on an enum containing a tuple variant is an error at compile time.
external and adjacent
We are left with external
and adjacent
tagging.
External tagging is the default way.
This RFC can be achieved either way.
The resulting size of the serialized data is smaller when tagged externally, as adjacent tagging will require a tag even when a variant has no content.
For the reasons mentioned above, we implement an enum that is externally tagged.
Data Types to Skip Serialization/Deserialization
We are going to skip serialization and deserialization of fields that have the datatype that corresponds to @streaming blob
from smithy.
Any fields with these data types are tagged with #[serde(skip)]
.
By skipping, corresponding field's value will be assigned the value generated by Default
trait.
As of writing, aws_smithy_http::byte_stream::ByteStream
is the only data type that is affected by this decision.
Here is an example of data types affected by this decision:
aws_sdk_s3::input::put_object_input::PutObjectInput
We considered serializing them as bytes, however, it could take some time for a stream to reach the end, and the resulting serialized data may be too big for itself to fit into the ram.
Here is an example snippet.
#[allow(missing_docs)]
#[cfg_attr(
all(aws-sdk-unstable, feature = "serde-serialize"),
derive(serde::Serialize)
)]
#[cfg_attr(
all(aws-sdk-unstable, feature = "serde-deserialize"),
derive(serde::Deserialize)
)]
#[non_exhaustive]
#[derive(std::fmt::Debug)]
pub struct PutObjectInput {
pub acl: std::option::Option<crate::model::ObjectCannedAcl>,
pub body: aws_smithy_http::byte_stream::ByteStream,
// ... other fields
}
Data types to exclude from ser/de code generation
For data types that include @streaming union
in any of their fields, we do NOT implement serde
traits.
As of writing, following Rust data types corresponds to @streaming union
.
aws_smithy_http::event_stream::Receiver
aws_smithy_http::event_stream::EventStreamSender
Here is an example of data type affected by this decision;
aws_sdk_transcribestreaming::client::fluent_builders::StartMedicalStreamTranscription
We considered skipping relevant fields on serialization and creating a custom de-serialization function which creates event stream that will always result in error when a user tries to send/receive data. However, we believe that our decision is justified for following reason.
- All for operations that feature event streams since the stream is ephemeral (tied to the HTTP connection), and is effectively unusable after serialization and deserialization
- Most event stream operations don't have fields that go along with them, making the stream the sole component in them, which makes ser/de not so useful
- SDK that uses event stream, such as
aws-sdk-transcribestreaming
only has just over 5000 all-time downloads with recent downloads of just under 1000 as of writing (2023/01/21); It makes it difficult to justify since the implementation impacts smaller number of people.
Serde
traits implemented on Builder of Output Types
Output data, such as aws_sdk_dynamodb::output::UpdateTableOutput
has builder types.
These builder types are available to users, however, no API requires users to build data types by themselves.
We considered removing traits from these data types.
Removing serde traits on these types will help reduce compile time, however, builder type can be useful, for example, for testing. We have prepared examples here.
fn set_fields
to allow users to use externally created Input
Currently, to set the value to fluent builders, users must call setter methods for each field.
SDK does not have a method that allows users to use deserialized Input
.
Thus, we add a new method fn set_fields
to Client
types.
This method accepts inputs and replaces all parameters that Client
has with the new one.
pub fn set_fields(mut self, input_type: path::to::input_type) -> path::to::input_type {
self.inner = input_type;
self
}
Users can use fn set_fields
to replace the parameters in fluent builders.
You can find examples here.
Other Concerns
Model evolution
SDK will introduce new fields and we may see new data types in the future.
We believe that this will not be a problem.
Introduction of New Fields
Most fields are Option<T>
type.
When the user de-serializes data written for a format before the new fields were introduced, new fields will be assigned with None
type.
If a field isn't Option
, serde
uses Default
trait unless a custom de-serialization/serialization is specified to generate data to fill the field.
If the new field is not an Option<T>
type and has no Default
implementation, we must implement a custom de-serialization logic.
In the case of serialization, the introduction of new fields will not be an issue unless the data format requires a schema. (e.g. parquet, avro) However, this is outside the scope of this RFC.
Introduction of New Data Type
If a new field introduces a new data type, it will not require any additional work if the data type can derive serde
traits.
If the data cannot derive serde
traits on its own, then we have two options.
To clarify, this is the same approach we took on Data Type to skip
section.
- skip
We will simply skip serializing/de-serializing. However, we may need to implement custom serialization/de-serialization logic if a value is not wrapped with
Option
. - custom serialization/de-serialization logic We can implement tailored serialization/de-serialization logic.
Either way, we will mention this on the generated docs to avoid surprising users.
e.g.
#[derive(serde::Serialize, serde::Deserialize)]
struct OutputV1 {
string_field: Option<String>
}
#[derive(serde::Serialize, serde::Deserialize)]
struct OutputV2 {
string_field: Option<String>,
// this will always be treated as None value by serde
#[serde(skip)]
skip_not_serializable: Option<SomeComplexDataType>,
// We can implement a custom serialization logic
#[serde(serialize_with = "custom_serilization_logic", deserialize_with = "custom_deserilization_logic")]
not_derive_able: SomeComplexDataType,
// Serialization will be skipped, and de-serialization will be handled with the function provided on default tag
#[serde(skip, default = "default_value")]
skip_with_custom: DataTypeWithoutDefaultTrait,
}
Discussions
Sensitive Information
If serialized data contains sensitive information, it will not be masked. We mention that fields can compromise such information on every struct field to ensure that users know this.
Compile Time
We ran the following benchmark on C6a.2xlarge instance with 50gb of GP2 SSD. The commit hash of the code is a8e2e19129aead4fbc8cf0e3d34df0188a62de9f.
It clearly shows an increase in compile time. Users are advised to consider the use of software such as sccache or mold to reduce the compile time.
-
aws-sdk-dynamodb
-
when compiled with debug profile
command real time user time sys time cargo build 0m35.728s 2m24.243s 0m11.868s cargo build --features unstable-serde-serialize 0m38.079s 2m26.082s 0m11.631s cargo build --features unstable-serde-deserialize 0m45.689s 2m34.000s 0m11.978s cargo build --all-features 0m48.959s 2m45.688s 0m13.359s -
when compiled with release profile
command real time user time sys time cargo build --release 0m52.040s 5m0.841s 0m11.313s cargo build --release --features unstable-serde-serialize 0m53.153s 5m4.069s 0m11.577s cargo build --release --features unstable-serde-deserialize 1m0.107s 5m10.231s 0m11.699s cargo build --release --all-features 1m3.198s 5m26.076s 0m12.311s
-
-
aws-sdk-ec2
-
when compiled with debug profile
command real time user time sys time cargo build 1m20.041s 2m14.592s 0m6.611s cargo build --features unstable-serde-serialize 2m0.555s 4m24.881s 0m16.131s cargo build --features unstable-serde-deserialize 3m10.857s 5m34.246s 0m18.844s cargo build --all-features 3m31.473s 6m1.052s 0m19.681s -
when compiled with release profile
command real time user time sys time cargo build --release 2m29.480s 9m19.530s 0m15.957s cargo build --release --features unstable-serde-serialize 2m45.002s 9m43.098s 0m16.886s cargo build --release --features unstable-serde-deserialize 3m47.531s 10m52.017s 0m18.404s cargo build --release --all-features 3m45.208s 8m46.168s 0m10.211s
-
Misleading Results
SDK team previously expressed concern that serialized data may be misleading. We believe that features implemented as part of this RFC do not produce a misleading result as we focus on builder types and it's corresponding data types which are mapped to serde's data type model with the derive macro.
Appendix
Use Case Examples
use aws_sdk_dynamodb::{Client, Error};
async fn example(read_builder: bool) -> Result<(), Error> {
// getting the client
let shared_config = aws_config::load_from_env().await;
let client = Client::new(&shared_config);
// de-serializing input's builder types and input types from json
let deserialized_input = if read_builder {
let mut parameter: aws_sdk_dynamodb::input::list_tables_input::Builder = serde_json::from_str(include_str!("./builder.json"));
parameter.set_exclusive_start_table_name("some_name").build()
} else {
let input: aws_sdk_dynamodb::input::ListTablesInput = serde_json::from_str(include_str!("./input.json"));
input
};
// sending request using the deserialized input
let res = client.list_tables().set_fields(deserialized_input).send().await?;
println!("DynamoDB tables: {:?}", res.table_names);
let out: aws_sdk_dynamodb::output::ListTablesOutput = {
// say you want some of the field to have certain values
let mut out_builder: aws_sdk_dynamodb::output::list_tables_output::Builder = serde_json::from_str(r#"
{
table_names: [ "table1", "table2" ]
}
"#);
// but you don't really care about some other values
out_builder.set_last_evaluated_table_name(res.last_evaluated_table_name()).build()
};
assert_eq!(res, out);
// serializing json output
let json_output = serde_json::to_string(res).unwrap();
// you can save the serialized input
println!(json_output);
Ok(())
}
Changes checklist
-
Implement human-redable serialization for
DateTime
andBlob
inaws_smithy_types
-
Implement non-human-redable serialization for
DateTime
andBlob
inaws_smithy_types
-
Implement
Serialize
andDeserialize
for relevant data types inaws_smithy_types
-
Modify Kotlin's codegen so that generated Builder and non-Builder types implement
Serialize
andDeserialize
-
Add feature gate for
Serialize
andDeserialize
- Prepare examples
- Prepare reproducible compile time benchmark