From: Sergey Matveev Date: Tue, 8 Oct 2024 11:47:41 +0000 (+0300) Subject: More rationale X-Git-Url: http://www.git.cypherpunks.su/?a=commitdiff_plain;h=7feb6d6f8d5077031866668ce10274ded8c1d95724f00b6d4eba7e64fb23d721;p=keks.git More rationale --- diff --git a/spec/encoding/str.texi b/spec/encoding/str.texi index 288349a..be0e6e9 100644 --- a/spec/encoding/str.texi +++ b/spec/encoding/str.texi @@ -33,7 +33,7 @@ If length value equals to: String's length @strong{must} be encoded in shortest possible form. UTF-8 strings @strong{must} be valid UTF-8 sequences, except that null -byte is not allowed. +byte @strong{is not} allowed. That should be normalized Unicode string. Example representations: diff --git a/spec/index.texi b/spec/index.texi index b188c2f..71fc1e2 100644 --- a/spec/index.texi +++ b/spec/index.texi @@ -27,62 +27,82 @@ of structured data. But why!? transparently. @end itemize +Non goal is: very fast machine friendly decoding capability. As a rule, +that means non-compact data representation. + Are not there any satisfiable codecs? -@multitable @columnfractions .30 .10 .10 .10 .10 .10 .10 .10 +@multitable @columnfractions .30 .05 .05 .05 .05 .05 @headitem name @tab Schemaless @tab Simple @tab Deterministic @tab Streamable @tab - Compact @tab - Large strings @tab - Many types - -@item ASN.1 DER @tab N @tab @strong{N} @tab Y @tab N @tab ~ @tab Y @tab ~ -@item ASN.1 CER @tab N @tab @strong{N} @tab Y @tab Y @tab ~ @tab Y @tab ~ -@item @url{https://protobuf.dev/, Protocol Buffers} - @tab N @tab ~ @tab N @tab N @tab Y @tab N @tab Y -@item @url{https://flatbuffers.dev/, FlatBuffers} - @tab Y @tab N @tab N @tab N @tab N @tab Y @tab Y + Compact + +@item ASN.1 @url{https://en.wikipedia.org/wiki/Distinguished_Encoding_Rules#DER_encoding, DER} @tab + N @tab @strong{N} @tab Y @tab N @tab N +@item ASN.1 @url{https://en.wikipedia.org/wiki/Distinguished_Encoding_Rules#CER_encoding, CER} @tab + N @tab @strong{N} @tab Y @tab Y @tab N +@item @url{https://datatracker.ietf.org/doc/html/rfc1014, XDR} @tab + N @tab Y @tab N @tab N @tab N @item @url{https://bsonspec.org/, BSON} @tab - Y @tab Y @tab N @tab N @tab N @tab N @tab Y + Y @tab Y @tab N @tab N @tab N @item @url{https://msgpack.org/, MessagePack} @tab - Y @tab Y @tab N @tab N @tab N @tab N @tab Y + Y @tab Y @tab N @tab N @tab Y @item @url{https://datatracker.ietf.org/doc/html/rfc8949, CBOR} @tab - Y @tab N @tab N @tab Y @tab Y @tab Y @tab Y -@item Deterministic Encoded CBOR @tab - Y @tab N @tab Y @tab N @tab Y @tab Y @tab Y + Y @tab N @tab N @tab Y @tab Y +@item @url{https://datatracker.ietf.org/doc/html/draft-mcnally-deterministic-cbor-11, dCBOR} @tab + Y @tab @strong{N} @tab Y @tab N @tab Y @item @url{http://cr.yp.to/proto/netstrings.txt, Netstrings} @tab - Y @tab Y @tab Y @tab N @tab N @tab Y @tab N + Y @tab Y @tab Y @tab N @tab N @item @url{https://wiki.theory.org/BitTorrentSpecification#Bencoding, Bencode} @tab - Y @tab Y @tab Y @tab Y @tab N @tab Y @tab N -@item @url{https://en.wikipedia.org/wiki/Canonical_S-expressions, Canonical S-expression} - @tab Y @tab Y @tab Y @tab Y @tab N @tab Y @tab N + Y @tab Y @tab Y @tab Y @tab N +@item @url{https://en.wikipedia.org/wiki/Canonical_S-expressions, Canonical S-expression} @tab + Y @tab Y @tab Y @tab Y @tab N @item YAC @tab - Y @tab Y @tab Y @tab Y @tab Y @tab Y @tab Y + Y @tab Y @tab Y @tab Y @tab Y @end multitable -@itemize -@item Streamable formats allow you to send a part of the data - immediately, for example element of the list or map. Simplifying - encoder and requiring less memory usage. All formats who needs to - know the size of maps/lists are not streamable. -@item Compactness means small amount of bytes overhead for the given - data. For example any codec with ASCII decimal lengths of the - strings or integers representation is not compact. -@item "Large strings" is a strings bigger than 4GiB. Some codecs allow - you to send only even 2GiB of data in a single chunk. That will - force you code and structures be more complex when dealing with big - data transfer -@item "Many types" is a subjective thing of course. If codec can encode - everything JSON can, then it is enough types. ASN.1 codecs support - many various types, but they can not represent arbitrary map. -@item Hardly you will find CBOR libraries supporting strict validation - of deterministically encoded CBOR structures. -@end itemize +@multitable @columnfractions .30 .05 .05 .05 .05 .05 .05 + +@headitem name @tab + Large strings @tab + Human strings @tab + Integers @tab + Lists @tab + Structures @tab + Datetime + +@item ASN.1 DER @tab + Y @tab Y @tab Y @tab Y @tab Y @tab Y +@item ASN.1 CER @tab + Y @tab Y @tab Y @tab Y @tab Y @tab Y +@item XDR @tab + N @tab Y @tab Y @tab Y @tab Y @tab N +@item BSON @tab + N @tab Y @tab Y @tab Y @tab Y @tab Y +@item MessagePack @tab + N @tab Y @tab Y @tab Y @tab Y @tab N +@item CBOR @tab + Y @tab Y @tab Y @tab Y @tab Y @tab N +@item dCBOR @tab + Y @tab Y @tab Y @tab Y @tab Y @tab N +@item Netstrings @tab + Y @tab N @tab N @tab N @tab N @tab N +@item Bencode @tab + Y @tab N @tab Y @tab Y @tab Y @tab N +@item CSExp @tab + Y @tab N @tab N @tab Y @tab N @tab N +@item YAC @tab + Y @tab Y @tab Y @tab Y @tab Y @tab Y + +@end multitable + +But hardly you will find wide range of CBOR libraries supporting strict +validation of deterministically encoded CBOR structures. YAC deals with those problems by using only streaming deterministic encoding. Its other important differences: @@ -103,6 +123,7 @@ encoding. Its other important differences: @insertcopying +@include rationale.texi @include install.texi @include encoding/index.texi @include schema.texi diff --git a/spec/rationale.texi b/spec/rationale.texi new file mode 100644 index 0000000..6102f36 --- /dev/null +++ b/spec/rationale.texi @@ -0,0 +1,67 @@ +@node Rationale +@unnumbered Rationale + +@itemize + +@item +We do not want ASCII decimal parsing. This is not trivial and not very +fast to load an integer. Although it is human readable and +understandable. Also it is not compact. + +@item +We do not want varints (where most significant bit means continuation) +and zig-zag-like encoding. This is not trivial code, prohibiting fast +integer load. + +@item +We do not want formats where maps and lists need to know their +lengths/sizes in advance. That means no streaming possibility. That +complicates encoder and requires more memory usage. Containers can be +terminated with explicit signal tag. + +@item +We want formats with ability to store maps/dictionaries/tables. Of +course they can be emulated by reassembling lists, but that is manual +action after the codec did his job. + +@item +Differentiation of binary and human-readable strings (UTF-8 for example) +is a must for a format that is intended to be looked and analysed by a human. + +@item +ISO-based (string) representation of data is a no: because it requires +complex parsing and takes much space. Naive UNIX timestamp +representation raises questions about its length and dealing with the +dates before 1970. Moreover they are not suitable for tasks requiring +monotonous clocks, because of UTC. + +@item +No tagging ability, context specifying, marking, hinting, extension +mechanism or anything like that. That brings huge complications to the +state and questions when you do not know how to deal with unknown +entities. Any unsupported data type must be a string, possibly enveloped +in a map with additional data. @code{@{"cp": "koi8-r", "str": BIN(...)@}}. + +@item +Large (>2GiB) strings support is a must. Nowadays even a single +multimedia file can easily exceed that size. General-purpose codec must +be able to send it without complication of inventing your own chunked +format. + +@item +Is not embedded strings length, like in YAC and CBOR, is a more +complicated code? Definitely. But there are so many short strings in a +schemaless format for specifying map/structure keys. So many algorithm +identifiers, that are also relatively short human-readable strings. So +that is a compromise between slightly larger code and much shorter +resulting structures, that is worth of it. + +@item +We want strong distinguishing of continuous strings and streamable ones +(BLOBs). ASN.1 CER does not distinguish them, making representation of +every string in memory far from being convenient and easy to work with. +Different tasks have different constraints: many of them do not need +streamable strings at all, some of them may use them solely. YAC gives +flexibility in choosing necessary data type for your needs. + +@end itemize