From: Sergey Matveev <stargrave@stargrave.org>
Date: Tue, 8 Oct 2024 11:47:41 +0000 (+0300)
Subject: More rationale
X-Git-Url: http://www.git.cypherpunks.su/?a=commitdiff_plain;h=7feb6d6f8d5077031866668ce10274ded8c1d95724f00b6d4eba7e64fb23d721;p=keks.git

More rationale
---

diff --git a/spec/encoding/str.texi b/spec/encoding/str.texi
index 288349a..be0e6e9 100644
--- a/spec/encoding/str.texi
+++ b/spec/encoding/str.texi
@@ -33,7 +33,7 @@ If length value equals to:
 String's length @strong{must} be encoded in shortest possible form.
 
 UTF-8 strings @strong{must} be valid UTF-8 sequences, except that null
-byte is not allowed.
+byte @strong{is not} allowed. That should be normalized Unicode string.
 
 Example representations:
 
diff --git a/spec/index.texi b/spec/index.texi
index b188c2f..71fc1e2 100644
--- a/spec/index.texi
+++ b/spec/index.texi
@@ -27,62 +27,82 @@ of structured data. But why!?
     transparently.
 @end itemize
 
+Non goal is: very fast machine friendly decoding capability. As a rule,
+that means non-compact data representation.
+
 Are not there any satisfiable codecs?
 
-@multitable @columnfractions .30 .10 .10 .10 .10 .10 .10 .10
+@multitable @columnfractions .30 .05 .05 .05 .05 .05
 
 @headitem name @tab
     Schemaless @tab
     Simple @tab
     Deterministic @tab
     Streamable @tab
-    Compact @tab
-    Large strings @tab
-    Many types
-
-@item ASN.1 DER @tab N @tab @strong{N} @tab Y @tab N @tab ~ @tab Y @tab ~
-@item ASN.1 CER @tab N @tab @strong{N} @tab Y @tab Y @tab ~ @tab Y @tab ~
-@item @url{https://protobuf.dev/, Protocol Buffers}
-    @tab N @tab ~ @tab N @tab N @tab Y @tab N @tab Y
-@item @url{https://flatbuffers.dev/, FlatBuffers}
-    @tab Y @tab N @tab N @tab N @tab N @tab Y @tab Y
+    Compact
+
+@item ASN.1 @url{https://en.wikipedia.org/wiki/Distinguished_Encoding_Rules#DER_encoding, DER} @tab
+    N @tab @strong{N} @tab Y @tab N @tab N
+@item ASN.1 @url{https://en.wikipedia.org/wiki/Distinguished_Encoding_Rules#CER_encoding, CER} @tab
+    N @tab @strong{N} @tab Y @tab Y @tab N
+@item @url{https://datatracker.ietf.org/doc/html/rfc1014, XDR} @tab
+    N @tab Y @tab N @tab N @tab N
 @item @url{https://bsonspec.org/, BSON} @tab
-    Y @tab Y @tab N @tab N @tab N @tab N @tab Y
+    Y @tab Y @tab N @tab N @tab N
 @item @url{https://msgpack.org/, MessagePack} @tab
-    Y @tab Y @tab N @tab N @tab N @tab N @tab Y
+    Y @tab Y @tab N @tab N @tab Y
 @item @url{https://datatracker.ietf.org/doc/html/rfc8949, CBOR} @tab
-    Y @tab N @tab N @tab Y @tab Y @tab Y @tab Y
-@item Deterministic Encoded CBOR @tab
-    Y @tab N @tab Y @tab N @tab Y @tab Y @tab Y
+    Y @tab N @tab N @tab Y @tab Y
+@item @url{https://datatracker.ietf.org/doc/html/draft-mcnally-deterministic-cbor-11, dCBOR} @tab
+    Y @tab @strong{N} @tab Y @tab N @tab Y
 @item @url{http://cr.yp.to/proto/netstrings.txt, Netstrings} @tab
-    Y @tab Y @tab Y @tab N @tab N @tab Y @tab N
+    Y @tab Y @tab Y @tab N @tab N
 @item @url{https://wiki.theory.org/BitTorrentSpecification#Bencoding, Bencode} @tab
-    Y @tab Y @tab Y @tab Y @tab N @tab Y @tab N
-@item @url{https://en.wikipedia.org/wiki/Canonical_S-expressions, Canonical S-expression}
-    @tab Y @tab Y @tab Y @tab Y @tab N @tab Y @tab N
+    Y @tab Y @tab Y @tab Y @tab N
+@item @url{https://en.wikipedia.org/wiki/Canonical_S-expressions, Canonical S-expression} @tab
+    Y @tab Y @tab Y @tab Y @tab N
 @item YAC @tab
-    Y @tab Y @tab Y @tab Y @tab Y @tab Y @tab Y
+    Y @tab Y @tab Y @tab Y @tab Y
 
 @end multitable
 
-@itemize
-@item Streamable formats allow you to send a part of the data
-    immediately, for example element of the list or map. Simplifying
-    encoder and requiring less memory usage. All formats who needs to
-    know the size of maps/lists are not streamable.
-@item Compactness means small amount of bytes overhead for the given
-    data. For example any codec with ASCII decimal lengths of the
-    strings or integers representation is not compact.
-@item "Large strings" is a strings bigger than 4GiB. Some codecs allow
-    you to send only even 2GiB of data in a single chunk. That will
-    force you code and structures be more complex when dealing with big
-    data transfer
-@item "Many types" is a subjective thing of course. If codec can encode
-    everything JSON can, then it is enough types. ASN.1 codecs support
-    many various types, but they can not represent arbitrary map.
-@item Hardly you will find CBOR libraries supporting strict validation
-    of deterministically encoded CBOR structures.
-@end itemize
+@multitable @columnfractions .30 .05 .05 .05 .05 .05 .05
+
+@headitem name @tab
+    Large strings @tab
+    Human strings @tab
+    Integers @tab
+    Lists @tab
+    Structures @tab
+    Datetime
+
+@item ASN.1 DER @tab
+    Y @tab Y @tab Y @tab Y @tab Y @tab Y
+@item ASN.1 CER @tab
+    Y @tab Y @tab Y @tab Y @tab Y @tab Y
+@item XDR @tab
+    N @tab Y @tab Y @tab Y @tab Y @tab N
+@item BSON @tab
+    N @tab Y @tab Y @tab Y @tab Y @tab Y
+@item MessagePack @tab
+    N @tab Y @tab Y @tab Y @tab Y @tab N
+@item CBOR @tab
+    Y @tab Y @tab Y @tab Y @tab Y @tab N
+@item dCBOR @tab
+    Y @tab Y @tab Y @tab Y @tab Y @tab N
+@item Netstrings @tab
+    Y @tab N @tab N @tab N @tab N @tab N
+@item Bencode @tab
+    Y @tab N @tab Y @tab Y @tab Y @tab N
+@item CSExp @tab
+    Y @tab N @tab N @tab Y @tab N @tab N
+@item YAC @tab
+    Y @tab Y @tab Y @tab Y @tab Y @tab Y
+
+@end multitable
+
+But hardly you will find wide range of CBOR libraries supporting strict
+validation of deterministically encoded CBOR structures.
 
 YAC deals with those problems by using only streaming deterministic
 encoding. Its other important differences:
@@ -103,6 +123,7 @@ encoding. Its other important differences:
 
 @insertcopying
 
+@include rationale.texi
 @include install.texi
 @include encoding/index.texi
 @include schema.texi
diff --git a/spec/rationale.texi b/spec/rationale.texi
new file mode 100644
index 0000000..6102f36
--- /dev/null
+++ b/spec/rationale.texi
@@ -0,0 +1,67 @@
+@node Rationale
+@unnumbered Rationale
+
+@itemize
+
+@item
+We do not want ASCII decimal parsing. This is not trivial and not very
+fast to load an integer. Although it is human readable and
+understandable. Also it is not compact.
+
+@item
+We do not want varints (where most significant bit means continuation)
+and zig-zag-like encoding. This is not trivial code, prohibiting fast
+integer load.
+
+@item
+We do not want formats where maps and lists need to know their
+lengths/sizes in advance. That means no streaming possibility. That
+complicates encoder and requires more memory usage. Containers can be
+terminated with explicit signal tag.
+
+@item
+We want formats with ability to store maps/dictionaries/tables. Of
+course they can be emulated by reassembling lists, but that is manual
+action after the codec did his job.
+
+@item
+Differentiation of binary and human-readable strings (UTF-8 for example)
+is a must for a format that is intended to be looked and analysed by a human.
+
+@item
+ISO-based (string) representation of data is a no: because it requires
+complex parsing and takes much space. Naive UNIX timestamp
+representation raises questions about its length and dealing with the
+dates before 1970. Moreover they are not suitable for tasks requiring
+monotonous clocks, because of UTC.
+
+@item
+No tagging ability, context specifying, marking, hinting, extension
+mechanism or anything like that. That brings huge complications to the
+state and questions when you do not know how to deal with unknown
+entities. Any unsupported data type must be a string, possibly enveloped
+in a map with additional data. @code{@{"cp": "koi8-r", "str": BIN(...)@}}.
+
+@item
+Large (>2GiB) strings support is a must. Nowadays even a single
+multimedia file can easily exceed that size. General-purpose codec must
+be able to send it without complication of inventing your own chunked
+format.
+
+@item
+Is not embedded strings length, like in YAC and CBOR, is a more
+complicated code? Definitely. But there are so many short strings in a
+schemaless format for specifying map/structure keys. So many algorithm
+identifiers, that are also relatively short human-readable strings. So
+that is a compromise between slightly larger code and much shorter
+resulting structures, that is worth of it.
+
+@item
+We want strong distinguishing of continuous strings and streamable ones
+(BLOBs). ASN.1 CER does not distinguish them, making representation of
+every string in memory far from being convenient and easy to work with.
+Different tasks have different constraints: many of them do not need
+streamable strings at all, some of them may use them solely. YAC gives
+flexibility in choosing necessary data type for your needs.
+
+@end itemize