More rationale

author Sergey Matveev <stargrave@stargrave.org>

Tue, 8 Oct 2024 11:47:41 +0000 (14:47 +0300)

committer Sergey Matveev <stargrave@stargrave.org>

Tue, 8 Oct 2024 12:56:22 +0000 (15:56 +0300)
author Sergey Matveev <stargrave@stargrave.org>
Tue, 8 Oct 2024 11:47:41 +0000 (14:47 +0300)
committer Sergey Matveev <stargrave@stargrave.org>
Tue, 8 Oct 2024 12:56:22 +0000 (15:56 +0300)
diff --git a/spec/encoding/str.texi b/spec/encoding/str.texi

index 288349a082d4c58a2dba267394e6cdb8ce58d7c1bb0df5d24da0b3e84d71c204..be0e6e9d2e54d29f229f83f1613696fa2522559d35026d94a7e8bcae772c03fe 100644 (file)
--- a/spec/encoding/str.texi
+++ b/spec/encoding/str.texi
@@ -33,7 +33,7 @@ If length value equals to:
  String's length @strong{must} be encoded in shortest possible form.
  
  UTF-8 strings @strong{must} be valid UTF-8 sequences, except that null
-byte is not allowed.
+byte @strong{is not} allowed. That should be normalized Unicode string.
  
  Example representations:
  
diff --git a/spec/index.texi b/spec/index.texi

index b188c2fe35d0cdd99cb291be25c0a9a1001502165e8275959db8a30a953b11d2..71fc1e210992bea74beb0222ee0c73aad2e6237eff0dd1914d5438c509e13c0e 100644 (file)
--- a/spec/index.texi
+++ b/spec/index.texi
@@ -27,62 +27,82 @@ of structured data. But why!?
      transparently.
  @end itemize
  
+Non goal is: very fast machine friendly decoding capability. As a rule,
+that means non-compact data representation.
+
  Are not there any satisfiable codecs?
  
-@multitable @columnfractions .30 .10 .10 .10 .10 .10 .10 .10
+@multitable @columnfractions .30 .05 .05 .05 .05 .05
  
  @headitem name @tab
      Schemaless @tab
      Simple @tab
      Deterministic @tab
      Streamable @tab
-    Compact @tab
-    Large strings @tab
-    Many types
-
-@item ASN.1 DER @tab N @tab @strong{N} @tab Y @tab N @tab ~ @tab Y @tab ~
-@item ASN.1 CER @tab N @tab @strong{N} @tab Y @tab Y @tab ~ @tab Y @tab ~
-@item @url{https://protobuf.dev/, Protocol Buffers}
-    @tab N @tab ~ @tab N @tab N @tab Y @tab N @tab Y
-@item @url{https://flatbuffers.dev/, FlatBuffers}
-    @tab Y @tab N @tab N @tab N @tab N @tab Y @tab Y
+    Compact
+
+@item ASN.1 @url{https://en.wikipedia.org/wiki/Distinguished_Encoding_Rules#DER_encoding, DER} @tab
+    N @tab @strong{N} @tab Y @tab N @tab N
+@item ASN.1 @url{https://en.wikipedia.org/wiki/Distinguished_Encoding_Rules#CER_encoding, CER} @tab
+    N @tab @strong{N} @tab Y @tab Y @tab N
+@item @url{https://datatracker.ietf.org/doc/html/rfc1014, XDR} @tab
+    N @tab Y @tab N @tab N @tab N
  @item @url{https://bsonspec.org/, BSON} @tab
-    Y @tab Y @tab N @tab N @tab N @tab N @tab Y
+    Y @tab Y @tab N @tab N @tab N
  @item @url{https://msgpack.org/, MessagePack} @tab
-    Y @tab Y @tab N @tab N @tab N @tab N @tab Y
+    Y @tab Y @tab N @tab N @tab Y
  @item @url{https://datatracker.ietf.org/doc/html/rfc8949, CBOR} @tab
-    Y @tab N @tab N @tab Y @tab Y @tab Y @tab Y
-@item Deterministic Encoded CBOR @tab
-    Y @tab N @tab Y @tab N @tab Y @tab Y @tab Y
+    Y @tab N @tab N @tab Y @tab Y
+@item @url{https://datatracker.ietf.org/doc/html/draft-mcnally-deterministic-cbor-11, dCBOR} @tab
+    Y @tab @strong{N} @tab Y @tab N @tab Y
  @item @url{http://cr.yp.to/proto/netstrings.txt, Netstrings} @tab
-    Y @tab Y @tab Y @tab N @tab N @tab Y @tab N
+    Y @tab Y @tab Y @tab N @tab N
  @item @url{https://wiki.theory.org/BitTorrentSpecification#Bencoding, Bencode} @tab
-    Y @tab Y @tab Y @tab Y @tab N @tab Y @tab N
-@item @url{https://en.wikipedia.org/wiki/Canonical_S-expressions, Canonical S-expression}
-    @tab Y @tab Y @tab Y @tab Y @tab N @tab Y @tab N
+    Y @tab Y @tab Y @tab Y @tab N
+@item @url{https://en.wikipedia.org/wiki/Canonical_S-expressions, Canonical S-expression} @tab
+    Y @tab Y @tab Y @tab Y @tab N
  @item YAC @tab
-    Y @tab Y @tab Y @tab Y @tab Y @tab Y @tab Y
+    Y @tab Y @tab Y @tab Y @tab Y
  
  @end multitable
  
-@itemize
-@item Streamable formats allow you to send a part of the data
-    immediately, for example element of the list or map. Simplifying
-    encoder and requiring less memory usage. All formats who needs to
-    know the size of maps/lists are not streamable.
-@item Compactness means small amount of bytes overhead for the given
-    data. For example any codec with ASCII decimal lengths of the
-    strings or integers representation is not compact.
-@item "Large strings" is a strings bigger than 4GiB. Some codecs allow
-    you to send only even 2GiB of data in a single chunk. That will
-    force you code and structures be more complex when dealing with big
-    data transfer
-@item "Many types" is a subjective thing of course. If codec can encode
-    everything JSON can, then it is enough types. ASN.1 codecs support
-    many various types, but they can not represent arbitrary map.
-@item Hardly you will find CBOR libraries supporting strict validation
-    of deterministically encoded CBOR structures.
-@end itemize
+@multitable @columnfractions .30 .05 .05 .05 .05 .05 .05
+
+@headitem name @tab
+    Large strings @tab
+    Human strings @tab
+    Integers @tab
+    Lists @tab
+    Structures @tab
+    Datetime
+
+@item ASN.1 DER @tab
+    Y @tab Y @tab Y @tab Y @tab Y @tab Y
+@item ASN.1 CER @tab
+    Y @tab Y @tab Y @tab Y @tab Y @tab Y
+@item XDR @tab
+    N @tab Y @tab Y @tab Y @tab Y @tab N
+@item BSON @tab
+    N @tab Y @tab Y @tab Y @tab Y @tab Y
+@item MessagePack @tab
+    N @tab Y @tab Y @tab Y @tab Y @tab N
+@item CBOR @tab
+    Y @tab Y @tab Y @tab Y @tab Y @tab N
+@item dCBOR @tab
+    Y @tab Y @tab Y @tab Y @tab Y @tab N
+@item Netstrings @tab
+    Y @tab N @tab N @tab N @tab N @tab N
+@item Bencode @tab
+    Y @tab N @tab Y @tab Y @tab Y @tab N
+@item CSExp @tab
+    Y @tab N @tab N @tab Y @tab N @tab N
+@item YAC @tab
+    Y @tab Y @tab Y @tab Y @tab Y @tab Y
+
+@end multitable
+
+But hardly you will find wide range of CBOR libraries supporting strict
+validation of deterministically encoded CBOR structures.
  
  YAC deals with those problems by using only streaming deterministic
  encoding. Its other important differences:
@@ -103,6 +123,7 @@ encoding. Its other important differences:
  
  @insertcopying
  
+@include rationale.texi
  @include install.texi
  @include encoding/index.texi
  @include schema.texi
diff --git a/spec/rationale.texi b/spec/rationale.texi

new file mode 100644 (file)

index 0000000..6102f36
--- /dev/null
+++ b/spec/rationale.texi
@@ -0,0 +1,67 @@
+@node Rationale
+@unnumbered Rationale
+
+@itemize
+
+@item
+We do not want ASCII decimal parsing. This is not trivial and not very
+fast to load an integer. Although it is human readable and
+understandable. Also it is not compact.
+
+@item
+We do not want varints (where most significant bit means continuation)
+and zig-zag-like encoding. This is not trivial code, prohibiting fast
+integer load.
+
+@item
+We do not want formats where maps and lists need to know their
+lengths/sizes in advance. That means no streaming possibility. That
+complicates encoder and requires more memory usage. Containers can be
+terminated with explicit signal tag.
+
+@item
+We want formats with ability to store maps/dictionaries/tables. Of
+course they can be emulated by reassembling lists, but that is manual
+action after the codec did his job.
+
+@item
+Differentiation of binary and human-readable strings (UTF-8 for example)
+is a must for a format that is intended to be looked and analysed by a human.
+
+@item
+ISO-based (string) representation of data is a no: because it requires
+complex parsing and takes much space. Naive UNIX timestamp
+representation raises questions about its length and dealing with the
+dates before 1970. Moreover they are not suitable for tasks requiring
+monotonous clocks, because of UTC.
+
+@item
+No tagging ability, context specifying, marking, hinting, extension
+mechanism or anything like that. That brings huge complications to the
+state and questions when you do not know how to deal with unknown
+entities. Any unsupported data type must be a string, possibly enveloped
+in a map with additional data. @code{@{"cp": "koi8-r", "str": BIN(...)@}}.
+
+@item
+Large (>2GiB) strings support is a must. Nowadays even a single
+multimedia file can easily exceed that size. General-purpose codec must
+be able to send it without complication of inventing your own chunked
+format.
+
+@item
+Is not embedded strings length, like in YAC and CBOR, is a more
+complicated code? Definitely. But there are so many short strings in a
+schemaless format for specifying map/structure keys. So many algorithm
+identifiers, that are also relatively short human-readable strings. So
+that is a compromise between slightly larger code and much shorter
+resulting structures, that is worth of it.
+
+@item
+We want strong distinguishing of continuous strings and streamable ones
+(BLOBs). ASN.1 CER does not distinguish them, making representation of
+every string in memory far from being convenient and easy to work with.
+Different tasks have different constraints: many of them do not need
+streamable strings at all, some of them may use them solely. YAC gives
+flexibility in choosing necessary data type for your needs.
+
+@end itemize
author	Sergey Matveev <stargrave@stargrave.org>
	Tue, 8 Oct 2024 11:47:41 +0000 (14:47 +0300)
committer	Sergey Matveev <stargrave@stargrave.org>
	Tue, 8 Oct 2024 12:56:22 +0000 (15:56 +0300)
spec/encoding/str.texi		patch \| blob \| history
spec/index.texi		patch \| blob \| history
spec/rationale.texi	[new file with mode: 0644]	patch \| blob