From: Sergey Matveev Date: Wed, 30 Oct 2024 08:47:33 +0000 (+0300) Subject: Various trivial additions X-Git-Url: http://www.git.cypherpunks.su/?a=commitdiff_plain;h=999acedefaac107e41917f255d7642577486ca3e5804d0a67674cbd5fb13f40e;p=keks.git Various trivial additions --- diff --git a/spec/encoding/blob.texi b/spec/encoding/blob.texi index 9f3c572..c3ba278 100644 --- a/spec/encoding/blob.texi +++ b/spec/encoding/blob.texi @@ -1,14 +1,15 @@ @node Blobs @cindex BLOB +@cindex chunk @section Blobs Blob (binary large object) allows you to transfer binary data in chunks, -in a streaming way, when data may not fit in memory at once. +in a streaming way, when data may not fit in memory. 64-bit big-endian integer follows the BLOB tag, setting the following -chunks payload size (+1). Then come zero or more NIL tags with -fixed-length payload after each of them. Blob is terminated by -@ref{Strings, BIN}, probably having zero length. +chunks payload size (+1). Then come zero or more NIL tags, each followed +by fixed-length payload. Blob is terminated by @ref{Strings, BIN}, +probably having zero length. Data format definition must specify exact chunk size expected to be used, if it needs deterministic encoding. diff --git a/spec/encoding/cont.texi b/spec/encoding/cont.texi index 6c1774c..c915bb4 100644 --- a/spec/encoding/cont.texi +++ b/spec/encoding/cont.texi @@ -1,15 +1,19 @@ @node Containers +@cindex containers @section Containers Containers do not have any explicit length, but are terminated by EOC (end of contents) tag. +@cindex LIST +@cindex EOC LIST contains a concatenation of items of arbitrary type. @verbatim LIST [ITEM0 || ITEM1 || ...] EOC @end verbatim +@cindex MAP MAP contains concatenation of @ref{Strings, STR(key)}-value pairs. Keys @strong{must} be non-empty, unique and length-first bytewise ascending ordered. @@ -20,6 +24,10 @@ MAP [STR(KEY0) || ITEM0 || STR(KEY1) || ITEM1 || ... ] EOC Hint: Encoding code for known format can be ordered itself to emit values in an already properly sorted way. +@cindex SET +SET is emulated by using MAPs with NIL values. That gives only 1-byte +overhead for each element, but reuses already existing code. + Example representations: @multitable @columnfractions .5 .5 @@ -27,5 +35,6 @@ Example representations: @item LIST[] @tab @code{08 00} @item LIST[INT(123) FALSE] @tab @code{08 207B 02 00} @item MAP[foo: LIST["bar"]] @tab @code{09 C3666F6F 08 C3626172 00 00} +@item SET[sig, dh] @tab @code{09 C26468 01 C3736967 01 00} @end multitable diff --git a/spec/encoding/index.texi b/spec/encoding/index.texi index 9126cfc..0dfd7f0 100644 --- a/spec/encoding/index.texi +++ b/spec/encoding/index.texi @@ -2,7 +2,7 @@ @cindex encoding @unnumbered Encoding -YAC can store various primitive scalar types (strings, integers, ...), +YAC can store various primitive scalar types (strings, integers, ...) and container types (lists, maps, ...). Serialisation process is just emitting the TLV-like encoding for each item recursively. diff --git a/spec/encoding/int.texi b/spec/encoding/int.texi index 0d71a40..df31255 100644 --- a/spec/encoding/int.texi +++ b/spec/encoding/int.texi @@ -13,9 +13,8 @@ Long form is encoded as a big-endian number of varying length. @code{*INT(len=1)..*INT(len=16)} types are for 8-, 16-, 24-, ..., 128-bit integer representations. -Shortest possible form @strong{must} be used. Leading zero bytes are -@strong{forbidden}. Short values (<32) @strong{must} be encoded in a -short form. +Shortest possible form @strong{must} be used, that means no leading zero byte. +Short values (<32) @strong{must} be encoded in a short form. Negative integers store their absolute value the same way as positive integers. After decoding, their value is subtracted from -1. Negative @@ -23,9 +22,9 @@ value encoded as @code{0x02} means @code{-1 - 0x02 => -3}. Hint: both positive and negative long integer tag's value keeps the length in the last 16 bits. So there is no need in dealing with every -reserved value. You can check first 4 bits of the header to determine is -it positive or negative integer, and then treat remaining 4 bits as a -length (+1). +tag's reserved value. You can check first 4 bits of the header to +determine is it positive or negative integer, and then treat remaining 4 +bits as a length (+1). Example representations: diff --git a/spec/index.texi b/spec/index.texi index 71fc1e2..cbb24e9 100644 --- a/spec/index.texi +++ b/spec/index.texi @@ -23,7 +23,7 @@ of structured data. But why!? @item Its encoding must be deterministic -- there must be only a single representation of the structured data, allowing its usage in cryptography-related contexts. -@item It should support enough data types to be able to replace JSON +@item It should support enough data types for being able to replace JSON transparently. @end itemize @@ -34,12 +34,7 @@ Are not there any satisfiable codecs? @multitable @columnfractions .30 .05 .05 .05 .05 .05 -@headitem name @tab - Schemaless @tab - Simple @tab - Deterministic @tab - Streamable @tab - Compact +@headitem @tab Schemaless @tab Simple @tab Deterministic @tab Streamable @tab Compact @item ASN.1 @url{https://en.wikipedia.org/wiki/Distinguished_Encoding_Rules#DER_encoding, DER} @tab N @tab @strong{N} @tab Y @tab N @tab N @@ -68,13 +63,7 @@ Are not there any satisfiable codecs? @multitable @columnfractions .30 .05 .05 .05 .05 .05 .05 -@headitem name @tab - Large strings @tab - Human strings @tab - Integers @tab - Lists @tab - Structures @tab - Datetime +@headitem @tab Large strings @tab Human strings @tab Integers @tab Lists @tab Structures @tab Datetime @item ASN.1 DER @tab Y @tab Y @tab Y @tab Y @tab Y @tab Y diff --git a/spec/install.texi b/spec/install.texi index 09dbcb0..2ede379 100644 --- a/spec/install.texi +++ b/spec/install.texi @@ -7,7 +7,8 @@ and Tcl. But all of them are currently badly covered with tests. @cindex git You can obtain development source code with @command{git clone git://git.cypherpunks.su/yac.git} -(also you can use @url{https://git.cypherpunks.su/yac.git}). +(also you can use @url{http://git.cypherpunks.su/yac.git}, +@url{https://git.cypherpunks.su/yac.git}). Also there is @url{https://yggdrasil-network.github.io/, Yggdrasil} accessible address: @url{http://y.www.yac.cypherpunks.su/}. diff --git a/spec/rationale.texi b/spec/rationale.texi index d34d426..1c091e0 100644 --- a/spec/rationale.texi +++ b/spec/rationale.texi @@ -4,49 +4,46 @@ @itemize @item -We do not want ASCII decimal parsing. This is not trivial and not very -fast to load an integer. Although it is human readable and -understandable. Also it is not compact. +No ASCII decimal parsing. That is not trivial code, not fast, not +compact. Although it is human readable and understandable. @item -We do not want varints (where most significant bit means continuation) -and zig-zag-like encoding. This is not trivial code, prohibiting fast -integer load. +No varints (where most significant bit means continuation) and +zig-zag-like encoding. That is not trivial code. @item -We do not want formats where maps and lists need to know their -lengths/sizes in advance. That means no streaming possibility. That -complicates encoder and requires more memory usage. Containers can be -terminated with explicit signal tag. +No formats where maps and lists need to know their lengths/sizes in +advance. That means no streaming possibility. Complicates encoder and +requires more memory usage. @item -We want formats with ability to store maps/dictionaries/tables. Of -course they can be emulated by reassembling lists, but that is manual -action after the codec did his job. +No formats without ability to store maps/dictionaries/tables. Of course +they can be emulated by reassembling lists, but that is manual action +after the codec did his job. @item Differentiation of binary and human-readable strings (UTF-8 for example) is a must for a format that is intended to be looked and analysed by a human. @item -ISO-based (string) representation of data is a no: because it requires -complex parsing and takes much space. Naive UNIX timestamp -representation raises questions about its length and dealing with the -dates before 1970. Moreover they are not suitable for tasks requiring -monotonous clocks, because of UTC. +No ISO-based (string) representation of datetime: it requires complex +parsing and takes much space. Naive UNIX timestamp representation raises +questions about its length and dealing with the dates before 1970. +Moreover they are not suitable for tasks requiring monotonous clocks, +because of UTC. @item No tagging ability, context specifying, marking, hinting, extension -mechanism or anything like that. That brings huge complications to the -state and questions when you do not know how to deal with unknown -entities. Any unsupported data type must be a string, possibly enveloped -in a map with additional data. @code{@{"cp": "koi8-r", "str": BIN(...)@}}. +mechanism or anything like that. That brings complications to the state +and questions with unknown entities. Any unsupported data type must be a +string, possibly enveloped in a map with additional data. +@code{@{"cp": "koi8-r", "str": BIN(...)@}}. @item Large (>2GiB) strings support is a must. Nowadays even a single -multimedia file can easily exceed that size. General-purpose codec must -be able to send it without complication of inventing your own chunked -format. +multimedia file can easily exceed that size. General-purpose codec +should be able to send it without complication of inventing your own +chunked format. @item Is not embedded strings length, like in YAC and CBOR, is a more @@ -54,14 +51,14 @@ complicated code? Definitely. But there are so many short strings in a schemaless format for specifying map/structure keys. So many algorithm identifiers, that are also relatively short human-readable strings. So that is a compromise between slightly larger code and much shorter -resulting structures, that is worth of it. +resulting structures. @item -We want clear distinguishing of continuous strings and streamable ones -(BLOBs). ASN.1 CER does not distinguish them, making representation of -every string in memory far from being convenient and easy to work with. -Different tasks have different constraints: many of them do not need -streamable strings at all, some of them may use them solely. YAC gives -flexibility in choosing necessary data type for your needs. +There should be clear distinguishing of continuous strings and +streamable ones (BLOBs). ASN.1 CER does not do that, making +representation of every string in memory far from being convenient and +easy to work with. Different tasks have different constraints: many of +them do not need streamable strings at all, some of them may use it +solely. @end itemize diff --git a/spec/schema.texi b/spec/schema.texi index 78415e5..afcb2ad 100644 --- a/spec/schema.texi +++ b/spec/schema.texi @@ -30,9 +30,6 @@ identifiers. OIDs database can be considered as an external schema. Lacking it, or lacking its actual state, you probably won't be able even guessing the context of the data inside. -Sets can be emulated by using MAPs with NIL values. That gives only -1-byte overhead for each element, but reuses already existing code. - If you really desire more compact encoding, even agree to use schema definitions, then think about replacing MAPs with LISTs. Non-present values can be indicated by NIL tag. diff --git a/tyac/tyac.tcl b/tyac/tyac.tcl index f4232fd..ee0ae7a 100644 --- a/tyac/tyac.tcl +++ b/tyac/tyac.tcl @@ -134,6 +134,12 @@ proc MAP {pairs} { EOC } +proc SET {v} { + set args [list] + foreach k $v { lappend args $k NIL } + MAP $args +} + proc BLOB {chunkLen v} { char [expr 0x0B] toBE 8 [expr {$chunkLen - 1}] @@ -223,6 +229,6 @@ proc RAW {t v} { namespace export EOC NIL FALSE TRUE UUID INT STR BIN RAW namespace export TAI64 UTCFromISO -namespace export LIST MAP LenFirstSort BLOB +namespace export LIST MAP SET LenFirstSort BLOB }