@node Blobs
@cindex BLOB
+@cindex chunk
@section Blobs
Blob (binary large object) allows you to transfer binary data in chunks,
-in a streaming way, when data may not fit in memory at once.
+in a streaming way, when data may not fit in memory.
64-bit big-endian integer follows the BLOB tag, setting the following
-chunks payload size (+1). Then come zero or more NIL tags with
-fixed-length payload after each of them. Blob is terminated by
-@ref{Strings, BIN}, probably having zero length.
+chunks payload size (+1). Then come zero or more NIL tags, each followed
+by fixed-length payload. Blob is terminated by @ref{Strings, BIN},
+probably having zero length.
Data format definition must specify exact chunk size expected to be
used, if it needs deterministic encoding.
@node Containers
+@cindex containers
@section Containers
Containers do not have any explicit length, but are terminated by EOC
(end of contents) tag.
+@cindex LIST
+@cindex EOC
LIST contains a concatenation of items of arbitrary type.
@verbatim
LIST [ITEM0 || ITEM1 || ...] EOC
@end verbatim
+@cindex MAP
MAP contains concatenation of @ref{Strings, STR(key)}-value pairs. Keys
@strong{must} be non-empty, unique and length-first bytewise ascending ordered.
Hint: Encoding code for known format can be ordered itself to emit
values in an already properly sorted way.
+@cindex SET
+SET is emulated by using MAPs with NIL values. That gives only 1-byte
+overhead for each element, but reuses already existing code.
+
Example representations:
@multitable @columnfractions .5 .5
@item LIST[] @tab @code{08 00}
@item LIST[INT(123) FALSE] @tab @code{08 207B 02 00}
@item MAP[foo: LIST["bar"]] @tab @code{09 C3666F6F 08 C3626172 00 00}
+@item SET[sig, dh] @tab @code{09 C26468 01 C3736967 01 00}
@end multitable
@cindex encoding
@unnumbered Encoding
-YAC can store various primitive scalar types (strings, integers, ...),
+YAC can store various primitive scalar types (strings, integers, ...)
and container types (lists, maps, ...). Serialisation process is just
emitting the TLV-like encoding for each item recursively.
@code{*INT(len=1)..*INT(len=16)} types are for 8-, 16-, 24-, ...,
128-bit integer representations.
-Shortest possible form @strong{must} be used. Leading zero bytes are
-@strong{forbidden}. Short values (<32) @strong{must} be encoded in a
-short form.
+Shortest possible form @strong{must} be used, that means no leading zero byte.
+Short values (<32) @strong{must} be encoded in a short form.
Negative integers store their absolute value the same way as positive
integers. After decoding, their value is subtracted from -1. Negative
Hint: both positive and negative long integer tag's value keeps the
length in the last 16 bits. So there is no need in dealing with every
-reserved value. You can check first 4 bits of the header to determine is
-it positive or negative integer, and then treat remaining 4 bits as a
-length (+1).
+tag's reserved value. You can check first 4 bits of the header to
+determine is it positive or negative integer, and then treat remaining 4
+bits as a length (+1).
Example representations:
@item Its encoding must be deterministic -- there must be only a single
representation of the structured data, allowing its usage in
cryptography-related contexts.
-@item It should support enough data types to be able to replace JSON
+@item It should support enough data types for being able to replace JSON
transparently.
@end itemize
@multitable @columnfractions .30 .05 .05 .05 .05 .05
-@headitem name @tab
- Schemaless @tab
- Simple @tab
- Deterministic @tab
- Streamable @tab
- Compact
+@headitem @tab Schemaless @tab Simple @tab Deterministic @tab Streamable @tab Compact
@item ASN.1 @url{https://en.wikipedia.org/wiki/Distinguished_Encoding_Rules#DER_encoding, DER} @tab
N @tab @strong{N} @tab Y @tab N @tab N
@multitable @columnfractions .30 .05 .05 .05 .05 .05 .05
-@headitem name @tab
- Large strings @tab
- Human strings @tab
- Integers @tab
- Lists @tab
- Structures @tab
- Datetime
+@headitem @tab Large strings @tab Human strings @tab Integers @tab Lists @tab Structures @tab Datetime
@item ASN.1 DER @tab
Y @tab Y @tab Y @tab Y @tab Y @tab Y
@cindex git
You can obtain development source code with
@command{git clone git://git.cypherpunks.su/yac.git}
-(also you can use @url{https://git.cypherpunks.su/yac.git}).
+(also you can use @url{http://git.cypherpunks.su/yac.git},
+@url{https://git.cypherpunks.su/yac.git}).
Also there is @url{https://yggdrasil-network.github.io/, Yggdrasil}
accessible address: @url{http://y.www.yac.cypherpunks.su/}.
@itemize
@item
-We do not want ASCII decimal parsing. This is not trivial and not very
-fast to load an integer. Although it is human readable and
-understandable. Also it is not compact.
+No ASCII decimal parsing. That is not trivial code, not fast, not
+compact. Although it is human readable and understandable.
@item
-We do not want varints (where most significant bit means continuation)
-and zig-zag-like encoding. This is not trivial code, prohibiting fast
-integer load.
+No varints (where most significant bit means continuation) and
+zig-zag-like encoding. That is not trivial code.
@item
-We do not want formats where maps and lists need to know their
-lengths/sizes in advance. That means no streaming possibility. That
-complicates encoder and requires more memory usage. Containers can be
-terminated with explicit signal tag.
+No formats where maps and lists need to know their lengths/sizes in
+advance. That means no streaming possibility. Complicates encoder and
+requires more memory usage.
@item
-We want formats with ability to store maps/dictionaries/tables. Of
-course they can be emulated by reassembling lists, but that is manual
-action after the codec did his job.
+No formats without ability to store maps/dictionaries/tables. Of course
+they can be emulated by reassembling lists, but that is manual action
+after the codec did his job.
@item
Differentiation of binary and human-readable strings (UTF-8 for example)
is a must for a format that is intended to be looked and analysed by a human.
@item
-ISO-based (string) representation of data is a no: because it requires
-complex parsing and takes much space. Naive UNIX timestamp
-representation raises questions about its length and dealing with the
-dates before 1970. Moreover they are not suitable for tasks requiring
-monotonous clocks, because of UTC.
+No ISO-based (string) representation of datetime: it requires complex
+parsing and takes much space. Naive UNIX timestamp representation raises
+questions about its length and dealing with the dates before 1970.
+Moreover they are not suitable for tasks requiring monotonous clocks,
+because of UTC.
@item
No tagging ability, context specifying, marking, hinting, extension
-mechanism or anything like that. That brings huge complications to the
-state and questions when you do not know how to deal with unknown
-entities. Any unsupported data type must be a string, possibly enveloped
-in a map with additional data. @code{@{"cp": "koi8-r", "str": BIN(...)@}}.
+mechanism or anything like that. That brings complications to the state
+and questions with unknown entities. Any unsupported data type must be a
+string, possibly enveloped in a map with additional data.
+@code{@{"cp": "koi8-r", "str": BIN(...)@}}.
@item
Large (>2GiB) strings support is a must. Nowadays even a single
-multimedia file can easily exceed that size. General-purpose codec must
-be able to send it without complication of inventing your own chunked
-format.
+multimedia file can easily exceed that size. General-purpose codec
+should be able to send it without complication of inventing your own
+chunked format.
@item
Is not embedded strings length, like in YAC and CBOR, is a more
schemaless format for specifying map/structure keys. So many algorithm
identifiers, that are also relatively short human-readable strings. So
that is a compromise between slightly larger code and much shorter
-resulting structures, that is worth of it.
+resulting structures.
@item
-We want clear distinguishing of continuous strings and streamable ones
-(BLOBs). ASN.1 CER does not distinguish them, making representation of
-every string in memory far from being convenient and easy to work with.
-Different tasks have different constraints: many of them do not need
-streamable strings at all, some of them may use them solely. YAC gives
-flexibility in choosing necessary data type for your needs.
+There should be clear distinguishing of continuous strings and
+streamable ones (BLOBs). ASN.1 CER does not do that, making
+representation of every string in memory far from being convenient and
+easy to work with. Different tasks have different constraints: many of
+them do not need streamable strings at all, some of them may use it
+solely.
@end itemize
Lacking it, or lacking its actual state, you probably won't be able even
guessing the context of the data inside.
-Sets can be emulated by using MAPs with NIL values. That gives only
-1-byte overhead for each element, but reuses already existing code.
-
If you really desire more compact encoding, even agree to use schema
definitions, then think about replacing MAPs with LISTs. Non-present
values can be indicated by NIL tag.
EOC
}
+proc SET {v} {
+ set args [list]
+ foreach k $v { lappend args $k NIL }
+ MAP $args
+}
+
proc BLOB {chunkLen v} {
char [expr 0x0B]
toBE 8 [expr {$chunkLen - 1}]
namespace export EOC NIL FALSE TRUE UUID INT STR BIN RAW
namespace export TAI64 UTCFromISO
-namespace export LIST MAP LenFirstSort BLOB
+namespace export LIST MAP SET LenFirstSort BLOB
}