Confused about BER (Basic Encoding Rules) - encoding

I'm trying to study and understand BER (Basic Encoding Rules).
I've been using the website http://asn1-playground.oss.com/ to experiment with different ASN.1 objects and encoding them using BER.
However, even the simplest encodings seem to confuse me.
Let's take a simple ASN.1 schema:
World-Schema DEFINITIONS AUTOMATIC TAGS ::=
BEGIN
Human ::= SEQUENCE {
name UTF8String
}
END
So basically this is just a SEQUENCE with a single UTF8String type field called name.
An example of a value that matches this sequence would be something like:
{ "Bob" }
So, using http://asn1-playground.oss.com/, I produce the BER encoding of the following data:
some-guy Human ::=
{
name "Bob"
}
I would expect this to produce one sequence object, followed by a single string object.
What I get is:
30 05 80 03 42 6F 62
Now, I understand some of this encoding. The first octet, 30, is the identifier which tells us that a SEQUENCE type is the first object. The 30 is 00110000 in binary, which means that we have a class of 0, a PC (primitive/constructed) bit of 1 (meaning constructed), and a tag number of 10000 (16 in decimal) which means SEQUENCE
So far so good. The next value is the LENGTH in bytes of the SEQUENCE, which is 05.
Okay, still so far so good.
But then... I'm totally confused by the next octet 80. What does that mean??? I would have expected a value of 00001100 (for tag number 12, meaning UTF8String.)
The bytes following the 80 are pretty straightforward: the 03 means Length of 3, and the 42 6F 62 is just the UTF8String value itself, "Bob"

The 80 is a context-specific tag 0. Please note that "AUTOMATIC TAGS" is used at the beginning of the module. This indicates that all SEQUENCE, SET and CHOICE types will have context specific tags for their components starting with [0], and incrementing by 1 for each subsequent component. This way, you don't have to worry about tag conflicts when creating your messages, especially when dealing with components which are OPTIONAL or have a DEFAULT value. If you change "AUTOMATIC" to "EXPLICIT" (which I would not recommend) you will see the [UNIVERSAL 12] that you were expecting in the encoding.
Please note that AUTOMATIC TAGS applied only to tags on components of SEQUENCE, SET, or CHOICE. It does not apply to the top level components, which is why you saw the [UNIVERSAL 16] for the SEQUENCE rather than seeing a context-specific tag there also.

80 indicates context specific class, primitive, tag number 0. This is there because you specified an AUTOMATIC TAGGING environment, which automatically assigned a [0] tag to field name in type Human.

Related

Are control sequences the same number in every encoding?

I am writing a parser, and the original spec states:
The file header ends with the control sequence Ctrl-Z
They do not specify which encode the header is written in (could be latin1, utf8, windows-1252,...), so I wonder whether the sequence the same number in every language.
It appears to be the case, that it always correspond to decimal 26 or the hexa 1A
It would be good to know in a more general way, whether this is for all sequences.
Most likely, ASCII is assumed. For/if ASCII, especially if you say "Ctrl-Z" corresponds to binary representation/"codepoint" dec 26 hex 1A, this would be the SUB (substitute) sequence.
Other alternatives of the extended character sets/encodings wouldn't apply here, because if dec 26 in ASCII, it's within the first/lower 7 bits of the byte (dec 0-126 of 255 total). The 8th bit then was used to toggle all the previous codepoints/patterns once more and gain/use the other half, the other remaining 127 codepoints from dec 128-255. The idea here is that the extended character sets usually share/retain the lower ASCII codepoints/mappings (also for backward compatibility), but introduce their own special characters in the higher codepoint bit-patterns/range 128-255. And there's then many different ones of this type, trying to support more writing scripts of the world with such custom extended code sets. Like Windows-1252 which is an European mix, ISO-8859-1 for German, ISO-8859-15 which is identical but only adds the Euro currency symbol, code page 437 for IBM DOS shapes to use characters for drawing a TUI on the console (this, for example, has a different mapping at it's code points for what is the control sequences in ASCII), and so on. The problem obviously is, there's a lot of these:
each only gains 128 more characters
you can't combine/load/apply any two of them at the same time (if characters would be needed from multiple different code sets)
each application has to know (or be told) beforehand in which code set a file was saved in order to interpret/display/map the correct character rendering/symbols on the screen for these byte patterns, and if the user or a tool/app applies and saves the wrong code set to save its character renderings while not recognizing that, because the source was previously created/saved with a different code set, some characters didn't appear with the intended original renderings, now the file is "corrupt" because some bytes were stored under the assumption they would be rendered with code set A and some under the assumption they're for code set B, and can't apply both as there's also no mechanism in these flat dumb plain-text files on some old, memory-short DOS file systems to tell which part of a file is for which code-set, the characters can never be rendered correctly and it can be difficult work or impossible to retroactively guess + repair what the desired interpretation/rendering was for the binary pattern in a byte
no hope to get anywhere with only 127 more characters added on top of ASCII when it comes to Chinese etc.
So then the improvement was to not use the 8th bit for these stupid code pages, but instead use it as a marker that, if set, it's an indication that another byte is following (UTF-8), hence expanding the amount of code-points greatly. This can even be repeated with the next, subsequent byte. But, it's optional. If the character is within the 7-bit ASCII codepoints, then UTF-8 does not need to set the 8th bit and add another byte.
Also means, the extended code pages and UTF-8 cannot be mixed (used/applied at the same time). For many/most code pages and for UTF-8/UTF-16 as well, the character-onto-codepoint (latter is the bit pattern) mappings are identical to ASCII. If your characters are within the first/lower 7 bits of the byte, it does not matter what the encoding theoretically would be, as the 8th bit is not used for any of code pages or UTF-8. It only matters a great deal if/for characters that do have the 8th bit set/used (and usually if there's bytes like that, the choice of its encoding would usually then then for the entire file, just that some bytes may stay within the single-byte ASCII, but really should take great care at inserting/interpreting binary patterns that have the 8th bit set in a byte).
Easy rule is: if all bytes (or the byte in question) don't have the 8th bit set, it only matters whether the lower 7 bits are ASCII or not. EBCDIC for example is a non-ASCII alternative, where dec 26 hex 1A is UBS (unit backspace), while it also does have a SUB (substitute) but it's on codepoint (binary pattern) dec 63 hex 3F. Other encodings may not have ASCII's SUB at all, or something similar but with a slightly different meaning/use, or maybe ASCII has it's SUB from EBCDIC, etc. But there's no need to wonder/worry about UTF-8, as it does not apply if ASCII can be assumed, for the characters as encoded in ASCII are encoded identically UTF-8 as a single byte with the highest bit not set.
Maybe it can be determined from the spec if all the characters mentioned are within the ASCII range and according to the ASCII codepoint definitions, or if there's other characters that might only be found in UTF-8 (or UTF-16 or UTF-32) or in one of the old extended code pages (but not found in others), or if there's any indication that the encoding might not be ASCII/ASCII-based.
It's obviously problematic if a spec doesn't explicitly state the encoding it's implicitly assuming, if the spec is about a format, protocol or data representation. On the other hand, maybe the Ctrl-Z is misleading, because dec 26 hex 1A is always the same, no matter what the encoding could be if it were text/characters. Maybe the spec just uses this pattern as a construct with no meaning in terms of character display whatsoever, and is introducing only it's own particular local meaning as defined within the spec.

0D type and n?0D randoms

In A brief introduction to q and kdb+ there are several places with creation of time records with code like 0D00:01.
And even random time generation technique using syntax:
n?0D0
fcn?0D00:00:20
I found 0D mentioned only in q4m3 2.5.2 Time Types as optional.
Are there any references to this syntax on code.kx? And are there any other useful date/time random generators exist? I checked for capital-letters, - seems 0D is the only one, see: q)#[value;;::] each ("0",/:.Q.A)
Let me first note that the 0D... syntax is not specific to the rand operator. The prefix 0D is needed when a type of a literal kdb would infer without it would be different from what you intended. For example:
q)type 08:09:10.123 / time
-19h
q)type 0D08:09:10.123 / timespan
-16h
The prefix is optional when a type can be inferred unambiguously; in case of timespan literals it's sufficient to supply more than 4 digits after the dot when using the hh:mm:ss.nnnnnnnnn notation:
q)type 08:09:10.123 / time
-19h
q)type 08:09:10.1234 / still time
-19h
q)type 08:09:10.12345 / timespan
-16h
The 0D notation is very handy when you need a timespan value but don't want to specify all the details down to nanoseconds. I think you will agree that 0D00:01 (1 minute) is easier to type and read than 00:01:00.000000000.
Going back to your question, 0D0 is just a zero-valued timespan, the same as 00:00:00.00000000. However, ? treats it as if 1D0 (or 0D24:00:00.000000000) was passed. I didn't see it documented anywhere on code.kx.com but if you think about it you'll agree that generating a timespan in the range [0; 24h) is such a common case that it definitely deserves a shortcut. And there you have it!

Implementing MD5: Inconsistent endianness?

So I tried implementing the MD5 algorithm according to RFC1321 in C# and it works, but there is one thing about the way the padding is performed that I don't understand, here's an example:
If I want to hash the string "1" (without the quotation marks) this results in the following bit representation: 10001100
The next step is appending a single "1"-Bit, represented by 00000001 (big endian), which is followed by "0"-Bits, followed by a 64-bit representation of the length of the original message (low-order word first).
Since the length of the original message is 8 (Bits) I expected 00000000000000000000000000001000 00000000000000000000000000000000 to be appended (low-order word first). However this does not result in the correct hash value, but appending 00010000000000000000000000000000 00000000000000000000000000000000 does.
This looks as if suddenly the little-endian format is being used, but that does not really seem to make any sense at all, so I guess there must be something else that I am missing?
Yes, for md5 you have to add message length in little-endian.
So, message representation for "1" -> 49 -> 00110001, followed by single bit and zeroes. And after add message length in reversed order of bytes (the least significant byte first).
You could also check permutations step by step on this site: https://cse.unl.edu/~ssamal/crypto/genhash.php.
Or there: https://github.com/MrBlackk/md5_sha256-512_debugger

Why were PNG chunks named like that?

I've studied the PNG structure to develop something about it. And I found something interesting.
The names of critical PNG chunks(IHDR, PLTE, IDAT, IEND, PLTE) are all uppercase. And there is at least one lowercase character in the names of ancillary PNG chunks(bKGD, cHRM, gAMA, hIST, iCCP, iTXt, pHYs, sBIT, sPLT, sRGB, sTER, tEXt, tIME, tRNS, zTXt, etc.).
I'm so curious. Was there a naming rule when they standardize them?
According to Jongware, the answer is this:
https://www.w3.org/TR/PNG/#5Chunk-naming-conventions
5.4 Chunk naming conventions
Four bits of the chunk type, the property bits, namely bit 5 (value 32) of each byte, are used to convey chunk properties. This choice means that a human can read off the assigned properties according to whether the letter corresponding to each byte of the chunk type is uppercase (bit 5 is 0) or lowercase (bit 5 is 1). However, decoders should test the properties of an unknown chunk type by numerically testing the specified bits; testing whether a character is uppercase or lowercase is inefficient, and even incorrect if a locale-specific case definition is used.
The property bits are an inherent part of the chunk type, and hence are fixed for any chunk type. Thus, CHNK and cHNk would be unrelated chunk types, not the same chunk with different properties.
The semantics of the property bits are defined in Table 5.2.
Table 5.2 — Semantics of property bits
Ancillary bit: first byte
0 (uppercase) = critical, 1 (lowercase) = ancillary.
Critical chunks are necessary for successful display of the contents of the datastream, for example the image header chunk(IHDR). A decoder trying to extract the image, upon encountering an unknown chunk type in which the ancillary bit is 0, shall indicate to the user that the image contains information it cannot safely interpret.
Ancillary chunks are not strictly necessary in order to meaningfully display the contents of the datastream, for example the time chunk(tIME). A decoder encountering an unknown chunk type in which the ancillary bit is 1 can safely ignore the chunk and proceed to display the image.
Private bit: second byte
0 (uppercase) = public, 1 (lowercase) = private.
A public chunk is one that is defined in this International Standard or is registered in the list of PNG special-purpose public chunk types maintained by the Registration Authority (see 4.9 Extension and registration). Applications can also define private (unregistered) chunk types for their own purposes. The names of private chunks have a lowercase second letter, while public chunks will always be assigned names with uppercase second letters. Decoders do not need to test the private-chunk property bit, since it has no functional significance; it is simply an administrative convenience to ensure that public and private chunk names will not conflict. See clause 14: Editors and extensions and 12.10.2: Use of private chunks.
Reserved bit: third byte
0 (uppercase) in this version of PNG. If the reserved bit is 1, the datastream does not conform to this version of PNG.
The significance of the case of the third letter of the chunk name is reserved for possible future extension. In this International Standard, all chunk names shall have uppercase third letters.
Safe-to-copy bit: fourth byte
0 (uppercase) = unsafe to copy, 1 (lowercase) = safe to copy.
This property bit is not of interest to pure decoders, but it is needed by PNG editors. This bit defines the proper handling of unrecognized chunks in a datastream that is being modified. Rules for PNG editors are discussed further in 14.2: Behaviour of PNG editors.

Encoding that minimizes misreading / mistyping / misspeaking?

Let's say you have a system in which a fairly long key value can be accurately communicated to a user on-screen, via email or via paper; but the user needs to be able to communicate the key back to you accurately by reading it over the phone, or by reading it and typing it back into some other interface.
What is a "good" way to encode the key to make reading / hearing / typing it easy & accurate?
This could be an invoice number, a document ID, a transaction ID or some other abstract value. Let's say for the sake of this discussion the underlying key value is a big number, say 40 digits in base 10.
Some thoughts:
Shorter keys are generally better
a 40-digit base 10 value may not fit in the space given, and is easy to get lost in the middle of
the same value could be represented in base 16 in 33-34 digits
the same value could be represented in base 36 in 26 digits
the same value could be represented in base 64 in 22-23 digits
Characters that can't be visually confused with each other are better
e.g. an encoding that includes both O (oh) and 0 (zero), or S (ess) and 5 (five), could be bad
This issue depends on the font / face used to display the key, which you may be able to control in some cases (like printing on paper) but can't control in others (like web pages and email).
Also depends on whether you can control the exclusive use of upper and / or lower case -- e.g. capital D (dee) may look like O (oh) but lower case d (dee) would not; while lower case l (ell) looks like a 1 (one) while capital L (ell) would not. (With exceptions for especially exotic fonts / faces).
Characters that can't be verbally / aurally confused with each other are better
a (ay) 8 (eight)
B (bee) C (cee) D (dee) E (ee) g (gee) p (pee) t (tee) v (vee) z (zee) 3 (three)
This issue depends on the audio quality of the end-to-end channel -- bigger challenge if the expected user base could have a speech impediment, or may have to speak through a gas mask, or the communication channel could include CB radios or choppy VOIP phone systems.
Adding a check digit or two would detect errors but not help resolve errors.
An alpha - bravo - charlie - delta type dialog can help with hearing errors, but not reading errors.
Possible choices of encoding:
Base 64 -- compact, but too many hard-to-verbalize characters (underscore, dash etc.)
Base 34 -- 0-9 and A-Z but with O (oh) and I (aye) left out as the easiest to confuse with digits
Base 32 -- same as base 34 but leave out the 0 (zero) and 1 (one) as well
Is there a generally recognized encoding that is a reasonable solution for this scenario?
When I heard it first, I liked the article A Proposal for Proquints: Identifiers that are Readable, Spellable, and Pronounceable. It encodes data as a sequence of consonants and vowels. It's tied to the English language though. (Because in German, f and v sound equal, so they should not be used both.) But I like the general idea.