Interpreting ASN.1 indefinite-lenght encoding with multiple encapsulated octet-strings - encoding

I have a BER structure like this...
$ openssl asn1parse -inform der -in test.der -i -dump
????:d=4 hl=2 l=inf cons: cont [ 0 ]
????:d=5 hl=3 l= 240 prim: OCTET STRING
0000 - AABBCCDD
????:d=5 hl=2 l= 8 prim: OCTET STRING
0000 - EEFF
????:d=5 hl=2 l= 0 prim: EOC
...or in der2ascii style...
[0] `80`
OCTET_STRING { `AABBCCDD` }
OCTET_STRING { `EEFF` }
`0000`
What I know: indefinite-length encoding must contain a constructed type, because primitive types may introduce ambiguities, e.g. when containing 0x0000. What I want to know: How does a decoder must behave when parsing this BER structure? Are the header bytes of both OCTET STRINGs included in the encoding? If yes, how is indefinite-length byte data encoded? How does an application interpret the value of the TLV field tagged [0], when the second OCTET STRING is e.g. an INTEGER?
I am asking this question, because in the CMS standard, a field is defined as single OCTET STRING, but in most BER encodings I always see two of them. Is this only due to the indefinite-length encoding? Am I missing something?
From ITU-T X.690:
8.1.4 Contents octets
The contents octets shall consist of zero, one or more octets, and shall encode the data value as specified in
subsequent clauses.
NOTE – The contents octets depend on the type of the data value;
subsequent clauses follow the same sequence as the definition of types
in ASN.1.
Does this mean, that I can put every constructed type and the application must only interpret the value part of the contructed TLV structure?

When you encode a primitive OCTET STRING in indefinite length mode, the encoder must:
split up the value into chunks of smaller OCTET STRINGs
encode each chunk in definite length mode so that each has its own TLV (with length!)
the whole sequence of definite length encoded primitive OCTET STRINGs must be framed by a single, indefinite length encoded constructed OCTET STRING "container" having its own TLV (without length, but with end-of-octets sentinel)
At the other end, the decoder extracts the V part from the inner, definite length OCTET STRING chunks (dropping their TL headers). Then joins/consumes V's together in the order of arrival dropping the TL part of the outer frame.
Note that the idea behind indefinite length encoding technique is that both encoder and decoder can emit/consume incomplete, possibly oversized, data.
Chunk size is chosen by the encoder/application based on data availability, memory situation and possibly the estimation of decoder's buffering capabilities. I think this is mentioned somewhere in the X.280/X.680 papers.
Encoder is not allowed to put chunks of different ASN.1 types into any single indefinite length encoded container. In other words, all chunks must be of the same type as the outer container.
That should hopefully explain why you may see multiple (depending on chunk size) OCTET STRINGs in the indefinite length encoded BER/CER stream where just a single OCTET STRING is expected.
DER forbids indefinite length encoding on the grounds that serialized representation of the same data may change on re-encoding (due to potentially changing chunk size).

Related

Why the trailing 0x00 byte after BSON string (not Cstring/ename)?

obviously, for bson cstring the trailing byte is used to determine the length of the string, so it is: (byte*) "\x00". They are used as regex patterns, rexegs options and ename, which are not long / used in iterations, so the length is not necessary, but then comes...
bson string is written as: int32 (byte*) "\x00"
with specification as follows: The int32 is the number bytes in the (byte*) + 1 (for the trailing '\x00'). The (byte*) is zero or more UTF-8 encoded characters.
but why the use of trailing zero byte? if we have the utf-8 encoded string length, it is sufficient for the byte data workflow, and the 0x00 byte just adds an unneeded byte. Am I missing something?
The reasoning for both the length of the string and the null terminator is twofold: compatibility with existing C-style strings, and performance.
For performance, MongoDB needs to be able to quickly go to a specific field in a document without iterating through the whole BSON. This is important especially if you're looking for a field that is close to the end of a large (say 16 MB) document. With the length of the string encoded as one of the first information on a string type, it can just skip that number of bytes and get to the next field. Otherwise, it will need to iterate over the whole string until it finds the end of the string.
For compatibility, MongoDB is written in C++, where strings are null terminated. It can cut off that null terminator to save one byte since the length is encoded, but getting that string out of BSON into a format that's usable by C++ would require tacking on that null again. This will need specialized string handling routine that's the only advantage is saving a single byte.
Overall, it was decided that "wasting" a single byte is an acceptable tradeoff.

How can I work with raw bytes in perl

Documentation all directs me to unicode support, yet I don't think my request has anything to do with Unicode. I want to work with raw bytes within the context of a single scalar; I need to be able to figure out its length (in bytes), take substrings of it (in bytes), write the bytes to disc, and over the network. Is there an easy way to do this, without treating the bytes as any sort of encoding in perl?
EDIT
More explicitly,
my $data = "Perl String, unsure of encoding and don't need to know";
my #data_chunked_into_1024_bytes_each = #???
Perl strings are, conceptually, strings of characters, which are positive 32-bit integers that (normally) represent Unicode code points. A byte string, in Perl, is just a string in which all the characters have values less than 256.
(That's the conceptual view. The internal representation is somewhat more complicated, as the perl interpreter tries to store byte strings — in the above sense — as actual byte strings, while using a generalized UTF-8 encoding for strings that contain character values of 256 or higher. But this is all supposed to be transparent to the user, and in fact mostly is, except for some ugly historical corner cases like the bitwise not (~) operator.)
As for how to turn a general string into a byte string, that really depends on what the string you have contains and what the byte string is supposed to contain:
If your string already is a string of bytes — e.g. if you read it from a file in binary mode — then you don't need to do anything. The string shouldn't contain any characters above 255 to being with, and if it does, that's an error and will probably be reported as such by the encryption code.
Similarly, if your string is supposed to encode text in the ASCII or ISO-8859-1 encodings (which encode the 7- and 8-bit subsets of Unicode respectively), then you don't need to do anything: any characters up to 255 are already correctly encoded, and any higher values are invalid for those encodings.
If your input string contains (Unicode) text that you want to encode in some other encoding, then you'll need to convert the string to that encoding. The usual way to do that is by using the Encode module, like this:
use Encode;
my $byte_string = encode( "name of encoding", $text_string );
Obviously, you can convert the byte string back to the corresponding character string with:
use Encode;
my $text_string = decode( "name of encoding", $byte_string );
For the special case of the UTF-8 encoding, it's also possible to use the built-in utf8::encode() function instead of Encode::encode():
utf8::encode( $string );
which does essentially the same thing as:
use Encode;
$string = encode( "utf8", $string );
Note that, unlike Encode::encode(), the utf8::encode() function modifies the input string directly. Also note that the "utf8" above refers to Perl's extended UTF-8 encoding, which allows values outside the official Unicode range; for strictly standards-compliant UTF-8 encoding, use "utf-8" with a hyphen (see Encode documentation for the gory details). And, yes, there's also a utf8::decode() function that does pretty much what you'd expect.
If I understood your question correctly, what you want is the pack/unpack functions: http://perldoc.perl.org/functions/pack.html
As long as your string doesn't contain characters above codepoint 255, it will mostly work as plain byte string, with length and substr operating on bytes. Additionally, most output functions like print expect octets/bytes by default and will actually complain if you try to stuff anything else to them.
You may need to explicitly encode/decode your output if it is known to be in some encoding, but more details can only be added if you ask another specific question for each problematic part of your program.

hexadecimal encoding/decoding functions in mysql

are there any hexadecimal encoding/decoding functions that mysql will recognize to avoid writing a whole algorithm out. base16_encode/decode does not work. Thanks
MySQL understands hexidecimal literals
MySQL supports hexadecimal values, written using X'val', x'val', or 0xval format, where val contains hexadecimal digits (0..9, A..F). Lettercase of the digits does not matter. For values written using X'val' or x'val' format, val must contain an even number of digits. For values written using 0xval syntax, values that contain an odd number of digits are treated as having an extra leading 0. For example, 0x0a and 0xaaa are interpreted as 0x0a and 0x0aaa.
http://dev.mysql.com/doc/refman/5.0/en/hexadecimal-literals.html

How can I convert the tiger hash values from the official implementations into the form used by Direct Connect?

I am trying to implement a Direct Connect Client, and I am currently stuck at a point where I need to hash the files in order to be able to upload them to other clients.
As the all other clients require a TTHL (Tiger Tree Hashing Leaves) support for verification of the downloaded data. I have searched for implementations of the algorithm, and found tiger-hash-python.
I have implemented a routine that uses the hash function from before, and is able to hash large files, according to the logic specified in Tree Hash EXchange format (THEX) (basically, the tree diagram is the important part on that page).
However, the value produced by it is similar to those shown on Wikipedia, a hex digest, but is different from those shown in the DC clients I'm using for reference.
I have been unable to find out how the hex digest form is converted to this other one (39 characters, A-Z, 0-9). Could someone please explain how that is done?
Well ... I tried what Paulo Ebermann said, using the following functions:
def strdivide(list,length):
result = []
# Calculate how many blocks there are, using the condition: i*length = len(list).
# The additional maths operations are to deal with the last block which might have a smaller size
for i in range(0,int(math.ceil(float(len(list))/length))):
result.append(list[i*length:(i+1)*length])
return result
def dchash(data):
result = tiger.hash(data) # From the aformentioned tiger-hash-python script, 48-char hex digest
result = "".join([ "".join(strdivide(result[i:i+16],2)[::-1]) for i in range(0,48,16) ]) # Representation Transform
bits = "".join([chr(int(c,16)) for c in strdivide(result,2)]) # Converting every 2 hex characters into 1 normal
result = base64.b32encode(bits) # Result will be 40 characters
return result[:-1] # Leaving behind the trailing '='
The TTH for an empty file was found to be 8B630E030AD09E5D0E90FB246A3A75DBB6256C3EE7B8635A, which after the transformation specified here, becomes 5D9ED00A030E638BDB753A6A24FB900E5A63B8E73E6C25B6. Base-32 encoding this result yielded LWPNACQDBZRYXW3VHJVCJ64QBZNGHOHHHZWCLNQ, which was found to be what DC++ generates.
The only mention of the format of the hash in the Direct Connect protocol I found is on the $SR page on the NMDC Protocol wiki:
For files containing TTH, the <hub_name> parameter is replaced with TTH:<base32_encoded_tth_hash> (ref: TTH_Hash).
So, it is Base32-encoding. This is defined in RFC 4648 (and some earlier ones), section 6.
Basically, you are using the capital letters A-Z and the decimal digits 2 to 7, and one base32 digit represents 5 bits, while one base16 (hexadecimal) digit represents only 4 ones.
This means, each 5 hex digits map to 4 base32-digits, and for a Tiger hash (192 bits) you will need 40 base32-digits (in the official encoding, the last one would be a = padding, which seems to be omitted if you say that there are always 39 characters).
I'm not sure of an implementation of a conversion from hex (or bytes) to base32, but it shouldn't be too complicated with a lookup table and some bit-shifting.

Why does a base64 encoded string have an = sign at the end

I know what base64 encoding is and how to calculate base64 encoding in C#, however I have seen several times that when I convert a string into base64, there is an = at the end.
A few questions came up:
Does a base64 string always end with =?
Why does an = get appended at the end?
Q Does a base64 string always end with =?
A: No. (the word usb is base64 encoded into dXNi)
Q Why does an = get appended at the end?
A: As a short answer:
The last character (= sign) is added only as a complement (padding) in the final process of encoding a message with a special number of characters.
You will not have an = sign if your string has a multiple of 3 characters, because Base64 encoding takes each three bytes (a character=1 byte) and represents them as four printable characters in the ASCII standard.
Example:
(a) If you want to encode
ABCDEFG <=> [ABC] [DEF] [G]
Base64 deals with the first block (producing 4 characters) and the second (as they are complete). But for the third, it will add a double == in the output in order to complete the 4 needed characters. Thus, the result will be QUJD REVG Rw== (without spaces).
[ABC] => QUJD
[DEF] => REVG
[G] => Rw==
(b) If you want to encode ABCDEFGH <=> [ABC] [DEF] [GH]
similarly, it will add one = at the end of the output to get 4 characters.
The result will be QUJD REVG R0g= (without spaces).
[ABC] => QUJD
[DEF] => REVG
[GH] => R0g=
It serves as padding.
A more complete answer is that a base64 encoded string doesn't always end with a =, it will only end with one or two = if they are required to pad the string out to the proper length.
From Wikipedia:
The final '==' sequence indicates that the last group contained only one byte, and '=' indicates that it contained two bytes.
Thus, this is some sort of padding.
Its defined in RFC 2045 as a special padding character if fewer than 24 bits are available at the end of the encoded data.
No.
To pad the Base64-encoded string to a multiple of 4 characters in length, so that it can be decoded correctly.
The equals sign (=) is used as padding in certain forms of base64 encoding. The Wikipedia article on base64 has all the details.
It's padding. From http://en.wikipedia.org/wiki/Base64:
In theory, the padding character is not needed for decoding, since the
number of missing bytes can be calculated from the number of Base64
digits. In some implementations, the padding character is mandatory,
while for others it is not used. One case in which padding characters
are required is concatenating multiple Base64 encoded files.
http://www.hcidata.info/base64.htm
Encoding "Mary had" to Base 64
In this example we are using a simple text string ("Mary had") but the principle holds no matter what the data is (e.g. graphics file). To convert each 24 bits of input data to 32 bits of output, Base 64 encoding splits the 24 bits into 4 chunks of 6 bits. The first problem we notice is that "Mary had" is not a multiple of 3 bytes - it is 8 bytes long. Because of this, the last group of bits is only 4 bits long. To remedy this we add two extra bits of '0' and remember this fact by putting a '=' at the end. If the text string to be converted to Base 64 was 7 bytes long, the last group would have had 2 bits. In this case we would have added four extra bits of '0' and remember this fact by putting '==' at the end.
= is a padding character. If the input stream has length that is not a multiple of 3, the padding character will be added. This is required by decoder: if no padding present, the last byte would have an incorrect number of zero bits.
Better and deeper explanation here: https://base64tool.com/detect-whether-provided-string-is-base64-or-not/
The equals or double equals serves as padding. It's a stupid concept defined in RFC2045 and it is actually superfluous. Any decend parser can encode and decode a base64 string without knowing about padding by just counting up the number of characters and filling in the rest if size isn't dividable by 3 or 4 respectively. This actually leads to difficulties every now and then, because some parsers expect padding while others blatantly ignore it. My MPU base64 decoder for example needs padding, but it receives a non-padded base64 string over the network. This leads to erronous parsing and I had to account for it myself.