Why the trailing 0x00 byte after BSON string (not Cstring/ename)? - mongodb

obviously, for bson cstring the trailing byte is used to determine the length of the string, so it is: (byte*) "\x00". They are used as regex patterns, rexegs options and ename, which are not long / used in iterations, so the length is not necessary, but then comes...
bson string is written as: int32 (byte*) "\x00"
with specification as follows: The int32 is the number bytes in the (byte*) + 1 (for the trailing '\x00'). The (byte*) is zero or more UTF-8 encoded characters.
but why the use of trailing zero byte? if we have the utf-8 encoded string length, it is sufficient for the byte data workflow, and the 0x00 byte just adds an unneeded byte. Am I missing something?

The reasoning for both the length of the string and the null terminator is twofold: compatibility with existing C-style strings, and performance.
For performance, MongoDB needs to be able to quickly go to a specific field in a document without iterating through the whole BSON. This is important especially if you're looking for a field that is close to the end of a large (say 16 MB) document. With the length of the string encoded as one of the first information on a string type, it can just skip that number of bytes and get to the next field. Otherwise, it will need to iterate over the whole string until it finds the end of the string.
For compatibility, MongoDB is written in C++, where strings are null terminated. It can cut off that null terminator to save one byte since the length is encoded, but getting that string out of BSON into a format that's usable by C++ would require tacking on that null again. This will need specialized string handling routine that's the only advantage is saving a single byte.
Overall, it was decided that "wasting" a single byte is an acceptable tradeoff.

Related

Interpreting ASN.1 indefinite-lenght encoding with multiple encapsulated octet-strings

I have a BER structure like this...
$ openssl asn1parse -inform der -in test.der -i -dump
????:d=4 hl=2 l=inf cons: cont [ 0 ]
????:d=5 hl=3 l= 240 prim: OCTET STRING
0000 - AABBCCDD
????:d=5 hl=2 l= 8 prim: OCTET STRING
0000 - EEFF
????:d=5 hl=2 l= 0 prim: EOC
...or in der2ascii style...
[0] `80`
OCTET_STRING { `AABBCCDD` }
OCTET_STRING { `EEFF` }
`0000`
What I know: indefinite-length encoding must contain a constructed type, because primitive types may introduce ambiguities, e.g. when containing 0x0000. What I want to know: How does a decoder must behave when parsing this BER structure? Are the header bytes of both OCTET STRINGs included in the encoding? If yes, how is indefinite-length byte data encoded? How does an application interpret the value of the TLV field tagged [0], when the second OCTET STRING is e.g. an INTEGER?
I am asking this question, because in the CMS standard, a field is defined as single OCTET STRING, but in most BER encodings I always see two of them. Is this only due to the indefinite-length encoding? Am I missing something?
From ITU-T X.690:
8.1.4 Contents octets
The contents octets shall consist of zero, one or more octets, and shall encode the data value as specified in
subsequent clauses.
NOTE – The contents octets depend on the type of the data value;
subsequent clauses follow the same sequence as the definition of types
in ASN.1.
Does this mean, that I can put every constructed type and the application must only interpret the value part of the contructed TLV structure?
When you encode a primitive OCTET STRING in indefinite length mode, the encoder must:
split up the value into chunks of smaller OCTET STRINGs
encode each chunk in definite length mode so that each has its own TLV (with length!)
the whole sequence of definite length encoded primitive OCTET STRINGs must be framed by a single, indefinite length encoded constructed OCTET STRING "container" having its own TLV (without length, but with end-of-octets sentinel)
At the other end, the decoder extracts the V part from the inner, definite length OCTET STRING chunks (dropping their TL headers). Then joins/consumes V's together in the order of arrival dropping the TL part of the outer frame.
Note that the idea behind indefinite length encoding technique is that both encoder and decoder can emit/consume incomplete, possibly oversized, data.
Chunk size is chosen by the encoder/application based on data availability, memory situation and possibly the estimation of decoder's buffering capabilities. I think this is mentioned somewhere in the X.280/X.680 papers.
Encoder is not allowed to put chunks of different ASN.1 types into any single indefinite length encoded container. In other words, all chunks must be of the same type as the outer container.
That should hopefully explain why you may see multiple (depending on chunk size) OCTET STRINGs in the indefinite length encoded BER/CER stream where just a single OCTET STRING is expected.
DER forbids indefinite length encoding on the grounds that serialized representation of the same data may change on re-encoding (due to potentially changing chunk size).

How do I iterate through an UFT16 encoded string character by character?

I have an UFT16 encoded string theUFT16string. It contains double byte characters. I would like to interate through it Unicode character by Unicode character. I understand that the chunk expressions work by single-byte characters?
An example
We have the following string
abcαβɣ
We want to iterate through it and put each character on a line of its own in another container.
In LiveCode, there are two ways to get a character from a UTF16 string. If the string is displayed in a field, you can do
select char 3 of fld 1
and if you have a Russian or Polish text in the field, it will correctly select 1 character. However, this feature isn't very well developed in LiveCode and will fail with many Chinese, Japanese and Arabic (and other) languages. Therefore, it is better to use bytes for now:
select byte 5 to 6 of fld 1
The latter will also be compatible with future versions of LiveCode, while the former may not be.
Anyway, you have your string in a variable, which means you have to handle the string as bytes (you could use chars, but bytes and chars are dealt with in the same way in this case, because the data is in a variable). You can iterate through the variable with steps of two, i.e. one char at a time:
repeat with x = 1 to number of bytes of theUFT16String step 2
put byte x to x+1 into myChar
// do something with myChar here, e.g. reverse the bytes?
put byte 2 of myChar & char 1 of myChar after myNewString
end repeat
// myNewString now contains the entire theUTF16String in reverse byte order.
(You could do this in 3 lines instead of 4, but for the purpose of the example I have added a line that stores the bytes in var myChar).

Extract the first letter of a UTF-8 string with Lua

Is there any way to extract the first letter of a UTF-8 encoded string with Lua?
Lua does not properly support Unicode, so string.sub("ÆØÅ", 2, 2) will return "?" rather than "Ø".
Is there a relatively simple UTF-8 parsing algorithm I could use on the string byte per byte, for the sole purpose of getting the first letter of the string, be it a Chinese character or an A?
Or is this way too complex, requiring a huge library, etc.?
You can easily extract the first letter from a UTF-8 encoded string with the following code:
function firstLetter(str)
return str:match("[%z\1-\127\194-\244][\128-\191]*")
end
Because a UTF-8 code point either begins with a byte from 0 to 127, or with a byte from 194 to 244 followed by one or several bytes from 128 to 191.
You can even iterate over UTF-8 code points in a similar manner:
for code in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do
print(code)
end
Note that both examples return a string value for each letter, and not the Unicode code point numerical value.
Lua 5.3 provide a UTF-8 library.
You can use utf8.codes to get each code point, and then use utf8.char to get the character:
local str = "ÆØÅ"
for _, c in utf8.codes(str) do
print(utf8.char(c))
end
This also works:
local str = "ÆØÅ"
for w in str:gmatch(utf8.charpattern ) do
print(w)
end
where utf8.charpattern is just the string "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" for the pattern to match one UTF-8 byte sequence.

How can I work with raw bytes in perl

Documentation all directs me to unicode support, yet I don't think my request has anything to do with Unicode. I want to work with raw bytes within the context of a single scalar; I need to be able to figure out its length (in bytes), take substrings of it (in bytes), write the bytes to disc, and over the network. Is there an easy way to do this, without treating the bytes as any sort of encoding in perl?
EDIT
More explicitly,
my $data = "Perl String, unsure of encoding and don't need to know";
my #data_chunked_into_1024_bytes_each = #???
Perl strings are, conceptually, strings of characters, which are positive 32-bit integers that (normally) represent Unicode code points. A byte string, in Perl, is just a string in which all the characters have values less than 256.
(That's the conceptual view. The internal representation is somewhat more complicated, as the perl interpreter tries to store byte strings — in the above sense — as actual byte strings, while using a generalized UTF-8 encoding for strings that contain character values of 256 or higher. But this is all supposed to be transparent to the user, and in fact mostly is, except for some ugly historical corner cases like the bitwise not (~) operator.)
As for how to turn a general string into a byte string, that really depends on what the string you have contains and what the byte string is supposed to contain:
If your string already is a string of bytes — e.g. if you read it from a file in binary mode — then you don't need to do anything. The string shouldn't contain any characters above 255 to being with, and if it does, that's an error and will probably be reported as such by the encryption code.
Similarly, if your string is supposed to encode text in the ASCII or ISO-8859-1 encodings (which encode the 7- and 8-bit subsets of Unicode respectively), then you don't need to do anything: any characters up to 255 are already correctly encoded, and any higher values are invalid for those encodings.
If your input string contains (Unicode) text that you want to encode in some other encoding, then you'll need to convert the string to that encoding. The usual way to do that is by using the Encode module, like this:
use Encode;
my $byte_string = encode( "name of encoding", $text_string );
Obviously, you can convert the byte string back to the corresponding character string with:
use Encode;
my $text_string = decode( "name of encoding", $byte_string );
For the special case of the UTF-8 encoding, it's also possible to use the built-in utf8::encode() function instead of Encode::encode():
utf8::encode( $string );
which does essentially the same thing as:
use Encode;
$string = encode( "utf8", $string );
Note that, unlike Encode::encode(), the utf8::encode() function modifies the input string directly. Also note that the "utf8" above refers to Perl's extended UTF-8 encoding, which allows values outside the official Unicode range; for strictly standards-compliant UTF-8 encoding, use "utf-8" with a hyphen (see Encode documentation for the gory details). And, yes, there's also a utf8::decode() function that does pretty much what you'd expect.
If I understood your question correctly, what you want is the pack/unpack functions: http://perldoc.perl.org/functions/pack.html
As long as your string doesn't contain characters above codepoint 255, it will mostly work as plain byte string, with length and substr operating on bytes. Additionally, most output functions like print expect octets/bytes by default and will actually complain if you try to stuff anything else to them.
You may need to explicitly encode/decode your output if it is known to be in some encoding, but more details can only be added if you ask another specific question for each problematic part of your program.

Why does a base64 encoded string have an = sign at the end

I know what base64 encoding is and how to calculate base64 encoding in C#, however I have seen several times that when I convert a string into base64, there is an = at the end.
A few questions came up:
Does a base64 string always end with =?
Why does an = get appended at the end?
Q Does a base64 string always end with =?
A: No. (the word usb is base64 encoded into dXNi)
Q Why does an = get appended at the end?
A: As a short answer:
The last character (= sign) is added only as a complement (padding) in the final process of encoding a message with a special number of characters.
You will not have an = sign if your string has a multiple of 3 characters, because Base64 encoding takes each three bytes (a character=1 byte) and represents them as four printable characters in the ASCII standard.
Example:
(a) If you want to encode
ABCDEFG <=> [ABC] [DEF] [G]
Base64 deals with the first block (producing 4 characters) and the second (as they are complete). But for the third, it will add a double == in the output in order to complete the 4 needed characters. Thus, the result will be QUJD REVG Rw== (without spaces).
[ABC] => QUJD
[DEF] => REVG
[G] => Rw==
(b) If you want to encode ABCDEFGH <=> [ABC] [DEF] [GH]
similarly, it will add one = at the end of the output to get 4 characters.
The result will be QUJD REVG R0g= (without spaces).
[ABC] => QUJD
[DEF] => REVG
[GH] => R0g=
It serves as padding.
A more complete answer is that a base64 encoded string doesn't always end with a =, it will only end with one or two = if they are required to pad the string out to the proper length.
From Wikipedia:
The final '==' sequence indicates that the last group contained only one byte, and '=' indicates that it contained two bytes.
Thus, this is some sort of padding.
Its defined in RFC 2045 as a special padding character if fewer than 24 bits are available at the end of the encoded data.
No.
To pad the Base64-encoded string to a multiple of 4 characters in length, so that it can be decoded correctly.
The equals sign (=) is used as padding in certain forms of base64 encoding. The Wikipedia article on base64 has all the details.
It's padding. From http://en.wikipedia.org/wiki/Base64:
In theory, the padding character is not needed for decoding, since the
number of missing bytes can be calculated from the number of Base64
digits. In some implementations, the padding character is mandatory,
while for others it is not used. One case in which padding characters
are required is concatenating multiple Base64 encoded files.
http://www.hcidata.info/base64.htm
Encoding "Mary had" to Base 64
In this example we are using a simple text string ("Mary had") but the principle holds no matter what the data is (e.g. graphics file). To convert each 24 bits of input data to 32 bits of output, Base 64 encoding splits the 24 bits into 4 chunks of 6 bits. The first problem we notice is that "Mary had" is not a multiple of 3 bytes - it is 8 bytes long. Because of this, the last group of bits is only 4 bits long. To remedy this we add two extra bits of '0' and remember this fact by putting a '=' at the end. If the text string to be converted to Base 64 was 7 bytes long, the last group would have had 2 bits. In this case we would have added four extra bits of '0' and remember this fact by putting '==' at the end.
= is a padding character. If the input stream has length that is not a multiple of 3, the padding character will be added. This is required by decoder: if no padding present, the last byte would have an incorrect number of zero bits.
Better and deeper explanation here: https://base64tool.com/detect-whether-provided-string-is-base64-or-not/
The equals or double equals serves as padding. It's a stupid concept defined in RFC2045 and it is actually superfluous. Any decend parser can encode and decode a base64 string without knowing about padding by just counting up the number of characters and filling in the rest if size isn't dividable by 3 or 4 respectively. This actually leads to difficulties every now and then, because some parsers expect padding while others blatantly ignore it. My MPU base64 decoder for example needs padding, but it receives a non-padded base64 string over the network. This leads to erronous parsing and I had to account for it myself.