I am trying to understand the string character encoding of a proprietary file format (fp7-file format from Filemaker Pro).
I found that each character is obfuscated by XOR with 0b01011010 and that the string length is encoded using a single starting byte (max string length in Filemaker is 100 characters).
Encoding is a variable byte encoding, where by default ISO 8859-1 (Western) is used to encode most characters.
If a unicode character outside ISO 8859-1 is to be encoded, some sort of control characters are included into the string that modify the decoding of the next or several following characters. These control characters are using the ASCII control character space (0x01 to 0x1f in particular). This is were I am stuck, as I can't seem to find a pattern to how these control characters work.
Some examples of what I think I have found:
When encountering a control character 0x11 the following characters are created by adding 0x40 to the byte value, e.g. the character Ā (Unicode \U0100) is encoded as 0x11 0xC0 (0xC0 + 0x40 = 0x100).
When encountering the control character 0x10 the previous control character seems to be reset.
When encountering the control character 0x03 the next (only the next!) character is created by adding 0x100 to the byte value. If the control character 0x03 is preceeded by 0x1b then all following characters are created by adding 0x100.
An example string (0_ĀĐĠİŀŐŠŰƀƐƠưǀǐǠǰȀ), its unicode code points and the encoding in Filemaker:
char 0 _ Ā Đ Ġ İ ŀ Ő Š Ű ƀ Ɛ Ơ ư ǀ ǐ Ǡ ǰ Ȁ
unicode 30 5f 100 110 120 130 140 150 160 170 180 190 1a0 1b0 1c0 1d0 1e0 1f0 200
encoded 30 5f 11 c0 d0 e0 f0 3 40 3 50 3 60 3 70 1b 3 80 90 a0 b0 c0 d0 e0 f0 1c 4 80
As you can see the characters 0 and _ are encoded with their direct unicode/ASCII value. The characters ĀĐĠİ are encoded using the 0x11 control byte. Then ŀŐŠŰ are encoded using 0x03 for each character, then 0x1B 0x03 are used to encode the next 8 characters, etc.
Does this encoding scheme look familiar to anybody?
The rules are simple for characters up to 0x200, but then become more and more confusing, even to the point where they seem position dependent.
I can provide more examples for a weekend of puzzles and joy.
Related
When I read The Swift Programming Language Strings and Characters. I don't know how U+203C (means !!) can represented by (226, 128, 188) in utf-8.
How did it happen ?
I hope you already know how UTF-8 reserves certain bits to indicate that the Unicode character occupies several bytes. (This website can help).
First, write 0x203C in binary:
0x230C = 10000000111100
So this character takes 16 bits to represent. Due to the "header bits" in the UTF-8 encoding scheme, it would take 3 bytes to encode it:
0x230C = 10 000000 111100
1st byte 2nd byte 3rd byte
-------- -------- --------
header 1110 10 10
actual data 10 000000 111100
-------------------------------------------
full byte 11100010 10000000 10111100
decimal 226 128 188
Is representing UTF-8 encoding in decimals even possible? I think only values till 255 would be correct, am I right?
As far as I know, we can only represent UTF-8 in hex or binary form.
I think it is possible. Let's look at an example:
The Unicode code point for ∫ is U+222B.
Its UTF-8 encoding is E2 88 AB, in hexadecimal representation. In octal, this would be 342 210 253. In decimal, it would be 226 136 171. That is, if you represent each byte separately.
If you look at the same 3 bytes as a single number, you have E288AB in hexadecimal; 70504253 in octal; and 14846123 in decimal.
For an input "hello", SHA-1 returns "aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d", which are 40 hex outputs. I know 1 byte can denote as 1 character, so the 160 bits output should be able to converted to 20 characters. But when I look up "aa" in an ASCII table, there are no such hex value, and I'm confused about that. How to map 160 bits SHA-1 string as 20 characters in ANSI?
ASCII only has 128 characters (7 bits), while ANSI has 256 (8 bits). As to the ANSI value of hex value AA (decimal 170), the corresponding ANSI character would be ª (see for example here).
Now, you have to keep in mind that a number of both ASCII and ANSI characters (0-31) are non-printable control characters (system bell, null character, etc.), so turning your hash into a readable 20 character string will be not possible in most cases. For instance, your example contains the hex value 0F, which would translate to a shift-in character.
I'm using owasp EnDe web-based tool to understand nibbles and encoding in general. I'm testing a sample input which is abcd.
Now, the results of encoding it based upon first nibble and second nibble is given
as 36,1,36,2,36,3,36,4,37,7,37,8,37,9,37,A and 6,31,6,32,6,33,6,34,7,37,7,38,7,39,7,61 respectively.
A simple representation in hex of above sample input is 61 62 63 64 77 78 79 7a.
Should nibble 1 and nibble 2 in simple terms would be mean LSB nibble and MSB nibble respectively. Can someone explain how it relates to the use in this tool?
Thanks
When looking at the code that performs the encoding it seems to work on the hex strings of the ASCII codes instead of taking the nibble of the ASCII code. So for you example of "abcd" and the 1 nibble encoding it works as follows.
'a' -> 0x61 -> '61'
First nibble of '61' is '6', with '6' -> 0x36 -> '36'
So 'a' ends up being encoded as %%361
'b' -> 0x62 -> '62'
First nibble of '62' is '6' and again will be '36'.
So 'b' ends up being encoded as %%362
....
I am not sure where this encoding is documented, perhaps you can try Google.
You can find the function that performs the encodings at https://github.com/EnDe/EnDe/blob/master/EnDe.js#L982
Why do we have Base64 encoding? I am a beginner and I really don't understand why would you obfuscate the bytes into something else (unless it is encryption). In one of the books I read Base64 encoding is useful when binary transmission is not possible. Eg. When we post a form it is encoded. But why do we convert bytes into letters? Couldn't we just convert bytes into string format with a space in between? For example, 00000001 00000004? Or simply 0000000100000004 without any space because bytes always come in pair of 8?
Base64 is a way to encode binary data into an ASCII character set known to pretty much every computer system, in order to transmit the data without loss or modification of the contents itself.
For example, mail systems cannot deal with binary data because they expect ASCII (textual) data. So if you want to transfer an image or another file, it will get corrupted because of the way it deals with the data.
Note: base64 encoding is NOT a way of encrypting, nor a way of compacting data. In fact a base64 encoded piece of data is 1.333… times bigger than the original datapiece. It is only a way to be sure that no data is lost or modified during the transfer.
Base64 is a mechanism to enable representing and transferring binary data over mediums that allow only printable characters.It is most popular form of the “Base Encoding”, the others known in use being Base16 and Base32.
The need for Base64 arose from the need to attach binary content to emails like images, videos or arbitrary binary content . Since SMTP [RFC 5321] only allowed 7-bit US-ASCII characters within the messages, there was a need to represent these binary octet streams using the seven bit ASCII characters...
Hope this answers the Question
Base64 is a more or less compact way of transmitting (encoding, in fact, but with goal of transmitting) any kind of binary data.
See http://en.wikipedia.org/wiki/Base64
"The general rule is to choose a set of 64 characters that is both part of a subset common to most encodings, and also printable."
That's a very general purpose and the common need is not to waste more space than needed.
Historically, it's based on the fact that there is a common subset of (almost) all encodings used to store chars into bytes and that a lot of the 2^8 possible bytes risk loss or transformations during simple data transfer (for example a copy-paste-emailsend-emailreceive-copy-paste sequence).
(please redirect upvote to Brian's comment, I just make it more complete and hopefully more clear).
For data transmission, data can be textual or non-text(binary) like image, video, file etc.
As we know, during transmission only a stream of data(textual/printable characters) can be sent or received, hence we need a way encode non-text data like image, video, file.
Binary and ASCII representation of non-text(image, video, file) is easily obtainable.
Such non-text(binary) represenation is encoded in textual format such that each ASCII character takes one out of sixty four(A-Z, a-z, 0-9, + and /) possible character set.
Table 1: The Base 64 Alphabet
Value Encoding Value Encoding Value Encoding Value Encoding
0 A 17 R 34 i 51 z
1 B 18 S 35 j 52 0
2 C 19 T 36 k 53 1
3 D 20 U 37 l 54 2
4 E 21 V 38 m 55 3
5 F 22 W 39 n 56 4
6 G 23 X 40 o 57 5
7 H 24 Y 41 p 58 6
8 I 25 Z 42 q 59 7
9 J 26 a 43 r 60 8
10 K 27 b 44 s 61 9
11 L 28 c 45 t 62 +
12 M 29 d 46 u 63 /
13 N 30 e 47 v
14 O 31 f 48 w (pad) =
15 P 32 g 49 x
16 Q 33 h 50 y
These sixty four character set is called Base64 and encoding a given data into this character set having sixty four allowed characters is called Base64 encoding.
Let us take examples of few ASCII characters when encoded to Base64.
1 ==> MQ==
12 ==> MTI=
123 ==> MTIz
1234 ==> MTIzNA==
12345 ==> MTIzNDU=
123456 ==> MTIzNDU2
Here few points are to be noted:
Base64 encoding occurs in size of 4 characters. Because an ASCII character can take any out of 256 characters, which needs 4 characters of Base64 to cover. If the given ASCII value is represented in lesser character then rest of characters are padded with =.
= is not part of base64 character set. It is used for just padding.
Hence, one can see that the Base64 encoding is not encryption but just a way to transform any given data into a stream of printable characters which can be transmitted over network.