Why Standard Compression Scheme for Unicode sample code not work?

Why Standard Compression Scheme for Unicode sample code not work? - unicode

I'm tring to learn SCSU
http://unicode.org/reports/tr6
but when I try Java sample code, the output always larger than input.
This is what I got:
I tried this example:
Öl fließt
they say the input:
Unicode code points (9 code points):
00D6 006C 0020 0066 006C 0069 0065 00DF 0074
and output is :
Compressed (9 bytes):
D6 6C 20 66 6C 69 65 DF 74
But what I got is:
Input:
famihug#hvn:/home/famihug/TestRoom/SCSU%xxd german.txt [0]
0000000: c396 6c20 666c 6965 c39f 7420 0a ..l flie..t .
Output:
famihug#hvn:/home/famihug/TestRoom/SCSU%java CompressMain /compress german.txt
Compressed german.txt: 6 chars to german.csu 13 bytes. Ratio: 108%.
famihug#hvn:/home/famihug/TestRoom/SCSU%ls -lt german.* [0]
-rw-r--r-- 1 famihug famihug 13 2012-06-09 10:24 german.csu
-rw-r--r-- 1 famihug famihug 13 2012-06-08 01:04 german.txt
famihug#hvn:/home/famihug/TestRoom/SCSU%xxd german.csu [0]
0000000: 0fc3 966c 2066 6c69 65c3 9f74 20
~~~~~~~~~~~~~
And this is when I tried Japanese sample:
famihug#hvn:/home/famihug/TestRoom/SCSU%wc -m jav.txt [0]
117 jav.txt
famihug#hvn:/home/famihug/TestRoom/SCSU%ls -lt jav.* [0]
-rw-r--r-- 1 famihug famihug 349 2012-06-08 01:13 jav.txt
-rw-r--r-- 1 famihug famihug 405 2012-06-08 01:01 jav.csu
they said output is Compressed (178 bytes)
I use gedit/Vim to paste the sample plaintext to file. What did I doing wrong here?

It looks like the sample encoder is expecting UTF-16 input, and you're giving it UTF-8.
This input: c396 6c20 666c 6965 c39f 7420 0a is Öl fließt in UTF-8, with a trailing space and newline.
What you're getting back is 0fc3 966c 2066 6c69 65c3 9f74 20. The first 0f is the SCU tag, which indicates that the rest of the bytes are big-endian UTF-16. The thing is, instead of the UTF-16 equivalents of your input string, the rest of the bytes are just the exact same bytes from the input (minus the newline), and those same bytes represent totally different characters between UTF-8 and UTF-16.
The output you're getting back seems to represent 쎖氠晬楥쎟琠. Note that this is a 6 character long string, as CompressMain reported. You could run your compressed output back through /expand of the same class to confirm.
If you encode your input file in UTF-16, not UTF-8 you should get the output you're expecting.

Related

[guid]::NewGuid().GetBytes() returns different result than [System.Text.Encoding]::UTF8.GetBytes(...)

I found this excellent approach on shortening GUIDs here on stackowerflow: .NET Short Unique Identifier
I have some other strings that I wanted to treat the same way, but I found out that in most cases the Base64String is even longer than the original string.
My question is: why does [guid]::NewGuid().ToByteArray() return a significant smaller byte array than [System.Text.Encoding]::UTF8.GetBytes([guid]::NewGuid().Guid)?
For example, let's look at the following GUID:
$guid = [guid]::NewGuid()
$guid
Guid
----
34c2b21e-18c3-46e7-bc76-966ae6aa06bc
With $guid.GetBytes(), the following is returned:
30
178
194
52
195
24
231
70
188
118
150
106
230
170
6
188
And [System.Convert]::ToBase64String($guid.ToByteArray()) generates HrLCNMMY50a8dpZq5qoGvA==
[System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($guid.Guid)), however, returns MzRjMmIyMWUtMThjMy00NmU3LWJjNzYtOTY2YWU2YWEwNmJj, with [System.Text.Encoding]::UTF8.GetBytes($guid.Guid) being:
51
52
99
50
98
50
49
101
45
49
56
99
51
45
52
54
101
55
45
98
99
55
54
45
57
54
54
97
101
54
97
97
48
54
98
99

The GUID struct is an object storing a 16 byte array that contains its value.
These are the 16 bytes you see when you perform its method .ToByteArray() method.
The 'normal' string representation is a grouped series of these bytes in hexadecimal format. (4-2-2-2-6)
As for converting to Base64, this will always return a longer string because each Base64 digit represents exactly 6 bits of data.
Therefore, every three 8-bits bytes of the input (3×8 bits = 24 bits) can be represented by four 6-bit Base64 digits (4×6 = 24 bits).
The resulting string can even be extended with = padding characters at the end of the string to always be a multiple of 4.
The result is a string of [math]::Ceiling(<original size> / 3) * 4 length.
Using [System.Text.Encoding]::UTF8.GetBytes([guid]::NewGuid().Guid) is actually first performing the GUID's .ToString() method and from that string it will return the ascii values of each character in there.
(hexadecimal representation = 2 characters per byte = 32 values + the four dashes in it leaves a 36-byte array)

[guid]::NewGuid().ToByteArray()
In the scope of this question, a GUID can be seen as a 128-bit number (actually it is a structure, but that's not relevant to the question). When converting it into a byte array, you divide 128 by 8 (bits per byte) and get an array of 16 bytes.
[System.Text.Encoding]::UTF8.GetBytes([guid]::NewGuid().Guid)
This converts the GUID to a hexadecimal string representation first. Then this string gets encoded as UTF-8.
A hex string uses two characters per input byte (one hex digit for the lower and one for the upper 4 bits). So we need at least 32 characters (16 bytes of GUID multiplied by 2). When converted to UTF-8 each character relates to exactly one byte, because all hex digits as well as the dash are in the basic ASCII range which maps 1:1 to UTF-8. So including the dashes we end up with 32 + 4 = 36 bytes.
So this is what [System.Convert]::ToBase64String() has to work with - 16 bytes of input in the first case and 36 bytes in the second case.
Each Base64 output digit represents up to 6 input bits.
16 input bytes = 128 bits, divided by 6 = 22 Base64 characters
36 input bytes = 288 bits, divided by 6 = 48 Base64 characters
That's how you end up with more than twice the number of Base64 characters when converting a GUID to hex string first.

UTF8 Hex Codepoint to Decimal Mismatch

I'm working on a program that takes the hex value of a unicode character and converts it to an integer, then to a byte array, then to a UTF-8 string. All is fine other than the fact that, for example, the hex value E2 82 AC (€ symbol) is 14 844 588 in decimal, but, if you look at the code point value of it on the web page provided below, it's 226 130 172, which is a big difference.
http://utf8-chartable.de/unicode-utf8-table.pl?start=8320&number=128&names=-
If you sort the values their by decimal, they're not just converting the hex to decimal. Obviously I don't understand encodings as well as I thought I did.
E2 82 AC maps to 226 130 172 instead of 14 844 588.
Why is this discrepancy?
Thanks in advance.

I think your statement, "the hex value E2 82 AC (€ symbol) is 14 844 588 in decimal", is incorrect.
How did you interpret the hex values E2, 82, and AC?
hex E2 = hex E * 16 + hex 2 = 14 * 16 + 2 = 226.
hex 82 = hex 8 * 16 + hex 2 = 8 * 16 + 2 = 130.
hex AC = hex A * 16 + hex C = 10 * 16 + 12 = 172.
So, the hex value E2 82 AC (€ symbol) is in fact 226 130 172 in decimal.

SonarQube 6.3 LDAP/SSO UTF-8 encoding

We‘re using LDAP/SSO in my company which provides the username in UTF-8 format to SonarQube.
However LDAP/SSO sends the username in UFT-8 but SonarQube requires Latin1/ISO-8859. There is no way to change the encoding on LDAP/SSO or SonarQube.
Result wrong umlauts:
Andrü Tingö = Andr«Ã Ting¼Ã OR äëüö = Ã¤Ã«Ã¼Ã
Is there any workaround?

I wanted to post this as comment, but I need 50 reputations to write comments.
We are using simplesamlphp for SSO as IdP and SP. IdP takes cn, givenName and sn from LDAP, which has UTF-8 values. Loginnames/Usernames are us-ascii only.
If the user comes to Sonar, the non-us-ascii characters are incorrect - they were converted from ... to utf-8, even they already are in utf-8.
If I use the attributes from IDP in PHP which sends the page in UTF-8, then characters are correct.
I did just now one test. In our Apache Config we set the X-Forwarded-Name to MCAC_ATTR_CN attribute what SP get from IdP. Original configuration is:
RequestHeader set X-Forwarded-Name "expr=%{reqenv:MCAC_ATTR_CN}"
Now I have added fixed string in UTF-8:
RequestHeader set X-Forwarded-Name "expr=%{reqenv:MCAC_ATTR_CN} cäëöüc"
The "c" characters are only separators to see the encoded text better.
The hexdump of this configuration line is:
0000750: 09 0909 5265 7175 6573 7448 6561 ...RequestHea
0000760: 6465 7220 7365 7420 582d 466f 7277 6172 der set X-Forwar
0000770: 6465 642d 4e61 6d65 2022 6578 7072 3d25 ded-Name "expr=%
0000780: 7b72 6571 656e 763a 4d43 4143 5f41 5454 {reqenv:MCAC_ATT
0000790: 525f 434e 7d20 63c3 a4c3 abc3 b6c3 bc63 R_CN} c........c
00007a0: 220a ".
As you can see, there are fixed utf-8 characters "ä" c3a4 "ë" c3ab "ö" c3b6 "ü" c3bc.
From LDAP comes follwing name:
xxxxxx xxxxx xxxx äëüö
In Apache config is appended " cäëöüc", therefore resulting name should be:
xxxxxx xxxxx xxxx äëüö cäëöüc
But in Sonar, the name is displayed as
xxxxxx xxxxx xxxx Ã¤Ã«Ã¼Ã¶ cÃ¤Ã«Ã¶Ã¼c
You get similar result if you convert follwing text:
xxxxxx xxxxx xxxx äëüö cäëöüc
from ISO-8859-1 to UTF-8:
echo "xxxxxx xxxxx xxxx äëüö cäëöüc" | iconv -f iso-8859-2 -t utf-8
xxxxxx xxxxx xxxx Ă¤ĂŤĂźĂś cĂ¤ĂŤĂśĂźc
The "¤" character is utf-8 char c2 a4:
00000000: c2a4 0a ...
I have made tcpdump on loopback to get communications from apache proxy module to sonarqube and even there you can see correct UTF-8 characters c3a4 c3ab c3bc c3b6 comming from IdP and then between "c"s you can see c3a4 c3ab c3b6 c3bc comming direct from apache.
00000000 47 45 54 20 2f 61 63 63 6f 75 6e 74 20 48 54 54 GET /acc ount HTT
...
00000390 58 2d 46 6f 72 77 61 72 64 65 64 2d 4e 61 6d 65 X-Forwar ded-Name
000003A0 3a 20 72 6f 62 65 72 74 20 74 65 73 74 32 20 77 : xxxxxx xxxxx x
000003B0 6f 6c 66 20 c3 a4 c3 ab c3 bc c3 b6 20 63 c3 a4 xxx .... .... c..
000003C0 c3 ab c3 b6 c3 bc 63 0d 0a ......c. .
...
The system has locales set to en_US.UTF-8, if this matters.
So Sonar gets really UTF-8 Text from Apache (direct config or from IdP) but then something probably converts this utf-8 text as if it was iso-8859 text to utf-8 again and makes nonsense.
Do you have any idea now? Could this be something in sonar or in wrapper or somewhere some options set incorrectly?
Regards,
Robert.

how to save values to .dat file in specific format using matlab

I have a matrix of values like [150 255 25;400 80 10;240 68 190]. I want to store these values to text file in hexadecimal format such that each value in matrix is represented by 3digit hexa value (12bit). i.e
Decimal Hexa notation
150 255 25 096 0FF 019
400 80 10 -> 190 050 00A
240 68 190 0F0 044 0BE
I am using like this
`fp=fopen('represen.dat','wb');
for i=1:1:x
for j=1:1:y
fprintf(fp,"%3x\t",A(i,j));
end
fprintf(fp,"\n");
end`
It is giving result as
Decimal Hexa notation
150 255 25 96 FF 19
400 80 10 -> 190 50 0A
240 68 190 F0 44 BE
help me in this regard..

First you have to convert the data to hex:
myHexData = dec2hex(myDecimalData)
Then you can save it, as explained here and mentioned in the comments by Deve:
how-to-save-values-to-text-file-in-specific-format-using-matlab

Date compression

i got a problem in a project to uncompress a date.
(No documentation is available)
I have to convert a date, shown by this 6 Bytes:
0xFD 0x77 0x59 0x51 0x10 0x00
Did anyone know, how to uncompress ?
The date is from today, ~ 10:30 GMT
Programm language doesn´t matters.
(It is just a question of understanding. Not a question by programming)
Christian
added Again some examples
11:09 -->
0x fd 77 59 fd 10 00
11:09 -->
0x fd 77 79 05 28 00
11:05 -->
0x fd 77 59 fd 28 00

The solution was, to convert the Data byte array not via Java to Hex. In several cases (Byte is > 127) the result is a ..FD Hex Value. If i convert it otherwise, the Log Result is i.e.:
0x DD7719b33A00
dd7 -> 7dd -> 2013
7 -> 7 -> Month
19 -> 25 (dec.) --> Day
b -> 11 -> Hour
33 -> 51 -> Min.
A -> 10 -> seconds
0 -> 0 -> ms