Mixing UTF-8 with UTF-16 - encoding

I'm currently working on a korean program, which should be translated into chinese language. What I found strange, is that application is mixing UTF-8 and UTF-16 characters.
Let's say we've a string which goes as:
"게임을 정말로 종료하시겠습니까"
8C AC 84 C7 44 C7 20 00 15 C8 D0 B9 5C B8 20 00
85 C8 CC B8 58 D5 DC C2 A0 AC B5 C2 C8 B2 4C AE 00
But it's stored as
B0 D4 C0 D3 C0 BB 20 C1 A4 B8 BB B7 CE 20 C1 BE
B7 E1 C7 CF BD C3 B0 DA BD C0 B4 CF B1 EE 3F 00
just to prevent zeros. I'd like to know, if it's some kind of encryption, or is it just a normal method used by compilers to prevent end of the string somewhere in the middle of the string? Because, the final result is the first string, that I've mentioned. Any reading would be strongly appreciated.

A string must be either uft-8 or utf-16 (or some other encoding). If you mix encodings in a string it is an error. However it is very common to pass string about as utf-8, and only convert them to utf-16 when needed by a Windows function. There are several reasons for this, Basile Starynkevitch has provided a link.
If you need routines to read UFT-8, I've got some here.
https://github.com/MalcolmMcLean/babyx/blob/master/src/common/BBX_Font.c

Related

Where are files marked as assume-unchanged in Git? [duplicate]

I like to modify config files directly (like .gitignore and .git/config) instead of remembering arbitrary commands, but I don't know where Git stores the file references that get passed to "git update-index --assume-unchanged file".
If you know, please do tell!
It says where in the command - git update-index
So you can't really be editing the index as it is not a text file.
Also, to give more detail on what is stored with the git update-index --assume-unchanged command, see the Using “assume unchanged” bit section in the manual
As others said, it's stored in the index, which is located at .git/index.
After some detective work, I found that it is located at the: assume valid bit of each index entry.
Therefore, before understanding what follows, you should first understand the global format of the index, as explained in my other answer.
Next, I will explain how I verified that the "assume valid" bit is the culprit:
empirically
by reading the source
Empirical
Time to hd it up.
Setup:
git init
echo a > b
git add b
Then:
hd .git/index
Gives:
00000000 44 49 52 43 00 00 00 02 00 00 00 01 54 e9 b6 f3 |DIRC........T...|
00000010 2d 4f e1 2f 54 e9 b6 f3 2d 4f e1 2f 00 00 08 05 |-O./T...-O./....|
00000020 00 de 32 ff 00 00 81 a4 00 00 03 e8 00 00 03 e8 |..2.............|
00000030 00 00 00 00 e6 9d e2 9b b2 d1 d6 43 4b 8b 29 ae |...........CK.).|
00000040 77 5a d8 c2 e4 8c 53 91 00 01 62 00 c9 a2 4b c1 |wZ....S...b...K.|
00000050 23 00 1e 32 53 3c 51 5d d5 cb 1a b4 43 18 ad 8c |#..2S<Q]....C...|
00000060
Now:
git update-index --assume-unchanged b
hd .git/index
Gives:
00000000 44 49 52 43 00 00 00 02 00 00 00 01 54 e9 b6 f3 |DIRC........T...|
00000010 2d 4f e1 2f 54 e9 b6 f3 2d 4f e1 2f 00 00 08 05 |-O./T...-O./....|
00000020 00 de 32 ff 00 00 81 a4 00 00 03 e8 00 00 03 e8 |..2.............|
00000030 00 00 00 00 e6 9d e2 9b b2 d1 d6 43 4b 8b 29 ae |...........CK.).|
00000040 77 5a d8 c2 e4 8c 53 91 80 01 62 00 17 08 a8 58 |wZ....S...b....X|
00000050 f7 c5 b3 e1 7d 47 ac a2 88 d9 66 c7 5c 2f 74 d7 |....}G....f.\/t.|
00000060
By comparing the two indexes, and looking at the global structure of the index, see that the only differences are:
byte number 0x48 (9th on line 40) changed from 00 to 80. That is our flag, the first bit of the cache entry flags.
the 20 bytes from 0x4C to 0x5F. This is expected since that is a SHA-1 over the entire index.
This has also though me that the SHA-1 of the index entry in bytes from 0x34 to 0x47 does not take into account the flags, since it did not changed between both indexes. This is probably why the flags are placed after the SHA, which only considers what comes before it.
Source code
Now let's see if that is coherent with source code of Git 2.3.
First look at the source of update-index, grep assume-unchanged.
This leads to the following line:
{OPTION_SET_INT, 0, "assume-unchanged", &mark_valid_only, NULL,
N_("mark files as \"not changing\""),
PARSE_OPT_NOARG | PARSE_OPT_NONEG, NULL, MARK_FLAG},
{OPTION_SET_INT, 0, "no-assume-unchanged", &mark_valid_only, NULL,
N_("clear assumed-unchanged bit"),
PARSE_OPT_NOARG | PARSE_OPT_NONEG, NULL, UNMARK_FLAG},
so the value is stored at mark_valid_only. Grep it, and find that it is only used at one place:
if (mark_valid_only) {
if (mark_ce_flags(path, CE_VALID, mark_valid_only == MARK_FLAG))
die("Unable to mark file %s", path);
return;
}
CE means Cache Entry.
By quickly inspecting mark_ce_flags, we see that:
if (mark)
active_cache[pos]->ce_flags |= flag;
else
active_cache[pos]->ce_flags &= ~flag;
So the function basically sets or unsets the CE_VALID bit, depending on mark_valid_only, which is a tri-state:
mark: --assume-unchanged
unmark: --no-assume-unchanged
do nothing: the default value 0 of the option set at {OPTION_SET_INT, 0
Next, by grepping under builtin/, we see that no other place sets the value of CE_VALID, so --assume-unchanged must be the only command that sets it.
The flag is however used in many places of the source code, which should be expected as it has many side-effects, and it is used every time like:
ce->ce_flags & CE_VALID
so we conclude that it is part of the ce_flags field of struct cache_entry.
The index is specified at cache.h because one of its functions is to be a cache for creating commits faster.
By looking at the definition of CE_VALID under cache.h and surrounding lines we have:
#define CE_STAGEMASK (0x3000)
#define CE_EXTENDED (0x4000)
#define CE_VALID (0x8000)
#define CE_STAGESHIFT 12
So we conclude that it is the very first bit of that integer (0x8000), just next to the CE_EXTENDED, which is coherent with my earlier experiment.

Identifying an unknown data encoding?

I'm trying understand an undocumented API I have discovered, and I can't get past the format of the data that is being returned.
Here is an example of what I get back when I perform a GET on the url I'm looking at:
A+uZL4258wXdnWztlEPJNXtdl3Tu4hRITtW2AUwQHUK5c6BATSBU/XsQEVIttCpI7wrW/oXWiBloT8+cdtUWBag3mzk3cLohKPvi7PWpf7jqCSbjNGh+5Iv5Gb8by2k31kp62sfwZ+i8r/3TA6nGrnJb6edOB7d0c6F34RTFRrrZSeJtiWYXAJ5JeD3yJY+C
At first I thought this was base64 encoded, but that just gives me back gibberish:
echo -n "<above snippet>" | base64 -D
?/???ݝl?C?5Vy??????,?8?s?#M T?{R-?*H?
???ֈhOϜv??7?97p?!(??????? &?4h~???i7?Jz???g輯???Ʈr[??N?ts?w??F??I?m?f?Ix=?%?
When I strip the URL down to just the domain, I get a website with cyrillic text. Maybe the data could be converted to cyrillic somehow?
Does this data format look familiar to you?
I'll continue to keep trying and report back if I make any progress.
This is definitely base64, because of the / and + characters.
When you decode that string using base64, you get this hexdump:
00000000 03 eb 99 2f 8d b9 f3 05 dd 9d 6c ed 94 43 c9 35 |.../......l..C.5|
00000010 7b 5d 97 74 ee e2 14 48 4e d5 b6 01 4c 10 1d 42 |{].t...HN...L..B|
00000020 b9 73 a0 40 4d 20 54 fd 7b 10 11 52 2d b4 2a 48 |.s.#M T.{..R-.*H|
00000030 ef 0a d6 fe 85 d6 88 19 68 4f cf 9c 76 d5 16 05 |........hO..v...|
00000040 a8 37 9b 39 37 70 ba 21 28 fb e2 ec f5 a9 7f b8 |.7.97p.!(.......|
00000050 ea 09 26 e3 34 68 7e e4 8b f9 19 bf 1b cb 69 37 |..&.4h~.......i7|
00000060 d6 4a 7a da c7 f0 67 e8 bc af fd d3 03 a9 c6 ae |.Jz...g.........|
00000070 72 5b e9 e7 4e 07 b7 74 73 a1 77 e1 14 c5 46 ba |r[..N..ts.w...F.|
00000080 d9 49 e2 6d 89 66 17 00 9e 49 78 3d f2 25 8f 82 |.I.m.f...Ix=.%..|
This just looks like 128 bytes of random data. And whenever you call this API URL again, you get a different string, although it starts with the same few characters.
Perhaps you should ask the maintainers of that website how to use their API. Maybe this string is some session ID that you should use in further calls.

cstring m_pszdata doesn't match converted char* in UNICODE

I tested the Unicode conversion with a UNICODE MFC dialog app, where I can input some Chinese in the edit box. After reading in the characters using
DDX_Text(pDX, IDC_EDIT1, m_strUnicode)
UpdateDate(TRUE)
the m_pszdata of m_strUnicode shows "e0 65 2d 4e 1f 75 09 67". Then I used the following code to convert it to char*:
char *psText; psText = new char[dwMinSize];
WideCharToMultiByte (CP_OEMCP, NULL, m_strUnicode,-1, psText,
dwMinSize, NULL, FALSE);
The psText contains "ce de d6 d0 c9 fa d3 d0", nothing similar with the m_pszdata of m_strUnicode. Would anyone please explain why it is like that?
ce de d6 d0 c9 fa d3 d0 is 无中生有 in GBK. You sure you're manipulating Unicode?
CP_OEMCP instructs the API to use the currently set default OEM codepage.
So my guess here is that you're on a Chinese PC with GBK as default codepage.
无中生有 in UTF16LE is e0 65 2d 4e 1f 75 09 67 so basically you are converting a UTF-16-LE string to GBK.

Decoding an arbitrary block of NSData?

If I have an arbitrary block of NSData as a hex value, is there a way to determine what the object might have been before it was archived or serialized? I don't mind a few guess and check methods, but I need some pointers in the right direction.
I have an NSData object with some hex in it. What methods of NSData should I look at? Are there other classes to try as well
Don't want to scare people away from answering, but I have a file of game data which was likely encoded using a Cocoa Touch class. The data, when viewed in a hex editor, shows gibberish and a username, which leads me to suspect that it's an archived or encoded object of sorts. I have copied the hex from the hex editor into a sample project which I am using to try and unarchive the data.
I don't believe this is related to the 3d format, the file extension is arbitrary.
Here's the data. I'm hoping it doesn't get lost in translation:
'µköXN[ÎÀü÷h/F9ó9Vìñ°ceE¸z¶=Hmoshbermú«ó¼Ppù#ÝVÔ=4â®L,K;Êç;ASÀ&Ë÷ëÓ%È;Úf¬G}tmQ;µéüø_87´y©ã©!߶óQòAçÛl©âSG4S½3ýJת9äô¡wxiD²M¼ÏB]39øþ:óñ7ª¾÷躣È3Ï¢ÍEFÍ¢ª»r]BmÁ'Ò+åygÞÅQ?luó>÷ú¼è6¸|}[¼[¶Ñ¦g!\OÎÒJSE..pSß&_ÈEäø)6òëó¨¼2¶ð°æà`ï7Ë=Ã¥:cƧ=L4qG-"µ(ÐÝïß ÓãXkÀ4fzæ·p\ññT<tu¥Æ©;Ìn4£³Ï¢ÌFåG´
And the corresponding hex:
27 B5 6B F6 01 00 00 00 58 4E 5B CE C0 FC F7 68 2F 46 86 87 83 39 F3 39 9E 56 EC F1 B0 63 9E 65 45 B8 7A B6 3D 07 99 48 6D 6F 73 68 62 65 72 6D 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90 86 FA 03 0E AB F3 BC 0B 50 70 F9 23 DD 87 56 03 D4 3D 34 90 E2 AE 4C 2C 94 9E 8E 15 4B 0C 83 8C 3B 03 CA E7 3B 1B 41 53 C0 26 04 CB F7 EB D3 25 C8 3B DA 66 8A AC 47 7D 8A 7F 74 6D 51 3B B5 19 E9 FC F8 5F 38 37 B4 11 0C 79 A9 12 E3 A9 21 DF B6 F3 51 F2 41 E7 DB 85 02 9F 6C A9 E2 53 47 1F 34 86 53 BD 33 FD 4A D7 AA 39 C3 A4 F4 A1 77 78 69 44 B2 4D BC CF 42 5D 33 39 F8 FE 97 3A 81 F3 F1 10 37 AA BE 86 91 F7 1F E8 83 BA A3 C8 33 CF 1D A2 CD 45 7F 46 1F CD A2 AA BB 1A 72 5D 42 02 6D C1 0F 27 D2 2B E5 0B 79 67 DE C5 1A 51 3F 14 6C 75 F3 3E F7 FA BC E8 36 8E B8 7C 02 1C 7D 01 00 92 8C 19 5B BC 5B B6 D1 A6 67 7F 21 5C 84 13 4F CE 0C D2 4A 53 19 82 45 1B 2E 2E 96 70 53 DF 26 5F C8 1C 45 8F E4 F8 29 36 F2 EB 9D 95 F3 A8 BC 32 B6 F0 B0 E6 91 98 1A E0 99 60 EF 37 CB 3D C3 A5 3A 63 0C C6 A7 3D 4C 34 71 47 2D 22 B5 28 D0 DD EF DF 09 D3 E3 58 6B C0 17 34 66 7A E6 B7 70 5C F1 F1 54 3C 74 94 75 A5 C6 15 A9 9E 14 3B CC 15 10 83 6E 34 A3 B3 CF 0F A2 9C CC 8E 46 8C E5 00 00 47 B4 17 05 00 00 00 00
If anyone cares to help figure this out it would be much appreciated.
If I have an arbitrary block of NSData as a hex value, is there a way to determine what the object might have been before it was archived or serialized?
Not really. That's about as 'trivial' as reading arbitrary files correctly without the use of a UTI, extension, MIME type. Of course, your program would also need to support reading of all those files/formats.
I don't mind a few guess and check methods, but I need some pointers in the right direction.
You need to narrow your problem/inputs down, if you don't want an impossibly difficult task.
I have an NSData object with some hex in it. What methods of NSData should I look at?
It's just a data blob of length bytes. It could represent anything -- if you don't know where it came from.
Are there other classes to try as well?
Perhaps you would start by saving all your data via NSCoder or another serializer/archiver which offers some introspection and support for you to enter your own information (which would be comparable to a UTI or MIME type).
Edit:
Don't want to scare people away from answering, but I have a file of game data which was likely encoded using a Cocoa Touch class. The data, when viewed in a hex editor, shows gibberish and a username, which leads me to suspect that it's an archived or encoded object of sorts. I have copied the hex from the hex editor into a sample project which I am using to try and unarchive the data.
Using these APIs, the data may be represented multiple ways. You're probably facing something within the domain of 1) a proprietary file format through 2) a keyed archive.
The latter is easier for nontrivial data representations. You would need to define any objc classes you do not have available when unarchived. In that case, a few sample representations would offer a rough outline of the data structures you will need (under conventional implementations). It could also be an archive similar to an NSDictionary, if the unarchiver is capable of opening it. This is a problem which is easier than with other langs, since archiving often falls back on keys and values mapped to members in Cocoa.
Edit2:
The file came from the Draw Something directory. It's called gamedata.i3d
(shrug)
Try using NSKeyedUnarchiver to read it. It's not uncommon to use just the standard Foundation containers like NSArray, NSDictionary, and NSString to store data, so you might get lucky. That obviously won't work if custom classes were involved, but it might be worth 15 minutes of your time to try it.

Translate unreadable Russian text

I'm trying to read documentation which was written in I believe is Russian, but I'm not sure if what I'm seeing is even encoded correctly. The text looks something like this:
Ãåíåðèðóåò ìàòðèöó ñëó÷àéíûõ ÷èñåë â äèàïàçîíå îò -1 äî 1
(appears as several special A's and o's)
when opened in Firefox. In other programs it looks like this:
���������� ������� ��������� ����� � ��������� �� -1 �� 1
(appears as several question marks)
Is there any hope to translate this?
Decode as CP1251.
>>> print u'Ãåíåðèðóåò ìàòðèöó ñëó÷àéíûõ ÷èñåë â äèàïàçîí'.encode('latin-1').decode('cp1251')
Генерирует матрицу случайных чисел в диапазон
You need to determine which of multiple possible Cyrillic codesets was used - the linked site lists more than a dozen possibilities, of which ISO 8859-5 and CP-1251 are perhaps the most likely.
You may be able to get one of the translation web sites (Babelfish or Google, and no doubt others) to help. However, you may have to translate from the original codeset to UTF-8 to get it to work -- simply copying the bytes above did not work.
When copying the original text to a Mac, it was encoded as UTF-8:
0x0000: C3 83 C3 A5 C3 AD C3 A5 C3 B0 C3 A8 C3 B0 C3 B3 ................
0x0010: C3 A5 C3 B2 20 C3 AC C3 A0 C3 B2 C3 B0 C3 A8 C3 .... ...........
0x0020: B6 C3 B3 20 C3 B1 C3 AB C3 B3 C3 B7 C3 A0 C3 A9 ... ............
0x0030: C3 AD C3 BB C3 B5 20 C3 B7 C3 A8 C3 B1 C3 A5 C3 ...... .........
0x0040: AB 20 C3 A2 20 C3 A4 C3 A8 C3 A0 C3 AF C3 A0 C3 . .. ...........
0x0050: A7 C3 AE C3 AD C3 A5 20 C3 AE C3 B2 20 2D 31 20 ....... .... -1
0x0060: C3 A4 C3 AE 20 31 0A .... 1.
0x0067:
So, to translate this with Perl, I used the Encode module first to convert the UTF-8 string back to Latin-1, and then I told Perl to treat the Latin-1 as if it was CP-1251 and convert that back to UTF-8:
#!/usr/bin/env perl
use Encode qw( from_to );
my $source = 'Ãåíåðèðóåò ìàòðèöó ñëó÷àéíûõ ÷èñåë â äèàïàçîíå îò -1 äî 1';
# from_to changes things 'in situ'
my $nbytes = from_to($source, "utf-8", "latin-1");
# print "$nbytes: $source\n";
$nbytes = from_to($source, "cp-1251", "utf-8");
print "$nbytes: $source\n";
The output is:
102: Генерирует матрицу случайных чисел в диапазоне от -1 до 1
Which Babelfish translates as:
102: It generates the matrix of random numbers in the range from -1 to 1
and Google translates as:
102: Generate a matrix of random numbers ranging from -1 to 1
The initial UTF-8 to Latin-1 translation was required because of the setup on my Mac (my terminal uses UTF-8 by default, etc): YMMV.