Some Hexademical numbers get modified when writing to a file - encoding

I'm writing a program which packs hexadecimal strings into bytes and writes them on disk. I expect the hexdump of the file to be same as the hexadecimal things. I'm doing this in Clojure:
(defn- hex-char->int
[hex-char]
(-> hex-char
str
(Integer/parseInt 16)))
(defn- pack
[hex-1 hex-2]
(-> hex-1
(bit-shift-left 4)
(bit-or hex-2)
unchecked-char))
(defn- hex-str->packed-bytes
[hex-str]
(->> hex-str
(map hex-char->int)
(partition 2)
(mapv (partial apply pack))))
(defn write-bytes
[bs]
(with-open [f (io/output-stream "test.txt")]
(.write f (.getBytes bs))))
(defn test-write
[hex-str]
(->> hex-str
hex-str->packed-bytes
(apply str)
write-bytes))
This program works well for hex couples from "00" to "7f". I can see the same hex numbers when I hexdump the output file.
But for characters from "80" to "ff", this doesn't work. The hexdump for "80" is "c280" and for "ff" it is "c3bf".
This gets solved if I don't convert to characters and directly write with bytes, so I assume that this is related to encoding.
I even found this: https://superuser.com/questions/1349494/filling-file-with-0xff-gives-c3bf-in-osx
But I want to understand how to solve this in Clojure's context.
Pasting the hexdump of `000f101f202f303f404f505f606f707f808f909fa0afb0bfc0cfd0dfe0eff0ff" for reference:
00000000 00 0f 10 1f 20 2f 30 3f 40 4f 50 5f 60 6f 70 7f |.... /0?#OP_`op.|
00000010 c2 80 c2 8f c2 90 c2 9f c2 a0 c2 af c2 b0 c2 bf |................|
00000020 c3 80 c3 8f c3 90 c3 9f c3 a0 c3 af c3 b0 c3 bf |................|
00000030
Please help me solve this.
Thanks! :)

As you suspected, the problem is in encoding. I'm guessing that the problem is happening when you (apply str) in test-write. So, I slightly re-wrote your code as follows:
user> (defn- hex-char->int
[hex-char]
(-> hex-char
str
(Integer/parseInt 16)))
#'user/hex-char->int
user> (defn- pack
[hex-1 hex-2]
(-> hex-1
(bit-shift-left 4)
(bit-or hex-2)))
#'user/pack
user> (defn- hex-str->packed-bytes
[hex-str]
(->> hex-str
(map hex-char->int)
(partition 2)
(mapv (partial apply pack))))
#'user/hex-str->packed-bytes
user> (defn write-bytes
[bs]
(with-open [f (io/output-stream "test.txt")]
(.write f bs)))
#'user/write-bytes
user> (defn test-write
[hex-str]
(->> hex-str
hex-str->packed-bytes
(mapv unchecked-byte)
(byte-array)
write-bytes))
#'user/test-write
user> (test-write "000f101f202f303f404f505f606f707f808f909fa0afb0bfc0cfd0dfe0eff0ff")
nil
user>
And showing the contents of the resultant file in hex:
dorabs-imac:example dorab$ od -h test.txt
0000000 0f00 1f10 2f20 3f30 4f40 5f50 6f60 7f70
0000020 8f80 9f90 afa0 bfb0 cfc0 dfd0 efe0 fff0
0000040

Related

perl replace non UTF-8 characters or binary contents with whitespace

I have a file with non-ascii characters.
$ org od -t c -t x1 -A d tmp.txt
0000000 S o - c a l l e d 217 204 l a b
53 6f 2d 63 61 6c 6c 65 64 f4 8f b1 84 6c 61 62
0000016 e l e d 217 204 p a t t e r n s
65 6c 65 64 f4 8f b1 84 70 61 74 74 65 72 6e 73
0000032 217 204 c a n b e 217 204 u s
f4 8f b1 84 63 61 6e 20 62 65 f4 8f b1 84 75 73
0000048 e d 217 204 w i t h 217 204 s i
65 64 f4 8f b1 84 77 69 74 68 f4 8f b1 84 73 69
0000064 n g l e , 217 204 d o u b l e
6e 67 6c 65 2c 20 f4 8f b1 84 64 6f 75 62 6c 65
0000080 , 217 204 a n d 217 204 t r i
2c 20 f4 8f b1 84 61 6e 64 f4 8f b1 84 74 72 69
0000096 p l e 217 204 b l a n k s .
70 6c 65 f4 8f b1 84 62 6c 61 6e 6b 73 2e
As you can see, \x{f4}\x{8f}\x{b1}\x{84} has several occurrences. I want to replace \x{f4}\x{8f}\x{b1}\x{84} with whitespace. According to this, I try:
s/\x{f4}\x{8f}\x{b1}\x{84}/ /g;
tr/\x{f4}\x{8f}\x{b1}\x{84}/ /;
It doesn't work.
But if I remove this two lines in the script:
use utf8;
use open qw( :std :encoding(UTF-8) );
It works. Why?
I suspect that it is because perl only deals with characters, but \x{f4}\x{8f}\x{b1}\x{84} is not regarded as a character. Is there a way to remove \x{f4}\x{8f}\x{b1}\x{84} or any other binary contents or non UTF-8 characters with perl?
While the file may contain "\x{f4}\x{8f}\x{b1}\x{84}", your string contains "\x{10FC44}" — "\N{U+10FC44}" if you prefer — because you decoded what you read. As such, you'd need
tr/\N{U+10FC44}/ /
It's a private-use Code Point. To replace all 137,468 private-use Code Points, you can use
s/\p{General_Category=Private_Use}/ /g
General_Category can be abbreviated to Gc.
Private_Use can be abbreviated to Co.
General_Category= can be omitted.
So these are equivalent:
s/\p{Gc=Private_Use}/ /g
s/\p{Private_Use}/ /g
s/\p{Co}/ /g
Co makes me think of "control", so maybe it's best to avoid that one. (Controls characters are identified by the Control aka Cc general category.)

Poor C optimization in Watcom

I am using Watcom C compiler (wcc) version 2.0. I compile this simple code with many optimizations enabled, but the resulting ASM seems very unoptimized to me.
int test(int x) {
return x ? 1 : 2;
}
Compiling with 8086 as target, fastest possible optimizations (-otexan).
wcc test.c -i="C:\Data\Projects\WATCOM/h" -otexan -d2 -bt=dos -fo=.obj -mc
Then disassembling with:
wdis test.obj -s > test.dasm
Resulting assembler looks like this:
...
return x ? 1 : 2;
02C7 83 7E FA 00 cmp word ptr -0x6[bp],0x0000
02CB 74 03 je L$39
02CD E9 02 00 jmp L$40
02D0 L$39:
02D0 EB 07 jmp L$41
02D2 L$40:
02D2 C7 46 FE 01 00 mov word ptr -0x2[bp],0x0001
02D7 EB 05 jmp L$42
02D9 L$41:
02D9 C7 46 FE 02 00 mov word ptr -0x2[bp],0x0002
02DE L$42:
02DE 8B 46 FE mov ax,word ptr -0x2[bp]
02E1 89 46 FC mov word ptr -0x4[bp],ax
02E4 8B 46 FC mov ax,word ptr -0x4[bp]
...
These jumps look heavily un-optimized to me. I would expect less unnecessary jumps and maybe putting result directly to AX without putting it under BP location there and back (last two lines).
cmp word ptr -0x6[bp],0x0000
jz L$39
mov ax,0x0001
jmp L$40
L$39:
mov ax,0x0002
L$40:
...
Am I missing something there? Is wcc ignoring my switches for some reason? Thanks.
Problem is the -d2 switch that generates detailed debug info. It probably inserts unnecessary lines to correspond to original C file's lines (to be able to put hardware breakpoints there, maybe?). I put -d1 instead and voila:
01A0 test_:
01A0 85 C0 test ax,ax
01A2 74 04 je L$15
01A4 B8 01 00 mov ax,0x0001
01A7 C3 ret
01A8 L$15:
01A8 B8 02 00 mov ax,0x0002
01AB C3 ret

Random symbols in Source window instead of Russian characters in RStudio

I have been googling and stackoverflowing (yes, that is the word now) on how to fix the problem with wrong encoding. However, I could not find the solution.
I am trying to load .Rmd file with UTF-8 encoding which basically has Russian characters in it. They do not show properly. Instead, the code lines in the Source window look like so:
Initially, I created this .Rmd file long ago on my previous laptop. Now, I am using another one and I cannot spot the issue here.
I have already tried to use some Sys.setlocale() commands with no success whatsoever.
I run RStudio on Windows 10.
Edited
This is the output of readBin('raw[1].Rmd', raw(), 10000). Slice from 2075 to 2211:
[2075] 64 31 32 2c 20 71 68 35 20 3d 3d 20 22 d0 a0 d1 9a d0 a0 d0 88 d0 a0
e2 80 93 d0 a0 d0 8e d0 a0 d1 99
[2109] d0 a0 d1 9b d0 a0 e2 84 a2 22 29 3b 20 64 31 32 6d 24 71 68 35 20 3d
20 4e 55 4c 4c 0d 0a 64 31 35 6d
[2143] 20 3d 20 66 69 6c 74 65 72 28 64 31 35 2c 20 74 68 35 20 3d 3d 20 22
d0 a0 d1 9a d0 a0 d0 88 d0 a0 e2
[2177] 80 93 d0 a0 d0 8e d0 a0 d1 99 d0 a0 d1 9b d0 a0 e2 84 a2 22 29 3b 20
64 31 35 6d 24 74 68 35 20 3d 20
Thank you.
Windows doesn't have very good support for UTF-8. Likely your local encoding is something else.
RStudio normally reads files using the system encoding. If that is wrong, you can use "File | Reopen with encoding..." to re-open the file using a different encoding.
Edited to add:
The first line of the sample output looks like UTF-8 encoding with some Cyrillic letters, but not Russian-language text. I decode it as "d12, qh5 == \"РњРЈР–РЎРљ". Is that what RStudio gave you when you re-opened the file, declaring it as UTF-8?

Mixing UTF-8 with UTF-16

I'm currently working on a korean program, which should be translated into chinese language. What I found strange, is that application is mixing UTF-8 and UTF-16 characters.
Let's say we've a string which goes as:
"게임을 정말로 종료하시겠습니까"
8C AC 84 C7 44 C7 20 00 15 C8 D0 B9 5C B8 20 00
85 C8 CC B8 58 D5 DC C2 A0 AC B5 C2 C8 B2 4C AE 00
But it's stored as
B0 D4 C0 D3 C0 BB 20 C1 A4 B8 BB B7 CE 20 C1 BE
B7 E1 C7 CF BD C3 B0 DA BD C0 B4 CF B1 EE 3F 00
just to prevent zeros. I'd like to know, if it's some kind of encryption, or is it just a normal method used by compilers to prevent end of the string somewhere in the middle of the string? Because, the final result is the first string, that I've mentioned. Any reading would be strongly appreciated.
A string must be either uft-8 or utf-16 (or some other encoding). If you mix encodings in a string it is an error. However it is very common to pass string about as utf-8, and only convert them to utf-16 when needed by a Windows function. There are several reasons for this, Basile Starynkevitch has provided a link.
If you need routines to read UFT-8, I've got some here.
https://github.com/MalcolmMcLean/babyx/blob/master/src/common/BBX_Font.c

Translate unreadable Russian text

I'm trying to read documentation which was written in I believe is Russian, but I'm not sure if what I'm seeing is even encoded correctly. The text looks something like this:
Ãåíåðèðóåò ìàòðèöó ñëó÷àéíûõ ÷èñåë â äèàïàçîíå îò -1 äî 1
(appears as several special A's and o's)
when opened in Firefox. In other programs it looks like this:
���������� ������� ��������� ����� � ��������� �� -1 �� 1
(appears as several question marks)
Is there any hope to translate this?
Decode as CP1251.
>>> print u'Ãåíåðèðóåò ìàòðèöó ñëó÷àéíûõ ÷èñåë â äèàïàçîí'.encode('latin-1').decode('cp1251')
Генерирует матрицу случайных чисел в диапазон
You need to determine which of multiple possible Cyrillic codesets was used - the linked site lists more than a dozen possibilities, of which ISO 8859-5 and CP-1251 are perhaps the most likely.
You may be able to get one of the translation web sites (Babelfish or Google, and no doubt others) to help. However, you may have to translate from the original codeset to UTF-8 to get it to work -- simply copying the bytes above did not work.
When copying the original text to a Mac, it was encoded as UTF-8:
0x0000: C3 83 C3 A5 C3 AD C3 A5 C3 B0 C3 A8 C3 B0 C3 B3 ................
0x0010: C3 A5 C3 B2 20 C3 AC C3 A0 C3 B2 C3 B0 C3 A8 C3 .... ...........
0x0020: B6 C3 B3 20 C3 B1 C3 AB C3 B3 C3 B7 C3 A0 C3 A9 ... ............
0x0030: C3 AD C3 BB C3 B5 20 C3 B7 C3 A8 C3 B1 C3 A5 C3 ...... .........
0x0040: AB 20 C3 A2 20 C3 A4 C3 A8 C3 A0 C3 AF C3 A0 C3 . .. ...........
0x0050: A7 C3 AE C3 AD C3 A5 20 C3 AE C3 B2 20 2D 31 20 ....... .... -1
0x0060: C3 A4 C3 AE 20 31 0A .... 1.
0x0067:
So, to translate this with Perl, I used the Encode module first to convert the UTF-8 string back to Latin-1, and then I told Perl to treat the Latin-1 as if it was CP-1251 and convert that back to UTF-8:
#!/usr/bin/env perl
use Encode qw( from_to );
my $source = 'Ãåíåðèðóåò ìàòðèöó ñëó÷àéíûõ ÷èñåë â äèàïàçîíå îò -1 äî 1';
# from_to changes things 'in situ'
my $nbytes = from_to($source, "utf-8", "latin-1");
# print "$nbytes: $source\n";
$nbytes = from_to($source, "cp-1251", "utf-8");
print "$nbytes: $source\n";
The output is:
102: Генерирует матрицу случайных чисел в диапазоне от -1 до 1
Which Babelfish translates as:
102: It generates the matrix of random numbers in the range from -1 to 1
and Google translates as:
102: Generate a matrix of random numbers ranging from -1 to 1
The initial UTF-8 to Latin-1 translation was required because of the setup on my Mac (my terminal uses UTF-8 by default, etc): YMMV.