Translate unreadable Russian text - unicode

I'm trying to read documentation which was written in I believe is Russian, but I'm not sure if what I'm seeing is even encoded correctly. The text looks something like this:
Ãåíåðèðóåò ìàòðèöó ñëó÷àéíûõ ÷èñåë â äèàïàçîíå îò -1 äî 1
(appears as several special A's and o's)
when opened in Firefox. In other programs it looks like this:
���������� ������� ��������� ����� � ��������� �� -1 �� 1
(appears as several question marks)
Is there any hope to translate this?

Decode as CP1251.
>>> print u'Ãåíåðèðóåò ìàòðèöó ñëó÷àéíûõ ÷èñåë â äèàïàçîí'.encode('latin-1').decode('cp1251')
Генерирует матрицу случайных чисел в диапазон

You need to determine which of multiple possible Cyrillic codesets was used - the linked site lists more than a dozen possibilities, of which ISO 8859-5 and CP-1251 are perhaps the most likely.
You may be able to get one of the translation web sites (Babelfish or Google, and no doubt others) to help. However, you may have to translate from the original codeset to UTF-8 to get it to work -- simply copying the bytes above did not work.
When copying the original text to a Mac, it was encoded as UTF-8:
0x0000: C3 83 C3 A5 C3 AD C3 A5 C3 B0 C3 A8 C3 B0 C3 B3 ................
0x0010: C3 A5 C3 B2 20 C3 AC C3 A0 C3 B2 C3 B0 C3 A8 C3 .... ...........
0x0020: B6 C3 B3 20 C3 B1 C3 AB C3 B3 C3 B7 C3 A0 C3 A9 ... ............
0x0030: C3 AD C3 BB C3 B5 20 C3 B7 C3 A8 C3 B1 C3 A5 C3 ...... .........
0x0040: AB 20 C3 A2 20 C3 A4 C3 A8 C3 A0 C3 AF C3 A0 C3 . .. ...........
0x0050: A7 C3 AE C3 AD C3 A5 20 C3 AE C3 B2 20 2D 31 20 ....... .... -1
0x0060: C3 A4 C3 AE 20 31 0A .... 1.
0x0067:
So, to translate this with Perl, I used the Encode module first to convert the UTF-8 string back to Latin-1, and then I told Perl to treat the Latin-1 as if it was CP-1251 and convert that back to UTF-8:
#!/usr/bin/env perl
use Encode qw( from_to );
my $source = 'Ãåíåðèðóåò ìàòðèöó ñëó÷àéíûõ ÷èñåë â äèàïàçîíå îò -1 äî 1';
# from_to changes things 'in situ'
my $nbytes = from_to($source, "utf-8", "latin-1");
# print "$nbytes: $source\n";
$nbytes = from_to($source, "cp-1251", "utf-8");
print "$nbytes: $source\n";
The output is:
102: Генерирует матрицу случайных чисел в диапазоне от -1 до 1
Which Babelfish translates as:
102: It generates the matrix of random numbers in the range from -1 to 1
and Google translates as:
102: Generate a matrix of random numbers ranging from -1 to 1
The initial UTF-8 to Latin-1 translation was required because of the setup on my Mac (my terminal uses UTF-8 by default, etc): YMMV.

Related

Some Hexademical numbers get modified when writing to a file

I'm writing a program which packs hexadecimal strings into bytes and writes them on disk. I expect the hexdump of the file to be same as the hexadecimal things. I'm doing this in Clojure:
(defn- hex-char->int
[hex-char]
(-> hex-char
str
(Integer/parseInt 16)))
(defn- pack
[hex-1 hex-2]
(-> hex-1
(bit-shift-left 4)
(bit-or hex-2)
unchecked-char))
(defn- hex-str->packed-bytes
[hex-str]
(->> hex-str
(map hex-char->int)
(partition 2)
(mapv (partial apply pack))))
(defn write-bytes
[bs]
(with-open [f (io/output-stream "test.txt")]
(.write f (.getBytes bs))))
(defn test-write
[hex-str]
(->> hex-str
hex-str->packed-bytes
(apply str)
write-bytes))
This program works well for hex couples from "00" to "7f". I can see the same hex numbers when I hexdump the output file.
But for characters from "80" to "ff", this doesn't work. The hexdump for "80" is "c280" and for "ff" it is "c3bf".
This gets solved if I don't convert to characters and directly write with bytes, so I assume that this is related to encoding.
I even found this: https://superuser.com/questions/1349494/filling-file-with-0xff-gives-c3bf-in-osx
But I want to understand how to solve this in Clojure's context.
Pasting the hexdump of `000f101f202f303f404f505f606f707f808f909fa0afb0bfc0cfd0dfe0eff0ff" for reference:
00000000 00 0f 10 1f 20 2f 30 3f 40 4f 50 5f 60 6f 70 7f |.... /0?#OP_`op.|
00000010 c2 80 c2 8f c2 90 c2 9f c2 a0 c2 af c2 b0 c2 bf |................|
00000020 c3 80 c3 8f c3 90 c3 9f c3 a0 c3 af c3 b0 c3 bf |................|
00000030
Please help me solve this.
Thanks! :)
As you suspected, the problem is in encoding. I'm guessing that the problem is happening when you (apply str) in test-write. So, I slightly re-wrote your code as follows:
user> (defn- hex-char->int
[hex-char]
(-> hex-char
str
(Integer/parseInt 16)))
#'user/hex-char->int
user> (defn- pack
[hex-1 hex-2]
(-> hex-1
(bit-shift-left 4)
(bit-or hex-2)))
#'user/pack
user> (defn- hex-str->packed-bytes
[hex-str]
(->> hex-str
(map hex-char->int)
(partition 2)
(mapv (partial apply pack))))
#'user/hex-str->packed-bytes
user> (defn write-bytes
[bs]
(with-open [f (io/output-stream "test.txt")]
(.write f bs)))
#'user/write-bytes
user> (defn test-write
[hex-str]
(->> hex-str
hex-str->packed-bytes
(mapv unchecked-byte)
(byte-array)
write-bytes))
#'user/test-write
user> (test-write "000f101f202f303f404f505f606f707f808f909fa0afb0bfc0cfd0dfe0eff0ff")
nil
user>
And showing the contents of the resultant file in hex:
dorabs-imac:example dorab$ od -h test.txt
0000000 0f00 1f10 2f20 3f30 4f40 5f50 6f60 7f70
0000020 8f80 9f90 afa0 bfb0 cfc0 dfd0 efe0 fff0
0000040

Mixing UTF-8 with UTF-16

I'm currently working on a korean program, which should be translated into chinese language. What I found strange, is that application is mixing UTF-8 and UTF-16 characters.
Let's say we've a string which goes as:
"게임을 정말로 종료하시겠습니까"
8C AC 84 C7 44 C7 20 00 15 C8 D0 B9 5C B8 20 00
85 C8 CC B8 58 D5 DC C2 A0 AC B5 C2 C8 B2 4C AE 00
But it's stored as
B0 D4 C0 D3 C0 BB 20 C1 A4 B8 BB B7 CE 20 C1 BE
B7 E1 C7 CF BD C3 B0 DA BD C0 B4 CF B1 EE 3F 00
just to prevent zeros. I'd like to know, if it's some kind of encryption, or is it just a normal method used by compilers to prevent end of the string somewhere in the middle of the string? Because, the final result is the first string, that I've mentioned. Any reading would be strongly appreciated.
A string must be either uft-8 or utf-16 (or some other encoding). If you mix encodings in a string it is an error. However it is very common to pass string about as utf-8, and only convert them to utf-16 when needed by a Windows function. There are several reasons for this, Basile Starynkevitch has provided a link.
If you need routines to read UFT-8, I've got some here.
https://github.com/MalcolmMcLean/babyx/blob/master/src/common/BBX_Font.c

Merge specific rows from two files if number in row file 1 is between two numbers in row in file 2

I'm searching for a couple of hours (actually already two days) but I can't find an answer to my problem yet. I've tried Sed and Awk but I can't get the parameters right.
Essentially, this is what I'm looking for
FOR every line in file_1
IF [value in colum2 in file_1]
IS EQUAL TO [value in column 4 in some row in file_2]
OR IS EQUAL TO [value in column 5 in some row in file_2]
OR IS BETWEEN [value column 4 and value column 5 in some row in file_2]
THAN
ADD column 3, 6 and 7 of some row of file_2 to column 3, 4 and 5 of file_1
NB: Values that needs to be compared are INTs, values in col 3, 6 and 7 (that only needs to be copied) are STRINGs
And this is the context, but probably not necessary to read:
I've two files with genome data which I want to merge in a specific way (the columns are tab separated)
The first file contains variants (only SNPs for the ones interested) of which, efficiently, only the second column is relevant. This column is a list of numbers (position of that variant on the chromosome)
I have a structural annotation files that contains the following data:
In column 4 is a begin position of the specific structure and in column 5 is the end position.
Column 3, 7 and 9 contains information that describes the specific structure (name of a gene etc.)
I would like to annotate the variants in the first file with the data in the annotation file. Therefore, if the number in column 2 of the variants file is equal to column 4 or 5 OR between those values in a specific row, columns 3, 7 and 9 of that specific row in the annotation needs to be added.
Sample File 1
SOME_NON_RELEVANT_STRING 142
SOME_NON_RELEVANT_STRING 182
SOME_NON_RELEVANT_STRING 320
SOME_NON_RELEVANT_STRING 321
SOME_NON_RELEVANT_STRING 322
SOME_NON_RELEVANT_STRING 471
SOME_NON_RELEVANT_STRING 488
SOME_NON_RELEVANT_STRING 497
SOME_NON_RELEVANT_STRING 541
SOME_NON_RELEVANT_STRING 545
SOME_NON_RELEVANT_STRING 548
SOME_NON_RELEVANT_STRING 4105
SOME_NON_RELEVANT_STRING 15879
SOME_NON_RELEVANT_STRING 26534
SOME_NON_RELEVANT_STRING 30000
SOME_NON_RELEVANT_STRING 30001
SOME_NON_RELEVANT_STRING 40001
SOME_NON_RELEVANT_STRING 44752
SOME_NON_RELEVANT_STRING 50587
SOME_NON_RELEVANT_STRING 87512
SOME_NON_RELEVANT_STRING 96541
SOME_NON_RELEVANT_STRING 99541
SOME_NON_RELEVANT_STRING 99871
Sample File 2
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A1 0 38 B1 C1
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A2 40 2100 B2 C2
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A3 2101 9999 B3 C3
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A4 10000 15000 B4 C4
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A5 15001 30000 B5 C5
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A6 30001 40000 B6 C6
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A7 40001 50001 B7 C7
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A8 50001 50587 B8 C8
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A9 50588 83054 B9 C9
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A10 83055 98421 B10 C10
SOME_NON_RELEVANT_STRING SOME_NON_RELEVANT_STRING A11 98422 99999 B11 C11
Sample output file
142 A2 B2 C2
182 A2 B2 C2
320 A2 B2 C2
321 A2 B2 C2
322 A2 B2 C2
471 A2 B2 C2
488 A2 B2 C2
497 A2 B2 C2
541 A2 B2 C2
545 A2 B2 C2
548 A2 B2 C2
4105 A3 B3 C3
15879 A5 B5 C5
26534 A5 B5 C5
30000 A5 B5 C5
30001 A6 B6 C6
40001 A7 B7 C7
44752 A7 B7 C7
50587 A8 B8 C8
87512 A10 B10 C10
96541 A10 B10 C10
99541 A11 B11 C11
99871 A11 B11 C1
1
As a start, here's how to write the algorithm you posted in awk, assuming when you said "ADD" you meant "append" and assuming all lines in file1 have unique values of the 2nd field (ran against the sample input provided):
awk '
BEGIN{ FS=OFS="\t"; startIdx=1 }
NR==FNR {
if ($2 in seen) {
printf "%s on line %d, first seen on line %d\n", $2, NR, seen[$2] | "cat>&2"
}
else {
f2s[++endIdx] = $2
seen[$2] = NR
}
next
}
{
inBounds = 1
for (idx=startIdx; (idx<=endIdx) && inBounds; idx++) {
f2 = f2s[idx]
if (f2 >= $4) {
if (f2 <= $5) {
print f2, $3, $6, $7
}
else {
inBounds = 0
}
}
else {
startIdx = idx
}
}
}
' file1 file2
142 A2 B2 C2
182 A2 B2 C2
320 A2 B2 C2
321 A2 B2 C2
322 A2 B2 C2
471 A2 B2 C2
488 A2 B2 C2
497 A2 B2 C2
541 A2 B2 C2
545 A2 B2 C2
548 A2 B2 C2
4105 A3 B3 C3
15879 A5 B5 C5
26534 A5 B5 C5
30000 A5 B5 C5
30001 A6 B6 C6
40001 A7 B7 C7
44752 A7 B7 C7
50587 A8 B8 C8
87512 A10 B10 C10
96541 A10 B10 C10
99541 A11 B11 C11
99871 A11 B11 C11

Why does encoding, then decoding strings make Arabic characters lose their context?

I'm (belatedly) testing Unicode waters for the first time and am failing to understand why the process of encoding, then decoding an Arabic string is having the effect of separating out the individual characters that the word is made of.
In the example below, the word "ﻟﻠﺒﻴﻊ" comprises of 5 individual letters: "ع","ي","ب","ل","ل", written right to left. Depending on the surrounding context (adjacent letters), the letters change form
use strict;
use warnings;
use utf8;
binmode( STDOUT, ':utf8' );
use Encode qw< encode decode >;
my $str = 'ﻟﻠﺒﻴﻊ'; # "For sale"
my $enc = encode( 'UTF-8', $str );
my $dec = decode( 'UTF-8', $enc );
my $decoded = pack 'U0W*', map +ord, split //, $enc;
print "Original string : $str\n"; # ل ل ب ي ع
print "Decoded string 1: $dec\n" # ل ل ب ي ع
print "Decoded string 2: $decoded\n"; # ل ل ب ي ع
ADDITIONAL INFO
When pasting the string to this post, the rendering is reversed so it looks like "ﻊﻴﺒﻠﻟ". I'm reversing it manually to get it to look 'right'. The correct hexdump is given below:
$ echo "ﻟﻠﺒﻴﻊ" | hexdump
0000000 bbef ef8a b4bb baef ef92 a0bb bbef 0a9f
0000010
The output of the Perl script (per ikegami's request):
$ perl unicode.pl | od -t x1
0000000 4f 72 69 67 69 6e 61 6c 20 73 74 72 69 6e 67 20
0000020 3a 20 d8 b9 d9 8a d8 a8 d9 84 d9 84 0a 44 65 63
0000040 6f 64 65 64 20 73 74 72 69 6e 67 20 31 3a 20 d8
0000060 b9 d9 8a d8 a8 d9 84 d9 84 0a 44 65 63 6f 64 65
0000100 64 20 73 74 72 69 6e 67 20 32 3a 20 d8 b9 d9 8a
0000120 d8 a8 d9 84 d9 84 0a
0000127
And if I just print $str:
$ perl unicode.pl | od -t x1
0000000 4f 72 69 67 69 6e 61 6c 20 73 74 72 69 6e 67 20
0000020 3a 20 d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
0000035
Finally (per ikegami's request):
$ grep 'For sale' unicode.pl | od -t x1
0000000 6d 79 20 24 73 74 72 20 3d 20 27 d8 b9 d9 8a d8
0000020 a8 d9 84 d9 84 27 3b 20 20 23 20 22 46 6f 72 20
0000040 73 61 6c 65 22 20 0a
0000047
Perl details
$ perl -v
This is perl, v5.10.1 (*) built for x86_64-linux-gnu-thread-multi
(with 53 registered patches, see perl -V for more detail)
Outputting to file reverses the string: "ﻊﻴﺒﻠﻟ"
QUESTIONS
I have several:
How can I preserve the context of each character while printing?
Why is the original string printed out to screen as individual letters, even though it hasn't been 'processed'?
When printing to file, the word is reversed (I'm guessing this is due to the script's right-to-left nature). Is there a way I can prevent this from happening?
Why does the following not hold true: $str !~ /\P{Bidi_Class: Right_To_Left}/;
Source code returned by StackOverflow (as fetched using wget):
... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a ...
U+FEDF ARABIC LETTER LAM INITIAL FORM
U+FEE0 ARABIC LETTER LAM MEDIAL FORM
U+FE92 ARABIC LETTER BEH MEDIAL FORM
U+FEF4 ARABIC LETTER YEH MEDIAL FORM
U+FECA ARABIC LETTER AIN FINAL FORM
perl output I get from the source code returned by StackOverflow:
... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a 0a
... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a 0a
... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a 0a
U+FEDF ARABIC LETTER LAM INITIAL FORM
U+FEE0 ARABIC LETTER LAM MEDIAL FORM
U+FE92 ARABIC LETTER BEH MEDIAL FORM
U+FEF4 ARABIC LETTER YEH MEDIAL FORM
U+FECA ARABIC LETTER AIN FINAL FORM
U+000A LINE FEED
So I get exactly what's in the source, as I should.
perl output you got:
... d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
... d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
... d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
U+0639 ARABIC LETTER AIN
U+064A ARABIC LETTER YEH
U+0628 ARABIC LETTER BEH
U+0644 ARABIC LETTER LAM
U+0644 ARABIC LETTER LAM
U+000A LINE FEED
Ok, so you could have a buggy Perl (that reverses and changes Arabic characters and only those), but it's far more likely that your sources doesn't contain what you think it does. You need to check what bytes form up your source.
echo output you got:
ef bb 8a ef bb b4 ef ba 92 ef bb a0 ef bb 9f 0a
U+FECA ARABIC LETTER AIN FINAL FORM
U+FEF4 ARABIC LETTER YEH MEDIAL FORM
U+FE92 ARABIC LETTER BEH MEDIAL FORM
U+FEE0 ARABIC LETTER LAM MEDIAL FORM
U+FEDF ARABIC LETTER LAM INITIAL FORM
U+000A LINE FEED
There are significant differences in what you got from perl and from echo, so it's no surprise they show up differently.
Output inspected using:
$ perl -Mcharnames=:full -MEncode=decode_utf8 -E'
say sprintf("U+%04X %s", $_, charnames::viacode($_))
for unpack "C*", decode_utf8 pack "H*", $ARGV[0] =~ s/\s//gr;
' '...'
(Don't forget to swap the bytes of hexdump.)
Maybe something odd with your shell? If I redirect the output to a file, the result will be the same. Please try this out:
use strict;
use warnings;
use utf8;
binmode( STDOUT, ':utf8' );
use Encode qw< encode decode >;
my $str = 'ﻟﻠﺒﻴﻊ'; # "For sale"
my $enc = encode( 'UTF-8', $str );
my $dec = decode( 'UTF-8', $enc );
my $decoded = pack 'U0W*', map +ord, split //, $enc;
open(F1,'>',"origiinal.txt") or die;
open(F2,'>',"decoded.txt") or die;
open(F3,'>',"decoded2.txt") or die;
binmode(F1, ':utf8');binmode(F2, ':utf8');binmode(F3, ':utf8');
print F1 "$str\n"; # ل ل ب ي ع
print F2 "$dec\n"; # ل ل ب ي ع
print F3 "$decoded\n";

cstring m_pszdata doesn't match converted char* in UNICODE

I tested the Unicode conversion with a UNICODE MFC dialog app, where I can input some Chinese in the edit box. After reading in the characters using
DDX_Text(pDX, IDC_EDIT1, m_strUnicode)
UpdateDate(TRUE)
the m_pszdata of m_strUnicode shows "e0 65 2d 4e 1f 75 09 67". Then I used the following code to convert it to char*:
char *psText; psText = new char[dwMinSize];
WideCharToMultiByte (CP_OEMCP, NULL, m_strUnicode,-1, psText,
dwMinSize, NULL, FALSE);
The psText contains "ce de d6 d0 c9 fa d3 d0", nothing similar with the m_pszdata of m_strUnicode. Would anyone please explain why it is like that?
ce de d6 d0 c9 fa d3 d0 is 无中生有 in GBK. You sure you're manipulating Unicode?
CP_OEMCP instructs the API to use the currently set default OEM codepage.
So my guess here is that you're on a Chinese PC with GBK as default codepage.
无中生有 in UTF16LE is e0 65 2d 4e 1f 75 09 67 so basically you are converting a UTF-16-LE string to GBK.