Cyrillic sets CP 1048 conversion - unicode

I am using a printer which needs Cyrillic sets CP 1048 for printing Kazak and Russian language in text mode. How do we convert a text to CP 1048? CP 1048 is combined character set for Kazak and Russian languages. These languages come together in text files and this code page is available as a standard feature in the printer.

You convert text with some kind of text encoding converter. Since it wasn't specified, I'll use a Python script. Note this requires Python 3.5 or later. That version first defined the kz1048 codec.
Unicode string to KZ-1048 encoding:
>>> 'Россия/Ресей'.encode('kz1048')
b'\xd0\xee\xf1\xf1\xe8\xff/\xd0\xe5\xf1\xe5\xe9'
Note in Python b'\xd0\xee' denotes a byte string containing the byte hexadecimal values D0 and EE, which in KZ-1048 represent Р and о.

Related

How to implement Unicode (UTF-8) support for a CID-keyed font (Adobe's Type 0 Composite font) converted from ttf?

This post is sequel to Conversion from ttf to type 2 CID font (type 42 base font)
It is futile to have a CID-Keyed font (containingCIDMap that enforces Identity Mapping i.e Glyph index = CID) without offering Unicode support inherently. Then, how to provide UTF-8 support for such a CID-keyed font externally by an application software?
Note: The application program that uses the CID-keyed font can be written in C, C++, Postscript or any other language.
The CID-keyed font NotoSansTamil-Regular.t42 has been converted from Google's Tamil ttf font.
You need this conversion because without this conversion, a postscript program can't access a truetype font!
Refer Post Conversion from ttf to type 2 CID font (type 42 base font) for conversion.
The CIDMap of t42 font enforces an identity mapping as follows:
Character code 0 maps to Glyph index 0
Character code 1 maps to Glyph index 1
Character code 2 maps to Glyph index 2
......
......
Character code NumGlyphs-1 maps to Glyph index NumGlyphs-1
It is clearly evident that there is no Unicode involved in this mapping inherently.
To understand concretely, edit the following postscript program tamil.ps that accesses t42 font through postscript's findfont operator.
%!PS-Adobe-3.0
/myNoTo {/NotoSansTamil-Regular findfont exch scalefont setfont} bind def
13 myNoTo
100 600 moveto
% தமிழ் தங்களை வரவேற்கிறது!
<0019001d002a005e00030019004e00120030002200030024001f002f0024005b0012002a0020007a00aa> show
100 550 moveto
% Tamil Welcomes You!
<0155017201aa019801a500030163018801a5017f01b101aa018801c20003016901b101cb00aa00b5> show
showpage
Issue the following Ghostscript command to execute the postscript program tamil.ps.
gswin64c.exe "D:\cidfonts\NotoSansTamil-Regular.t42" "D:\cidfonts\tamil.ps (on Windows Platform).
gs ~/cidfonts/NotoSansTamil-Regular.t42 ~/cidfonts/tamil.ps (on Linux Platform).
This will display two strings தமிழ் தங்களை வரவேற்கிறது! and Tamil Welcomes You! respectively in subsequent rows.
Note that the strings for show operator are in Hexadecimal format embedded within angular brackets. Operator show extracts 2 bytes at a time and maps this CID (16 bit value) to a Glyph.
For example, the first 4 Hex digits in the 1st string is 0019 whose decimal equivalent is 25. This maps to glyph த.
In order to use this font t42, each string (created from character set of a ttf) should be converted into hexadecimal string by hand which is practically impossible and therefore this font becomes futile.
Now consider the following C++ code that generates a postscript program called myNotoTamil.ps that accesses the same t42 font through postscript's findfont operator.
const short lcCharCodeBufSize = 200; // Character Code buffer size.
char bufCharCode[lcCharCodeBufSize]; // Character Code buffer
FILE *fps = fopen ("D:\\cidfonts\\myNotoTamil.ps", "w");
fprintf (fps, "%%!PS-Adobe-3.0\n");
fprintf (fps, "/myNoTo {/NotoSansTamil-Regular findfont exch scalefont setfont} bind def\n");
fprintf (fps, "13 myNoTo\n");
fprintf (fps, "100 600 moveto\n");
fprintf (fps, u8"%% தமிழ் தங்களை வரவேற்கிறது!\n");
fprintf (fps, "<%s> show\n", strps(ELang::eTamil, EMyFont::eNoToSansTamil_Regular, u8"தமிழ் தங்களை வரவேற்கிறது!", bufCharCode, lcCharCodeBufSize));
fprintf (fps, "%% Tamil Welcomes You!\n");
fprintf (fps, "<%s> show\n", strps(ELang::eTamil, EMyFont::eNoToSansTamil_Regular, u8"Tamil Welcomes You!", bufCharCode, lcCharCodeBufSize));
fprintf (fps, "showpage\n");
fclose (fps);
Although the contents of tamil.ps and myNotoTamil.ps are same and identical, the difference in the production of those ps files is like difference between heaven and earth!
Observe that unlike tamil.ps(handmade Hexadecimal strings), the myNotoTamil.ps is generated by a C++ program which uses UTF-8 encoded strings directly hiding the hex strings completely. The function strps produces hex strings from UTF-8 encoded strings which are the same and identical as the strings present in tamil.ps.
The futile t42 font has suddenly become fruitful due to strps function's mapping ability from UTF-8 to CIDs (every 2 bytes in Hex strings maps to a CID)!
The strps function consults a mapping table aNotoSansTamilMap (implemented as a single dimensional array constructed with the help of Unicode Blocks) in order to map Unicode Points (extracted from UTF-8 encoded string) to Character Identifiers (CIDs).
The buffer bufCharCode used in strps function (4th parameter) passes out hex strings corresponding to UTF-8 encoded strings to Postscript's show operator.
In order to benefit others, I released this UTF8Map program through GitHub on the following platforms.
Windows 10 Platform (Github Public Repository for UTF8Map Program on Windows 10)
Open up DOS command line and issue the following clone command to download source code:
git clone https://github.com/marmayogi/UTF8Map-Win
Or execute the following curl command to download source code release in zip form:
curl -o UTF8Map-Win-2.0.zip -L https://github.com/marmayogi/UTF8Map-Win/archive/refs/tags/v2.0.zip
Or execute the following wget command to download source code release in zip form:
wget -O UTF8Map-Win-2.0.zip https://github.com/marmayogi/UTF8Map-Win/archive/refs/tags/v2.0.zip
Linux Platform (Github Public Repository for UTF8Map Program on Linux)
Issue the following clone command to download source code:
git clone https://github.com/marmayogi/UTF8Map-Linux
Or execute the following curl command to download source code release in tar form:
curl -o UTF8Map-Linux-1.0.tar.gz -L https://github.com/marmayogi/UTF8Map-Linux/archive/refs/tags/v1.0.tar.gz
Or execute the following wget command to download source code release in tar form:
wget -O UTF8Map-Linux-1.0.tar.gz https://github.com/marmayogi/UTF8Map-Linux/archive/refs/tags/v1.0.tar.gz
Note:
This program uses t42 file to generates a ps file (a postscript program) which will display the following in a single page:
A welcome message in Tamil and English.
List of Vowels (12 + 1 Glyphs). All of them are associated with Unicode Points.
List of Consonants (18 + 6 = 24 Glyphs). No association of Unicode Points.
List of combined glyphs (Combination of Vowels + Consonants) in 24 lines. Each line displays 12 glyphs. Out of 288 Glyphs, 24 are associated with Unicode Points and rest do not.
List of Numbers in two lines. All 13 Glyphs for Tamil numbers are associated with Unicode Points.
A foot Note.
The two program files (main.cpp and mapunicode.h) are 100% portable. i.e. the contents of two files are same and identical across platforms.
The two mapping tables aNotoSansTamilMap and aLathaTamilMap are given in mapunicode.h file.
A README Document in Markdown format has been included with the release.
This software has been tested for t42 fonts converted from the following ttf files.
Google's Noto Tamil ttf
Microsoft`s Latha Tamil ttf

How to remove accents and keep Chinese characters using a command?

I’m trying to remove the accented characters (CAFÉ -> CAFE) while keeping all the Chinese characters by using a command. Currently, I’m using iconv to remove the accented characters. It turns out that all the Chinese characters are encoded as “?????”. I can’t figure out the way to keep the Chinese characters in an ASCII-encoded file at the same time.
How can I do so?
iconv -f utf-8 -t ascii//TRANSLIT//IGNORE -o converted.bin test.bin
There is no way to keep Chinese characters in a file whose encoding is ASCII; this encoding only encodes the code points between NUL (0x00) and 0x7F (DEL) which basically means the basic control characters plus basic
English alphabetics and punctuation. (Look at the ASCII chart for an enumeration.)
What you appear to be asking is how to remove accents from European alphabetics while keeping any Chinese characters intact in a file whose encoding is UTF-8. I believe there is no straightforward way to do this with iconv, but it should be comfortably easy to come up with a one-liner in a language with decent Unicode support, like perhaps Perl.
bash$ python -c 'print("\u4effCaf\u00e9\u9f00")' >unizh.txt
bash$ cat unizh.txt
仿Café鼀
bash$ perl -CSD -MUnicode::Normalize -pe '$_ = NFKD($_); s/\p{M}//g' unizh.txt
仿Cafe鼀
Maybe add the -i option to modify the file in-place; this simple demo just writes out the result to standard output.
This has the potentially undesired side effect of normalizing each character to its NFKD form.
Code inspired by Remove accents from accented characters and Chinese characters to test with gleaned from What's the complete range for Chinese characters in Unicode? (the ones on the boundary of the range are not particularly good test cases so I just guessed a bit).
The iconv tool is meant to convert the way characters are encoded (i.e. saved to a file as bytes). By converting to ASCII (a very limited character set that contains the numbers, some punctuation, and the basic alphabet in upper and lower case), you can save only the characters that can reasonably be matched to that set. So an accented letter like É gets converted to E because that's a reasonably similar ASCII character, but a Chinese character like 公 is so far away from the ASCII character set that only question marks are possible.
The answer by tripleee is probably what you need. But if the conversion to NFKD form is a problem for you, an alternative is using a direct list of characters you want to replace:
sed 'y/áàäÁÀÄéèëÉÈË/aaaAAAeeeEEE/' <test.bin >converted.bin
where you need to list the original characters and their replacements in the same order. Obviously it is more work, so do this only if you need full control over what changes you make.

Encode binary data into ASCII while preserving most of its valid characters

When displaying a bytes object in python the print function will display a select number values as ASCII characters instead of their numeric representation.
>>> b"1 2 \x30 a b \x80"
b'1 2 0 a b \x80'
Is there a known encoding that would allow a set of binary data containing mostly ASCII text to be put into a valid ASCII string where the few invalid characters are replaced by a numeric representation, similarly to what python does with bytes?
Edit:
I used python's bytes.repr as an example of what the encoding would do, the project we need this for is written in c++, something like a "language agnostic" spec would be nice.
Edit 2:
Think of this as a base64 alternative so that binary data that is mostly ASCII does not get altered too much.

Why can iconv convert precomposed form but not decomposed form of "É" (from UTF-8 to CP1252)

I use the iconv library to interface from a modern input source that uses UTF-8 to a legacy system that uses Latin1, aka CP1252 (superset of ISO-8859-1).
The interface recently failed to convert the French string "Éducation", where the "É" was encoded as hex 45 CC 81. Note that the destination encoding does have an "É" character, encoded as C9.
Why does iconv fail converting that "É"? I checked that the iconv command-line tool that's available with MacOS X 10.7.3 says it cannot convert, and that the PERL iconv module fails too.
This is all the more puzzling that the precomposed form of the "É" character (encoded as C3 89) converts just fine.
Is this a bug with iconv or did I miss something?
Note that I also have the same issue if I try to convert from UTF-16 (where "É" is encoded as 00 C9 composed or 00 45 03 01 decomposed).
Unfortunately iconv indeed doesn't deal with the decomposed characters in UTF-8, except the version installed on Mac OS X.
When dealing with Mac file names, you can use iconv with the "utf8-mac" character set option. It also takes into account a few idiosyncrasies of the Mac decomposed form.
However, non-mac versions of iconv or libiconv don't support this, and I could not find the sources used on Mac which provide this support.
I agree with you that iconv should be able to deal with both NFC and NFD forms of UTF8, but until someone patches the sources we have to detect this manually and deal with it before passing stuff to iconv.
Faced with this annoying problem, I used Perl's Unicode::Normalize module as suggested by Jukka.
#!/usr/bin/perl
use Encode qw/decode_utf8 encode_utf8/;
use Unicode::Normalize;
while (<>) {
print encode_utf8( NFC(decode_utf8 $_) );
}
Use a normalizer (in this case, to Normalization Form C) before calling iconv.
A program that deals with character encodings (different representations of characters or, more exactly, code points, as sequences of bytes) and converting between them should be expected to treat precomposed and composed forms as distinct. The decomposed É is two code points and as such distinct from the precomposed É, which is one code point.

translate data file with odd Hebrew encoding

I have a binary data file, in a format used by a relatively ancient program, which I am trying to convert into something sane. With the help of a Hex editor I have basically worked out the file format except that it contains Hebrew characters with an odd encoding.
All characters are 8 bits. The "standard" 27 consonants (including "final" consonants) go from hex 80 to 9A. Then there are vowels that seem to start around hex 9B or so (I'm guessing right after the standard consonants end). Then there are "dotted" consonants that seem to start at hex E0.
If I remember correctly, I think this is some sort of DOS encoding. What encoding is this and what encoding should I translate it to so that a user in Israel will be able to most easily open it in, say, Microsoft Word? Are there any tools that I could use to do the translation?
80 to 9A seem to match the codepoints in the CP862, but I could not find any match for the vowel codepoints. I think what you should do is just make a custom mapping to Unicode and produce the output in UTF-8 or UTF-16LE plain text file. If you add a BOM (Byte-Order-Mark), Notepad and/or Word should be able to read it without issues. I would probably make a small Python script, but it shouldn't be hard in any other language.