ZPL Utf-8 characters dissapearing - encoding

I've encountered a problem when trying to print a simple ZPL string.
My ZPL contains some UTF-8 characters like so:
^XA
^FT16,591^A0N,34^FH^FVM_F6lntorp^FS
^FT16,626^A0N,34^FH^FVV_E4gen^FS
^XZ
This should print out Mölntorp (_F6 = ö) and Vägen (_E4 = ä). And it does.
BUT, here comes the problem, I tried adding a danish ø (_F8 = ø), like so:
^XA
^FT16,626^A0N,34^FH^FVK_F8benhavnsvej
^XZ
But what comes out is K°benhavnsvej (which corresponds to _F8 = ° in CP-850). I have no clue why it successfully translates one hex code and then mucks up on the other one, since they should both be using the same encoding table. (None specified)
If I add ^CI28 below the starting ^XA tag, the UTF-8 characters simply vanish, and the output is just Kbenhavnsvej
I hope someone could give me input on why this is happening. It's frustrating.

^XA
^FT16,626^CI4^A0N,34^FH^FVK_7Cbenhavnsvej
^XZ
I haven't tried this - but it should work in theory.
^CI4 selects international character set for Denmark; character 7C should be the character you require (5C for upper-case)

This may also be the font you are using, the font simply might not contain this in its character set. You may have to use the Swiss Eastern European font.

Related

Read turkish characters from txt file

I am trying to read string data from txt file which has special turkish characters in it.
I want to store content in a string. I tried some methods like textscan , fileread but, instead of special turkish characters like ş,ç,ı,ö,ğ, there are some weird symbols. Are there any way to do that?
I created a file called turkish.txt with the characters you mentioned (ş,ç,ı,ö,ğ). Trying to read it gave me the following:
fid = fopen('turkish.txt','r','n','UTF-8');
str=fread(fid);
native2unicode(str')
ans =
ÿþ_, ç , 1, ö ,
As you can see, ş,ı,ğ are not rendered correctly. If you type
help slCharacterEncoding
You can see a list of most commonly supported encodings by platforms. I played with the encodings a little, some which I have checked were:
ISO-8891-1
US-ASCII
Windows-1252
Shift_JIS
The last one is related to japanese characters. They contain some of the turkish characters, which were rendered correctly such as ç and ö, but not all of them.
If you skim through the docs it says:
If you want to use a different character encoding, you need to start MATLAB with the appropriate locale settings for your operating system. Consult your operating system manual to change the locale setting.
The instructions for setting the locale on windows platforms, which I haven't tried, can be found here.
Hope it helps.

Print Chinese / Japanese character in Zebra Printer with ZPL

I have loaded the Mono Chinese/ Japanese font onto my ZM400 printer. So far I have no success printing both Chinese & English together on the same field.
Here is some example code:
^XA^CW1,B:ANMDS.TTF
^SEB:GB.DAT^CI14
^FO100,100^A1,50,50^FD中文English Here^FS
^XZ
Since I change the international code to 14 (with ^CI14), it only prints the Chinese text without the English text.
I have also try using the ^FL command, but can't seen to get it to work.
Does anyone have a working example of printing Chinese / Japanese text along with English text on the same FD (data field)?
You should probably use ^CI28 (UTF-8), and make sure that your labels are encoded in UTF-8.
As far as I know, ^CI14 only supports Asian encodings.
If anyone is looking at how to do this, I imagine what I did for Japanese will work for Chinese.
Firstly, I didn't want to purchase the Asian Font Pack because I think it's a bit of a ripoff, so I found an appropriate open source Japanese Unitype Font. I then uploaded this to the printer using Zebra Tools... make sure you upload it as a file, NOT using the font upload.
Then I managed to get it printing by escaping the characters.
So my final ZPL is
^XA
^LL150
^CI28^A#N,60,60,E:OSAKA.TTF
^FO0,0
^FH
^FD_5F_E3_81_93_E3_82_8C_E3_81_AF_E4_BD_95_E3_81_A8_E8_A8_80_E3_81_A3_E3_81_A6_E3_81_84_E3_81_BE_E3_81_99^FS
^XZ
Essentially you have to escape the bytes of each value (original Japanese これは何と言っています)
You also have to put ^FH in front of ^FD so it knows you're escaping characters.
Hopefully this helps the poster and anyone else who is looking to overcome problems with ZPL and Unicode fonts / characters.
I have figured out why. The Chinese text needs to be in gibberish format.
What I meant by gibberish is that. When you use Chinese in ZPL code, it needs to be in the windows codepage format text. This windows codepage format Text that is Chinese will be displayed as gibberish in English environment.
For example. In ZPL Code, your code might look like this:
^H ~!!####$ (this gibberish is actual the ASCII representation of Chinese text in windows code page format)
However, you can't type in unicode Chinese because ZPL would not print it.
^H 中文 (this is Chinese text in unicode format)

Find non-ASCII characters in a text file and convert them to their Unicode equivalent

I am importing .txt file from a remote server and saving it to a database. I use a .Net script for this purpose. I sometimes notice a garbled word/characters (Ullerهkersvنgen) inside the files, which makes a problem while saving to the database.
I want to filter all such characters and convert them to unicode before saving to the database.
Note: I have been through many similar posts but had no luck.
Your help in this context will be highly appreciated.
Thanks.
Assuming your script does know the correct encoding of your text snippet than that should be the regular expression to find all Non-ASCII charactres:
[^\x00-\x7F]+
see here: https://stackoverflow.com/a/20890052/1144966 and https://stackoverflow.com/a/8845398/1144966
Also, the base-R tools package provides two functions to detect non-ASCII characters:
tools::showNonASCII()
tools::showNonASCIIfile()
You need to know or at least guess the character encoding of the data in order to be able to convert it properly. So you should try and find information about the origin and format of the text file and make sure that you read the file properly in your software.
For example, “Ullerهkersvنgen” looks like a Scandinavian name, with Scandinavian letters in it, misinterpreted according to a wrong character encoding assumption or as munged by an incorrect character code conversion. The first Arabic letter in it, “ه”, is U+0647 ARABIC LETTER HEH. In the ISO-8859-6 encoding, it is E7 (hex.); in windows-1256, it is E5. Since Scandinavian text are normally represented in ISO-8859-1 or windows-1252 (when Unicode encodings are not used), it is natural to check what E7 and E5 mean in them: “ç” and “å”. For linguistic reasons, the latter is much more probable here. The second Arabic letter is “ن” U+0646 ARABIC LETTER NOON, which is E4 in windows-1256. And in ISO-8859-1, E4 is “ä”. This makes perfect sense: the word is “Ulleråkersvägen”, a real Swedish street name (in Uppsala, at least).
Thus, the data is probably ISO-8859-1 or windows-1252 (Windows Latin 1) encoded text, incorrectly interpreted as windows-1256 (Windows Arabic). No conversion is needed; you just need to read the data as windows-1252 encoded. (After reading, it can of course be converted to another encoding.)

Convert non english characters into Unicode (UTF-8)

I copied large amount of text from another system to my PC. When I viewed the text in my PC, it looked weird. So I copied all the fonts from the other PC and installed them in mine too. Now the text looks okay, but actually it seems that is not in Unicode. For example, if I copy the text and paste in another UTF-8 supported editor such as Notepad++, I get English characters ("bgah;") only like shown below.
How to convert this whole text into unicode text, like the one below. So I can copy the text and paste anywhere else.
பெயர்
The above text was manually obtained using http://www.google.com/transliterate/indic/Tamil
I need this conversion to be done, so I can copy them into database tables.
'Ja-01' is a font with a custom 'visual encoding'.
That is to say, the sequence of characters really is "bgah;" and it only looks like Tamil to you because the font's shapes for the Latin characters bg look like பெ.
This is always to be avoided, because by storing the content as "bgah;" you lose the ability to search and process it as real Tamil, but this approach was common in the pre-Unicode days especially for less-widespread scripts without mature encoding standards. This application probably predates widespread use of TSCII.
Because it is a custom encoding not shared by any other font, it is very unlikely you will be able to find a tool to convert content in this encoding to proper Unicode characters. It does not appear to be any standard character ordering, so you will have to look at the font (eg in charmap.exe) and note down every character, find the matching character in Unicode and map between them.
For example here's a trivial Python script to replace characters in a file:
mapping= {
u'a': u'\u0BAF', # Tamil letter Ya
u'b': u'\u0BAA', # Tamil letter Pa
u'g': u'\u0BC6', # Tamil vowel sign E (combining)
u'h': u'\u0BB0', # Tamil letter Ra
u';': u'\u0BCD', # Tamil sign virama (combining)
# fill in the rest of the mapping information here!
}
with open('ja01data.txt', 'rb') as fp:
data= fp.read().decode('utf-8')
for char in mapping:
data= data.replace(char, mapping[char])
with open('utf8data.txt', 'wb') as fp:
fp.write(data.encode('utf-8'))
The font you found is getting you into trouble. The actual cell text is "bgah;", it gets rendered to பெயர் because you found a font that can work with 8-bit non-Unicode characters. So reading it or pasting it into Notepad++ is going to produce "bgah;" since that's the real text. It can only ever be rendered properly again by forcing the program that displays the string to use that same font.
Ditch the font and enter Unicode so it looks like this:
"bgah" looks like a Baamini based system, which is pre-unicode. It was popular in Canada (and the SL Tamil diaspora in general) in the 90s.
As the others mentioned, it looks like a custom visual encoding that mimics the performance of a foreign script while maintaining ASCII encoding.
Google "Baamini to unicode convertor". The University of Colombo seems to have put one up: http://www.ucsc.cmb.ac.lk/ltrl/services/feconverter/?maps=t_b-u.xml
Let me know if this works. If not, I can ask around and get something for you.
You could first check whether the encoding is TSCII, as this sounds most probable. It is an 8-bit encoding, and the fonts you copied are probably based on that encoding. Check out whether the TSCII to UTF-8 converter at SourceForge is suitable. The project there is called “Any Tamil Encoding to Unicode” but they say that only TSCII is supported for now.

Some UTF-8 characters do not show up on browser

Some UTF-8 characters like the UTF-8 equivalent of C2 96 (hyphen). On the browser it displays it as (utf box with 00 96). And not as '-'(hyphen). Any reasons for this behavior? How do we correct this?
http://stuffofinterest.com/misc/utf8.php?s=128 (Refer this URL for the codes)
I found that this can be handled with html entities. Is there any way to display this without converting to html entities?
The character you're talking about is an en-dash, not a hyphen. Its Unicode code point is U+2013, and its UTF-8 encoding is E2 80 93, not C2 96. That table you linked to is incorrect. The first two columns have nothing to do with UCS-2 or Unicode; they actually contain the windows-1252 encodings for the characters in question. The columns labeled "UTF-8 Hex" and "UTF-8 Native" are just plain wrong, at least for the rows labeled 128 to 159. The entities – and – represent an en-dash, but the UTF-8 sequence C2 96 represents a non-displayable control character.
You shouldn't need to encode those characters manually anyway. Just tell your text editor (or whatever you use to create the content) to save the file as UTF-8.
I suspect this is because the characters between U+0080 and U+009F inclusive are control characters. I'm still slightly surprised that they show differently when encoded directly in the HTML than using entities, but basically you shouldn't be using them to start with. U+0096 isn't really "hyphen", it's "start of guarded area".
See the U+0080-U+00FF code chart for more information. Basically, try to avoid control characters...
Two reasons come to mind:
Are you sure that you have output the correct character code to the browser? Better check in some hex viewer.
The font you are using doesn't have a glyph defined at this code point.