Force Unicode on Data Transfer utility for iSeries AS400 for TSV tab delimited files - encoding

I am using Data Transfer utility for IBM i in order to create TSV files from my AS400s and then import them to my SQl Server Data Warehouse.
Following this: SO Question about SSIS encoding script i want to stop using conversion in SSIS task and have the data ready from the source.
I have tried using vatious codepages in TSV creation (1200 etc.) but 1208 only does the trick in half: It creates UTF8 which then i have to convert to unicode as shown in the other question.
What CCSID i have to use to get unicode from the start?
Utility Screenshot:

On IBM i, CCSID support is intended to be seamless. Imagine the situation where the table is in German encoding, your job is in English and you are creating a new table in French - all on a system whose default encoding is Chinese. Use the appropriate CCSID for each of these and the operating system will do the character encoding conversion for you.
Unfortunately, many midrange systems aren't configured properly. Their system default CCSID is 'no CCSID / binary' - a remnant of a time some 20 years ago, before CCSID support. DSPSYSVAL QCCSID will tell you what the default CCSID is for your system. If it's 65535, that's 'binary'. This causes no end of problems, because the operating system can't figure out what the true character encoding is. Because CCSID(65535) was set for many years, almost all the tables on the system have this encoding. All the jobs on the system run under this encoding. When everything on the system is 65535, then the OS doesn't need to do any character conversion, and all seems well.
Then, someone needs multi-byte characters. It might be an Asian language, or as in your case, Unicode. If the system as a whole is 'binary / no conversion' it can be very frustrating because, essentially, the system admins have lied to the operating system with respect to the character encoding that is effect for the database and jobs.
I'm guessing that you are dealing with a CCSID(65535) environment. I think you are going to have to request some changes. At the very least, create a new / work table using an appropriate CCSID like EBCDIC US English (37). Use a system utility like CPYF to populate this table. Now try to download that, using a CCSID of say, 13488. If that does what you need, then perhaps all you need is an intermediate table to pass your data through.
Ultimately, the right solution is a proper CCSID configuration. Have the admins set the QCCSID system value and consider changing the encoding on the existing tables. After that, the system will handle multiple encodings seamlessly, as intended.

The CCSID on IBM i called 13488 is Unicode type UCS-2 (UTF-16 Big Endian). There is not "one unicode" - there are several types of Unicode formats. I looked at your other question. 1208 is also Unicode UTF-8. So what exactly is meant "to get Unicode to begin with" is not clear (you are getting Unicode to begin with in format UTF-8) -- but then I read your other question and the function you mention does not say what kind of "unicode" it expects :
using (StreamWriter writer = new StreamWriter(to, false, Encoding.Unicode, 1000000))
The operating system on IBM i default is to mainly store data in EBCDIC database tables, and there are some rare applications that are built on this system to use Unicode natively. It will translate the data into whatever type of Unicode it supports.
As for SQL Server and Java - I am fairly sure they use UCS-2 type Unicode so if you try using CCSID 13488 on the AS/400 side to transfer, it may let you avoid the extra conversion from UTF-8 Unicode because CCSID 13488 is UCS-2 style Unicode.

https://www-01.ibm.com/software/globalization/ccsid/ccsid_registered.html
There are 2 CCSID's for UTF-8 unicode on system i 1208 and 1209. 1208 is UTF-8 with IBM PAU 1209 is for UTF-8. See link above.

Related

What happens if you set your integration package to Unicode?

I'm importing data from flat-files (text files). I do not know which encoding they will use, it may be unicode, or it may be ASCII. What happens if I just choose "Unicode string [DT_WSTR]" (Or unicode data) in my integration package. Would it be able to read ASCII without issues? I am using SSIS 2012.
What happens if I just choose "Unicode string [DT_WSTR]" (Or unicode data) in my integration package. Would it be able to read ASCII without issues?
The encoding that Microsoft misleadingly call “Unicode” is actually UTF-16LE, an encoding based around two-byte code units.
UTF-16LE is not compatible with ASCII (or any of the locale-specific ANSI code pages) so if you read a file this is actually encoded in an ASCII superset you will get unreadable nonsense.
There's no magic ‘do the right thing’ option for reading characters from files, you have to know what encoding was used to create them. If you can see an encoded Byte Order Mark on the front of the data that usually allows you to make a good guess, but otherwise you're on your own.

What charset to use to store russian text into javascript files as an array

I am creating a coldfusion page, that takes language translation data stored in a table in my database, and makes static js files for each language pairing of english to ___ etc...
I am now starting to work on russian, I was able to get the other languages to work fine..
However, when it saves the file, all the text looks like question marks. Even when I run my translation app, the text for just that language looks like all ?????
I have tried writing it via cffile as utf-8 or ISO-8859-1 but neither seems to get it to display properly.
Any suggestions?
Have you tried ISO-8859-5? I believe it's the encoding that "should" be used for Russian.
By all means do use UTF-8 over any other encoding type. You need to make sure that:
your cfm templates were written to disk with UTF-8 encoding (notepad++ handles that nicely, and so does Eclipse or the new ColdFusion Builder)
your database was created with the proper codepage for nvarchar (and varchar) datatypes
your database connection handles UTF-8
How to go about the last two items depends on your database back-end. Coldfusion is quite agnostic in that regard, as it will happily use any jdbc driver that you may need.
When working in a multi-character set environment, character set conversion issues can occur and it can be difficult to determine where the conversion issue occurred.
There are two categories into which conversion issues can be placed. The first involves sending data in the wrong format to the client API. Although this cannot happen with Unicode APIs, it is possible with all other client APIs and results in garbage data.
The second category of issue involves a character that does not have an equivalent in the final character set, or in one of the intermediate character sets. In this case, a substitution character is used. This is called lossy conversion and can happen with any client API. You can avoid lossy conversions by configuring the database to use UTF-8 for the database character set.
The advantage of UTF-8 over any other encoding is that you can handle any number of languages in the same database / client.
I can't personally reproduce this problem at all. Is the ColdFusion template that is making the call itself UTF-8? (with or without a BOM it matters not for Russian). In any case UTF-8 is absolutely what you should be using. Make sure you get a UTF-8 compliant editor. Which is most things on Mac. On Windows you could use Scite or GVim.
The correct encoding to use in a .js file is whatever encoding the parent page is in. Whilst there are methods to serve JavaScript using a different encoding to the page including it, they don't work on all browsers.
So make sure your web page is being saved and served in an encoding that contains the Russian characters, and then save the .js file using the same encoding. That will be either:
ISO-8859-5. A single-byte encoding with Cyrillic in the high bytes, similar to Windows code page 1251. cp1251 will be the default encoding when you save in a text editor from a Russian install of Windows;
or UTF-8. A multi-byte encoding that contains every character. All modern websites should be using UTF-8.
(ISO-8859-1 is Western European and does not include any Cyrillic. It is similar to code page 1252, the default on a Western Windows install. It's of no use to you.)
So, best is to save both the cf template and the js file as UTF-8, and add <cfprocessingdirective pageencoding="utf-8"> if CF doesn't pick it up automatically.
If you can't control the encoding of the page that includes the script (for example because it's a third party), then you can't use any non-ASCII characters directly. You would have to use JavaScript string literal escapes instead:
var translation_ru= {
launchMyCalendar: '\u0417\u0430\u043f\u0443\u0441\u043a \u041c\u043e\u0439 \u043a\u0430\u043b\u0435\u043d\u0434\u0430\u0440\u044c'
};
when it saves to file it is "·ÐßãáÚ ¼ÞÙ ÚÐÛÕÝÔÐàì" so the charset is wrong
Looks like you've saved as cp1251 (ie. default codepage on a Russian machine) and then copied the file to a Western server where the default codepage is cp1252.
I also just found out that my text editor of choice, textpad, doesn't support unicode.
Yes, that was my reason for no longer using it too. EmEditor (commercial) and Notepad++ (open-source) are good replacements.

How important is file encoding?

How important is file encoding? The default for Notepad++ is ANSI, but would it be better to use UTF-8 or what problems could occur if not using one or the other?
Yes, it would be better if everyone used UTF-8 for all documents always.
Unfortunately, they don't, primarily because Windows text editors (and many other Win tools) default to “ANSI”. This is a misleading name as it is nothing to do with ANSI X3.4 (aka ASCII) or any other ANSI standard, but in fact means the system default code page of the current Windows machine. That default code page can change between machines, or on the same machine, at which point all text files in “ANSI” that have non-ASCII characters like accented letters in will break.
So you should certainly create new files in UTF-8, but you will have to be aware that text files other people give you are likely to be in a motley collection of crappy country-specific code pages.
Microsoft's position has been that users who want Unicode support should use UTF-16LE files; it even, misleadingly, calls this encoding simply “Unicode” in save box encoding menus. MS took this approach because in the early days of Unicode it was believed that this would be the cleanest way of doing it. Since that time:
Unicode was expanded beyond 16-bit code points, removing UTF-16's advantage of each code unit being a code point;
UTF-8 was invented, with the advantage that as well as covering all of Unicode, it's backwards-compatible with 7-bit ASCII (which UTF-16 isn't as it's full of zero bytes) and for this reason it's also typically more compact.
Most of the rest of the world (Mac, Linux, the web in general) has, accordingly, already moved to UTF-8 as a standard encoding, eschewing UTF-16 for file storage or network purposes. Unfortunately Windows remains stuck with the archaic and useless selection of incompatible code pages it had back in the early Windows NT days. There is no sign of this changing in the near future.
If you're sharing files between systems that use differing default encodings, then a Unicode encoding is the way to go. If you don't plan on it, or use only the ASCII set of characters and aren't going to work with encodings that, for whatever reason, modify those (I can't think of any at the moment, but you never know...), you don't really need it.
As an aside, this is the sort of stuff that happens when you don't use a Unicode encoding for files with non-ASCII characters on a system with a different encoding from the one the file was created with: http://en.wikipedia.org/wiki/Mojibake
It is very importaint since your whatevertool will show false chars/whatever if you use the wrong encoding. Try to load a kyrillic file in Notepad without using UTF-8 or so and see a lot of "?" coming up. :)

What encoding does InstallShield expect non-latin-alphabet string table entries to use?

I work on an app that gets distributed via a single installer containing multiple localizations. The build process includes a script that updates the .ism string table with translations for each supported language.
This works fine for languages like French and German. But when testing the installer in, i.e. Japanese, the text shows up as a series of squares. It's unlikely to be a font problem, since the InstallShield-supplied strings show up fine; only the string table entries are mangled. So the problem seems to be that the strings are in the wrong encoding.
The .ism is in XML format, with UTF-8 declared as its encoding, so I assumed the strings needed to be UTF-8 encoded as well. Do they actually need to use the encoding of the target platform? Is there any concern, then, about targets having different encodings, i.e. Chinese systems using one GB-encoding versus another? What is the right thing to do here?
Edit: Using InstallShield 2009, since there is apparently a difference between that and 2010.
In InstallShield 2009 and earlier, the encoding is a base-64 encoding of the binary string in the ANSI encoding specific to the language in question (e.g. CP932 for Japanese). In InstallShield 2010 and later, it will still accept that or use UTF-8, depending on other columns in that table.
Thanks (up-voted his answer) go to Michael Urman, for pointing us in the right direction. But this is the actual working (with InstallShield 2009) algorithm, reverse-engineered by a co-worker:
Start with a unicode (multi-byte-character) string
Write out the length as the encoded-length field in the ism-file
Encode the string as UTF-16-little-endian
Base-64 using the uuencode dictionary, except with ` (back-tick) instead of spaces.
Write the result to the ism-file, escaping XML entities
Be aware that base-64ing using the uuencode dictionary is not the same as using the uuencode algorithm. Standard uuencode produces a set of newline-separated lines, including a header, footers and one or more data lines, each of which begins with a length-character. If you're implementing this using a uuencode codec, you'll need to strip all of that off.
I'm also trying to figure this out...
I've inhereted some Installshield 12 (which is pre-2009) projects with string table entries containing characters outside the range of base64 'target' characters.
For example, one of the Japanese strings is:
4P!H&$9!O'<4!R&\=!E&,=``#$(80!C&L=0!P"00!G`&4`;#!T`)(PI##S,+DPR##\,.LP5S!^,%DP`C
After much searching I happened upon Base85 encoding, which looks much closer to being plausible, but have not yet verified this to be the solution.

Toad unicode input problem

In toad, I can see unicode characters that are coming from oracle db. But when I click one of the fields in the data grid into the edit mode, the unicode characters are converted to meaningless symbols, but this is not the big issue.
While editing this field, the unicode characters are displayed correctly as I type. But as soon as I press enter and exit edit mode, they are converted to the nearest (most similar) non-unicode character. So I cannot type unicode characters on data grids. Copy & pasting one of the unicode characters also does not work.
How can I solve this?
Edit: I am using toad 9.0.0.160.
We never found a solution for the same problems with toad. In the end most people used Enterprise Manager to get around the issues. Sorry I couldn't be more help.
Quest officially states, that they currently do not fully support Unicode, but they promise a full Unicode version of Toad in 2009: http://www.quest.com/public-sector/UTF8-for-Toad-for-Oracle.aspx
An excerpt from know issues with Toad 9.6:
Toad's data layer does not support UTF8 / Unicode data. Most non-ASCII characters will display as question marks in the data grid and should not produce any conversion errors except in Toad Reports. Toad Reports will produce errors and will not run on UTF8 / Unicode databases. It is therefore not advisable to edit non-ASCII Unicode data in Toad's data grids. Also, some users are still receiving "ORA-01026: multiple buffers of size > 4000 in the bind list" messages, which also seem to be related to Unicode data.