Is it possible to store unicode (utf-8) characters in ColdFusion's CLIENT scope? (CF9)
If I set a CLIENT scope variable and dump it immediately, it looks fine. But on next page load (ie: when CLIENT scope read back from storage) I just see question marks for the unicode characters.
I'm using a database for persistence and the data column in the CDATA table has been set to ntext.
Looking directly in the database I can see that the records have not been written correctly (again, just question marks showing for unicode characters).
(From the comments)
Have you checked/enabled the: String format --Enable High Ascii characters and Unicode ..." option in your client datasource?
From the docs:
Enable this option if your application uses Unicode data in
DBMS-specific Unicode data types, such as National Character or nchar.
Related
I am currently working on a project in my organisation where we are migrating Informatica Powercenter in our application from v8.1 to v9.1.
Informatica PC is loading data from datafiles but is not able to maintain certain special characters present in few of the input dat files.
The data was is getting loaded correctly in v8.1.
Tried changing characterset settings in Informatica as below -
CodePage movement = Unicode
NLS_LANG = AMERICAN_AMERICA.UTF8 to ENGLISH_UNITEDKINGDOM.UTF8
"DataMovementMode" = Unicode
After making the above settings I am getting the below error in the in Informatica log:
READER_1_2_1> FR_3015 Warning! Row [2258], field [exDestination]: Data [TO] was truncated.
READER_1_2_1> FR_3015 Warning! Row [2265], field [exDestination]: Data [IOMR] was truncated.
READER_1_2_1> FR_3015 Warning! Row [2265], field [parentOID]: Data [O-MS1109ZTRD00:esm4:iomr-2_20040510_0_0] was truncated.
READER_1_2_1> FR_3015 Warning! Row [2268], field [exDestination]: Data [IOMR] was truncated.
The special character that are being sent in the data are and not being handled correctly -
Ø
Ù
Ɨ
¿
Á
Can somebody please guide how to resolve this issue? What else is required at Informatica end to be changed.
Does it need any session parameters to be set in database?
I posted this in another thread about special characters. Please check if this is of any help.
Start with the Source in designer. Are you able to see the data correctly in the source qualifier preview? If not, you might want to set ff source definition encoding to UTF-8.
The Integration service should be running in Unicode mode and not ASCII mode. You can check this from the Integration service properties in Admin Console.
The target should be UTF-8 encoding.
Check the relational connection ( if the target is a database) encoding in workflow manager to ensure it is UTF-8
If the problem persists, write the output to a UTF-8 flat file and check if the data is loading properly. If yes, then the issue is with writing to the database.
Check the database settings like NLS_LANG, NLS_CHARACTERSET (for oracle) etc.
Also set your integration service (IS) to run in Unicode mode for best results apart from configuring ODBC and relational connections to use Unicode
Details for Unicode & ASCII
a) Unicode - IS allows 2 bytes for each character and uses additional byte for each non-ascii character (such as Japanese/chinese characters)
b) ASCII - IS holds all data in a single byte
Make sure that the size of the variable is big enough to hold the data. Some times the warnings mentioned will be received when the size is small to hold the incoming data.
I'm inserting large amount of data to a oracle database.
In that database text is stored in windows-1252 format.
It turns out that there are lot of things to be entered, all of them need to be converted to this format. Also all of those data is in Arabic words.
can some one help me to find a online converter or a tool that encodes Arabic words to windows-1252 format?
*hope the details are enough
--rangana
The pair of Win32 APIs, MultiByteToWideChar and WideCharToMultiByte, allow you to convert code-page encoding to Unicode and Unicode data to code-page encoding, respectively. Each of these APIs takes as an argument the value of the code page to be used for that conversion. You can, therefore, either specify the value of a given code page (example: 1256 for Arabic) or use predefined flags such as:
CP_ACP: for the currently selected system Windows code page
CP_OEMCP: for the currently selected system OEM code page
CP_UTF8: for conversions between UTF-16 and UTF-8
Since windows-1252 does not encode Arabic letters at all, the only way to do the conversion would be to use some kind of transliteration. This is something completely different from encoding conversion (which does not change the identity of characters, only their coded representation).
There is a large number of transliteration (romanization) schemes for Arabic. Almost all of them non-reversible, and almost all of them not suitable for fully automatic processing (mainly because normal Arabic writing does not indicate short vowels but most transliteration schemes indicate them, i.e. the transliterator needs to know how the word is pronounced and to insert vowel characters).
You could fake a conversion by converting to windows-1256 and then inserting the windows-1256 encoded data into the database as raw bytes. You would then need to keep track of the encoding of each value in the database, so that you know which bytes are windows-1252 and which are really windows-1256. This sounds like a mess, so consider whether it is possible to convert the data base to use UTF-8.
I am creating a coldfusion page, that takes language translation data stored in a table in my database, and makes static js files for each language pairing of english to ___ etc...
I am now starting to work on russian, I was able to get the other languages to work fine..
However, when it saves the file, all the text looks like question marks. Even when I run my translation app, the text for just that language looks like all ?????
I have tried writing it via cffile as utf-8 or ISO-8859-1 but neither seems to get it to display properly.
Any suggestions?
Have you tried ISO-8859-5? I believe it's the encoding that "should" be used for Russian.
By all means do use UTF-8 over any other encoding type. You need to make sure that:
your cfm templates were written to disk with UTF-8 encoding (notepad++ handles that nicely, and so does Eclipse or the new ColdFusion Builder)
your database was created with the proper codepage for nvarchar (and varchar) datatypes
your database connection handles UTF-8
How to go about the last two items depends on your database back-end. Coldfusion is quite agnostic in that regard, as it will happily use any jdbc driver that you may need.
When working in a multi-character set environment, character set conversion issues can occur and it can be difficult to determine where the conversion issue occurred.
There are two categories into which conversion issues can be placed. The first involves sending data in the wrong format to the client API. Although this cannot happen with Unicode APIs, it is possible with all other client APIs and results in garbage data.
The second category of issue involves a character that does not have an equivalent in the final character set, or in one of the intermediate character sets. In this case, a substitution character is used. This is called lossy conversion and can happen with any client API. You can avoid lossy conversions by configuring the database to use UTF-8 for the database character set.
The advantage of UTF-8 over any other encoding is that you can handle any number of languages in the same database / client.
I can't personally reproduce this problem at all. Is the ColdFusion template that is making the call itself UTF-8? (with or without a BOM it matters not for Russian). In any case UTF-8 is absolutely what you should be using. Make sure you get a UTF-8 compliant editor. Which is most things on Mac. On Windows you could use Scite or GVim.
The correct encoding to use in a .js file is whatever encoding the parent page is in. Whilst there are methods to serve JavaScript using a different encoding to the page including it, they don't work on all browsers.
So make sure your web page is being saved and served in an encoding that contains the Russian characters, and then save the .js file using the same encoding. That will be either:
ISO-8859-5. A single-byte encoding with Cyrillic in the high bytes, similar to Windows code page 1251. cp1251 will be the default encoding when you save in a text editor from a Russian install of Windows;
or UTF-8. A multi-byte encoding that contains every character. All modern websites should be using UTF-8.
(ISO-8859-1 is Western European and does not include any Cyrillic. It is similar to code page 1252, the default on a Western Windows install. It's of no use to you.)
So, best is to save both the cf template and the js file as UTF-8, and add <cfprocessingdirective pageencoding="utf-8"> if CF doesn't pick it up automatically.
If you can't control the encoding of the page that includes the script (for example because it's a third party), then you can't use any non-ASCII characters directly. You would have to use JavaScript string literal escapes instead:
var translation_ru= {
launchMyCalendar: '\u0417\u0430\u043f\u0443\u0441\u043a \u041c\u043e\u0439 \u043a\u0430\u043b\u0435\u043d\u0434\u0430\u0440\u044c'
};
when it saves to file it is "·ÐßãáÚ ¼ÞÙ ÚÐÛÕÝÔÐàì" so the charset is wrong
Looks like you've saved as cp1251 (ie. default codepage on a Russian machine) and then copied the file to a Western server where the default codepage is cp1252.
I also just found out that my text editor of choice, textpad, doesn't support unicode.
Yes, that was my reason for no longer using it too. EmEditor (commercial) and Notepad++ (open-source) are good replacements.
I work on an app that gets distributed via a single installer containing multiple localizations. The build process includes a script that updates the .ism string table with translations for each supported language.
This works fine for languages like French and German. But when testing the installer in, i.e. Japanese, the text shows up as a series of squares. It's unlikely to be a font problem, since the InstallShield-supplied strings show up fine; only the string table entries are mangled. So the problem seems to be that the strings are in the wrong encoding.
The .ism is in XML format, with UTF-8 declared as its encoding, so I assumed the strings needed to be UTF-8 encoded as well. Do they actually need to use the encoding of the target platform? Is there any concern, then, about targets having different encodings, i.e. Chinese systems using one GB-encoding versus another? What is the right thing to do here?
Edit: Using InstallShield 2009, since there is apparently a difference between that and 2010.
In InstallShield 2009 and earlier, the encoding is a base-64 encoding of the binary string in the ANSI encoding specific to the language in question (e.g. CP932 for Japanese). In InstallShield 2010 and later, it will still accept that or use UTF-8, depending on other columns in that table.
Thanks (up-voted his answer) go to Michael Urman, for pointing us in the right direction. But this is the actual working (with InstallShield 2009) algorithm, reverse-engineered by a co-worker:
Start with a unicode (multi-byte-character) string
Write out the length as the encoded-length field in the ism-file
Encode the string as UTF-16-little-endian
Base-64 using the uuencode dictionary, except with ` (back-tick) instead of spaces.
Write the result to the ism-file, escaping XML entities
Be aware that base-64ing using the uuencode dictionary is not the same as using the uuencode algorithm. Standard uuencode produces a set of newline-separated lines, including a header, footers and one or more data lines, each of which begins with a length-character. If you're implementing this using a uuencode codec, you'll need to strip all of that off.
I'm also trying to figure this out...
I've inhereted some Installshield 12 (which is pre-2009) projects with string table entries containing characters outside the range of base64 'target' characters.
For example, one of the Japanese strings is:
4P!H&$9!O'<4!R&\=!E&,=``#$(80!C&L=0!P"00!G`&4`;#!T`)(PI##S,+DPR##\,.LP5S!^,%DP`C
After much searching I happened upon Base85 encoding, which looks much closer to being plausible, but have not yet verified this to be the solution.
In toad, I can see unicode characters that are coming from oracle db. But when I click one of the fields in the data grid into the edit mode, the unicode characters are converted to meaningless symbols, but this is not the big issue.
While editing this field, the unicode characters are displayed correctly as I type. But as soon as I press enter and exit edit mode, they are converted to the nearest (most similar) non-unicode character. So I cannot type unicode characters on data grids. Copy & pasting one of the unicode characters also does not work.
How can I solve this?
Edit: I am using toad 9.0.0.160.
We never found a solution for the same problems with toad. In the end most people used Enterprise Manager to get around the issues. Sorry I couldn't be more help.
Quest officially states, that they currently do not fully support Unicode, but they promise a full Unicode version of Toad in 2009: http://www.quest.com/public-sector/UTF8-for-Toad-for-Oracle.aspx
An excerpt from know issues with Toad 9.6:
Toad's data layer does not support UTF8 / Unicode data. Most non-ASCII characters will display as question marks in the data grid and should not produce any conversion errors except in Toad Reports. Toad Reports will produce errors and will not run on UTF8 / Unicode databases. It is therefore not advisable to edit non-ASCII Unicode data in Toad's data grids. Also, some users are still receiving "ORA-01026: multiple buffers of size > 4000 in the bind list" messages, which also seem to be related to Unicode data.