How to send text in UTF-8 using Indy TIdTCPServer in c++ builder - sockets

My client j2me application reading text input stream using UTF-8
reader = new InputStreamReader(in,"UTF-8");
and my server when gets connected sends text using this statement
AContext->Connection->IOHandler->WriteLn(cxMemo1->Text,TEncoding::UTF8);
but result text showing weird characters like ?????????????????????????? ?????????????
Where I'm doing wrong?
also when i tried to load from utf-8 encoding data file in such a way
AContext->Connection->IOHandler->WriteFile("c:\\fids.xml");
it's all the same!

Indy 10 completely supports UTF-8 encoding. I've myself worked with it's TIdFTP component & successfully uploaded Unicode text files. From what I can make of it:
Your connection/transfer type is set to ftASCII rather than ftBinary.
Your J2ME applet/Host platform does not suport UTF-8

'?' characters occur when data is going through a Unicode-to-Ansi conversion to an Ansi charset that does not support the Unicode characters being converted.
What version of C++Builder are you using? In versions prior to CB2009, you should tell Indy the encoding of the AnsiString data that you are passing in. Indy defaults to ASCII (ie: TIdTextEncoding::ASCII) for most String-based operation. That can be overridden when needed, either with optional AAnsiEncoding parameters, the TIdIOHandler::DefAnsiEncoding property, or the global Idglobal::GIdDefaultAnsiEncoding setting. If you do not specify the correct encoding, the AnsiString data may not be converted to Unicode correctly before then being converted to UTF-8. For example:
AContext->Connection->IOHandler->WriteLn(cxMemo1->Text, TIdTextEncoding_UTF8, TTIdTextEncoding_Default);
Or:
AContext->Connection->IOHandler->DefAnsiEncoding = TIdTextEncoding_Default;
AContext->Connection->IOHandler->WriteLn(cxMemo1->Text, TIdTextEncoding_UTF8);
You can optionally also use the TIdIOHandler::DefStringEncoding property if you do not want to specify the UTF-8 encoding on every call:
AContext->Connection->IOHandler->DefStringEncoding = TIdTextEncoding_UTF8;
AContext->Connection->IOHandler->WriteLn(cxMemo1->Text);
Now, with that said, the fact that WriteFile() is also sending data that J2ME is not handling correctly tells me that Indy is not the root of the issue. WriteFile() simply dups the raw file data as-is to the connection without any interpretation at all. If you send a UTF-8 encoded file, then UTF-8 encoded octets will be sent to J2ME.
I suggest you use a packet sniffer, such as Wireshark, to verify the data that Indy is sending. That will tell you for sure whether Indy is really at fault or not.
*PS: notice in the examples above that I use Indy's TIdTextEncoding macros instead of TEncoding directly. This is because Indy's TIdTextEncoding logic works around some bugs in Embarcadero's TEncoding classes. Also, we're going to phase out direct support for TEncoding in Indy 11 and expand on TIdTextEncoding so Indy has more control than Embarcadero offers.

Related

Why Base64 is used "only" to encode binary data?

I saw many resources about the usages of base64 in today's internet. As I understand it, all of those resources seem to spell out single usecase in different ways : Encode binary data in Base64 to avoid getting it misinterpreted/corrupted as something else during transit (by intermediate systems). But I found nothing that explains following :
Why would binary data be corrupted by intermediate systems? If I am sending an image from a server to client, any intermediate servers/systems/routers will simply forward data to next appropriate servers/systems/routers in the path to client. Why would intermediate servers/systems/routers need to interpret something that it receives? Any example of such systems which may corrupt/wrongly interpret data that it receives, in today's internet?
Why do we fear only binary data to be corrupted. We use Base64 because we are sure that those 64 characters can never be corrupted/misinterpreted. But by this same logic, any text characters that do not belong to base64 characters can be corrupted/misinterpreted. Why then, base64 is use only to encode binary data? Extending the same idea, when we use browser are javascript and HTML files transferred in base64 form?
There's two reasons why Base64 is used:
systems that are not 8-bit clean. This stems from "the before time" where some systems took ASCII seriously and only ever considered (and transferred) 7bits out of any 8bit byte (since ASCII uses only 7 bits, that would be "fine", as long as all content was actually ASCII).
systems that are 8-bit clean, but try to decode the data using a specific encoding (i.e. they assume it's well-formed text).
Both of these would have similar effects when transferring binary (i.e. non-text) data over it: they would try to interpret the binary data as textual data in a character encoding that obviously doesn't make sense (since there is no character encoding in binary data) and as a consequence modify the data in an un-fixable way.
Base64 solves both of these in a fairly neat way: it maps all possible binary data streams into valid ASCII text: the 8th bit is never set on Base64-encoded data, because only regular old ASCII characters are used.
This pretty much solves the second problem as well, since most commonly used character encodings (with the notable exception of UTF-16 and UCS-2, among a few lesser-used ones) are ASCII compatible, which means: all valid ASCII streams happen to also be valid streams in most common encodings and represent the same characters (examples of these encodings are the ISO-8859-* family, UTF-8 and most Windows codepages).
As to your second question, the answer is two-fold:
textual data often comes with some kind of meta-data (either a HTTP header or a meta-tag inside the data) that describes the encoding to be used to interpret it. Systems built to handle this kind of data understand and either tolerate or interpret those tags.
in some cases (notably for mail transport) we do have to use various encoding techniques to ensure text doesn't get mangles. This might be the use of quoted-printable encoding or sometimes even wrapping text data in Base64.
Last but not least: Base64 has a serious drawback and that's that it's inefficient. For every 3 bytes of data to encode, it produces 4 bytes of output, thus increasing the size of the data by ~33%. That's why it should be avoided when it's not necessary.
One of the use of BASE64 is to send email.
Mail servers used a terminal to transmit data. It was common also to have translation, e.g. \c\r into a single \n and the contrary. Note: Also there where no guarantee that 8-bit can be used (email standard is old, and it allowed also non "internet" email, so with ! instead of #). Also systems may not be fully ASCII.
Also \n\n. is considered as end of body, and mboxes uses also \n>From to mark start of new mail, so also when 8-bit flag was common in mail servers, the problems were not totally solved.
BASE64 was a good way to remove all problems: the content is just send as characters that all servers must know, and the problem of encoding/decoding requires just sender and receiver agreement (and right programs), without worrying of the many relay server in between. Note: all \c, \r, \n etc. are just ignored.
Note: you can use BASE64 also to encode strings in URL, without worrying about the interpretation of webbrowsers. You may see BASE64 also in configuration files (e.g. to include icons): special crafted images may not be interpreted as configuration. Just BASE64 is handy to encode binary data into protocols which were not designed for binary data.

The File Encoding Is utf8 but is in Windows-1256 readable

I am working on files with unknown encoding at first but I get the encoding with this lines in JAVA:
InputStream in = new FileInputStream(new File("D:\\lbl2\\1 (26).LBL"));
InputStreamReader inputStreamReader = new InputStreamReader(in);
System.out.print(inputStreamReader.getEncoding());
and we get UTF8 in output.
but the problem is that when I try to see file content with the browser or text editor like Notpad++ I can't see character correctly. Instead when I change the encoding to Windows-1256 all of characters view correct and readable.
Do i do any mistake?
Java does not attempt to detect the encoding of a file. getEncoding returns the encoding that was selected in the InputStreamReader constructor. If you don't use one of the constructors that take a character set parameter, you get the 'platform default charset', according to Oracle's documentation.
This question discusses what the platform default charset is, and how you can change it.
If you know in advance that this file is Windows-1256, you can use:
InputStreamReader inputStreamReader = new InputStreamReader(in, "Windows-1256");
Attempting to detect the encoding of a file usually fails - see for example the Bush hid the facts issue in Windows Notepad.
Unfortunately there is no 100% reliable way to detect the encoding of a file and as the other answer points out Java by default doesn't try. It simply assumes the platform's default encoding.
If you know the files are all in a single encoding then great, you can just specify that encoding and life is good.
If you know that some files are in UTF-8 and some files are in a single legacy encoding then you can generally get away with trying a strict* UTF-8 decode first. If the strict UTF-8 decode errors out then you move on to your legacy encoding.
If you have a wider mix of encodings things get considerablly harder, you may have to resort to some quite complex language processing to sort them out.
* I belive to get a strict decode in Java you need to first get the "Charset", then get a "CharsetDecoder" and then use the "onMalformedInput" method to set it to strict mode.

internal string encoding

I'm trying to understand how ASP classic handles strings internally. I've googled and debugged, but I still don't know how a string is encoded within the ASP script.
See the illustration below.
Is input data transformed so that all string variables have the same encoding no matter what source?
Most ASP-pages are saved on disk as utf-8. They do however #include asp-files that are saved with another encoding. A the top of front-end-pages I set the Response encoding to unicode.
response.codepage = 65001 //unicode
reponse.charset = 'utf-8'
http://www.designerline.se/db/aspclassicencoding.png
First of all its worth considering that the both UTF-8 and Windows-1252 (and ISO-8859-1 and others) are based on US-ASCII. The first 128 characters in all of these codepages are identical. Use exactly the same byte value and all occupy just one byte.
In many cases the vast majority of the content is within the US-ASCII range so its hard to tell there is any difference between. Frequently the whole file is just using US-ASCII characters and hence the files are identical despite choosen encoding (save perhaps the BOM at the start of the file).
Basic Script Processing
First the processor combines an ASP file with all its includes and the includes of those includes. This is done very simply sequentially replacing the include markers with the content of the include file being referenced. This is done purely at the byte level not attempt is made to convert files of different encodings.
Next the combined version of the file is parsed. tokenized, "compiled" even into a tight interperter friendly file. Its at this point that chunks of content in the file (the stuff outside of script code blocks) are turned into a special form of Response.Write. Its special in that at the point script execution would reach these special writes the processor simply copies verbatim the bytes as found in the file directly to the output stream, again no attempt is made to convert any encodings.
Script code and character encoding
The ASP processor just doesn't cope well with anything that isn't ASCII. All your code and especially your string literals in your code should only be in ASCII.
What can be a bit confusing once a script is executing all string variables are stored using Unicode encoding.
When code writes content the response using the proper Response.Write method this is where the Response.CodePage comes into effect. It will encode the unicode string the script provides to the response code page before adding it to the output stream.
What is the effect of Response.CharSet
It adds the CharSet attribute to the Content-Type http header. That is it, it has no other impact. If set this one character set but send different one because either your Response.CodePage doesn't match it or because the byte content of the files are not in that encoding then you can expect problems.
Input encoding
Things get really messy here. When form data is posted to the server there is no provision in the form url encoding standard to declare the code page used. Browser can be told what encoding to use and they will default to the charset of the html page contain the form, but there is no mechanism to communicate that choice to the server.
ASP takes the view that the codepage of posted form fields would be the same as the codepage of the response its about to send. Take a moment to absorb that.... This means that quite counter intuatively the Response.CodePage value has an impact on the strings returned by Request.Form. For this reason its important to get the correct codepage set early, doing some form processing and then setting the codepage later just before sending a response can lead to unexpected results.
The classic "the web page looks fine but the data in the database is corrupt" gotcha
One common gotcha this behaviour results in is where the developer has set CharSet="UTF-8" but left the codepage at something like "Windows-1252".
What ends up happening is the user enters text which is sent to the server in UTF-8 encoding but the script code reads it as 1252. This corrupt string gets stored in the database. A subsequent web page looks at this data, the corrupt string it pulled from the DB. This string is then sent by response.write using 1252 encoding but the destination page is told its UTF-8. This has the effect of reversing the corruption and everything looks fine to the user.
However when other components, say a report generator, creates content from the database then the data appears corrupt because it is.
The Bottom Line
You are already doing the correct thing, get that CharSet and CodePage set early and consistently. Where other files may not be saved as UTF-8 you will have problems if there is non-ascii content in them but otherwise you would be fine.
Many include asps are purely code with no content and since that code ought to be purely in ascii its encoding doesn't really matter.

What charset to use to store russian text into javascript files as an array

I am creating a coldfusion page, that takes language translation data stored in a table in my database, and makes static js files for each language pairing of english to ___ etc...
I am now starting to work on russian, I was able to get the other languages to work fine..
However, when it saves the file, all the text looks like question marks. Even when I run my translation app, the text for just that language looks like all ?????
I have tried writing it via cffile as utf-8 or ISO-8859-1 but neither seems to get it to display properly.
Any suggestions?
Have you tried ISO-8859-5? I believe it's the encoding that "should" be used for Russian.
By all means do use UTF-8 over any other encoding type. You need to make sure that:
your cfm templates were written to disk with UTF-8 encoding (notepad++ handles that nicely, and so does Eclipse or the new ColdFusion Builder)
your database was created with the proper codepage for nvarchar (and varchar) datatypes
your database connection handles UTF-8
How to go about the last two items depends on your database back-end. Coldfusion is quite agnostic in that regard, as it will happily use any jdbc driver that you may need.
When working in a multi-character set environment, character set conversion issues can occur and it can be difficult to determine where the conversion issue occurred.
There are two categories into which conversion issues can be placed. The first involves sending data in the wrong format to the client API. Although this cannot happen with Unicode APIs, it is possible with all other client APIs and results in garbage data.
The second category of issue involves a character that does not have an equivalent in the final character set, or in one of the intermediate character sets. In this case, a substitution character is used. This is called lossy conversion and can happen with any client API. You can avoid lossy conversions by configuring the database to use UTF-8 for the database character set.
The advantage of UTF-8 over any other encoding is that you can handle any number of languages in the same database / client.
I can't personally reproduce this problem at all. Is the ColdFusion template that is making the call itself UTF-8? (with or without a BOM it matters not for Russian). In any case UTF-8 is absolutely what you should be using. Make sure you get a UTF-8 compliant editor. Which is most things on Mac. On Windows you could use Scite or GVim.
The correct encoding to use in a .js file is whatever encoding the parent page is in. Whilst there are methods to serve JavaScript using a different encoding to the page including it, they don't work on all browsers.
So make sure your web page is being saved and served in an encoding that contains the Russian characters, and then save the .js file using the same encoding. That will be either:
ISO-8859-5. A single-byte encoding with Cyrillic in the high bytes, similar to Windows code page 1251. cp1251 will be the default encoding when you save in a text editor from a Russian install of Windows;
or UTF-8. A multi-byte encoding that contains every character. All modern websites should be using UTF-8.
(ISO-8859-1 is Western European and does not include any Cyrillic. It is similar to code page 1252, the default on a Western Windows install. It's of no use to you.)
So, best is to save both the cf template and the js file as UTF-8, and add <cfprocessingdirective pageencoding="utf-8"> if CF doesn't pick it up automatically.
If you can't control the encoding of the page that includes the script (for example because it's a third party), then you can't use any non-ASCII characters directly. You would have to use JavaScript string literal escapes instead:
var translation_ru= {
launchMyCalendar: '\u0417\u0430\u043f\u0443\u0441\u043a \u041c\u043e\u0439 \u043a\u0430\u043b\u0435\u043d\u0434\u0430\u0440\u044c'
};
when it saves to file it is "·ÐßãáÚ ¼ÞÙ ÚÐÛÕÝÔÐàì" so the charset is wrong
Looks like you've saved as cp1251 (ie. default codepage on a Russian machine) and then copied the file to a Western server where the default codepage is cp1252.
I also just found out that my text editor of choice, textpad, doesn't support unicode.
Yes, that was my reason for no longer using it too. EmEditor (commercial) and Notepad++ (open-source) are good replacements.

is it almost a standard that web applications assume the query string to be in UTF-8?

when we type in or paste it to a browser's address bar:
http://www.google.com/search?q=%E5%A4%A9
i think there is no way to tell whether the encoding is UTF-8 or any other encoding, so the application will usually assume it is UTF-8. So is it entirely up to the app to interpret it as whatever encoding it wants to or assumes to be?
(for all websites and even the platform i worked on, they seems to be almost always UTF-8)
Update: changed to the webapp instead.
RFC 3986 says:
"When a new URI scheme defines a
component that represents textual data
consisting of characters from the
Universal Character Set [UCS], the
data should first be encoded as octets
according to the UTF-8 character
encoding [STD63]; "
So UTF-8 is definitely the way to go for any new HTTP GET apis.
The servers don't care. They just pass it to PHP or Ruby or whatever.