How to read text file without knowing the encoding - iphone

When reading a text file that was created somewhere else outside my app, the encoding used is unknown. My app has being using NSUnicodeStringEncoding (which is the same as NSUTF16StringEncoding) so have problems reading other than UTF16 encoded files.
Is there a way I can guess the encoding of a file? My priority is to be able to read UTF8 files and then all other files.
Is iterating through available encodings and check if read string's length is more than zero is really a good approach?
Thanks in advance.
Ignacio

Apple's documentation has some guidance on how to proceed: String Programming Guide: Reading data with an unknown encoding:
If you are forced to guess the encoding (and note that in the absence of explicit information, it is a guess):
Try stringWithContentsOfFile:usedEncoding:error: or initWithContentsOfFile:usedEncoding:error: (or the URL-based equivalents).
These methods try to determine the encoding of the resource, and if successful return by reference the encoding used.
If (1) fails, try to read the resource by specifying UTF-8 as the encoding.
If (2) fails, try an appropriate legacy encoding.
"Appropriate" here depends a bit on circumstances; it might be the default C string encoding, it might be ISO or Windows Latin 1, or something else, depending on where your data is coming from.

If the file is properly constructed you can read the first four bytes and see if it is a BOM (Byte Order Mark):
http://en.wikipedia.org/wiki/Byte-order_mark

Related

What happens if you set your integration package to Unicode?

I'm importing data from flat-files (text files). I do not know which encoding they will use, it may be unicode, or it may be ASCII. What happens if I just choose "Unicode string [DT_WSTR]" (Or unicode data) in my integration package. Would it be able to read ASCII without issues? I am using SSIS 2012.
What happens if I just choose "Unicode string [DT_WSTR]" (Or unicode data) in my integration package. Would it be able to read ASCII without issues?
The encoding that Microsoft misleadingly call “Unicode” is actually UTF-16LE, an encoding based around two-byte code units.
UTF-16LE is not compatible with ASCII (or any of the locale-specific ANSI code pages) so if you read a file this is actually encoded in an ASCII superset you will get unreadable nonsense.
There's no magic ‘do the right thing’ option for reading characters from files, you have to know what encoding was used to create them. If you can see an encoded Byte Order Mark on the front of the data that usually allows you to make a good guess, but otherwise you're on your own.

How to determine if a file is IBM1047 encoded

I have a bunch of XML files that are declared as encoding="IBM1047" but they don't seem to be:
when converted with iconv from IBM1047 to UTF-8 or ISO8859-1 (Latin 1) they result in indecipherable garbage
file -i <name_of_file> says "unknown 8-bit encoding"
when parsed by an XML parser the parser complains there is text before the prolog but there isn't; this error doesn't happen if I change the encoding in the XML declaration to something else
It would be nice to find out the real encoding of these files (I tried 'file -i' as mentioned above, and 'enca' but it's limited to Slavic languages (the files are in French)).
I have little control about how these files are produced; short of finding the actual encoding, if I can prove conclusively that the files are not in fact IBM1047 I may get the producer to do something about it.
How do I prove it?
Some special chars:
'é' is '©'
'à' is 'ë'
'è' is 'Û'
'ê' is 'ª'
The only way to prove that any class of data streams is encoded or not encoded in a particular way is to know, for at least one instance of the class, exactly what characters are supposed to be in the stream. If you have agreement on what characters are (supposed to be) in a particular test case, you can then calculate the bits that should be in the IBM 1047 (or any other) encoding of the test case, and compare those bits to the bits you actually see.
One simple way for EBCDIC data to be mangled, of course, is for it to have passed through some EBCDIC/ASCII gateway along the way that used a translate table designed for some other EBCDIC code page. But if you are working with EBCDIC data you presumably already know that.

Character Encoding Issue

I'm using an API that processes my files and presents optimized output, but some special characters are not preserved, for example:
Input: äöü
Output: äöü
How do I fix this? What encoding should I use?
Many thanks for your help!
It really depend what processing you are done to your data. But in general, one powerful technique is to convert it to UTF-8 by Iconv, for example, and pass it through ASCII-capable API or functions. In general, if those functions don't mess with data they don't understand as ASCII, then the UTF-8 is preserved -- that's a nice property of UTF-8.
I am not sure what language you're using, but things like this occur when there is a mismatch between the encoding of the content when entered and encoding of the content when read in.
So, you might want to specify exactly what encoding to read the data. You may have to play with the actual encoding you need to use
string.getBytes("UTF-8")
string.getBytes("UTF-16")
string.getBytes("UTF-16LE")
string.getBytes("UTF-16BE")
etc...
Also, do some research about the system where this data is coming from. For example, web services from ASP.NET deliver the content as UTF-16LE, but Java uses UTF-16BE encoding. When these two system talk to each other with extended characters, they might not understand each other exactly the same way.

What text encoding scheme do you use when you have binary data that you need to send over an ascii channel?

If you have binary data that you need to encode, what encoding scheme do you use?
I know about:
Hex encoding. Very simple, but quite verbose, expands one byte to two.
Base 64. Most common, not so verbose, expands three bytes to four.
Base 85. Not common, less verbose again, expands four bytes to five.
Are there any other encoding schemes in common use? If so, what are there advantages and disadvantages?
Edit: This is useful, for example, when trying to store arbitrary data in a cookie. Cookies can only store text, not arbitrary data, so you need to convert it in some way, preferably with a way to convert it back. Further, assume that you are using a stateless server so that you cannot save the state on the server and just put an identifier into the cookie. Of course, if you do this you would also need some way of verifying that what the user is passing back to you is what you passed to the user, for example a signature.
Also, since the current consensus is that you should use base64 since it is widespread, I will also point out that this is what I use... I am just curious if anyone used anything else, and if so, why.
Edit: Just in case someone stumbles across this, if you do want to use Base64 to store data in a cookie, you need to use a modified Base64 implementation. See this answer for the reason why.
For encoding cookie values, you need to be careful. See this older answer:
With Version 0 cookies, values should
not contain white space, brackets,
parentheses, equals signs, commas,
double quotes, slashes, question
marks, at signs, colons, and
semicolons. Empty values may not
behave the same way on all browsers.
Base64 encoding can generate = symbols for certain inputs, and this technically is not permitted in cookies (version 0 cookies, anyway, which are the most widely supported). In practice, I suspect the = will actually work fine, but maybe not.
I would suggest that to be absolutely sure that your encoded binary is cookie-compatible, then basic hex encoding is safest (e.g. in java).
edit: As #Paul helpfully pointed out, there is a modified version of Base 64 that is "URL safe" (and, I assume, "cookie safe"). Using a modified version of a standard algorithm rather dilutes its charm, mind you.
edit: #shoosh pointed out that the = is only used to denote the end of the base64 string, so you could trim the =, set the cookie, then reattach the = again when you need to decode it.
Base64 wins because it's so common that I don't have to ever worry about rolling my own encoder/decoder. I haven't run into any applications where I've been worried about saving bandwidth or filespace in encoded binary data.
Once upon a time, there was UTF-7. It's officially deprecated, but it still works as an ACE (ASCII Compatible Encoding). Now there's IDN.
uuencode is popular is some circles
HTML and XML encode unicode using this syntax
Base64 is the de-facto standard. Using anything else is asking for trouble.

Is there a standard encoding for NEEDED entries in ELF?

I'm trying to make some of my code a bit more friendly to non-pure-ascii systems and was wondering if there was a particular character encoding used for NEEDED entries in ELF binaries, or is it rather unstandard and based on the creating system's filesystem encoding (or even just directly the bytes that were passed to whatever created the binary) (if so is there any place in the binary that specifies the encoding? assuming the current systems encoding wouldn't work very well for my usage I think), are non-ascii names pretty much banned or something else?
ELF format specifies NEEDED fields as "null-terminated string" and does not say more about the encoding, which pretty much implies 8-bit ASCII string.
I personally don't see any point in complicating executable file format specification that does not provide any additional value for the final product or development process: the user won't see library names, so they wouldn't care about localization of thereof. You may try to use UTF-8, but actual file system encoding is not guaranteed to be UTF-8. To be sure you need to know how your target linker handles those strings.
As far as I know, the standard Unix way of dealing with non-ASCII characters is to encode them as UTF-8.