read() function for FileReader reads 8212 - encoding

I'm trying to read the binary content of a text file (which contains the compressed version of a different text file). The first two characters (01111011 and 00100110) are correct (going by the values that were original put there during the compression process.
However, when it gets to the third character, it should be reading 10010111 (again, going by what was added during the compression process), but instead it reads 10000000010100 (aka 8212). Does anyone know what is causing this discrepancy, or how to fix it? Thanks!

The Java FileReader should not be used to read binary data from files since it reads a character at a time using the default encoding (which is most likely not very good for binary reading)
Instead, use FileInputStream which has read methods that reads actual raw bytes without any encoding applied.

Related

Why importdata is not working here?

I have two data text files with the same text content but they have different sizes. The following snapshot compares them (using Beyond Compare).
It seems that the hex content of the files is different.
The MATLAB function importdata reads fine the file to the left but gives the following error with the file to the right (the bigger size file):
Unable to load file. Use TEXTSCAN or FREAD for more complex formats.
What exactly is the difference between the two files ?
How to make importdata work with the file to the right ?
The problem and solution is already mentioned in the comments:
it seems like one text file is in ascii encoding (that is 8 bits per char) while the second one is unicode (16 bits per char). try converting the big file into simple ascii and re-read it
You may see the difference already in BeyondCompare: On the left, you have an ANSI-File, on the right you have a unicode with BOM (The hex-code looks like UTF-16LE. I'm not sure, which version the BeyondCompare you used to get the screenshot. My BeyondCompare would not show 'Unicode' but 'UTF16-LE' ...).

Does Apache Tika do character set conversion?

I'm using org.apache.tika.Tika.parseToString() to convert documents into plain text (i.e., unformatted text) files. My application potentially needs to convert documents that don't use a Unicode character set. For instance, some documents may be encoded in the Chinese GB2312 character set. It would be great if Tika re-coded the output into UTF-8. This would require Tika to reference a mapping between many different character sets and Unicode in order to convert the characters.
Does Tika convert the non-Unicode character set text into Unicode as the output of parseToString()? There are a lot of character sets out there so I would be impressed if Tika did this for more than a few character sets.
Update: I was able to create a couple different files with some non-Latin charsets (GB2312 (Chinese) and KOI8-R (Russian)). Tika.parseToString() couldn't even detect the charset or encoding. I opened an issue on the Tika bug tracker here: https://issues.apache.org/jira/browse/TIKA-1262
When talking about Character Sets in Apache Tika, you need to consider two kinds of files differently. One kind is that of basically just plain text, the other are more complex types (including binary ones)
With the more complex files, Tika mostly uses third party libraries, and these libraries are responsible for returning Java Strings. The exact way of doing that will depend on the file format in question - sometimes the file format will including encoding information, other times it'll be fixed in what it supports. Either way, Tika gets Java Strings, and returns to you a Java String. How you choose to encode that for output is up to you. (For Windows users especially, check the encoding of your terminal, and the font used. There've been lots of "Tika Encoding Problems" which were actually people failing to correctly set the default Java encoding on output, or failing to have a Unicode capable terminal!)
With plain text files, there's no encoding information in the file, all we have is a bunch of bytes. Here, Apache Tika uses one of a number of EncodingDetector instances to do the detection. These use hints, n-grams, language detection etc, to try to work out the most likely encoding of the file based on information given, pattern of bytes in the file etc.
The definition of EncodingDetector is held in the Tika-Core jar, but most of the implentations are held in the Tika-Parsers jar (and loaded by the service loader method, just like Detectors and Parsers). The main ones are here in SVN. If you check there, you'll see the main list of encodings that Tika can detect.
One final thing - the encoding detection is only performed on files that are text files, it isn't done on the binary type files. Depending on how you call Tika, you might need to tweak that and/or provide a hint that it's a text file, so that the EncodingDetector logic gets triggered.
This answer actually comes from a JIRA user on the Tika project. https://issues.apache.org/jira/browse/TIKA-1262
It turns out that if you tell Tika that the file extension is '.txt' it will treat the file as plain text, attempt to detect the encoding, and convert it to UTF.
An easy way to do this is to pass an empty Metadata object to TikaInputStream.get(). This will fill out the resourceName field of the Metadata object. Then pass this object to parseToString(). With the resourceName field set to a file name that ends with .txt the parser knows to treat this file as plain text and will do a encoding detection to try to discover how to decode the file. The string returned from parseToString() is a Java UTF-16 String object. When written to a file you can see that it is Unicode and uses the UCS charset.
Tika tika = new Tika();
Metadata metadata = new Metadata();
TikaInputStream reader = TikaInputStream.get(new File(filepath), metadata);
String contents = tika.parseToString(reader, metadata);
So far this has worked for text files using either GB2312/GB18030 and KOI8-R. This is the expected behavior and it's perfect! I don't know what other charsets/encoding is can handle.

How do I use the StackExchange API from Matlab?

How do I access data from the StackExchange API using Matlab?
The naive
sitedata = urlread('http://api.stackoverflow.com/1.1/questions?tagged=matlab')
fails since the data is compressed. However, when I write this to file (using fprintf(fileID,'%s',sitedata)), I get a zip-file that cannot be uncompressed.
Try urlwrite() instead:
urlwrite('http://api.stackoverflow.com/1.1/questions?tagged=matlab',...
'tempfile.zip')
gunzip('tempfile.zip')
fid = fopen('tempfile');
str = textscan(fid,'%s',Delimiter','\n');
fclose(fid);
A better version of this snippet would use tempname to dynamically generate temporary filenames.
Matlab's urlread assumes you're getting text data back, not binary. The gzip binary data is getting mangled either when urlread is decoding the character data to Unicode values to stick in Matlab chars, or when the formatted-output fprintf function is writing them out, encoding them to UTF-8 or whatever default character encoding you're using for fileID and changing the byte sequence, or maybe both.
IIRC, urlread will default to using ISO-8859-1 encoding, which means the bytes will be turned in to the Unicode code points with the same numeric values - effectively just a widening. So you can get the byte data back by doing sitebytes = uint8(sitedata). (That's a regular uint8() conversion, not a typecast().) (If this isn't the case, you can probably fiddle with urlread's CharSet option.)
If you can't get the right bytes out from urlread by fiddling with the encoding and casts, then you can drop down and make calls against the Java HttpAgent like urlread does and bypass the character set decoding step, or fiddle with its options. See the urlread source for how to do it.
Once you have the right bytes in memory, you can write them out to a file using the lower-level fwrite() function, which won't mangle them by doing character set encoding. Then you'll have a valid gzip file of the site's original response. (I think it'll work if you also just use fwrite(fileID, sitedata, 'uint8') directly on the char string, but it's uglier IMHO.)
You can also unzip it in memory using Java classes and save a trip to the filesystem. Do jsitebytes = typecast(sitebytes 'int8') to get them as Java-friendly signed bytes and then stick it into a ByteArrayInputStream and read it out through a GZIPInputStream. You'll need to build a little Java helper class because Matlab doesn't play well with passing byte[] buffers by reference like java.io wants, but it may be worthwhile if you do a lot of in-memory munging like this.
When working with web services or fancier data downloads (e.g. sites that need sessions or certificates), I've often ended up dropping down and coding directly against the HttpAgent and java.io classes from within Matlab.

How to determine if a file is IBM1047 encoded

I have a bunch of XML files that are declared as encoding="IBM1047" but they don't seem to be:
when converted with iconv from IBM1047 to UTF-8 or ISO8859-1 (Latin 1) they result in indecipherable garbage
file -i <name_of_file> says "unknown 8-bit encoding"
when parsed by an XML parser the parser complains there is text before the prolog but there isn't; this error doesn't happen if I change the encoding in the XML declaration to something else
It would be nice to find out the real encoding of these files (I tried 'file -i' as mentioned above, and 'enca' but it's limited to Slavic languages (the files are in French)).
I have little control about how these files are produced; short of finding the actual encoding, if I can prove conclusively that the files are not in fact IBM1047 I may get the producer to do something about it.
How do I prove it?
Some special chars:
'é' is '©'
'à' is 'ë'
'è' is 'Û'
'ê' is 'ª'
The only way to prove that any class of data streams is encoded or not encoded in a particular way is to know, for at least one instance of the class, exactly what characters are supposed to be in the stream. If you have agreement on what characters are (supposed to be) in a particular test case, you can then calculate the bits that should be in the IBM 1047 (or any other) encoding of the test case, and compare those bits to the bits you actually see.
One simple way for EBCDIC data to be mangled, of course, is for it to have passed through some EBCDIC/ASCII gateway along the way that used a translate table designed for some other EBCDIC code page. But if you are working with EBCDIC data you presumably already know that.

How to read text file without knowing the encoding

When reading a text file that was created somewhere else outside my app, the encoding used is unknown. My app has being using NSUnicodeStringEncoding (which is the same as NSUTF16StringEncoding) so have problems reading other than UTF16 encoded files.
Is there a way I can guess the encoding of a file? My priority is to be able to read UTF8 files and then all other files.
Is iterating through available encodings and check if read string's length is more than zero is really a good approach?
Thanks in advance.
Ignacio
Apple's documentation has some guidance on how to proceed: String Programming Guide: Reading data with an unknown encoding:
If you are forced to guess the encoding (and note that in the absence of explicit information, it is a guess):
Try stringWithContentsOfFile:usedEncoding:error: or initWithContentsOfFile:usedEncoding:error: (or the URL-based equivalents).
These methods try to determine the encoding of the resource, and if successful return by reference the encoding used.
If (1) fails, try to read the resource by specifying UTF-8 as the encoding.
If (2) fails, try an appropriate legacy encoding.
"Appropriate" here depends a bit on circumstances; it might be the default C string encoding, it might be ISO or Windows Latin 1, or something else, depending on where your data is coming from.
If the file is properly constructed you can read the first four bytes and see if it is a BOM (Byte Order Mark):
http://en.wikipedia.org/wiki/Byte-order_mark