I am familiar with the .zip file format, and able to read the internal file table content so far.
The problem occurs with non-english characters in the file name.
The specification states that file names use OEM character set, yet sometimes I get UTF-8 representation and sometimes I get OEM represantation.
The specification states the "version made by" field should be in range 0-20, yet I get versions 31 and 63 which may or may not affect the character set.
Another related problem: When I read the "extra field" there is "up" (unicode path, id=0x7075) which suppose to store the utf-8 represantation of the filename, well, it starts with 5 redundant bytes before the actual utf-8 string (Created by WinRar), yet the other softwares seems to read it correctly.
Any input about the issue?
Related
I am having problems opening files which contain special characters like é, è, ë, ê, à, á, ö, etc. The error message I get from OxygnXML is:
File encoding (UTF8) does not support all characters from the current file.
To ignore these errors or to replace invalid characters follow the link below to change the "Encoding errors handling" option value from REPORT to IGNORE or REPLACE.
The strange thing is: when I alter the file (by swapping the 'ó' for an 'o', for instance), I can import the files both in OxygenXML and in FontoXML.Afterwards I can correct them again and save the file. But I don't see a difference between the original file and the altered file.
This is the original file
<p id="id-9f3a1788-a751-4f48-ed9c-9e19447ad3b0">Ze is zó zenuwachtig, dat ze bijna aan de ... moet .</p>
And this is the saved corrected file (from FontoXML, in this case - just to show the added instructions):
<p id="id-9f3a1788-a751-4f48-ed9c-9e19447ad3b0">Ze is
z<?fontoxml-change-addition-start author-id="erik.verhaar" change-id="6f6bb382-3d43-4c5b-b35f-f857d729cf22" timestamp="1627473671530"?>ó<?fontoxml-change-addition-end change-id="6f6bb382-3d43-4c5b-b35f-f857d729cf22"?><?fontoxml-change-deletion author-id="erik.verhaar" change-id="0296c77c-863b-421f-bf5c-c0901c7a2751" text="ó" timestamp="1627473669483"?>
zenuwachtig, dat ze bijna aan de ... moet .</p>
What is the difference between the original ó and the corrected one? And how can I change my original files so they can be imported in OxygenXML?
Thanks!!
Text files (XML for example) are saved on disk using bytes, they are edited and presented using characters. An encoding takes care of converting bytes to characters (sometimes multiple bytes are converted to characters) when the document is opened and again the encoding does the conversion of characters to bytes when the document is saved.
There are many encodings but with the most popular (like UTF-8) characters belonging to the 0-128 ASCII range like a-z A-Z are usually saved to a single byte. Characters outside of the range, for example e-acute (é) usually get saved as multiple bytes, depending on the encoding used for saving.
When an XML document is opened Oxygen attempts to understand what encoding to use for reading it. If the XML document has a heading like this:
Oxygen uses the encoding specified in the heading. If the XML doc is lacking the heading Oxygen will fallback to UTF-8. Basically Oxygen implements the XML specification when it comes to detecting the encoding of the XML file:
https://www.w3.org/TR/xml/#sec-guessing
In your case Oxygen detected the encoding as UTF-8 and started to use UTF-8 to convert bytes to characters. It encountered a sequence of bytes which were not encoded using UTF-8. Oxygen does not continue loading the file because in such cases you may end up with corrupt content when saving it back.
In my opinion the other editor tool you used to create the XML files was not XML aware, it did not actually saved the XML as UTF-8 even if the heading in the XML document specified this.
We do not actually know with what encoding that other editing tool used to save the XML, one thing you could try would be to reopen the XML document in that other editing tool and change its encoding heading declaration from:
<?xml version='1.0' encoding='UTF-8'?>
to:
<?xml version='1.0' encoding='CP1250'?>
because I suspect that other editing tool actually used for saving the XML document the default platform encoding which on Windows should usually be CP1250.
Then save the XML document in the other editing tool and try to re-open it in Oxygen, if it works change its heading encoding declaration back to UTF-8 and save the XML document in Oxygen in order to properly save it using the UTF-8 encoding.
This older set of slides I made about XML encoding might also be useful to you:
https://www.oxygenxml.com/events/2018/large_xml_documents.pdf
I'm trying to accomplish following task in Octave:
Read filename from text file
Search for this file in particular location on hard drive
My script works for most files, but for certain files containing unicode characters I'm unable to match the filename from textfile with filename as it appears in the file system.
Filenames in textfile are in UTF-8 encoding and I read them in Octave with function fgetl().
Filenames from file system are obtained via function readdir(). I'm on Windows, NTFS file system.
For example, one problematic filename contains character "Č".
When printed out in Octave console, the characters appear exactly the same. However, a HEX viewer reveals that the characters are not actually the same. In the first case the character is encoded as 0x010C, in the second case as 0x0043 + 0x030C. Comparing both of them via strcmp() fails, of course.
What I tried to do is to omitt all non-ASCII characters from the filename and then compare them. But this didn't work, probably because in the second variant the first part of the character (0x0043) is actually ASCII.
Now I'm looking for some way of converting one format to another to be able to compare them. Any ideas?
EDIT:
As I discovered later, the character Č in the filename on Windows is actually written as C+ˇ, which is just another way you can write that character. So the difference probably insn't in encoding standard, but in 2 different ways to achieve 1 visible character (glyph).
This question basically then changes to a task of matching characters written "at once" and corresponding pair of letter+combining character.
Using SSIS we are extracting names and addresses from one system and providing them to a downstream system via a vendor specific file format that only accepts UTF-8, & parses the data based on character positions, so it expects each row to be an exact length.
Many users have umlauts, apostrophes or accents in their names and addresses.
These characters do not translate well in UTF-8, showing up as xD3, xE1 and similar.
As one char is now replaced with 3, the row length is now incorrect and the upload fails.
Is there a way to represent characters with accents & umlauts in UTF-8?
We can change them in the source system, but that means the spelling is now technically incorrect.
I'm using org.apache.tika.Tika.parseToString() to convert documents into plain text (i.e., unformatted text) files. My application potentially needs to convert documents that don't use a Unicode character set. For instance, some documents may be encoded in the Chinese GB2312 character set. It would be great if Tika re-coded the output into UTF-8. This would require Tika to reference a mapping between many different character sets and Unicode in order to convert the characters.
Does Tika convert the non-Unicode character set text into Unicode as the output of parseToString()? There are a lot of character sets out there so I would be impressed if Tika did this for more than a few character sets.
Update: I was able to create a couple different files with some non-Latin charsets (GB2312 (Chinese) and KOI8-R (Russian)). Tika.parseToString() couldn't even detect the charset or encoding. I opened an issue on the Tika bug tracker here: https://issues.apache.org/jira/browse/TIKA-1262
When talking about Character Sets in Apache Tika, you need to consider two kinds of files differently. One kind is that of basically just plain text, the other are more complex types (including binary ones)
With the more complex files, Tika mostly uses third party libraries, and these libraries are responsible for returning Java Strings. The exact way of doing that will depend on the file format in question - sometimes the file format will including encoding information, other times it'll be fixed in what it supports. Either way, Tika gets Java Strings, and returns to you a Java String. How you choose to encode that for output is up to you. (For Windows users especially, check the encoding of your terminal, and the font used. There've been lots of "Tika Encoding Problems" which were actually people failing to correctly set the default Java encoding on output, or failing to have a Unicode capable terminal!)
With plain text files, there's no encoding information in the file, all we have is a bunch of bytes. Here, Apache Tika uses one of a number of EncodingDetector instances to do the detection. These use hints, n-grams, language detection etc, to try to work out the most likely encoding of the file based on information given, pattern of bytes in the file etc.
The definition of EncodingDetector is held in the Tika-Core jar, but most of the implentations are held in the Tika-Parsers jar (and loaded by the service loader method, just like Detectors and Parsers). The main ones are here in SVN. If you check there, you'll see the main list of encodings that Tika can detect.
One final thing - the encoding detection is only performed on files that are text files, it isn't done on the binary type files. Depending on how you call Tika, you might need to tweak that and/or provide a hint that it's a text file, so that the EncodingDetector logic gets triggered.
This answer actually comes from a JIRA user on the Tika project. https://issues.apache.org/jira/browse/TIKA-1262
It turns out that if you tell Tika that the file extension is '.txt' it will treat the file as plain text, attempt to detect the encoding, and convert it to UTF.
An easy way to do this is to pass an empty Metadata object to TikaInputStream.get(). This will fill out the resourceName field of the Metadata object. Then pass this object to parseToString(). With the resourceName field set to a file name that ends with .txt the parser knows to treat this file as plain text and will do a encoding detection to try to discover how to decode the file. The string returned from parseToString() is a Java UTF-16 String object. When written to a file you can see that it is Unicode and uses the UCS charset.
Tika tika = new Tika();
Metadata metadata = new Metadata();
TikaInputStream reader = TikaInputStream.get(new File(filepath), metadata);
String contents = tika.parseToString(reader, metadata);
So far this has worked for text files using either GB2312/GB18030 and KOI8-R. This is the expected behavior and it's perfect! I don't know what other charsets/encoding is can handle.
I have to write some code working with character encoding. Is there a good introduction to the subject to get me started?
First posted at What every developer should know about character encoding.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
Wrapping it up
I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
From Joel Spolsky
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html
As usual, Wikipedia is a good starting point: http://en.wikipedia.org/wiki/Character_encoding
I have a very basic introduction on my blog, which also includes links to in-depth resources if you REALLY want to dig into the subject matter.
http://www.dotnetnoob.com/2011/12/introduction-to-character-encoding.html