Trim UTF8 from the end of a UTF16 string - unicode

Here's an interesting puzzle, and I'd welcome any thoughts that anyone has. I don't think that there's a right or wrong answer here.
My program is loading a file which contains (amongst other things) string data structures. It helpfully declares what type the structure is (UTF8, UTF16 etc), and its length (of course) before the structure, so my program knows how to handle the data. Up to now this has worked perfectly every time.
Now I have been given a data file to load which has rubbish on the end of it - and, when I say rubbish, I mean that it looks like UTF8 at the end of a structure which is declared as UTF16.
D·a·v·e···E·d·m·u·n·d·s·o·n·d`upbqp!c°rÞPrpupÎÐAgâh(28.RïSÿ
The Dave Edmundson part is fine - it's everything after that which needs to be trimmed in this case. The fly in the ointment is that I still need to be able to handle legitimate UTF16 extended characters (like Korean, Chinese etc).
I could just fling my hands up in the air and say, "this data is corrupt", and spit out an error. But I'd like to be able to clean it if at all possible. Any ideas that anyone might have would be very welcome!
This is a logical issue, hence no code - if anyone is interested, I'm using Objective-C, but all I'm really after is some intelligent conversation on how I might approach this issue. I don't need code to be written for me!

Related

CSV in bad Encoding

We have uploaded a file with bad encoding now when downloading it again all the "strange" French characters are mixed up.
Example of the bad text:
R�union
Now when opening the CSV with Openoffice we tried all of the encodings in the Dropdown none of them seem to work.
Anyone have a way to fix the encoding to the correct one that we can view the chars?
Links to file https://drive.google.com/file/d/0BwgeuQK3LAFRWkJuNHd2TlF2WjQ/view?usp=sharing
Kr.
Sadly there is no way to automatically fix the linked file. Consider the two words afectación and sécurité. In the file they have been converted incorrectly to afectaci?n and s?curit?. There is no way to convert the question marks back because sometimes they're ó and other times é.
(Actually instead of question marks the file uses the unicode replacement character, but that doesn't change the problem).
Hopefully you have an earlier version of the file that has not been converted incorrectly.
Next time try to use a consistent encoding. This question gives some suggestions for how to do this.
If the original data cannot be obtained, there is one thing that could be done outside of retyping the whole thing. It is possible to use dictionary lookups to guess the missing words. However this would be a difficult project, and there would be mistakes where incorrect guesses were made. It's probably not worth it.

How "Expensive" is Perl's Encode::Detect::Detector

I've been having problems with "gremlins" from different encodings getting mixed into form input and data from a database within a Perl program. At first, I wasn't decoding, and smart quotes and similar things would generate multiple gibberish characters; but, blindly decoding everything as UTF-8 caused older Windows-1252 content to be filled with question marks.
So, I've used Encode::Detect::Detector and the decode() function to detect and decode all POST and GET input, along with data from a SQL database (the decoding process probably occurs on 10-20 strings of text each time a page is generated now). This seems to clean things up so UTF-8, ASCII and Windows-1252 content all display properly as UTF-8 output (as I've designated in the HTML headers):
my $encoding_name = Encode::Detect::Detector::detect($value);
eval { $value = decode($encoding_name, $value) };
My question is this: how resource heavy is this process? I haven't noticed a slowdown, so I think I'm happy with how this works, but if there's a more efficient way of doing this, I'd be happy to hear it.
The answer is highly application-dependent, so the acceptability of the 'expense' accrued is your call.
The best way to quantify the overhead is through profiling your code. You may want to give Devel::NYTProf a spin.
Tim Bunce's YAPC::EU presentation provide more details about the module.

Sample parser code for the CEDICT

Does anyone have a sample code for parsing the CEDICT file? CEDICT is a Chinese-English Dictionary. For instance, currently, if I open it in a text editor, a line in the CEDICT file looks like:
不 不 [bu4] /(negative prefix)/not/no/
I would like to see it as:
不 不 [bu4] /(negative prefix)/not/no/
I found Textwrangler to do this for me as a text editor. What I now need is sample code that achieves the same.
The thing is, it's just an encoding problem. If the line looks like
不 不 [bu4] /(negative prefix)/not/no/
It's because the text editor doesn't know/realize that the text is encoded as UTF-8. Text Wrangler, or its big brother BBEdit, are very good at guessing encoding, and can even be asked to display text in a specific encoding.
Since we don't know what you want, in the end, to achieve, it's hard to tell you exactly what has to be done, specifically. What I can say is that your app (which language are you using anyway?) needs to be Unicode aware (and be able to read/manipulate UTF strings).
I wrote a couple of apps based on the CEDICT, one for Mac OS X, one for Android. Parsing and indexing the CEDICT is not very hard.
UPDATE
Regarding the parsing itself of the CEDICT, it's nothing complicated. I don't do Objective-C, never have, never will, but the process would be the same in any language:
Read a line. Say your own example: 不 不 [bu4] /(negative prefix)/not/no/
You have four fields: Trad. Ch., Simp. Ch., Reading, Meaning(s).
These fields are space separated. Of course the 4th field may contain spaces, so be careful.
Store (I used an sqlite db) the 4 fields in to db.
You might want to remove the slashes from the definition field, replace them with something else.
Loop
You have now converted the CEDICT to a database. That's the easy part. As for tokenizing Chinese, good luck with that, mate. Better minds than mine are still banging their heads on this one.

Strange character rendered correctly in notepad, but as a control character elsewhere

I have a .csv list of businesses. The file has some strange characters in. For example, in this field: Stocktonon-Tees, the first hyphen, between Stockton and on seems to be a character with the value 6 rather than a hyphen, with the value 45. Stack overflow will probably sanatize this so you can't see it, so here is a pastebin:
http://pastebin.com/NuyyaQy9
Can anyone explain why this could be? Is it some encoding issue that I have missed? Or a corruption in the dataset?
Yes, it's almost certainly an encoding issue. A file just consists of binary data - it's how you interpret that binary data that matters. It sounds like Notepad is guessing at the originally-intended encoding, but whatever else you're using isn't.
Unfortunately you haven't said anything about what software is trying to read the file or what wrote it in the first place - but you should look at what encoding Notepad thinks it is, and work from there.
If it's your code that wrote the file out, and you get to decide the encoding, I'd recommend UTF-8 as a good general purpose, platform-portable encoding.

Should I convert overlong UTF-8 strings to their shortest normal form?

I've just been reworking my Encoding::FixLatin Perl module to handle overlong UTF-8 byte sequences and convert them to the shortest normal form.
My question is quite simply "is this a bad idea"?
A number of sources (including this RFC) suggest that any over-long UTF-8 should be treated as an error and rejected. They caution against "naive implementations" and leave me with the impression that these things are inherently unsafe.
Since the whole purpose of my module is to clean up messy data files with mixed encodings and convert them to nice clean utf8, this seems like just one more thing I can clean up so the application layer doesn't have to deal with it. My code does not concern itself with any semantic meaning the resulting characters might have, it simply converts them into a normalised form.
Am I missing something. Is there a hidden danger I haven't considered?
Yes, this is a bad idea.
Maybe some of the data in one of these messy data files was checked to see that it didn't contain a dangerous sequence of ASCII characters.
The canonical example that caused many problems: '\xC0\xBCscript>'. ‘Fix’ the overlong sequence to plain ASCII < and you have accidentally created a security hole.
No tool has ever generated overlongs for any legitimate purpose. If you're trying to repair mixed encoding files, you should consider encountering one as a sign that you have mis-guessed the encoding.
I don't think this is a bad idea from a security or usability perspective.
From security perspective you should be sanitizing user input before use. So you can run your clean up routines, and then make sure the data doesn't contain greater-than/less-than symbols <> before it is printed out. You should also make sure you call mysql_real_escape_string() before inserting it into the database. Keep in mind that language encoding issues such as GBK vs Latin1 can lead to sql injection when you aren't using mysql_real_escape_string(). (This function name should be pretty similar regardless of your platform specific mysql library bindings)
Sanitizing all user input is generally a terrible idea because you don't know how the specific variable will be used. For instance sql injection and xss have very different control characters involved and the same sensitization for both often leads to vulnerabilities.
I don't know if it is a bad idea in your scenario, however, as this kind of change is not bijective, it may lead to data loss.
If you incorrectly detected the encoding of your data, you may interpret data as being legitimate UTF-8 overlongs and change them in the shortest normal form. There will be no way to later retrieve the original data.
As a personal experience, I know that when such things may happen, they WILL and you will potentially not notice the error before it is too late...