How to detect malformed UTF characters - perl

I want to detect and replace malformed UTF-8 characters with blank space using a Perl script while loading the data using SQL*Loader. How can I do this?

Consider Python. It allows to extend codecs with user-defined error handlers, so you can replace undecodable bytes with anything you want.
import codecs
codecs.register_error('spacer', lambda ex: (u' ', ex.start + 1))
s = 'spam\xb0\xc0eggs\xd0bacon'.decode('utf8', 'spacer')
print s.encode('utf8')
This prints:
spam eggs bacon

EDIT: (Removed bit about SQL Loader as it seems to no longer be relevant.)
One problem is going to be working out what counts as the "end" of a malformed UTF-8 character. It's easy to say what's illegal, but it may not be obvious where the next legal character starts.

RFC 3629 describes the structure of UTF-8 characters. If you take a look at that, you'll see that it's pretty straightforward to find invalid characters, AND that the next character boundary is always easy to find (it's a character < 128, or one of the "long character" start markers, with leading bits of 110, 1110, or 11110).
But BKB is probably correct - the easiest answer is to let perl do it for you, although I'm not sure what Perl does when it detects the incorrect utf-8 with that filter in effect.

Related

PyGame: Proper use of Unicode

My goal is to create a program, with which the user can learn Bible verses by getting shown a problem and solving it through input (e.g. "Quote vers Gen 3:15"). As the Bible translation, I have to work with, is German, it contains a ton of umlauts, which are never showing properly.
My PyGame file's header:
#!/usr/bin/python
# -*- coding: utf-8 -*-
Later on, I list the three German umlauts:
u'ö'.encode('utf-8')
u'ä'.encode('utf-8')
u'ü'.encode('utf-8')
The txt-file is parsed by this function:
def load_list(listname):
fullname = os.path.join("daten", listname + ".txt")
with codecs.open(fullname, "r", "utf-8-sig") as name:
lines = name.readlines()
for x in range(0, len(lines)):
lines[x] = lines[x].strip("\n")
lines[x] = lines[x].strip("\r")
print lines
I'm aware, that I could combine the two lines with the strip-commands, but that's not the topic here.
How can I get my PyGame to display the umlauts from the text-file correctly as well also display the user input's umlauts correctly? I checked hundreds of suggestions, I can't get anything really working here.
Any help is highly appreciated, before I lose my sane mind (well, as I'm sitting here, coding games, I probably did already anyway :D )
I'll try to summarize:
Printing something else than a string or unicode opject triggers that object's __repr__() method. If it is a sequence, this applies to the contained elements as well, causing any non-ascii character to be escaped with \xXX (or \uXXXX) notation. Note the difference between print 'text' and print ['text']: in the latter case, the string's quotes will be printed as well (besides the brackets of course). Use str.join() for concatenating lists of strings in order to control the way the output looks.
It's a good idea to always explicitely decode input (as you do by using codecs) and encode the output (which is not done in the code snippets in your question).
The source file encoding (the # coding: utf8 line in the header) has nothing to do with encoding of input and output. It only enables you to type non-ascii character in string literals (= characters inside quotes in the source file), instead of using \xXX escapes.
Hope that makes some things clearer. There's a lot that can go wrong that looks like an encoding error, and it's not always easy to find out what's actually happening.

Interpretation of Greek characters by FOP

Can you please help me interpret the Greek Characters with HTML display as HTML= & #8062; and Hex value 01F7E
Details of these characters can be found on the below URL
http://www.isthisthingon.org/unicode/index.php?page=01&subpage=F&hilite=01F7E
When I run this character in Apache FOP, they give me an ArrayIndexOut of Bounds Exception
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.fop.text.linebreak.LineBreakUtils.getLineBreakPairProperty(LineBreakUtils.java:668)
at org.apache.fop.text.linebreak.LineBreakStatus.nextChar(LineBreakStatus.java:117)
When I looked into the FOP Code, I was unable to understand the need for lineBreakProperties[][] Array in LineBreakUtils.java.
I also noticed that FOP fails for all the Greek characters mentioned on the above page which are non-displayable with the similar error.
What are these special characters ?
Why is their no display for these characters are these Line Breaks or TAB’s ?
Has anyone solved a similar issue with FOP ?
The U+1F7E code point is part of the Greek Extended Unicode block. But it is does not represent any actual character; it is a reserved but unassigned code point. Here is the chart from Unicode 6.0: http://www.unicode.org/charts/PDF/U1F00.pdf.
So the errors you are getting are perhaps not so surprising.
I ran a FO file that included the following <fo:block> through both FOP 0.95 and FOP 1.0:
<fo:block>Unassigned code point: ὾</fo:block>
I did get the same java.lang.ArrayIndexOutOfBoundsException that you are seeing.
When using an adjacent "real" character, there was no error:
<fo:block>Assigned code point: ώ</fo:block>
So it seems like you have to ensure that your datastream does not contain non-characters like U+1F7E.
Answer from Apache
At first glance, this seems like a minor oversight in the implementation of Unicode linebreaking in FOP. This does not take into account the possibility that a given codepoint is not assigned a 'class' in linebreaking context. (=
U+1F7E does not appear in the file
http://www.unicode.org/Public/UNIDATA/LineBreak.txt, which is used as a basis to generate those arrays in LineBreakUtils.java)
On the other hand, one could obviously raise the question why you so desperately need to have an unassigned codepoint in your output. Are you absolutely sure you need this? If yes, then can you elaborate on the exact reason? (i.e. What exactly is this unassigned codepoint used for?)
The most straightforward 'fix' seems to be roughly as follows:
Index: src/java/org/apache/fop/text/linebreak/LineBreakStatus.java
--- src/java/org/apache/fop/text/linebreak/LineBreakStatus.java (revision
1054383)
+++ src/java/org/apache/fop/text/linebreak/LineBreakStatus.java (working
copy)
## -87,6 +87,7 ##
/* Initial conversions */
switch (currentClass) {
+ case 0: // Unassigned codepoint: consider as AL?
case LineBreakUtils.LINE_BREAK_PROPERTY_AI:
case LineBreakUtils.LINE_BREAK_PROPERTY_SG:
case LineBreakUtils.LINE_BREAK_PROPERTY_XX:
What this does, is assign the class 'AL' or 'Alphabetic' to any codepoint that has not been assigned a class by Unicode. This means it will be treated as a regular letter.
Now, the reason why I am asking the question whether you are sure you know what you're doing, is that this may turn out to be undesirable. Perhaps the character in question needs to be treated as a space rather than a letter.
Unicode does not define U+1F7E other than as a 'reserved' character, so it makes sense that Unicode cannot say what should happen with this character in the context of linebreaking...
That said, it is also wrong of FOP to crash in this case, so the bug is definitely genuine.

base64 encoding that doesn't use "+/=" (plus or equals) characters?

I need to encode a string of about 1000 characters that can be any byte value (00-FF). I don't want to use Hex because it's not dense enough. the problem with base64 as I understand it is that it includes + / and = which are characters I can not tolerate in my application.
Any suggestions?
Base58Check is an option. It is starting to become something of a de facto standard in cryptocurrency addresses.
Basic improvements over Base64:
Only alphanumeric characters [0-9a-zA-Z]
No look-alike characters: 0OIl / 0OIl
No punctuation to trigger word wrap or line break in documents and emails
Can also select entire value with a single double click due to no punctuation.
The Bitcoin Address Utility is an implementation example; geared for Bitcoins.
Note: A novel de facto standard may not be adequate for your needs. It is unclear if the Base58Check encoding method will formalise across current protocols.
Pick your replacements. Consider some other variants: base64 Variant table from Wikipedia.
While base64 encoder/decoders are trivial, replacement subsitution can be done in a simple pre/post processing step of an existing base64 encode/decode functions (inside wrappers) -- no need to re-invent the wheel (entirely). Or, better yet, as Mr. Skeet points out, find an existing library with enough flexibility.
If you have no alternative suitable "funny" characters to choose from (perhaps all the other characters are invalid leaving only the 62 alphanumeric characters to choose from), you can always use an escape character for a very slight (~3/64?) increase in size. For instance, 0 (A) would be encoded as "AA", 62 (+) would be encoded as "AB" and 63 (/) would be encoded as "AC". This too could be done as a pre/post step if you don't want to write your own encoder/decoder from the ground-up. The disadvantage with this approach is that the ratio of output characters to input bytes is not fixed.
If it's just those particular characters that bother you, and you can find some other characters to use instead, then how about implementing your own custom base64 module? It's not all that difficult.
You could use Base32 instead. Less dense than Base64, but eliminates unwanted characters completely.
As Ciaran says, base64 isn't terribly hard to implement - but you may want to have a look for existing libraries which allow you to specify a custom set of characters to use. I'm pretty sure there are plenty out there, but you haven't specified which platform you need this for.
Basically, you just need 65 ASCII characters which are acceptable - preferably in addition to line breaks.
Sure. Why not write your own Base64 encoder/decoder, but replace those chars in your algorithm. Sure, it will not be able to be decoded with a normal decoder, but if that's not an issue, then whyt worry about it. But, you better have at least 3 other chars that ARE useable in your app to represent the +/ and ='s...
base62 is essentially base64 but alphanumeric only.

Detect presence of a specific charset

I need a way to detect whether a file contains characters from a certain charset.
Specifically, I want to detect the presence of UTF8-encoded cyrillic characters in a series of files. Is there a tool to do this?
Thanks
If you are looking for ready solution, you might want to try Enca.
However, if you only want to detect presence of what can be possibly decoded as UTF-8 Cyrillic characters (without any complete UTF-8 validity checks), you just have to grep for something like /(\xD0[\x81\x90-\xBF]|\xD1[\x80-\x8F\x91]){n,}/ (this exact regexp is for n subsequent UTF8-encoded Russian Cyrillic characters). For additional check that the whole file contains only valid UTF-8 data you can use something like isutf8(1).
Both methods have their good and bad sides and may sometimes give wrong results.
IIRC the ICU library has code that does character set detection. Though it's basically a best effort guess.
Edit: I did remember correctly, check out this paper / tutorial

How do I determine the character set of a string?

I have several files that are in several different languages. I thought they were all encoded UTF-8, but now I'm not so sure. Some characters look fine, some do not. Is there a way that I can break out the strings and try to identify the character sets? Perhaps split on white space then identify each word? Finally, is there an easy way to translate characters from one set to UTF-8?
If you don't know the character set for sure You can only guess, basically. utf8::valid might help you with that, but you can't really know for sure. If you know that if it isn't unicode it must be a specific character set (Like Latin-1), you lucky. If you have no idea, you're screwed. In any case, you should always assume the whole file is in the same character set, unless otherwise specified. You will lose your sanity if you don't.
As for your question how to convert between character sets: Encode is there to do that for you
Determining whether a file is probably UTF-8 or not should be pretty easy. Determining the encoding if it is not UTF-8 would be very difficult in general.
If the file is encoded with UTF-8, the high bits of each byte should follow a pattern. If a character is one byte, its high bit will be cleared (zero). Otherwise, an n byte character (where n is 2–4) will have the high n bits of the first byte set to one, followed by a single zero bit. The following n - 1 bytes should all have the highest bit set and the second-highest bit cleared.
If all the bytes in your file follow these rules, it's probably encoded with UTF-8. I say probably, because anyone can invent a new encoding that happens to follow the same rules, deliberately or by chance, but interprets the codes differently.
Note that a file encoded with US-ASCII will follow these rules, but the high bit of every byte is zero. It's okay to treat such a file as UTF-8, since they are compatible in this range. Otherwise, it's some other encoding, and there's not an inherent test to distinguish the encoding. You'll have to use some contextual knowledge to guess.
Take a look at iconv
http://www.gnu.org/software/libiconv/
Text::Iconv