Tika is not detecting plain ascii input - encoding

We have a byte sequence input and we need to check if it's UTF-8 or plain ASCII or something else. In other words, we have to reject ISO-8859-X latin-x or other encoded input.
Our first choice was Tika, but we have a problem with it: plain ascii input (input with no accented chars at all) is often detected as ISO-8859-2 or ISO-8859-1 !
This is the problematic part:
CharsetDetector detector = new CharsetDetector();
String ascii = "Only ascii Visible:a;Invisible:GUID\nX;XXddd\n";
detector.setText(ascii.getBytes());
System.out.println("detected charset: " + detector.detect().getName());
String ascii2 = "Only ascii plain english text";
detector.setText(ascii2.getBytes());
System.out.println("detected charset: " + detector.detect().getName());
String ascii3 = "this is ISO-8859-2 do not know why";
detector.setText(ascii3.getBytes());
System.out.println("detected charset: " + detector.detect().getName());
String ascii4 = "this is UTF-8 but tell me why o why maybe sdlkfjlksdjlkfjlksdjflkjlskdjflkjsdjkflkdsjlkfjldsjlkfjldkjkfljdlkjsdfhjshdkjfhjksdhjfkksdfksjdfhkjsdhj";
detector.setText(ascii4.getBytes());
System.out.println("detected charset: " + detector.detect().getName());
This is the output
detected charset: ISO-8859-2
detected charset: ISO-8859-1
detected charset: ISO-8859-2
detected charset: UTF-8
How should I use Tika to get sensible results?
Ps: Here is a mini demo: https://github.com/riskop/tikaproblem

There's a detectAll() method on detector, with that one can get all the encodings Tika considered matching to the input. I can solve my problem with following this rule: if UTF-8 is among the matching encodings the input is accepted (because it is possibly UTF-8), else the input is rejected as not UTF-8.
I understand that Tika must use heuristics, and I understand that there are inputs which can be valid UTF-8 or other encoded texts at the same time.
So for example
bytes = "Only ascii plain english text".getBytes("UTF-8");
printCharsetArray(new CharsetDetector().setText(bytes).detectAll());
results in:
Match of ISO-8859-1 in nl with confidence 40
Match of ISO-8859-2 in ro with confidence 30
Match of UTF-8 with confidence 15
Match of ISO-8859-9 in tr with confidence 10
Match of Big5 in zh with confidence 10
Match of EUC-KR in ko with confidence 10
Match of EUC-JP in ja with confidence 10
Match of GB18030 in zh with confidence 10
Match of Shift_JIS in ja with confidence 10
Match of UTF-16LE with confidence 10
Match of UTF-16BE with confidence 10
Which is usable in my case, although the two "best" matches are ISO-8859-1 and 2, the third best is UTF-8, so I can accept the input.
For invalid UTF-8 input it seems also working.
For example 0xc3, 0xa9, 0xa9
bytes = new byte[]{(byte)0xC3, (byte)0xA9, (byte)0xA9}; // illegal utf-8: Cx leading byte followed by two continuation bytes
printCharsetArray(new CharsetDetector().setText(bytes).detectAll());
results:
Match of Big5 in zh with confidence 10
Match of EUC-KR in ko with confidence 10
Match of EUC-JP in ja with confidence 10
Match of GB18030 in zh with confidence 10
Which is good, there's no UTF-8 among the matches.
A more likely input is text with accented chars encoded with not UTF-8 encoding:
bytes = "this is somethingó not utf8 é".getBytes("ISO-8859-2");
printCharsetArray(new CharsetDetector().setText(bytes).detectAll());
results:
Match of ISO-8859-2 in hu with confidence 31
Match of ISO-8859-1 in en with confidence 31
Match of KOI8-R in ru with confidence 10
Match of UTF-16LE with confidence 10
Match of UTF-16BE with confidence 10
Which is good, because no UTF-8 among the results.

Related

how does UTF-8 end up with bigger bits than UTF-16 [duplicate]

I have a university programming exam coming up, and one section is on unicode.
I have checked all over for answers to this, and my lecturer is useless so that’s no help, so this is a last resort for you guys to possibly help.
The question will be something like:
The string 'mЖ丽' has these unicode codepoints U+006D, U+0416 and
U+4E3D, with answers written in hexadecimal, manually encode the
string into UTF-8 and UTF-16.
Any help at all will be greatly appreciated as I am trying to get my head round this.
Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)
The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the utf-8(7) manpage on many Linux systems:
Encoding
The following byte sequences are used to represent a
character. The sequence to be used depends on the UCS code
number of the character:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
[... removed obsolete five and six byte forms ...]
The xxx bit positions are filled with the bits of the
character code number in binary representation. Only the
shortest possible multibyte sequence which can represent the
code number of the character can be used.
The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well
as 0xfffe and 0xffff (UCS noncharacters) should not appear in
conforming UTF-8 streams.
It might be easier to remember a 'compressed' version of the chart:
Initial bytes starts of mangled codepoints start with a 1, and add padding 1+0. Subsequent bytes start 10.
0x80 5 bits, one byte
0x800 4 bits, two bytes
0x10000 3 bits, three bytes
You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:
2**(5+1*6) == 2048 == 0x800
2**(4+2*6) == 65536 == 0x10000
2**(3+3*6) == 2097152 == 0x200000
I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)
Update
Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:
U+4E3E
This fits in the 0x00000800 - 0x0000FFFF range (0x4E3E < 0xFFFF), so the representation will be of the form:
1110xxxx 10xxxxxx 10xxxxxx
0x4E3E is 100111000111110b. Drop the bits into the x above (start from the right, we'll fill in missing bits at the start with 0):
1110x100 10111000 10111110
There is an x spot left over at the start, fill it in with 0:
11100100 10111000 10111110
Convert from bits to hex:
0xE4 0xB8 0xBE
The descriptions on Wikipedia for UTF-8 and UTF-16 are good:
Procedures for your example string:
UTF-8
UTF-8 uses up to 4 bytes to represent Unicode codepoints. For the 1-byte case, use the following pattern:
1-byte UTF-8 = 0xxxxxxxbin = 7 bits = 0-7Fhex
The initial byte of 2-, 3- and 4-byte UTF-8 start with 2, 3 or 4 one bits, followed by a zero bit. Follow on bytes always start with the two-bit pattern 10, leaving 6 bits for data:
2-byte UTF-8 = 110xxxxx 10xxxxxxbin = 5+6(11) bits = 80-7FFhex
3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxxbin = 4+6+6(16) bits = 800-FFFFhex
4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxbin = 3+6+6+6(21) bits = 10000-10FFFFhex†
†Unicode codepoints are undefined beyond 10FFFFhex.
Your codepoints are U+006D, U+0416 and U+4E3D requiring 1-, 2- and 3-byte UTF-8 sequences, respectively. Convert to binary and assign the bits:
U+006D = 1101101bin = 01101101bin = 6Dhex
U+0416 = 10000 010110bin = 11010000 10010110bin = D0 96hex
U+4E3D = 0100 111000 111101bin = 11100100 10111000 10111101bin = E4 B8 BDhex
Final byte sequence:
6D D0 96 E4 B8 BD
or if nul-terminated strings are desired:
6D D0 96 E4 B8 BD 00
UTF-16
UTF-16 uses 2 or 4 bytes to represent Unicode codepoints. Algorithm:
U+0000 to U+D7FF uses 2-byte 0000hex to D7FFhex
U+D800 to U+DFFF are invalid codepoints reserved for 4-byte UTF-16
U+E000 to U+FFFF uses 2-byte E000hex to FFFFhex
U+10000 to U+10FFFF uses 4-byte UTF-16 encoded as follows:
Subtract 10000hex from the codepoint.
Express result as 20-bit binary.
Use the pattern 110110xxxxxxxxxx 110111xxxxxxxxxxbin to encode the upper- and lower- 10 bits into two 16-bit words.
Using your codepoints:
U+006D = 006Dhex
U+0416 = 0416hex
U+4E3D = 4E3Dhex
Now, we have one more issue. Some machines store the two bytes of a 16-bit word least significant byte first (so-called little-endian machines) and some store most significant byte first (big-endian machines). UTF-16 uses the codepoint U+FEFF (called the byte order mark or BOM) to help a machine determine if a byte stream contains big- or little-endian UTF-16:
big-endian = FE FF 00 6D 04 16 4E 3D
little-endian = FF FE 6D 00 16 04 3D 4E
With nul-termination, U+0000 = 0000hex:
big-endian = FE FF 00 6D 04 16 4E 3D 00 00
little-endian = FF FE 6D 00 16 04 3D 4E 00 00
Since your instructor didn't give a codepoint that required 4-byte UTF-16, here's one example:
U+1F031 = 1F031hex - 10000hex = F031hex = 0000111100 0000110001bin =
1101100000111100 1101110000110001bin = D83C DC31hex
The following program will do the necessary work. It may not be "manual" enough for your purposes, but at a minimum you can check your work.
#!/usr/bin/perl
use 5.012;
use strict;
use utf8;
use autodie;
use warnings;
use warnings qw< FATAL utf8 >;
no warnings qw< uninitialized >;
use open qw< :std :utf8 >;
use charnames qw< :full >;
use feature qw< unicode_strings >;
use Encode qw< encode decode >;
use Unicode::Normalize qw< NFD NFC >;
my ($x) = "mЖ丽";
open(U8,">:encoding(utf8)","/tmp/utf8-out");
print U8 $x;
close(U8);
open(U16,">:encoding(utf16)","/tmp/utf16-out");
print U16 $x;
close(U16);
system("od -t x1 /tmp/utf8-out");
my $u8 = encode("utf-8",$x);
print "utf-8: 0x".unpack("H*",$u8)."\n";
system("od -t x1 /tmp/utf16-out");
my $u16 = encode("utf-16",$x);
print "utf-16: 0x".unpack("H*",$u16)."\n";

How to convert string in UTF-8 to ASCII ignoring errors and removing non ASCII characters

I am new to Scala.
Please advise how to convert strings in UTF-8 to ASCII ignoring errors and removing non ASCII characters in output.
For example, how to remove non ASCII character \uc382 from result string: "hello���", so that "hello" is printed in output.
scala.io.Source.fromBytes("hello\uc382".getBytes ("UTF-8"), "US-ASCII").mkString
val str = "hello\uc382"
str.filter(_ <= 0x7f) // keep only valid ASCII characters
If you had text in UTF-8 as bytes that is now in a String then it was converted.
If you have text in a String and you want it in ASCII as bytes, you can convert it later.
It seems that you just want to filter for only the UTF-16 code units for the C0 Controls and Basic Latin codepoints. Fortunately, such codepoints take only one code unit so we can filter them directly without converting them to codepoints.
"hello\uC382"
.filter(Character.UnicodeBlock.of(_) == Character.UnicodeBlock.BASIC_LATIN)
.getBytes(StandardCharsets.US_ASCII)
.foreach {
println }
With the question generalized to an arbitrary, known character encoding, filtering doesn't do the job. Instead, the feature of the encoder to ignore characters that are not present in the target Charset can be used. An Encoder requires a bit more wrapping and unwrapping. (The API design is based on streaming and reusing the buffer within the same stream and even other streams.) So, with ISO_8859_1 as an example:
val encoder = StandardCharsets.ISO_8859_1
.newEncoder()
.onMalformedInput(CodingErrorAction.IGNORE)
.onUnmappableCharacter(CodingErrorAction.IGNORE)
val string = "ñhello\uc382"
println(string)
val chars = CharBuffer.allocate(string.length())
.put(string)
chars.rewind()
val buffer = encoder.encode(chars)
val bytes = Array.ofDim[Byte](buffer.remaining())
buffer.get(bytes)
println(bytes)
bytes
.foreach {
println }

ASCII characters converted to 2 hexagonal numbers

Could someone tell me why characters from extended ASCII table are being converted to 2 hexagonal numbers instead of 1? For example:
a = 61
â = C3 A2 (even though it should be normally encoded as E2)
This is "Hex UTF-8 bytes".
U+007F (127) -> 1 Byte
U+07FF (2,047) -> 2 Byte
http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C3%A2&mode=char
http://unicode.mayastudios.com/examples/utf8.html

Swift: Check if first character of String is ASCII or a letter with an accent

In Swift, how do I check whether the first character of a String is ASCII or a letter with an accent? Wider characters, like emoticons & logograms, don't count.
I'm trying to copy how the iPhone Messages app shows a person's initials beside their message in a group chat. But, it doesn't include an initial if it's an emoticon or a Chinese character for example.
I see decomposableCharacterSet & nonBaseCharacterSet, but I'm not sure if those are what I want.
There are many Unicode Character with an accent.
Is this for you a character with an accent?
Ê
Ð
Ï
Ḝ
Ṹ
é⃝
In Unicode there are combined characters, which are two unicode characters as one:
let eAcute: Character = "\u{E9}" // é
let combinedEAcute: Character = "\u{65}\u{301}" // e followed by ́
// eAcute is é, combinedEAcute is é
For Swift the Character is the same!
Good reference is here.
If you want to know the CodeUnit of the Characters in the String, you can use the utf8 or utf16 property. They are different!
let characterString: String = "abc"
for character in characterString.utf8 {
print("\(character) ")
}
// output are decimal numbers: 97 98 99
// output of only é: 195 169, used the combined é
Then you could check for ASCII alphabet A-Z as 65-90 and a-z as 97-122.
And then check for the standard accent grave and acute
À 192
Á 193
È 200
É 201
à 224
á 225
è 232
é 233
... and the combined ones and everything you like.
But there are symbols that look like a latin letter with accent, but doesn't have the same meaning!
You should make sure, that only these characters are accepted, that you want with the correct linguistic meaning.

Manually converting unicode codepoints into UTF-8 and UTF-16

I have a university programming exam coming up, and one section is on unicode.
I have checked all over for answers to this, and my lecturer is useless so that’s no help, so this is a last resort for you guys to possibly help.
The question will be something like:
The string 'mЖ丽' has these unicode codepoints U+006D, U+0416 and
U+4E3D, with answers written in hexadecimal, manually encode the
string into UTF-8 and UTF-16.
Any help at all will be greatly appreciated as I am trying to get my head round this.
Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)
The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the utf-8(7) manpage on many Linux systems:
Encoding
The following byte sequences are used to represent a
character. The sequence to be used depends on the UCS code
number of the character:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
[... removed obsolete five and six byte forms ...]
The xxx bit positions are filled with the bits of the
character code number in binary representation. Only the
shortest possible multibyte sequence which can represent the
code number of the character can be used.
The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well
as 0xfffe and 0xffff (UCS noncharacters) should not appear in
conforming UTF-8 streams.
It might be easier to remember a 'compressed' version of the chart:
Initial bytes starts of mangled codepoints start with a 1, and add padding 1+0. Subsequent bytes start 10.
0x80 5 bits, one byte
0x800 4 bits, two bytes
0x10000 3 bits, three bytes
You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:
2**(5+1*6) == 2048 == 0x800
2**(4+2*6) == 65536 == 0x10000
2**(3+3*6) == 2097152 == 0x200000
I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)
Update
Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:
U+4E3E
This fits in the 0x00000800 - 0x0000FFFF range (0x4E3E < 0xFFFF), so the representation will be of the form:
1110xxxx 10xxxxxx 10xxxxxx
0x4E3E is 100111000111110b. Drop the bits into the x above (start from the right, we'll fill in missing bits at the start with 0):
1110x100 10111000 10111110
There is an x spot left over at the start, fill it in with 0:
11100100 10111000 10111110
Convert from bits to hex:
0xE4 0xB8 0xBE
The descriptions on Wikipedia for UTF-8 and UTF-16 are good:
Procedures for your example string:
UTF-8
UTF-8 uses up to 4 bytes to represent Unicode codepoints. For the 1-byte case, use the following pattern:
1-byte UTF-8 = 0xxxxxxxbin = 7 bits = 0-7Fhex
The initial byte of 2-, 3- and 4-byte UTF-8 start with 2, 3 or 4 one bits, followed by a zero bit. Follow on bytes always start with the two-bit pattern 10, leaving 6 bits for data:
2-byte UTF-8 = 110xxxxx 10xxxxxxbin = 5+6(11) bits = 80-7FFhex
3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxxbin = 4+6+6(16) bits = 800-FFFFhex
4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxbin = 3+6+6+6(21) bits = 10000-10FFFFhex†
†Unicode codepoints are undefined beyond 10FFFFhex.
Your codepoints are U+006D, U+0416 and U+4E3D requiring 1-, 2- and 3-byte UTF-8 sequences, respectively. Convert to binary and assign the bits:
U+006D = 1101101bin = 01101101bin = 6Dhex
U+0416 = 10000 010110bin = 11010000 10010110bin = D0 96hex
U+4E3D = 0100 111000 111101bin = 11100100 10111000 10111101bin = E4 B8 BDhex
Final byte sequence:
6D D0 96 E4 B8 BD
or if nul-terminated strings are desired:
6D D0 96 E4 B8 BD 00
UTF-16
UTF-16 uses 2 or 4 bytes to represent Unicode codepoints. Algorithm:
U+0000 to U+D7FF uses 2-byte 0000hex to D7FFhex
U+D800 to U+DFFF are invalid codepoints reserved for 4-byte UTF-16
U+E000 to U+FFFF uses 2-byte E000hex to FFFFhex
U+10000 to U+10FFFF uses 4-byte UTF-16 encoded as follows:
Subtract 10000hex from the codepoint.
Express result as 20-bit binary.
Use the pattern 110110xxxxxxxxxx 110111xxxxxxxxxxbin to encode the upper- and lower- 10 bits into two 16-bit words.
Using your codepoints:
U+006D = 006Dhex
U+0416 = 0416hex
U+4E3D = 4E3Dhex
Now, we have one more issue. Some machines store the two bytes of a 16-bit word least significant byte first (so-called little-endian machines) and some store most significant byte first (big-endian machines). UTF-16 uses the codepoint U+FEFF (called the byte order mark or BOM) to help a machine determine if a byte stream contains big- or little-endian UTF-16:
big-endian = FE FF 00 6D 04 16 4E 3D
little-endian = FF FE 6D 00 16 04 3D 4E
With nul-termination, U+0000 = 0000hex:
big-endian = FE FF 00 6D 04 16 4E 3D 00 00
little-endian = FF FE 6D 00 16 04 3D 4E 00 00
Since your instructor didn't give a codepoint that required 4-byte UTF-16, here's one example:
U+1F031 = 1F031hex - 10000hex = F031hex = 0000111100 0000110001bin =
1101100000111100 1101110000110001bin = D83C DC31hex
The following program will do the necessary work. It may not be "manual" enough for your purposes, but at a minimum you can check your work.
#!/usr/bin/perl
use 5.012;
use strict;
use utf8;
use autodie;
use warnings;
use warnings qw< FATAL utf8 >;
no warnings qw< uninitialized >;
use open qw< :std :utf8 >;
use charnames qw< :full >;
use feature qw< unicode_strings >;
use Encode qw< encode decode >;
use Unicode::Normalize qw< NFD NFC >;
my ($x) = "mЖ丽";
open(U8,">:encoding(utf8)","/tmp/utf8-out");
print U8 $x;
close(U8);
open(U16,">:encoding(utf16)","/tmp/utf16-out");
print U16 $x;
close(U16);
system("od -t x1 /tmp/utf8-out");
my $u8 = encode("utf-8",$x);
print "utf-8: 0x".unpack("H*",$u8)."\n";
system("od -t x1 /tmp/utf16-out");
my $u16 = encode("utf-16",$x);
print "utf-16: 0x".unpack("H*",$u16)."\n";