Swift: Check if first character of String is ASCII or a letter with an accent - swift

In Swift, how do I check whether the first character of a String is ASCII or a letter with an accent? Wider characters, like emoticons & logograms, don't count.
I'm trying to copy how the iPhone Messages app shows a person's initials beside their message in a group chat. But, it doesn't include an initial if it's an emoticon or a Chinese character for example.
I see decomposableCharacterSet & nonBaseCharacterSet, but I'm not sure if those are what I want.

There are many Unicode Character with an accent.
Is this for you a character with an accent?
Ê
Ð
Ï
Ḝ
Ṹ
é⃝
In Unicode there are combined characters, which are two unicode characters as one:
let eAcute: Character = "\u{E9}" // é
let combinedEAcute: Character = "\u{65}\u{301}" // e followed by ́
// eAcute is é, combinedEAcute is é
For Swift the Character is the same!
Good reference is here.
If you want to know the CodeUnit of the Characters in the String, you can use the utf8 or utf16 property. They are different!
let characterString: String = "abc"
for character in characterString.utf8 {
print("\(character) ")
}
// output are decimal numbers: 97 98 99
// output of only é: 195 169, used the combined é
Then you could check for ASCII alphabet A-Z as 65-90 and a-z as 97-122.
And then check for the standard accent grave and acute
À 192
Á 193
È 200
É 201
à 224
á 225
è 232
é 233
... and the combined ones and everything you like.
But there are symbols that look like a latin letter with accent, but doesn't have the same meaning!
You should make sure, that only these characters are accepted, that you want with the correct linguistic meaning.

Related

Detect encoding of text

Hello guys please I want to know what is the encoding of text below :
€Mƒ€ƒ. 3&# ÿÿÿ € `y "&A 0ÿ ÿ ÿÿÿÿÿÿÿÿ3‚Pq3b!#o58D2bŸ "&A0 "&BX!38(&sip:+33825071071#ims.mnc001.mcc208.3gp€=€Q€€Mqbs7108613c23tcc43c03t86ohe83t87nhkamghddrn2khdc5rs6nfpaf3j8l2tb92kc7r477n3o1J)K ÿÿ€Uÿÿÿÿ "&B ÿ€€2IEEE-802.11;country=FR;i-wlan-node-id=ffffffffffff€ €028B9D90486C8-368f-4a9ec700-f2864d-635938f2-d93969-1-11€€INVITE[0\D020€Mƒ€ƒ. 3&# ÿÿÿ € `y "&EW ÿ ÿ ÿÿÿÿÿÿÿÿ3‚P3?3b!#o58D2bŸ "&ED0 "&G!38(&sip:+33825033033#ims.mnc001.mcc208.3gp€=€Q€€Mfks7108613c23tcc43c03t86ohe83t87nhkamghddrn2khdc5rs6nfpaf3j8l2tb92kc7rova64o1J)K ÿÿ€Uÿÿÿÿ "&F' ÿ€€2IEEE-802.11;country=FR;i-wlan-node-id=ffffffffffff€ €028B9D90486C8-368f-4a9ec700-f29a30-63593a08-46e5f9-1-11€€INVITE[0\D020
I received it in .trf extension containing TLV (tag value length).
I want to know what is the encoding of this text to decode it.

Blocking other type keyboards

I have a textField and the checking is done in shouldChangeCharactersIn function.
This is working well in most cases.
First I have a CharacterSet of allowed characters:
// Check characters
allowedCharacters = CharacterSet.decimalDigits.union(.letters)
allowedCharacters = allowedCharacters?.union(CharacterSet(charactersIn: "àÀáÁâÂãÃäÄåāÅæèÈéÉêÊëËìÌíÍîÎïÏòÒóÓöÖôÔõÕøØùÙúÚûÛüÜýÝÿçÇñÑ"))
allowedCharacters = allowedCharacters?.union(CharacterSet(charactersIn: " ,.:;##%*+&_=<>!?\r\n'(){}[]/-"))
The allowedCharacters variable now holds all the characters that I wish to allow in my app. The trimminCharacters removes all characters not in the allowed set.
guard string.trimmingCharacters(in: allowedCharacters!) == "" else { return false }
This seems to be working okay, but when the user switches Keyboard to Turkish or Chinese, it is possible to enter characters that are not in the list above.
Eg. from the Turkish keyboard: ğ and ş
And from Chinese keyboard: ㄎ ㄕ and ㄨ
I want to block all characters not in the allowed CharacterSet. How can I prevent the user from inputting these characters?
The letters character set includes every unicode scalar whose category starts with "L" or "M".
Well, ğ and ş are both in the category Ll (Lowercase Letter), and the Bopomofo symbols ㄎ ㄕ and ㄨ are all in the category Lo (Other Letter). All these characters are in the letter character set!
Note that the same thing goes for decimalDigits, which includes everything in the category Nd. This includes the Indic Arabic numbers ٠١٢٣٤٥٦٧٨٩ for example.
You seem to have a rather specific set of characters that you want to allow, so you should just write that out explicitly:
CharacterSet(charactersIn: "a"..."z") // assuming these chars are what you want
.union(.init(charactersIn: "A"..."Z"))
.union(.init(charactersIn: "0"..."9"))
.union(.init(charactersIn: "àÀáÁâÂãÃäÄåāÅæèÈéÉêÊëËìÌíÍîÎïÏòÒóÓöÖôÔõÕøØùÙúÚûÛüÜýÝÿçÇñÑ"))
.union(.init(charactersIn: " ,.:;##%*+&_=<>!?\r\n'(){}[]/-"))

What is the difference between ö and ö?

The following characters look alike. But they are not the same. I can not visually see their difference. Could anybody let me know what their difference is? Why are there two Unicode characters that are so similar?
$ xxd <<< ö
00000000: c3b6 0a ...
$ xxd <<< ö
00000000: 6fcc 880a o...
The first is a single Unicode code point, while the second is two Unicode code points. They are two forms of the same glyph (examples in Python):
import unicodedata as ud
o1 = 'ö' # '\xf6'
o2 = 'ö' # 'o\u0308'
for c in o1:
print(f'U+{ord(c):04X} {ud.name(c)}')
print()
for c in o2:
print(f'U+{ord(c):04X} {ud.name(c)}')
U+00F6 LATIN SMALL LETTER O WITH DIAERESIS
U+006F LATIN SMALL LETTER O
U+0308 COMBINING DIAERESIS
Ensure the two strings are in the same normalization form (either composed or decomposed) for comparison:
print(ud.normalize('NFC',o1) == ud.normalize('NFC',o2))
print(ud.normalize('NFD',o1) == ud.normalize('NFD',o2))
True
True

Tika is not detecting plain ascii input

We have a byte sequence input and we need to check if it's UTF-8 or plain ASCII or something else. In other words, we have to reject ISO-8859-X latin-x or other encoded input.
Our first choice was Tika, but we have a problem with it: plain ascii input (input with no accented chars at all) is often detected as ISO-8859-2 or ISO-8859-1 !
This is the problematic part:
CharsetDetector detector = new CharsetDetector();
String ascii = "Only ascii Visible:a;Invisible:GUID\nX;XXddd\n";
detector.setText(ascii.getBytes());
System.out.println("detected charset: " + detector.detect().getName());
String ascii2 = "Only ascii plain english text";
detector.setText(ascii2.getBytes());
System.out.println("detected charset: " + detector.detect().getName());
String ascii3 = "this is ISO-8859-2 do not know why";
detector.setText(ascii3.getBytes());
System.out.println("detected charset: " + detector.detect().getName());
String ascii4 = "this is UTF-8 but tell me why o why maybe sdlkfjlksdjlkfjlksdjflkjlskdjflkjsdjkflkdsjlkfjldsjlkfjldkjkfljdlkjsdfhjshdkjfhjksdhjfkksdfksjdfhkjsdhj";
detector.setText(ascii4.getBytes());
System.out.println("detected charset: " + detector.detect().getName());
This is the output
detected charset: ISO-8859-2
detected charset: ISO-8859-1
detected charset: ISO-8859-2
detected charset: UTF-8
How should I use Tika to get sensible results?
Ps: Here is a mini demo: https://github.com/riskop/tikaproblem
There's a detectAll() method on detector, with that one can get all the encodings Tika considered matching to the input. I can solve my problem with following this rule: if UTF-8 is among the matching encodings the input is accepted (because it is possibly UTF-8), else the input is rejected as not UTF-8.
I understand that Tika must use heuristics, and I understand that there are inputs which can be valid UTF-8 or other encoded texts at the same time.
So for example
bytes = "Only ascii plain english text".getBytes("UTF-8");
printCharsetArray(new CharsetDetector().setText(bytes).detectAll());
results in:
Match of ISO-8859-1 in nl with confidence 40
Match of ISO-8859-2 in ro with confidence 30
Match of UTF-8 with confidence 15
Match of ISO-8859-9 in tr with confidence 10
Match of Big5 in zh with confidence 10
Match of EUC-KR in ko with confidence 10
Match of EUC-JP in ja with confidence 10
Match of GB18030 in zh with confidence 10
Match of Shift_JIS in ja with confidence 10
Match of UTF-16LE with confidence 10
Match of UTF-16BE with confidence 10
Which is usable in my case, although the two "best" matches are ISO-8859-1 and 2, the third best is UTF-8, so I can accept the input.
For invalid UTF-8 input it seems also working.
For example 0xc3, 0xa9, 0xa9
bytes = new byte[]{(byte)0xC3, (byte)0xA9, (byte)0xA9}; // illegal utf-8: Cx leading byte followed by two continuation bytes
printCharsetArray(new CharsetDetector().setText(bytes).detectAll());
results:
Match of Big5 in zh with confidence 10
Match of EUC-KR in ko with confidence 10
Match of EUC-JP in ja with confidence 10
Match of GB18030 in zh with confidence 10
Which is good, there's no UTF-8 among the matches.
A more likely input is text with accented chars encoded with not UTF-8 encoding:
bytes = "this is somethingó not utf8 é".getBytes("ISO-8859-2");
printCharsetArray(new CharsetDetector().setText(bytes).detectAll());
results:
Match of ISO-8859-2 in hu with confidence 31
Match of ISO-8859-1 in en with confidence 31
Match of KOI8-R in ru with confidence 10
Match of UTF-16LE with confidence 10
Match of UTF-16BE with confidence 10
Which is good, because no UTF-8 among the results.

What is this character separator: ^_?

I dumped a SQLite3 table (from an Anki deck) to a CSV file. I found that the sfld column is separated by ^_.
What is this character or escape character in Unicode?
It's a control-underscore (Control-_), or 0x1F, or Unit Separator character from the ASCII (and ISO 8859-x and Unicode) control characters.
The upper-case letters in ASCII, ISO 8859-x and Unicode have code points (all numbers in hex):
41 U+0041 LATIN CAPITAL LETTER A
…
5A U+005A LATIN CAPITAL LETTER Z
The subsequent characters are:
5B U+005B LEFT SQUARE BRACKET
5C U+005C REVERSE SOLIDUS
5D U+005D RIGHT SQUARE BRACKET
5E U+005E CIRCUMFLEX ACCENT
5F U+005F LOW LINE
The control characters like Control-A have a code 0x40 less than the upper-case letters, so you have
01 U+0001 START OF HEADING (aka SOH or Control-A)
…
1A U+001A SUBSTITUTE (aka SUB or Control-Z)
and then you get:
1B U+001B ESCAPE (aka ESC or Control-[)
1C U+001C FILE SEPARATOR (aka FS or Control-\)
1D U+001D GROUP SEPARATOR (aka GS or Control-])
1E U+001E RECORD SEPARATOR (aka RS or Control-^)
1F U+001F UNIT SEPARATOR (aka US or Control-_)