Example in Python:
>>> s = 'ı̇'
>>> len(s)
2
>>> list(s)
['ı', '̇']
>>> print(", ".join(map(unicodedata.name, s)))
LATIN SMALL LETTER DOTLESS I, COMBINING DOT ABOVE
>>> normalized = unicodedata.normalize('NFC', s)
>>> print(", ".join(map(unicodedata.name, normalized)))
LATIN SMALL LETTER DOTLESS I, COMBINING DOT ABOVE
As you can see, NFC normalization does not compose the dotless i + a dot to a normal i. Is there a rationale for this? Is this an oversight? Or is it not included because NFC is supposed to be the perfect inverse of NFD (and one wouldn’t want to decompose i to dotless i + dot).
While NFC isn't the "perfect inverse" of NFD, this follows from NFC being defined in terms of the same decomposition mappings as NFD. NFC is basically defined as NFD followed by recomposing certain NFD decomposition pairs. Since there's no decomposition mapping for LATIN SMALL LETTER I, it can never be the result of a recomposition.
I was just reading a post and saw some odd text effects, however I cannot locate how it is achieved or what it is called:
P̢̢̲̭̘̣̪͉͞͞h̴̛̫͉͖̜͙̳͎̕͞͠'̶̀͢҉̯̞̹͈ṉ̶̘̠̯̬̭̖̳͘͞ģ̵̛͠҉̰̝͇̩͍̗͍̘̫͈̺̭̥͉l̨͍̘͔̰͔̖͍̹̠̭̱̰̖͙̦̦͎̕͟u̢̡҉̲̭̲̺̮̖͖͖i̴̢̹̳͉͎̥̪̜͎̼̣̦̖̻͈̖͉͚ͅ ̵͏͇̗̭ͅm̶̨͍̤̪̱͇̤̬̥̥͔̼͍̠̼͕g̷̷̰̩͙̪̫͉̺̯͘͟͠ļ̶̭͇̘̮̕͢ẃ̵̸̷҉͕̬̠̥̤͖̙̲͇̼̹'̺̩̖̟̣͈̖͙̤̫̰̗̯̀͡ń̷̴̶̰̮̺͔̼̺̹̘̟a̷̰̪͙͇̤͓̤̭͎̦͕̻f͏̨͙̰̘͔̟̜̠͈̯̻͕̖̳̝̝́͘ͅḩ̴̛͉͉̲͇̠͙̣̩͙̩͚̮̼̺ͅ ̧̛̟͓̤͇̯͍̫͖͎͈̫̳͓̞͘Ç͘͏͈̹̠̙͎̳̯͚͔̼͙̻͔͖̲̩̹̕ͅt͏̖̲̤̫̤̫̼̪̥̠͙͚͍̭́ͅḩ̡̲͈̫̯͚͉̱͍̳͝ù̧͙̭̙̻̲̙͚͔̲̬͚͢͝͡ḻ̴̵̨̹͉͙̟̯̞̠͔̦̝̩͜h̶̼̜̦͖͍͎͍̕ṷ̴̶̢͙̗̬͇̯̞̗̰̣̬̥̲̣̦ ̵̲͍̩̭̩̗͈͚͟͝R͏̛͘͟҉̫̝̞̪̣̪̻̤̼͖̪͎'̛̯͚͎̳͎̼͓̘͉͢l͟҉̵̘͈͙̣̹̜͍͎̬̺̹̪̜̀y͏͓̞̬͙̥̞̦͎͖̞͖͎̖̀e̶̵̡̺͉̯̭̣̗h͇̺͇̖̼̻̟͓͜͟͜͞ͅ ̴̷̡̨̪͍̙̳̞̭̙̫̯̘͚͇͚̼͙͟w̧̮̜̯̭̘͈̫̳̖̕͜͠g̢̨̗͖̬̠͎͓̱̞͓̭̯̺͕̭̯̦ͅa̴̠̘̬̩͍͜ͅh̵̷̨̜̻͔̖͈̤͈̩͔͈͇̩̞̲̜̩͍̺'̸̨͇̞̜͈͟n̨͟͞҉̤͚͎͇̣̺͚̻̖͖́ͅà̻͉̙̲̲̞͘͝ģ̙̗̙͓̜̣͔̥̫͟͡l̴̨̨̼͚̫̞̙̳͙͢͟ ̢̦͚̲͇̞̺̗̫͇f̸̸̫̠͖͙̜͉̲͖͓̭͇̦̭̩̲͡͠ḩ̸̲̤͍̖̻̣̝̼́̕͝ͅt̴͝҉҉̵͔̮̞̪á̢̕͢͏̗̯̗̙͙͉̪͓͙̣̰̣g͏̶̡͓̤͍͖̜̠̜ͅn̴̶̛̝̼͉̠̻͓
Don't worry though, unless thousands read it I think we are safe.
It's called Zalgo text.
You can Google for an online generator and use it:
TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
As a side note, don't try to parse HTML with RegEx.
You can get a clue if you pick a sample and submit it to Unicode character inspector:
C U+0043 LATIN CAPITAL LETTER C Lu Basic Latin
̧ U+0327 COMBINING CEDILLA Mn Combining Diacritical Marks
͘ U+0358 COMBINING DOT ABOVE RIGHT Mn Combining Diacritical Marks
U+034F COMBINING GRAPHEME JOINER Mn Combining Diacritical Marks
̕ U+0315 COMBINING COMMA ABOVE RIGHT Mn Combining Diacritical Marks
͈ U+0348 COMBINING DOUBLE VERTICAL LINE BELOW Mn Combining Diacritical Marks
[…]
… where Lu stands for Unicode Character Category 'Letter, Uppercase' and Mn stands for Unicode Character Category 'Mark, Nonspacing'.
In short, they're just regular letters attached to all sort of combining diacritics, thanks to the magic of Unicode. It abuses the fact that é can also be written as e + ´ for entertainment purposes.
If you look at the Unicode Block for Mathematical Alphanumeric Symbols you notice that the MATHEMATICAL DOUBLE-STRUCK CAPITAL C is missing. And it is not the only one. Why? What is the point of having DOUBLE-STRUCK if you don't have all 26?
The code chart (PDF) for Mathematical Alphanumeric Symbols contains the following explanation:
Double-struck symbols already encoded in the Letterlike
Symbols block and omitted here to avoid duplicate encoding.
Here “and” is apparently to be read as “are”. Anyway, the point is that the Letterlike Symbols block already contains the double-struck C (as well as a few other double-struck letters). This reflects their relatively common use in mathematics (e.g. for ℂ the use as an alternative to C to denote the set of complex numbers) and their presence in old character codes. The block does not have enough code points for adding all the double-struck letters, so additions were made elsewhere. To keep the allocation natural, holes (reserved code points) were left there.
The code chart contains cross references to the characters allocated elsewhere, e.g. for the reserved code point 1D53A is has the comment “→ 2102 ℂ double-struck capital c”.
The ℂ for example is in another block, as said in the article:
The characters with pink background are located in other Unicode blocks, such as Letterlike symbols.
The ℂ specifically is in Letterlike symbols:
ℂ Double-Struck Capital C U+2102
Given a NFC normalized string, applying full case folding to that string, can I assume that the result is NFC normalized too?
I don't understand what the Unicode standard is trying to tell me in this quote:
Normalization also interacts with case folding. For any string X, let
Q(X) = NFC(toCasefold(NFD(X))). In other words, Q(X) is the result
of normalizing X, then case folding the result, then putting the
result into Normalization Form NFC format. Because of the way
normalization and case folding are defined, Q(Q(X)) = Q(X).
Repeatedly applying Q does not change the result; case folding is
closed under canonical normalization for either Normalization Form
NFC or NFD.
A Unicode string might not be in NFC after case folding. An example is U+00DF (LATIN SMALL LETTER SHARP S) followed by U+0301 (COMBINING ACUTE ACCENT).
X = U+00DF U+0301
NFC(X) = U+00DF U+0301
toCasefold(NFC(X)) = U+0073 U+0073 U+0301
NFC(toCasefold(NFC(X))) = U+0073 U+015B
You have asked two questions:
Question 1: Is toCasefold(NFC(X)) binary equal to NFC(toCasefold(NFC(X)))?
The standard doesn't explicitly answer this question. (I would expect the answer is yes, that case folding does not affect normalization, but I have no proof.)
Question 2: What is the Unicode standard telling me in the quote?
The standard is only saying it is not necessary to do case folding again after canonical normalization. In other words, canonical normalization (to NFC or NFD form) does not change the case of any characters from uppercase to lowercase or vice versa. This doesn't answer your first question.
It is not saying whether or not it is necessary to do canonical normalization again after case folding.
Is there an existing subset of the alphanumerics that is easier to read? In particular, is there a subset that has fewer characters that are visually ambiguous, and by removing (or equating) certain characters we reduce human error?
I know "visually ambiguous" is somewhat waffly of an expression, but it is fairly evident that D, O and 0 are all similar, and 1 and I are also similar. I would like to maximize the size of the set of alpha-numerics, but minimize the number of characters that are likely to be misinterpreted.
The only precedent I am aware of for such a set is the Canada Postal code system that removes the letters D, F, I, O, Q, and U, and that subset was created to aid the postal system's OCR process.
My initial thought is to use only capital letters and numbers as follows:
A
B = 8
C = G
D = 0 = O = Q
E = F
H
I = J = L = T = 1 = 7
K = X
M
N
P
R
S = 5
U = V = Y
W
Z = 2
3
4
6
9
This problem may be difficult to separate from the given type face. The distinctiveness of the characters in the chosen typeface could significantly affect the potential visual ambiguity of any two characters, but I expect that in most modern typefaces the above characters that are equated will have a similar enough appearance to warrant equating them.
I would be grateful for thoughts on the above – are the above equations suitable, or perhaps are there more characters that should be equated? Would lowercase characters be more suitable?
I needed a replacement for hexadecimal (base 16) for similar reasons (e.g. for encoding a key, etc.), the best I could come up with is the following set of 16 characters, which can be used as a replacement for hexadecimal:
0 1 2 3 4 5 6 7 8 9 A B C D E F Hexadecimal
H M N 3 4 P 6 7 R 9 T W C X Y F Replacement
In the replacement set, we consider the following:
All characters used have major distinguishing features that would only be omitted in a truly awful font.
Vowels A E I O U omitted to avoid accidentally spelling words.
Sets of characters that could potentially be very similar or identical in some fonts are avoided completely (none of the characters in any set are used at all):
0 O D Q
1 I L J
8 B
5 S
2 Z
By avoiding these characters completely, the hope is that the user will enter the correct characters, rather than trying to correct mis-entered characters.
For sets of less similar but potentially confusing characters, we only use one character in each set, hopefully the most distinctive:
Y U V
Here Y is used, since it always has the lower vertical section, and a serif in serif fonts
C G
Here C is used, since it seems less likely that a C would be entered as G, than vice versa
X K
Here X is used, since it is more consistent in most fonts
F E
Here F is used, since it is not a vowel
In the case of these similar sets, entry of any character in the set could be automatically converted to the one that is actually used (the first one listed in each set). Note that E must not be automatically converted to F if hexadecimal input might be used (see below).
Note that there are still similar-sounding letters in the replacement set, this is pretty much unavoidable. When reading aloud, a phonetic alphabet should be used.
Where characters that are also present in standard hexadecimal are used in the replacement set, they are used for the same base-16 value. In theory mixed input of hexadecimal and replacement characters could be supported, provided E is not automatically converted to F.
Since this is just a character replacement, it should be easy to convert to/from hexadecimal.
Upper case seems best for the "canonical" form for output, although lower case also looks reasonable, except for "h" and "n", which should still be relatively clear in most fonts:
h m n 3 4 p 6 7 r 9 t w c x y f
Input can of course be case-insensitive.
There are several similar systems for base 32, see http://en.wikipedia.org/wiki/Base32 However these obviously need to introduce more similar-looking characters, in return for an additional 25% more information per character.
Apparently the following set was also used for Windows product keys in base 24, but again has more similar-looking characters:
B C D F G H J K M P Q R T V W X Y 2 3 4 6 7 8 9
My set of 23 unambiguous characters is:
c,d,e,f,h,j,k,m,n,p,r,t,v,w,x,y,2,3,4,5,6,8,9
I needed a set of unambiguous characters for user input, and I couldn't find anywhere that others have already produced a character set and set of rules that fit my criteria.
My requirements:
No capitals: this supposed to be used in URIs, and typed by people who might not have a lot of typing experience, for whom even the shift key can slow them down and cause uncertainty. I also want someone to be able to say "all lowercase" so as to reduce uncertainty, so I want to avoid capital letters.
Few or no vowels: an easy way to avoid creating foul language or surprising words is to simply omit most vowels. I think keeping "e" and "y" is ok.
Resolve ambiguity consistently: I'm open to using some ambiguous characters, so long as I only use one character from each group (e.g., out of lowercase s, uppercase S, and five, I might only use five); that way, on the backend, I can just replace any of these ambiguous characters with the one correct character from their group. So, the input string "3Sh" would be replaced with "35h" before I look up its match in my database.
Only needed to create tokens: I don't need to encode information like base64 or base32 do, so the exact number of characters in my set doesn't really matter, besides my wanting to to be as large as possible. It only needs to be useful for producing random UUID-type id tokens.
Strongly prefer non-ambiguity: I think it's much more costly for someone to enter a token and have something go wrong than it is for someone to have to type out a longer token. There's a tradeoff, of course, but I want to strongly prefer non-ambiguity over brevity.
The confusable groups of characters I identified:
A/4
b/6/G
8/B
c/C
f/F
9/g/q
i/I/1/l/7 - just too ambiguous to use; note that european "1" can look a lot like many people's "7"
k/K
o/O/0 - just too ambiguous to use
p/P
s/S/5
v/V
w/W
x/X
y/Y
z/Z/2
Unambiguous characters:
I think this leaves only 9 totally unambiguous lowercase/numeric chars, with no vowels:
d,e,h,j,m,n,r,t,3
Adding back in one character from each of those ambiguous groups (and trying to prefer the character that looks most distinct, while avoiding uppercase), there are 23 characters:
c,d,e,f,h,j,k,m,n,p,r,t,v,w,x,y,2,3,4,5,6,8,9
Analysis:
Using the rule of thumb that a UUID with a numerical equivalent range of N possibilities is sufficient to avoid collisions for sqrt(N) instances:
an 8-digit UUID using this character set should be sufficient to avoid collisions for about 300,000 instances
a 16-digit UUID using this character set should be sufficient to avoid collisions for about 80 billion instances.
Mainly drawing inspiration from this ux thread, mentioned by #rwb,
Several programs use similar things. The list in your post seems to be very similar to those used in these programs, and I think it should be enough for most purposes. You can add always add redundancy (error-correction) to "forgive" minor mistakes; this will require you to space-out your codes (see Hamming distance), though.
No references as to particular method used in deriving the lists, except trial and error
with humans (which is great for non-ocr: your users are humans)
It may make sense to use character grouping (say, groups of 5) to increase context ("first character in the second of 5 groups")
Ambiguity can be eliminated by using complete nouns (from a dictionary with few look-alikes; word-edit-distance may be useful here) instead of characters. People may confuse "1" with "i", but few will confuse "one" with "ice".
Another option is to make your code into a (fake) word that can be read out loud. A markov model may help you there.
If you have the option to use only capitals, I created this set based on characters which users commonly mistyped, however this wholly depends on the font they read the text in.
Characters to use: A C D E F G H J K L M N P Q R T U V W X Y 3 4 6 7 9
Characters to avoid:
B similar to 8
I similar to 1
O similar to 0
S similar to 5
Z similar to 2
What you seek is an unambiguous, efficient Human-Computer code. What I recommend is to encode the entire data with literal(meaningful) words, nouns in particular.
I have been developing a software to do just that - and most efficiently. I call it WCode. Technically its just Base-1024 Encoding - wherein you use words instead of symbols.
Here are the links:
Presentation: https://docs.google.com/presentation/d/1sYiXCWIYAWpKAahrGFZ2p5zJX8uMxPccu-oaGOajrGA/edit
Documentation: https://docs.google.com/folder/d/0B0pxLafSqCjKOWhYSFFGOHd1a2c/edit
Project: https://github.com/San13/WCode (Please wait while I get around uploading...)
This would be a general problem in OCR. Thus for end to end solution where in OCR encoding is controlled - specialised fonts have been developed to solve the "visual ambiguity" issue you mention of.
See: http://en.wikipedia.org/wiki/OCR-A_font
as additional information : you may want to know about Base32 Encoding - wherein symbol for digit '1' is not used as it may 'confuse' the users with the symbol for alphabet 'l'.
Unambiguous looking letters for humans are also unambiguous for optical character recognition (OCR). By removing all pairs of letters that are confusing for OCR, one obtains:
!+2345679:BCDEGHKLQSUZadehiopqstu
See https://www.monperrus.net/martin/store-data-paper
It depends how large you want your set to be. For example, just the set {0, 1} will probably work well. Similarly the set of digits only. But probably you want a set that's roughly half the size of the original set of characters.
I have not done this, but here's a suggestion. Pick a font, pick an initial set of characters, and write some code to do the following. Draw each character to fit into an n-by-n square of black and white pixels, for n = 1 through (say) 10. Cut away any all-white rows and columns from the edge, since we're only interested in the black area. That gives you a list of 10 codes for each character. Measure the distance between any two characters by how many of these codes differ. Estimate what distance is acceptable for your application. Then do a brute-force search for a set of characters which are that far apart.
Basically, use a script to simulate squinting at the characters and see which ones you can still tell apart.
Here's some python I wrote to encode and decode integers using the system of characters described above.
def base20encode(i):
"""Convert integer into base20 string of unambiguous characters."""
if not isinstance(i, int):
raise TypeError('This function must be called on an integer.')
chars, s = '012345689ACEHKMNPRUW', ''
while i > 0:
i, remainder = divmod(i, 20)
s = chars[remainder] + s
return s
def base20decode(s):
"""Convert string to unambiguous chars and then return integer from resultant base20"""
if not isinstance(s, str):
raise TypeError('This function must be called on a string.')
s = s.translate(bytes.maketrans(b'BGDOQFIJLT7KSVYZ', b'8C000E11111X5UU2'))
chars, i, exponent = '012345689ACEHKMNPRUW', 0, 1
for number in s[::-1]:
i += chars.index(number) * exponent
exponent *= 20
return i
base20decode(base20encode(10))
base58:123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz