I've seen weirdly formatted text called Zalgo like below written on various forums. It's kind of annoying to look at, but it really bothers me because it undermines my notion of what a character is supposed to be. My understanding is that a character is supposed to move horizontally across a line and stay within a certain "container". Obviously the Zalgo text is moving vertically and doesn't seem to be restricted to any space.
Is this a bug/flaw/exploit/hack in Unicode? Are these individual characters with weird properties? "What" is happening here?
H̡̫̤̤̣͉̤ͭ̓̓̇͗̎̀ơ̯̗̱̘̮͒̄̀̈ͤ̀͡w͓̲͙͖̥͉̹͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮̙̣͓͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸̤͓̞̱̫ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇͓̔͋͊̓ ̢͈͙͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx͎̬̠͇̌ͤ̓̂̓͐͐́͋͡ț̗̹̝̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤͍͇̰̄͗ͭ̃͗ͮ̐o̢̯̻̰̼͕̾ͣͬ̽̔̍͟ͅr̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬͙̟̮͕ͤ̌͗ͩ̕͡
The text uses combining characters, also known as combining marks. See section 2.11 of Combining Characters in the Unicode Standard (PDF).
In Unicode, character rendering does not use a simple character cell model where each glyph fits into a box with given height. Combining marks may be rendered above, below, or inside a base character
So you can easily construct a character sequence, consisting of a base character and “combining above” marks, of any length, to reach any desired visual height, assuming that the rendering software conforms to the Unicode rendering model. Such a sequence has no meaning of course, and even a monkey could produce it (e.g., given a keyboard with suitable driver).
And you can mix “combining above” and “combining below” marks.
The sample text in the question starts with:
LATIN CAPITAL LETTER H - H
COMBINING LATIN SMALL LETTER T - ͭ
COMBINING GREEK KORONIS - ̓
COMBINING COMMA ABOVE - ̓
COMBINING DOT ABOVE - ̇
Zalgo text works because of combining characters. These are special characters that allow to modify character that comes before.
OR
y + ̆ = y̆ which actually is
y + ̆ = y̆
Since you can stack them one atop the other you can produce the following:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
which actually is:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
The same goes for putting stuff underneath:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
that in fact is:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
In Unicode, the main block of combining diacritics for European languages and the International Phonetic Alphabet is U+0300–U+036F.
More about it here
To produce a list of combining diacritical marks you can use the following script (since links keep on dying)
for(var i=768; i<879; i++){console.log(new DOMParser().parseFromString("&#"+i+";", "text/html").documentElement.textContent +" "+"&#"+i+";");}
Also check em out
Mͣͭͣ̾ Vͣͥͭ͛ͤͮͥͨͥͧ̾
Related
I've seen weirdly formatted text called Zalgo like below written on various forums. It's kind of annoying to look at, but it really bothers me because it undermines my notion of what a character is supposed to be. My understanding is that a character is supposed to move horizontally across a line and stay within a certain "container". Obviously the Zalgo text is moving vertically and doesn't seem to be restricted to any space.
Is this a bug/flaw/exploit/hack in Unicode? Are these individual characters with weird properties? "What" is happening here?
H̡̫̤̤̣͉̤ͭ̓̓̇͗̎̀ơ̯̗̱̘̮͒̄̀̈ͤ̀͡w͓̲͙͖̥͉̹͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮̙̣͓͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸̤͓̞̱̫ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇͓̔͋͊̓ ̢͈͙͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx͎̬̠͇̌ͤ̓̂̓͐͐́͋͡ț̗̹̝̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤͍͇̰̄͗ͭ̃͗ͮ̐o̢̯̻̰̼͕̾ͣͬ̽̔̍͟ͅr̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬͙̟̮͕ͤ̌͗ͩ̕͡
The text uses combining characters, also known as combining marks. See section 2.11 of Combining Characters in the Unicode Standard (PDF).
In Unicode, character rendering does not use a simple character cell model where each glyph fits into a box with given height. Combining marks may be rendered above, below, or inside a base character
So you can easily construct a character sequence, consisting of a base character and “combining above” marks, of any length, to reach any desired visual height, assuming that the rendering software conforms to the Unicode rendering model. Such a sequence has no meaning of course, and even a monkey could produce it (e.g., given a keyboard with suitable driver).
And you can mix “combining above” and “combining below” marks.
The sample text in the question starts with:
LATIN CAPITAL LETTER H - H
COMBINING LATIN SMALL LETTER T - ͭ
COMBINING GREEK KORONIS - ̓
COMBINING COMMA ABOVE - ̓
COMBINING DOT ABOVE - ̇
Zalgo text works because of combining characters. These are special characters that allow to modify character that comes before.
OR
y + ̆ = y̆ which actually is
y + ̆ = y̆
Since you can stack them one atop the other you can produce the following:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
which actually is:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
The same goes for putting stuff underneath:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
that in fact is:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
In Unicode, the main block of combining diacritics for European languages and the International Phonetic Alphabet is U+0300–U+036F.
More about it here
To produce a list of combining diacritical marks you can use the following script (since links keep on dying)
for(var i=768; i<879; i++){console.log(new DOMParser().parseFromString("&#"+i+";", "text/html").documentElement.textContent +" "+"&#"+i+";");}
Also check em out
Mͣͭͣ̾ Vͣͥͭ͛ͤͮͥͨͥͧ̾
Is it possible to use unicode combining characters to for example make the characters x and y appear to be partially overlapping each other?
I know that in layout systems like CSS there are other ways to achieve this, but I specifically want to know if its possible with just unicode so I can for example do it in Slack messages.
No, there is no Unicode mechanism to make arbitrary letters overlap each other. You can put an x above a y using the character U+036F COMBINING LATIN SMALL LETTER X like so: yͯ, but that’s about it.
Latin letters partially overlapping each other serves no semantic function, so it is not part of the Unicode standard. And if it was found to be used to convey actual meaning in some writing system, it would most likely not be encoded as a generalised mechanism but as individual characters representing specific such ligatures.
The Unicode Consortium does not consider styling features like that to be part of plain text. That is also why those bold and italic mathematical letters you sometimes see on Twitter (𝐀, 𝐴, 𝓐 etc.) aren’t implemented as the base letters plus some style modifiers, but as separate character codes entirely. A character that means “display the preceding letter as bold” would have been too general; non-crucial style variation should be dealt with through higher-level protocols (like the CSS you mentioned) which are much more powerful and enjoy more widespread support anyway.
I'm writing a program that converts an integer to a Roman numeral.
Roman numerals over 3999 are overlined, so IV overlined is 4000, CM overlined is 900'000, etc. These lines can stack.
So as to not limit my program, stopping it at just 3999 isn't good enough.
The question is, how do I add the "combining overline" unicode character to my string to achieve this?
My program is written in Rust, but I suspect the solution is similar across most languages that support unicode strings.
Just add the combining mark after each character.
Here's a Python example. What you see depends on support for combining marks in your console/IDE/browser.
with open('test.txt','w',encoding='utf-8-sig') as f:
print('I\u0305V\u0305',file=f)
Output (image and text)
(image) I̅V̅ (text)
In testing, U+0305 COMBINING OVERLINE could stack up to two, but Chrome drew incorrectly for three. There is also U+033F COMBINING DOUBLE OVERLINE.
You can just use them in string constants, either with the Unicode escape sequence (here shown for Rust) or directly (as they can be easily represented in UTF-8 source code files):
println!("I\u{0305}V\u{0305} - I̅V̅");
Note however, that each letter with overline requires two Unicode codepoints. So they do not fit into a single char. You need to use a string.
The combining overline character itself fits into a single character:
let combining_overline = '\u{0305}';
To apply it, insert it after the base character that needs the overline.
Let's take COMBINING ACUTE ACCENT, for example. Its browser test page does include it alone in the page, but it reacts in a strange way: I can't select it with my mouse, and if I try to interact with it in the DOM inspector, it feels like it's not part of the text at all (there's no before and after this character):
Is a combining character, used alone, still a valid Unicode string?
Or does it have to follow another character?
Yes, a combining character alone is a valid Unicode string (even though its behaviour may be weird without a base character). Section 2.11 of the Unicode Standard emphasises this:
In the Unicode Standard, all sequences of character codes are permitted.
The presentation of such strings is described in D52:
There may be no such base character, such as when a combining character is at the start of text or follows a control or format character [...] In such cases, the combining characters are called isolated combining characters.
With isolated combining characters or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character.
However, if you want to display a combining character by itself, it is recommended that you attach it to a no-break space base character:
Nonspacing combining marks used by the Unicode Standard may be exhibited in apparent
isolation by applying them to U+00A0 NO-BREAK SPACE. This convention might be
employed, for example, when talking about the combining mark itself as a mark, rather
than using it in its normal way in text (that is, applied as an accent to a base letter or in
other combinations).
Also, a dotted circle ◌ (U+25CC, ◌) character can be used as a base character.
Source: https://en.wikipedia.org/wiki/Dotted_circle
I'm trying to gather a Unicode list of all the 'o' like shapes in the Hindi character-set. In fact, a list of any characters (in any language) that makes uses of separate characters to indicate an accent would be better.
I intend to use this unicode-list in a RegExp.
I been trying to edit a list of character-ranges by outputting them in an Input TextField, but editing this text causes weird issues (the keyboard-cursor isn't place on the correct character, selections suddenly dissappear / incorrectly warps... in other words... HINDI HELL!)
I've tried this with Notepad++ too, but although it was more responsive, it eventually crapped out on me like it did in the Flash Player textfield. This seems to occur especially while removing the [] block (nulls?) characters. Some of them trigger odd behaviors.
Anyways, all I want is a list of the accents.
An example of a few are in the image below (but I would need ALL accents):
Thanks!
You can find pdf's containing lists of unicode ranges, grouped by language, here: http://unicode.org/charts/
For Hindi, you probably want Devanagari or Devanagari Extended.
Here is the character class for Devanagari combining marks:
[\u901\u902\u903\u93c\u93e\u93f\u940\u941\u942\u943
\u944\u945\u946\u947\u948\u949\u94a\u94b\u94c\u94d
\u951\u952\u953\u954\u962\u963]
This is only the basic Devanagari block (not Devanagari Extended).
If you want the complete set (for all languages), you can do it problematically.
You start from the Unicode date file at ftp://ftp.unicode.org/Public/6.1.0/ucd/UnicodeData.txt, described by TR-44 (http://unicode.org/reports/tr44/#Property_Definitions)
You can use the Canonical_Combining_Class field (see at http://unicode.org/reports/tr44/#Canonical_Combining_Class_Values) to filter the exact characters you want.
Can't be more precise, because "accent" a bit vague :-)
You might even have to also look at General_Category to get the filter right (and exclude certain marks, or symbols, or punctuation).
And a script doing this would definitely be better than trying to mess with text editors.
One of the characteristics of combining characters is that they combine :-)
So you might get all kind of puzzling results (like this: http://www.siao2.com/2006/02/17/533929.aspx :-)