I was just reading a post and saw some odd text effects, however I cannot locate how it is achieved or what it is called:
P̢̢̲̭̘̣̪͉͞͞h̴̛̫͉͖̜͙̳͎̕͞͠'̶̀͢҉̯̞̹͈ṉ̶̘̠̯̬̭̖̳͘͞ģ̵̛͠҉̰̝͇̩͍̗͍̘̫͈̺̭̥͉l̨͍̘͔̰͔̖͍̹̠̭̱̰̖͙̦̦͎̕͟u̢̡҉̲̭̲̺̮̖͖͖i̴̢̹̳͉͎̥̪̜͎̼̣̦̖̻͈̖͉͚ͅ ̵͏͇̗̭ͅm̶̨͍̤̪̱͇̤̬̥̥͔̼͍̠̼͕g̷̷̰̩͙̪̫͉̺̯͘͟͠ļ̶̭͇̘̮̕͢ẃ̵̸̷҉͕̬̠̥̤͖̙̲͇̼̹'̺̩̖̟̣͈̖͙̤̫̰̗̯̀͡ń̷̴̶̰̮̺͔̼̺̹̘̟a̷̰̪͙͇̤͓̤̭͎̦͕̻f͏̨͙̰̘͔̟̜̠͈̯̻͕̖̳̝̝́͘ͅḩ̴̛͉͉̲͇̠͙̣̩͙̩͚̮̼̺ͅ ̧̛̟͓̤͇̯͍̫͖͎͈̫̳͓̞͘Ç͘͏͈̹̠̙͎̳̯͚͔̼͙̻͔͖̲̩̹̕ͅt͏̖̲̤̫̤̫̼̪̥̠͙͚͍̭́ͅḩ̡̲͈̫̯͚͉̱͍̳͝ù̧͙̭̙̻̲̙͚͔̲̬͚͢͝͡ḻ̴̵̨̹͉͙̟̯̞̠͔̦̝̩͜h̶̼̜̦͖͍͎͍̕ṷ̴̶̢͙̗̬͇̯̞̗̰̣̬̥̲̣̦ ̵̲͍̩̭̩̗͈͚͟͝R͏̛͘͟҉̫̝̞̪̣̪̻̤̼͖̪͎'̛̯͚͎̳͎̼͓̘͉͢l͟҉̵̘͈͙̣̹̜͍͎̬̺̹̪̜̀y͏͓̞̬͙̥̞̦͎͖̞͖͎̖̀e̶̵̡̺͉̯̭̣̗h͇̺͇̖̼̻̟͓͜͟͜͞ͅ ̴̷̡̨̪͍̙̳̞̭̙̫̯̘͚͇͚̼͙͟w̧̮̜̯̭̘͈̫̳̖̕͜͠g̢̨̗͖̬̠͎͓̱̞͓̭̯̺͕̭̯̦ͅa̴̠̘̬̩͍͜ͅh̵̷̨̜̻͔̖͈̤͈̩͔͈͇̩̞̲̜̩͍̺'̸̨͇̞̜͈͟n̨͟͞҉̤͚͎͇̣̺͚̻̖͖́ͅà̻͉̙̲̲̞͘͝ģ̙̗̙͓̜̣͔̥̫͟͡l̴̨̨̼͚̫̞̙̳͙͢͟ ̢̦͚̲͇̞̺̗̫͇f̸̸̫̠͖͙̜͉̲͖͓̭͇̦̭̩̲͡͠ḩ̸̲̤͍̖̻̣̝̼́̕͝ͅt̴͝҉҉̵͔̮̞̪á̢̕͢͏̗̯̗̙͙͉̪͓͙̣̰̣g͏̶̡͓̤͍͖̜̠̜ͅn̴̶̛̝̼͉̠̻͓
Don't worry though, unless thousands read it I think we are safe.
It's called Zalgo text.
You can Google for an online generator and use it:
TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
As a side note, don't try to parse HTML with RegEx.
You can get a clue if you pick a sample and submit it to Unicode character inspector:
C U+0043 LATIN CAPITAL LETTER C Lu Basic Latin
̧ U+0327 COMBINING CEDILLA Mn Combining Diacritical Marks
͘ U+0358 COMBINING DOT ABOVE RIGHT Mn Combining Diacritical Marks
U+034F COMBINING GRAPHEME JOINER Mn Combining Diacritical Marks
̕ U+0315 COMBINING COMMA ABOVE RIGHT Mn Combining Diacritical Marks
͈ U+0348 COMBINING DOUBLE VERTICAL LINE BELOW Mn Combining Diacritical Marks
[…]
… where Lu stands for Unicode Character Category 'Letter, Uppercase' and Mn stands for Unicode Character Category 'Mark, Nonspacing'.
In short, they're just regular letters attached to all sort of combining diacritics, thanks to the magic of Unicode. It abuses the fact that é can also be written as e + ´ for entertainment purposes.
Related
how to use regular expression to find strings containing two capital letters in a row?
^([A-Z\s]+)$
^.*[A-Z]{2}.*$ matches as follows
^ Beginning of the line
.* Any char for any number of times
[A-Z]{2} Two consecutive capital letters
.* Any char for any number of times
$ End of line
Find a live example here:
https://regex101.com/r/m2hPbh/1
([A-Z][A-Z][a-z0-9]*) would find every word that contains 2 capital letters in a row
This question already has an answer here:
Visually-identical characters in Unicode
(1 answer)
Closed 2 years ago.
They both render as Ð (And the other as Đ). I think that pretty much sums it up. Is there an inherent difference?
They render slightly differently, or similarly, depending on your system.. Char u+0110 is "Latin Capital Letter Eth" and char u+00d0 is "Latin Capital Letter D with stroke". Try opening these pages in different tabs and switch back and forth, the example is rendered slightly differently.
Unicode Character 'LATIN CAPITAL LETTER ETH' (U+00D0)
Unicode Character 'LATIN CAPITAL LETTER D WITH STROKE' (U+0110)
They render very similarly1, but they are different characters, and the lowercase version of those uppercase letters render very differently1.
For comparison:
Name
Capital
Small
Latin Letter Eth
Ð (U+00D0)
ð (U+00F0)
Latin Letter D with Stroke
Đ (U+0110)
đ (U+0111)
Latin Letter African D / Latin Letter D with Tail
Ɖ (U+0189)
ɖ (U+0256)
1) Depends on the font, of course.
Credit: This answer is mostly a refinement/combination of the answer by doublesharp and the comment by John Montgomery.
I was surprised to find that no Unicode normalization of the Ł character maps it to something like L + combining stroke. That was my best explanation to understand why Ł to get mapped to L rather than ? when converting from a Unicode-capable encoding to ASCII or a code page that doesn't have the Ł character. How does it work otherwise? Does the standard define fallback characters?
Example in Python:
>>> s = 'ı̇'
>>> len(s)
2
>>> list(s)
['ı', '̇']
>>> print(", ".join(map(unicodedata.name, s)))
LATIN SMALL LETTER DOTLESS I, COMBINING DOT ABOVE
>>> normalized = unicodedata.normalize('NFC', s)
>>> print(", ".join(map(unicodedata.name, normalized)))
LATIN SMALL LETTER DOTLESS I, COMBINING DOT ABOVE
As you can see, NFC normalization does not compose the dotless i + a dot to a normal i. Is there a rationale for this? Is this an oversight? Or is it not included because NFC is supposed to be the perfect inverse of NFD (and one wouldn’t want to decompose i to dotless i + dot).
While NFC isn't the "perfect inverse" of NFD, this follows from NFC being defined in terms of the same decomposition mappings as NFD. NFC is basically defined as NFD followed by recomposing certain NFD decomposition pairs. Since there's no decomposition mapping for LATIN SMALL LETTER I, it can never be the result of a recomposition.
If you look at the Unicode Block for Mathematical Alphanumeric Symbols you notice that the MATHEMATICAL DOUBLE-STRUCK CAPITAL C is missing. And it is not the only one. Why? What is the point of having DOUBLE-STRUCK if you don't have all 26?
The code chart (PDF) for Mathematical Alphanumeric Symbols contains the following explanation:
Double-struck symbols already encoded in the Letterlike
Symbols block and omitted here to avoid duplicate encoding.
Here “and” is apparently to be read as “are”. Anyway, the point is that the Letterlike Symbols block already contains the double-struck C (as well as a few other double-struck letters). This reflects their relatively common use in mathematics (e.g. for ℂ the use as an alternative to C to denote the set of complex numbers) and their presence in old character codes. The block does not have enough code points for adding all the double-struck letters, so additions were made elsewhere. To keep the allocation natural, holes (reserved code points) were left there.
The code chart contains cross references to the characters allocated elsewhere, e.g. for the reserved code point 1D53A is has the comment “→ 2102 ℂ double-struck capital c”.
The ℂ for example is in another block, as said in the article:
The characters with pink background are located in other Unicode blocks, such as Letterlike symbols.
The ℂ specifically is in Letterlike symbols:
ℂ Double-Struck Capital C U+2102