Determine the individual unicode characters that make up a word

Determine the individual unicode characters that make up a word - unicode

I'm having trouble breaking a word into its individual unicode components. I'm working with the devanagari script using google input tools. An example is र्म (pronounced -rm), which I want to break into म (-m) and the that hook at the top (-r). But I can't seem to find the unicode character that corresponds to the hook at the top. Here's some of the solutions I tried
1. copy and past र्म into MS word and hit alt x. But this breaks the word into र् and म. It doesn't give me the unicode character for the top hook
2. I tried the site http://shapecatcher.com/. I found a character called latin egyptological ain; while similar in shape, it cannot be used on top of another character. I'm looking the conjunct version of the hook.
Any help would be appreciated. I'm using TekMaker on Windows 8.

The ‘hook at the top’ representing a preceding र् is an inseparable part of the glyph for a variety of biconsonantal ligatures. It's not a discrete, freely-combinable diacritical mark as we would understand it in Latin-like scripts.
Consequently the visual rendering element doesn't have its own Unicode representation distinct from its linguistic meaning र्, sorry!

Related

What kind of Characters are these? Are these unicode? How to write them? [duplicate]

I've seen weirdly formatted text called Zalgo like below written on various forums. It's kind of annoying to look at, but it really bothers me because it undermines my notion of what a character is supposed to be. My understanding is that a character is supposed to move horizontally across a line and stay within a certain "container". Obviously the Zalgo text is moving vertically and doesn't seem to be restricted to any space.
Is this a bug/flaw/exploit/hack in Unicode? Are these individual characters with weird properties? "What" is happening here?
H̡̫̤̤̣͉̤ͭ̓̓̇͗̎̀ơ̯̗̱̘̮͒̄̀̈ͤ̀͡w͓̲͙͖̥͉̹͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮̙̣͓͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸̤͓̞̱̫ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇͓̔͋͊̓ ̢͈͙͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx͎̬̠͇̌ͤ̓̂̓͐͐́͋͡ț̗̹̝̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤͍͇̰̄͗ͭ̃͗ͮ̐o̢̯̻̰̼͕̾ͣͬ̽̔̍͟ͅr̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬͙̟̮͕ͤ̌͗ͩ̕͡

The text uses combining characters, also known as combining marks. See section 2.11 of Combining Characters in the Unicode Standard (PDF).
In Unicode, character rendering does not use a simple character cell model where each glyph fits into a box with given height. Combining marks may be rendered above, below, or inside a base character
So you can easily construct a character sequence, consisting of a base character and “combining above” marks, of any length, to reach any desired visual height, assuming that the rendering software conforms to the Unicode rendering model. Such a sequence has no meaning of course, and even a monkey could produce it (e.g., given a keyboard with suitable driver).
And you can mix “combining above” and “combining below” marks.
The sample text in the question starts with:
LATIN CAPITAL LETTER H - H
COMBINING LATIN SMALL LETTER T - ͭ
COMBINING GREEK KORONIS - ̓
COMBINING COMMA ABOVE - ̓
COMBINING DOT ABOVE - ̇

Zalgo text works because of combining characters. These are special characters that allow to modify character that comes before.
OR
y + ̆ = y̆ which actually is
y + ̆ = y̆
Since you can stack them one atop the other you can produce the following:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
which actually is:
y̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
The same goes for putting stuff underneath:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
that in fact is:
y̰̰̰̰̰̰̰̰̰̰̰̰̰̰̰̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆̆
In Unicode, the main block of combining diacritics for European languages and the International Phonetic Alphabet is U+0300–U+036F.
More about it here
To produce a list of combining diacritical marks you can use the following script (since links keep on dying)
for(var i=768; i<879; i++){console.log(new DOMParser().parseFromString("&#"+i+";", "text/html").documentElement.textContent +" "+"&#"+i+";");}
Also check em out
Mͣͭͣ̾ Vͣͥͭ͛ͤͮͥͨͥͧ̾

Unicode characters aren't combined properly

I am working with some Devanagari text data I want to display in the browser. Unfortunately, there's one combination of nonspacing combining characters that doesn't get rendered as a proberly combined character.
The problem occurs every time a base character is combined with the Devanagari Stress Sign Udatta ॑ (U+0951) and the Devanagari Sign Visarga ः (U+0903).
An example for this would be र॑ः, which is र (U+0930) + ॑ + ः and should be rendered as one character. But the stress sign and the other one don't seem to like each other (as you can see above!).
It's no problem to combine the base char with each of the other two signs alone, btw: र॑ / रः
I already tried to use several fonts which should be able to render Devanagari characters (some Noto fonts, Siddhanta, GentiumPlus) and tested it with different browsers, but the problem seems to be something else.
Does anyone have an idea? Is this not a valid combination of symbols?
EDIT: I just tried to switch around the two marks just to see what if - it renders as रः॑, so U+0951 and U+0903 don't seem to have the same function, as the stress sign gets rendered on top of the other mark.
It looks like i don't understand Unicode enough, yet.

This is NOT a solution for your problem, but might be useful information:
I am working with some Devanagari text data I want to display in the
browser.
Like you, I couldn't get this to work in any browser despite trying several fonts, including Arial Unicode MS:
The browser was simply rendering the text Devanagari Test: रः॑ from within the <body> of a JSP. The stress sign is clearly appearing above the Sign Visarga instead of the base character.
Is this not a valid combination of symbols?
It is a valid combination. I don't know Devanagari, so I don't know whether it is semantically "valid", but it is trivial to generate exactly the character you want from a Java application:
System.out.println("Devanagari test: \u0930\u0903\u0951");
This is the output from executing the println() call, showing the stress sign above the base character:
The screenshot above is from NetBeans 8.2 on Windows 10, but the rendering also worked fine using the latest releases of Eclipse and Intellij IDEA. The constraints are:
The three characters must be specified in that order in println() for the rendering to work.
The Sign Visarga and the Stress Sign Udatta must be presented in their Unicode form. Pasting their glyph representations into the source code won't work, although this can be done for the base character.
An appropriate font must be used for the display. I used Arial Unicode MS for the screen shot above, but other fonts such as Serif, SansSerif and Monospaced also worked.
Does anyone have an idea?
Unfortunately not, although it is clear that:
The grapheme you want to render exists, and is valid.
Although it won't render in a browser, it can be written to the console by a Java application.
The problem seems to be that all browsers apply the diacritic (Stress Sign Udatta) to the immediately preceding character rather than the base character.
See Why are some combining diacritics shifted to the right in some programs? for more information on this.

What character is this:?

EDIT
While posting the question, character I ask for was shown well to me, but after postig it does not show up anymore. As it does not appear, please look up in original site
EDIT2
I looked for Unicode chars associated with "alien", and found no matching ones. Here is how they are compared side by side:
I found, that some texts inside my database contain character like . I am not sure, how it would rendered with different fonts and environments, so here is the image, how I see it:
I tried to identify it with different ways. For example, when I paste it into Sublime Text, it automatically shows as control character <0x85>. When I tried to identify it in different unicode-detectors (http://www.babelstone.co.uk/Unicode/whatisit.html, https://unicode-table.com/en/, https://unicode-search.net/unicode-namesearch.pl), their conclusion is pretty match the same:
Unicode code point character U+0085
UTF-8 encoding c2 85 hexadecimal
194 133 decimal
0302 0205 octal
Unicode character name <control>
Unicode 1.0 character name (deprecated) NEXT LINE (NEL)
https://unicode-search.net/unicode-namesearch.pl
also included this information
HTML encoding … hexadecimal
… decimal
which gave me some vague hint, how it was possible, that … become ``. But this is not main problem here.
My question is: how is possible, that control character is shown up like this and what is the actual glyph used to represent it?
I tried to sketch into http://shapecatcher.com/ to identify it but without success. I did not find such a glyph in any Unicode table.

The alien symbol is not a Unicode character; but is in Microsoft's Webdings font, with character code 0x85. Running Start > Run > charmap, then selecting Webdings from the Font drop list, opens this window:
If I click that alien character in the leftmost column, the message Character Code : 0x85 is shown at the bottom of the window.
I can even copy that character from the Character Map and paste it into Microsoft Wordpad:
The WebDings symbols were included in Unicode Release 7: Pictographic symbols (including many emoji), geometric symbols, arrows, and ornaments originating from the Wingdings and Webdings sets. Therefore you would expect the alien symbol to also be in Unicode. However, I don't think the version of Webdings that was used included that alien symbol, since Windows 10 also has a ttf file for Webdings (version 5.01), and it also does not include the alien symbol:
So presumably what originally caught your attention was some text being rendered with an older version of the Webdings font which included that alien symbol.

The glyph is 👽 U+1F47D EXTRATERRESTRIAL ALIEN. I don't know why your system misrenders a control character.

Can a combining character be used alone in Unicode?

Let's take COMBINING ACUTE ACCENT, for example. Its browser test page does include it alone in the page, but it reacts in a strange way: I can't select it with my mouse, and if I try to interact with it in the DOM inspector, it feels like it's not part of the text at all (there's no before and after this character):
Is a combining character, used alone, still a valid Unicode string?
Or does it have to follow another character?

Yes, a combining character alone is a valid Unicode string (even though its behaviour may be weird without a base character). Section 2.11 of the Unicode Standard emphasises this:
In the Unicode Standard, all sequences of character codes are permitted.
The presentation of such strings is described in D52:
There may be no such base character, such as when a combining character is at the start of text or follows a control or format character [...] In such cases, the combining characters are called isolated combining characters.
With isolated combining characters or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character.
However, if you want to display a combining character by itself, it is recommended that you attach it to a no-break space base character:
Nonspacing combining marks used by the Unicode Standard may be exhibited in apparent
isolation by applying them to U+00A0 NO-BREAK SPACE. This convention might be
employed, for example, when talking about the combining mark itself as a mark, rather
than using it in its normal way in text (that is, applied as an accent to a base letter or in
other combinations).

Also, a dotted circle ◌ (U+25CC, ◌) character can be used as a base character.
Source: https://en.wikipedia.org/wiki/Dotted_circle

What are the unicode ranges for Hindi accented characters?

I'm trying to gather a Unicode list of all the 'o' like shapes in the Hindi character-set. In fact, a list of any characters (in any language) that makes uses of separate characters to indicate an accent would be better.
I intend to use this unicode-list in a RegExp.
I been trying to edit a list of character-ranges by outputting them in an Input TextField, but editing this text causes weird issues (the keyboard-cursor isn't place on the correct character, selections suddenly dissappear / incorrectly warps... in other words... HINDI HELL!)
I've tried this with Notepad++ too, but although it was more responsive, it eventually crapped out on me like it did in the Flash Player textfield. This seems to occur especially while removing the [] block (nulls?) characters. Some of them trigger odd behaviors.
Anyways, all I want is a list of the accents.
An example of a few are in the image below (but I would need ALL accents):
Thanks!

You can find pdf's containing lists of unicode ranges, grouped by language, here: http://unicode.org/charts/
For Hindi, you probably want Devanagari or Devanagari Extended.

Here is the character class for Devanagari combining marks:
[\u901\u902\u903\u93c\u93e\u93f\u940\u941\u942\u943
\u944\u945\u946\u947\u948\u949\u94a\u94b\u94c\u94d
\u951\u952\u953\u954\u962\u963]
This is only the basic Devanagari block (not Devanagari Extended).

If you want the complete set (for all languages), you can do it problematically.
You start from the Unicode date file at ftp://ftp.unicode.org/Public/6.1.0/ucd/UnicodeData.txt, described by TR-44 (http://unicode.org/reports/tr44/#Property_Definitions)
You can use the Canonical_Combining_Class field (see at http://unicode.org/reports/tr44/#Canonical_Combining_Class_Values) to filter the exact characters you want.
Can't be more precise, because "accent" a bit vague :-)
You might even have to also look at General_Category to get the filter right (and exclude certain marks, or symbols, or punctuation).
And a script doing this would definitely be better than trying to mess with text editors.
One of the characteristics of combining characters is that they combine :-)
So you might get all kind of puzzling results (like this: http://www.siao2.com/2006/02/17/533929.aspx :-)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse