What's the unicode glyph used to indicate combining characters? - unicode

My application needs to display "orphaned" combining characters. I would like to use the same format as the "official" unicode charts, using the dotted circle placeholder. See, for example:
Combining Diacritical Marks (PDF)
A quick scan through the charts and I came up with U+25CC "DOTTED CIRCLE". That looks good, but the note on this character reads:
note that the reference glyph for this
character is intentionally larger than
the dotted circle glyph used to
indicate combining characters in this
standard; see, for example, 0300
Which says (I think) that U+25CC is not the correct character. (Or, if it is, perhaps just a poorly worded note.)
So: if the dotted circle used on the "Combining Diacritical Marks" is not U+25CC, what is the correct code for that little booger?
I have tried:
Copying the text from the PDF and inspecting it, but the copy is disabled in the PDF.
Emailing it to myself in Gmail and then viewing the attachment as HTML, but there is gets converted to U+0024 ("DOLLAR SIGN"). Which means that either the conversion failed or they are just playing some font rendering games in the PDF.
[Clarification] I realize that the U+25CC looks OK (assuming one's font supports it), but it sounds like the spec says that this is the wrong character. Many unicode characters have similar glyphs but are different characters, semantically speaking. "Latin Capital Letter A" (U+0041) and "Greek Capital Letter Alpha" (U+0391) will look identical for most fonts, but they have different semantic meanings and are not interchangable.

I don't think there is an official placeholder character. The way I read that note, they chose U+25CC arbitrarily, purely for display purposes. Then, in the chart where the "real" dotted circle is listed, they made it a little larger to emphasize that it's not being used as a placeholder there. (Or maybe they shrunk it in the other charts; as you said, the note's poorly worded.)
Whatever the case, I don't see any reason not to use U+25CC as your placeholder.

Just tried this: create a blank .html file, copy the text, and load in Firefox. Displays as expected (although I really didn't expect space+combining character to display correctly):
<html>
<body>
<font size="24pt">
◌̀
◌́
◌̂
◌̃
<br/>
À
Á
Â
Ã
<br/>
̀
́
̂
̃
</font>
</body>
</html>

Related

Why do the printed unicode characters change?

The way the unicode symbol is displayed depends on whether I use the White Heavy Check Mark or the Negative Squared Cross Mark before it or not. If I do, the Warning Sign is coloured. If I put a space between the symbols, I get the mono-coloured text-like version.
Why does this behaviour exist and can I force the coloured symbol somehow?
I tried a couple of different REPLs, the behaviour was the same.
; No colour
(str (char 0x274e) " " (char 0x26A0))
; Coloured
(str (char 0x274e) "" (char 0x26A0))
Clojure unicode display.
I expect the symbol being displayed the same way regardless of which symbol comes before it.
Why does this behaviour exist
A vendor thought it would be a neat idea to render emoji glyhps in colour. The idea caught on.
https://en.wikipedia.org/wiki/Emoji#Emoji_versus_text_presentation
can I force the coloured symbol somehow
U+FE0E VARIATION SELECTOR-15 and U+FE0F VARIATION SELECTOR-16
http://mts.io/2015/04/21/unicode-symbol-render-text-emoji/
Unicode is about characters (code points), not glyphs (see it as "image" of a character).
Fonts are free to (and should) merge nearby characters into a single glyphs. In printed Latin scripts this is not very common (but we could have it e.g. ff,fi, ffi), without considering the combining codepoints which, per definition, should combine with other characters, to get just one glyph,
Many other scripts require it. Starting to cursive Latin scripts, but most cursive scripts requires changes. E.g. Arabic has different glyphs of initial, final, middle or separated character (+ special combination, common to cursive scripts). Indian scripts have similar behaviours.
So the base of Unicode has already this behaviour, and modern good fonts should be able to do it.
It was not so late, that emojii uses such functionality, e.g. country letters/flags to other common cases.
Often the Unicode documentation tell you of such possibilities, and the special code points which could change behaviour, but then it is task of the font to fullfil the expected behaviour (and to find good glyphs).
So: character (as unicode code point) is not one to one to a design (glyphs).

Unicode characters aren't combined properly

I am working with some Devanagari text data I want to display in the browser. Unfortunately, there's one combination of nonspacing combining characters that doesn't get rendered as a proberly combined character.
The problem occurs every time a base character is combined with the Devanagari Stress Sign Udatta ॑ (U+0951) and the Devanagari Sign Visarga ः (U+0903).
An example for this would be र॑ः, which is र (U+0930) + ॑ + ः and should be rendered as one character. But the stress sign and the other one don't seem to like each other (as you can see above!).
It's no problem to combine the base char with each of the other two signs alone, btw: र॑ / रः
I already tried to use several fonts which should be able to render Devanagari characters (some Noto fonts, Siddhanta, GentiumPlus) and tested it with different browsers, but the problem seems to be something else.
Does anyone have an idea? Is this not a valid combination of symbols?
EDIT: I just tried to switch around the two marks just to see what if - it renders as रः॑, so U+0951 and U+0903 don't seem to have the same function, as the stress sign gets rendered on top of the other mark.
It looks like i don't understand Unicode enough, yet.
This is NOT a solution for your problem, but might be useful information:
I am working with some Devanagari text data I want to display in the
browser.
Like you, I couldn't get this to work in any browser despite trying several fonts, including Arial Unicode MS:
The browser was simply rendering the text Devanagari Test: रः॑ from within the <body> of a JSP. The stress sign is clearly appearing above the Sign Visarga instead of the base character.
Is this not a valid combination of symbols?
It is a valid combination. I don't know Devanagari, so I don't know whether it is semantically "valid", but it is trivial to generate exactly the character you want from a Java application:
System.out.println("Devanagari test: \u0930\u0903\u0951");
This is the output from executing the println() call, showing the stress sign above the base character:
The screenshot above is from NetBeans 8.2 on Windows 10, but the rendering also worked fine using the latest releases of Eclipse and Intellij IDEA. The constraints are:
The three characters must be specified in that order in println() for the rendering to work.
The Sign Visarga and the Stress Sign Udatta must be presented in their Unicode form. Pasting their glyph representations into the source code won't work, although this can be done for the base character.
An appropriate font must be used for the display. I used Arial Unicode MS for the screen shot above, but other fonts such as Serif, SansSerif and Monospaced also worked.
Does anyone have an idea?
Unfortunately not, although it is clear that:
The grapheme you want to render exists, and is valid.
Although it won't render in a browser, it can be written to the console by a Java application.
The problem seems to be that all browsers apply the diacritic (Stress Sign Udatta) to the immediately preceding character rather than the base character.
See Why are some combining diacritics shifted to the right in some programs? for more information on this.

Unicode Keystroke Characters?

Does unicode have characters in it similar to stuff like the things formed by the <kbd> tag in HTML? I want to use it as part of a game to indicate that the user can press a key to perform a certain action, for example:
Press R to reset, or S to open the settings menu.
Are there characters for that? I don't need anything fancy like ⇧ Shift or Tab ⇆, single-letter keys are plenty. I am looking for something that would work somewhat like the Enclosed Alphanumerics subrange.
If there are characters for that, where could I find a page describing them? All the google searches I tried turned only turned up "unicode character keyboard shortcuts" stuff.
If there are not characters for that, how can I display something like that as part of (or at least in line with) a text string in Processing 2.0.1?
(The rendering referred to is not the default rendering of kbd, which simply shows the content in the system’s default monospace font. But e.g. in StackOverflow pages, a style sheet is used to format kbd so that it looks like a keycap.)
Somewhat surprisingly, there is a Unicode way to create something that looks like a character in a keycap: enter the character, then immediately COMBINING ENCLOSING KEYCAP U+20E3.
Font support to this character is very limited but contains a few free fonts. Unfortunately, none of them is a sans-serif font, and the character to be shown inside should normally appear in such a font – after all, real keycaps contains very simple shapes for characters, without serifs. And generally, a character and an enclosing mark should be taken from the same font; otherwise they might be incompatible. However, it seems that taking the normal character from the sans-serif font (FreeSans) in GNU Freefont and the combining mark from the serif font (FreeSerif) of the same source creates a reasonable presentation:
I’m afraid it won’t work here in text, but I’ll try: A⃣ .
Whether this works depends on the use of suitable fonts, as mentioned, but also on the rendering software. Programs have been rather bad at displaying combining marks, but there has been some improvement. I tested this in Word 2007, where it works OK, and also on web browsers (Chrome, Firefox, IE) with good results using code like this:
<style>
.cap { font-family: FreeSerif; }
.cap span { font-family: FreeSans; }
</style>
<span class="cap"><span>A</span>⃣</span>
It isn’t perfect, when using the fonts mentioned. The character in the cap is not quite centered. Moreover, if I try to use the technique e.g. for the character Å (which is present on normal Nordic keyboards), the ring above A extends out of the cap. You could tweak this by setting the font size of the letter in the cap to, say, 85% of the font size of the combining mark, but then the horizontal position of the letter is even more off.
To summarize, it is possible to do such things at the character level, but if you can use other methods, like using a border or a background image for a character, you can probably achieve better rendering.

Superscript in markdown (Github flavored)?

Following this lead, I tried this in a Github README.md:
<span style="vertical-align: baseline; position: relative;top: -0.5em;>text in superscript</span>
Does not work, the text appears as normal. Help?
Use the <sup></sup>tag (<sub></sub> is the equivalent for subscripts). See this gist for an example.
You have a few options for this. The answer depends on exactly what you're trying to do, how readable you want the content to be when viewed as Markdown and where your content will be rendered:
HTML Tags
As others have said, <sup> and <sub> tags work well for arbitrary text. Embedding HTML in a Markdown document like this is well supported so this approach should work with most tools that render Markdown.
Personally, I find HTML impairs the readable of Markdown somewhat, when working with it "bare" (eg. in a text editor) but small tags like this aren't too bad.
LaTeX (New!)
As of May 2022, GitHub supports embedding LaTeX expressions in Markdown docs directly. This gives us new way to render arbitrary text as superscript or subscript in GitHub flavoured Markdown, and it works quite well.
LaTeX expressions are delineated by $$ for blocks or $ for inline expressions. In LaTeX you indicate superscript with the ^ and subscript with _. Curly braces ({ and }) can be used to group characters. You also need to escape spaces with a backslash. The GitHub implementation uses MathJax so see their docs for what else is possible.
You can use super or subscript for mathematical expressions that require it, eg:
$$e^{-\frac{t}{RC}}$$
Which renders as..
Or render arbitrary text as super or subscript inline, eg:
And so it was indeed: she was now only $_{ten\ inches\ high}$, and her face brightened up at the thought that she was now the right size for going through the little door into that lovely garden.
Which renders as..
I've put a few other examples here in a Gist.
Unicode
If the superscript (or subscript) you need is of a mathematical nature, Unicode may well have you covered.
I've compiled a list of all the Unicode super and subscript characters I could identify in this gist. Some of the more common/useful ones are:
⁰ SUPERSCRIPT ZERO (U+2070)
¹ SUPERSCRIPT ONE (U+00B9)
² SUPERSCRIPT TWO (U+00B2)
³ SUPERSCRIPT THREE (U+00B3)
ⁿ SUPERSCRIPT LATIN SMALL LETTER N (U+207F)
People also often reach for <sup> and <sub> tags in an attempt to render specific symbols like these:
™ TRADE MARK SIGN (U+2122)
® REGISTERED SIGN (U+00AE)
℠ SERVICE MARK (U+2120)
Assuming your editor supports Unicode, you can copy and paste the characters above directly into your document or find them in your systems emoji and symbols picker.
On MacOS, simultaneously press the Command ⌘ + Control + Space keys to open the emoji picker. You can browse or search, or click the small icon in the top right to open the more advanced Character Viewer.
On Windows, you can a emoji and symbol picker by pressing ⊞ Windows + ..
Alternatively, if you're putting these characters in an HTML document, you could use the hex values above in an HTML character escape. Eg, ² instead of ². This works with GitHub (and should work anywhere else your Markdown is rendered to HTML) but is less readable when presented as raw text.
Images
If your requirements are especially unusual, you can always just inline an image. The GitHub supported syntax is:
![Alt text goes here, if you'd like](path/to/image.png)
You can use a full path (eg. starting with https:// or http://) but it's often easier to use a relative path, which will load the image from the repo, relative to the Markdown document.
If you happen to know LaTeX (or want to learn it) you could do just about any text manipulation imaginable and render it to an image. Sites like Quicklatex make this quite easy. Of course, if you know your document will be rendered on GitHub, you can use the new (2022) embedded LaTeX syntax discussed earlier)
Comments about previous answers
The universal solution is using the HTML tag <sup>, as suggested in the main answer.
However, the idea behind Markdown is precisely to avoid the use of such tags:
The document should look nice as plain text, not only when rendered.
Another answer proposes using Unicode characters, which makes the document look nice as a plain text document but could reduce compatibility.
Finally, I would like to remember the simplest solution for some documents: the character ^.
Some Markdown implementation (e.g. MacDown in macOS) interprets the caret as an instruction for superscript.
Ex.
Sin^2 + Cos^2 = 1
Clearly, Stack Overflow does not interpret the caret as a superscript instruction. However, the text is comprehensible, and this is what really matters when using Markdown.
If you only need superscript numbers, you can use pure Unicode. It provides all numbers plus several additional characters as superscripts:
x⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿⁱ
However, it might be that the chosen font does not support them, so be sure to check the rendered output.
In fact, there are even quite a few superscript letters, however, their intended use might not be for superscript, and font support might be even worse. Use your own judgement.

What are the unicode ranges for Hindi accented characters?

I'm trying to gather a Unicode list of all the 'o' like shapes in the Hindi character-set. In fact, a list of any characters (in any language) that makes uses of separate characters to indicate an accent would be better.
I intend to use this unicode-list in a RegExp.
I been trying to edit a list of character-ranges by outputting them in an Input TextField, but editing this text causes weird issues (the keyboard-cursor isn't place on the correct character, selections suddenly dissappear / incorrectly warps... in other words... HINDI HELL!)
I've tried this with Notepad++ too, but although it was more responsive, it eventually crapped out on me like it did in the Flash Player textfield. This seems to occur especially while removing the [] block (nulls?) characters. Some of them trigger odd behaviors.
Anyways, all I want is a list of the accents.
An example of a few are in the image below (but I would need ALL accents):
Thanks!
You can find pdf's containing lists of unicode ranges, grouped by language, here: http://unicode.org/charts/
For Hindi, you probably want Devanagari or Devanagari Extended.
Here is the character class for Devanagari combining marks:
[\u901\u902\u903\u93c\u93e\u93f\u940\u941\u942\u943
\u944\u945\u946\u947\u948\u949\u94a\u94b\u94c\u94d
\u951\u952\u953\u954\u962\u963]
This is only the basic Devanagari block (not Devanagari Extended).
If you want the complete set (for all languages), you can do it problematically.
You start from the Unicode date file at ftp://ftp.unicode.org/Public/6.1.0/ucd/UnicodeData.txt, described by TR-44 (http://unicode.org/reports/tr44/#Property_Definitions)
You can use the Canonical_Combining_Class field (see at http://unicode.org/reports/tr44/#Canonical_Combining_Class_Values) to filter the exact characters you want.
Can't be more precise, because "accent" a bit vague :-)
You might even have to also look at General_Category to get the filter right (and exclude certain marks, or symbols, or punctuation).
And a script doing this would definitely be better than trying to mess with text editors.
One of the characteristics of combining characters is that they combine :-)
So you might get all kind of puzzling results (like this: http://www.siao2.com/2006/02/17/533929.aspx :-)