The way the unicode symbol is displayed depends on whether I use the White Heavy Check Mark or the Negative Squared Cross Mark before it or not. If I do, the Warning Sign is coloured. If I put a space between the symbols, I get the mono-coloured text-like version.
Why does this behaviour exist and can I force the coloured symbol somehow?
I tried a couple of different REPLs, the behaviour was the same.
; No colour
(str (char 0x274e) " " (char 0x26A0))
; Coloured
(str (char 0x274e) "" (char 0x26A0))
Clojure unicode display.
I expect the symbol being displayed the same way regardless of which symbol comes before it.
Why does this behaviour exist
A vendor thought it would be a neat idea to render emoji glyhps in colour. The idea caught on.
https://en.wikipedia.org/wiki/Emoji#Emoji_versus_text_presentation
can I force the coloured symbol somehow
U+FE0E VARIATION SELECTOR-15 and U+FE0F VARIATION SELECTOR-16
http://mts.io/2015/04/21/unicode-symbol-render-text-emoji/
Unicode is about characters (code points), not glyphs (see it as "image" of a character).
Fonts are free to (and should) merge nearby characters into a single glyphs. In printed Latin scripts this is not very common (but we could have it e.g. ff,fi, ffi), without considering the combining codepoints which, per definition, should combine with other characters, to get just one glyph,
Many other scripts require it. Starting to cursive Latin scripts, but most cursive scripts requires changes. E.g. Arabic has different glyphs of initial, final, middle or separated character (+ special combination, common to cursive scripts). Indian scripts have similar behaviours.
So the base of Unicode has already this behaviour, and modern good fonts should be able to do it.
It was not so late, that emojii uses such functionality, e.g. country letters/flags to other common cases.
Often the Unicode documentation tell you of such possibilities, and the special code points which could change behaviour, but then it is task of the font to fullfil the expected behaviour (and to find good glyphs).
So: character (as unicode code point) is not one to one to a design (glyphs).
Let's take COMBINING ACUTE ACCENT, for example. Its browser test page does include it alone in the page, but it reacts in a strange way: I can't select it with my mouse, and if I try to interact with it in the DOM inspector, it feels like it's not part of the text at all (there's no before and after this character):
Is a combining character, used alone, still a valid Unicode string?
Or does it have to follow another character?
Yes, a combining character alone is a valid Unicode string (even though its behaviour may be weird without a base character). Section 2.11 of the Unicode Standard emphasises this:
In the Unicode Standard, all sequences of character codes are permitted.
The presentation of such strings is described in D52:
There may be no such base character, such as when a combining character is at the start of text or follows a control or format character [...] In such cases, the combining characters are called isolated combining characters.
With isolated combining characters or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character.
However, if you want to display a combining character by itself, it is recommended that you attach it to a no-break space base character:
Nonspacing combining marks used by the Unicode Standard may be exhibited in apparent
isolation by applying them to U+00A0 NO-BREAK SPACE. This convention might be
employed, for example, when talking about the combining mark itself as a mark, rather
than using it in its normal way in text (that is, applied as an accent to a base letter or in
other combinations).
Also, a dotted circle ◌ (U+25CC, ◌) character can be used as a base character.
Source: https://en.wikipedia.org/wiki/Dotted_circle
I'm having trouble breaking a word into its individual unicode components. I'm working with the devanagari script using google input tools. An example is र्म (pronounced -rm), which I want to break into म (-m) and the that hook at the top (-r). But I can't seem to find the unicode character that corresponds to the hook at the top. Here's some of the solutions I tried
1. copy and past र्म into MS word and hit alt x. But this breaks the word into र् and म. It doesn't give me the unicode character for the top hook
2. I tried the site http://shapecatcher.com/. I found a character called latin egyptological ain; while similar in shape, it cannot be used on top of another character. I'm looking the conjunct version of the hook.
Any help would be appreciated. I'm using TekMaker on Windows 8.
The ‘hook at the top’ representing a preceding र् is an inseparable part of the glyph for a variety of biconsonantal ligatures. It's not a discrete, freely-combinable diacritical mark as we would understand it in Latin-like scripts.
Consequently the visual rendering element doesn't have its own Unicode representation distinct from its linguistic meaning र्, sorry!
I'm having trouble understanding some concepts. In the Unicode spec, there's a property called general category.
OK I understood what are each of letters (usual characters; GC=L), numbers (like digits 0–9 and other characters that have numeric values; GC=N) and separators (dividers; GC=Z). But it's really hard to distinguish between symbols (GC=S), punctuation (GC=P), and marks (GC=M).
I looked up a list of them, but I couldn't find conceptual difference. And the document doesn't help me a lot. What's the difference between all these?
Marks aren't standalone characters, but are applied to another character. Non-spacing marks are displayed over the target character, spacing marks are displayed attached to the target character and enclosing marks are displayed surrounding the target character. For example here's an a in a box (the character "a" combined with the enclosing square character):
a⃞
Regarding punctuations versus symbols: As the text you linked explains, some edge cases are classified rather arbitrarily, but in principle the difference is that punctuation is used "to organize and delimit textual units" (i.e. to mark the end of a sentence, separate different parts of a sentence, separate the elements of an enumeration etc.) and symbols "to represent concepts" (like units for example or mathematical notations).
Can anybody please tell me what is the range of Unicode printable characters? [e.g. Ascii printable character range is \u0020 - \u007f]
See, http://en.wikipedia.org/wiki/Unicode_control_characters
You might want to look especially at C0 and C1 control character http://en.wikipedia.org/wiki/C0_and_C1_control_codes
The wiki says, the C0 control character is in the range U+0000—U+001F and U+007F (which is the same range as ASCII) and C1 control character is in the range U+0080—U+009F
other than C-control character, Unicode also has hundreds of formatting control characters, e.g. zero-width non-joiner, which makes character spacing closer, or bidirectional text control. This formatting control characters are rather scattered.
More importantly, what are you doing that requires you to know Unicode's non-printable characters? More likely than not, whatever you're trying to do is the wrong approach to solve your problem.
This is an old question, but it is still valid and I think there is more to usefully, but briefly, say on the subject than is covered by existing answers.
Unicode
Unicode defines properties for characters.
One of these properties is "General Category" which has Major classes and subclasses. The Major classes are Letter, Mark, Punctuation, Symbol, Separator, and Other.
By knowing the properties of your characters, you can decide whether you consider them printable in your particular context.
You must always remember that terms like "character" and "printable" are often difficult and have interesting edge-cases.
Programming Language support
Some programming languages assist with this problem.
For example, the Go language has a "unicode" package which provides many useful Unicode-related functions including these two:
func IsGraphic(r rune) bool
IsGraphic reports whether the rune is defined as a Graphic by Unicode. Such
characters include letters, marks, numbers, punctuation, symbols, and spaces,
from categories L, M, N, P, S, Zs.
func IsPrint(r rune) bool
IsPrint reports whether the rune is defined as printable by Go. Such
characters include letters, marks, numbers, punctuation, symbols, and
the ASCII space character, from categories L, M, N, P, S and the ASCII
space character. This categorization is the same as IsGraphic except
that the only spacing character is ASCII space, U+0020.
Notice that it says "defined as printable by Go" not by "defined as printable by Unicode". It is almost as if there are some depths the wizards at Unicode dare not plumb.
Printable
The more you learn about Unicode, the more you realise how unexpectedly diverse and unfathomably weird human writing systems are.
In particular whether a particular "character" is printable is not always obvious.
Is a zero-width space printable? When is a hyphenation point printable? Are there characters whose printability depends on their position in a word or on what characters are adjacent to them? Is a combining-character always printable?
Footnotes
ASCII printable character range is \u0020 - \u007f
No it isn't. \u007f is DEL which is not normally considered a printable character. It is, for example, associated with the keyboard key labelled "DEL" whose earliest purpose was to command the deletion of a character from some medium (display, file etc).
In fact many 8-bit character sets have many non-consecutive ranges which are non-printable. See for example C0 and C1 controls.
First, you should remove the word 'UTF8' in your question, it's not pertinent (UTF8 is just one of the encodings of Unicode, it's something orthogonal to your question).
Second: the meaning of "printable/non printable" is less clear in Unicode. Perhaps you mean a "graphical character" ; and one can even dispute if a space is printable/graphical. The non-graphical characters would consist, basically, of control characters: the range 0x00-0x0f plus some others that are scattered.
Anyway, the vast majority of Unicode characters (more than 200.000) are "graphical". But this certainly does not imply that they are printable in your environment.
It seems to me a bad idea, if you intend to generate a "random printable" unicode string, to try to include all "printable" characters.
What you should do is pick a font, and then generate a list of which Unicode characters have glyphs defined for your font. You can use a font library like freetype to test glyphs (test for FT_Get_Char_Index(...) != 0).
Taking the opposite approach to #HoldOffHunger, it might be easier to list the ranges of non-printable characters, and use not to test if a character is printable.
In the style of Regex (so if you wanted printable characters, place a ^):
[\u0000-\u0008\u000B-\u001F\u007F-\u009F\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\uFEFF]
Which accounts for things like separator spaces and joiners
Note that unlike their answer which is a whitelist that ignores all non-latin languages, this blacklist wont permit non-printable characters just because they're in blocks with printable characters (their answer wholly includes Non-Latin, Language Supplement blocks as 'printable', even though it contains things like 'zero-width non-joiner'..).
Be aware though, that if using this or any other solution, for sanitation for example, you may want to do something more nuanced than a blanket replace.
Arguably in that case, non-breaking spaces should change to space, not be removed, and invisible separator should be replaced with comma conditionally.
Then there's invalid character ranges, either [yet] unused or reserved for encoding purposes, and language-specific variation selectors..
NB when using regular expressions, that you enable unicode awareness if it isn't that way by default (for javascript it's via /.../u).
You can tell if you have it correct by attempting to create the regular expression with some multi-byte character ranges.
For example, the above, plus the invalid character range \u{E0100}-\u{E01EF} in javascript:
/[\u0000-\u0008\u000B-\u001F\u007F-\u009F\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\uFEFF\u{E0100}-\u{E01EF}]/u
Without u \u{E0100}-\u{E01EF} equates to \uDB40(\uDD00-\uDB40)\uDDEF, not (\uDB40\uDD00)-(\uDB40\uDDEF), and if replacing you should always enable u even when not including multbyte unicode in the regex itself as you might break surrogate pairs that exist in the text.
What characters are valid?
At present, Unicode is defined as starting from U+0000 and ending at U+10FFFF. The first block, Basic Latin, spans U+0000 to U+007F and the last block, Supplementary Private Use Area-B, spans U+100000 to 10FFFF. If you want to see all of these blocks, see here: Wikipedia.org: Unicode Block; List of Blocks.
Let's break down what's valid/invalid in the Latin Block1.
The Latin Block: TLDR
If you're interested in filtering out either invisible characters, you'll want to filter out:
U+0000 to U+0008: Control
U+000E to U+001F: Device (i.e., Control)
U+007F: Delete (Control)
U+008D to U+009F: Device (i.e., Control)
The Latin Block: Full Ranges
Here's the Latin block, broken up into smaller sections...
U+0000 to U+0008: Control
U+0009 to U+000C: Space
U+000E to U+001F: Device (i.e., Control)
U+0020: Space
U+0021 to U+002F: Symbols
U+0030 to U+0039: Numbers
U+003A to U+0040: Symbols
U+0041 to U+005A: Uppercase Letters
U+005B to U+0060: Symbols
U+0061 to U+007A: Lowercase Letters
U+007B to U+007E: Symbols
U+007F: Delete (Control)
U+0080 to U+008C: Latin1-Supplement symbols.
U+008D to U+009F: Device (i.e., Control)
U+00A0: Non-breaking space. (i.e., )
U+00A1 to U+00BF: Symbols.
U+00C0 to U+00FF: Accented characters.
The Other Blocks
Unicode is famous for supporting non-Latin character sets, so what are these other blocks? This is just a broad overview, see the wikipedia.org page for the full, complete list.
Latin1 & Latin1-Related Blocks
U+0000 to U+007F : Basic Latin
U+0080 to U+00FF : Latin-1 Supplement
U+0100 to U+017F : Latin Extended-A
U+0180 to U+024F : Latin Extended-B
Combinable blocks
U+0250 to U+036F: 3 Blocks.
Non-Latin, Language blocks
U+0370 to U+1C7F: 55 Blocks.
Non-Latin, Language Supplement blocks
U+1C80 to U+209F: 11 Blocks.
Symbol blocks
U+20A0 to U+2BFF: 22 Blocks.
Ancient Language blocks
U+2C00 to U+2C5F: 1 Block (Glagolitic).
Language Extensions blocks
U+2C60 to U+FFEF: 66 Blocks.
Special blocks
U+FFF0 to U+FFFF: 1 Block (Specials).
One approach is to render each character to a texture and manually check if it is visible. This solution excludes spaces.
I've written such a program and used it to determine there are roughly 467241 printable characters within the first 471859 code points. I've selected this number because it covers all of the first 4 Planes of Unicode, which seem to contain all printable characters. See https://en.wikipedia.org/wiki/Plane_(Unicode)
I would much like to refine my program to produce the list of ranges, but for now here's what I am working with for anyone who needs immediate answers:
https://editor.p5js.org/SamyBencherif/sketches/_OE8Y3kS9
I am posting this tool because I think this question attracts a lot of people who are looking for slightly different applications of knowing printable ranges. Hopefully this is useful, even though it does not fully answer the question.
The printable Unicode character range, excluding the hex, is 32 to 126 in the int datatype.
Unicode, stict term, has no range. Numbers can go infinite.
What you gave is not UTF8 which has 1 byte for ASCII characters.
As for the range, I believe there is no range of printable characters. It always evolves. Check the page I gave above.