Are there Unicode characters to represent bundles (and partial bundles) of 5 in the style of the tally/five-bar-gate?
If not, what would be the most standard/semantic/accessible solution to this problem?
Things I've tried but don't like:
Using the numbers 1-5 - easily confusing (3 bundles of 5 looks like 555)
1-4 pipes with strike-through for '5' - semantically messy, strike-through not reliably portable
Ogham letter Straif for '5' - non-contiguous/not-associated with representations of 1-4 (all Ogham characters contain the horizontal line)
Images with a descriptive alt tag - only works for the web, verbose for text-only browsers/screen readers.
References:
Wikipedia's page on tally notation shows nice Asian and Latin American systems. My audience is mostly European, but a method in these styles would be acceptable if the five-bar-gate is not possible.
Thanks a lot!
A proposal to include “Western-style tally marks” (L2/16-065, PDF) was accepted by the Unicode Technical Committee in May 2016.
This introduces two characters:
u1D377 - TALLY MARK ONE - 𝍷
u1D378 - TALLY MARK FIVE - 𝍸
As of early 2018, these characters are “accepted by UTC and in active ISO technical ballot (or on hold for ballot)”. Their adoption progress can be monitored in the proposed characters pipeline.
You could use 卌 U+534C (in fact, it's a Chinese symbol for 40)
Use four vertical bars with strikethrough text.
||||
GfBo
The closest thing Unicode has to tally marks are the counting rod numerals.
𝍠 U+1D360 COUNTING ROD UNIT DIGIT ONE
𝍡 U+1D361 COUNTING ROD UNIT DIGIT TWO
𝍢 U+1D362 COUNTING ROD UNIT DIGIT THREE
𝍣 U+1D363 COUNTING ROD UNIT DIGIT FOUR
𝍤 U+1D364 COUNTING ROD UNIT DIGIT FIVE
𝍥 U+1D365 COUNTING ROD UNIT DIGIT SIX
𝍦 U+1D366 COUNTING ROD UNIT DIGIT SEVEN
𝍧 U+1D367 COUNTING ROD UNIT DIGIT EIGHT
𝍨 U+1D368 COUNTING ROD UNIT DIGIT NINE
𝍩 U+1D369 COUNTING ROD TENS DIGIT ONE
𝍪 U+1D36A COUNTING ROD TENS DIGIT TWO
𝍫 U+1D36B COUNTING ROD TENS DIGIT THREE
𝍬 U+1D36C COUNTING ROD TENS DIGIT FOUR
𝍭 U+1D36D COUNTING ROD TENS DIGIT FIVE
𝍮 U+1D36E COUNTING ROD TENS DIGIT SIX
𝍯 U+1D36F COUNTING ROD TENS DIGIT SEVEN
𝍰 U+1D370 COUNTING ROD TENS DIGIT EIGHT
𝍱 U+1D371 COUNTING ROD TENS DIGIT NINE
As far as I can tell, there is no |||| character.
I personally just copy and paste this string:
||̸||
It's four of the ASCII vertical bar with one Unicode "combining long solidus overlay" sandwiched in between.
To be quite clear, it's the sequence of five characters specified by these Unicode code points:
U+007C
U+007C
U+0338
U+007C
U+007C
Related
I learned today that while common fractions have dedicated Unicode values, in order to form less common fractions like ³/₁₆ you have to use superscript/subscript characters followed by a slash. This is confirmed here and here.
This works for ¹¹/₁₆ and ¹³/₁₆, but it gets messed up with ¹⁵/₁₆. Do you see how the 5 rises higher than the one? I imagine this is because in order to show the number 5 clearly as a superscript, it requires more height than 1 and 3.
Well, that creates a problem. How do you display the fraction 15/16 nicely as Unicode characters? Unfortunately I can't use the sup and sub tags. I'm not displaying it in an HTML page. Rather, we're passing a string to a Java application that will then render these values. I know it renders Unicode values fine, but it wouldn't recognize HTML tags. Is there a Unicode solution?
The “proper” way of composing arbitrary vulgar fractions in Unicode is to not use the subscript and superscript digits at all, but to utilise the special properties of the character U+2044 FRACTION SLASH. You would simply type the regular ASCII digits and separate them with the slash like so: 15⁄16. The rendering engine will then automatically select the correct forms of the numbers, producing a clean, uniform look.
I put the word ‘proper’ in quotation marks because this method is not guaranteed to be supported on all systems, and some that do support it do so incorrectly or incompletely. If you absolutely need to make sure that 100% of recipients regardless of system will definitely see something that looks more or less right, I would therefore still (begrudgingly) recommend using the preformatted subscripts and superscripts as a substitute. As the other answer explained, the problem you are having is a font issue and cannot be solved if you do not have control over font settings.
This is indeed a font issue, however the problem arises from the fact that, in Unicode, ¹, ², and ³ belong to the Latin-1 Supplement block, while the other superscript digits belong to the Superscripts and Subscripts block, and some font substitution occurs.
Please see Why the display of Unicode characters for superscripted digits are not at the same height? for extra details; it is tagged as iOS, but I have the same problem on macOS too.
I found this site, Unicode Fraction Creator: https://lights0123.com/fractions/
Here's an example: ³⁄₂
Which is:
U+00B3 superscript three
U+2044 fraction slash
U+2082 subscript two
For a general answer on displaying fractions nicely, copy, paste, and change.
ASCII Characters
Name
hexadecimal value
⁄
Fraction Slash
8260
0
digit 0
48
1
digit 1
49
2
digit 2
50
3
digit 3
51
4
digit 4
52
5
digit 5
53
6
digit 6
54
7
digit 7
55
8
digit 8
56
9
digit 9
57
example: 1/0 =
1⁄0
What is the difference between zero-width space (U+200B) and zero-width non-joiner (U+200C) from practical point of view?
I have already read Wikipedia articles, but I can't understand if these characters are interchangeable or not.
I think they are completely interchangeable, but then I can't understand why we have two in Unicode set instead of one.
A zero-width non-joiner is almost non-existing. Its only purpose is to split things into two. For example, 123 zero-width-non-joiner 456 is two numbers with nothing in between.
A zero-width space is a space character, just a very very narrow one. For example 123 zero-width-space 456 is two numbers with a space character in between.
A zero width non joiner (ZWNJ) only interrupts ligatures. These are hard to notice in the latin alphabet but are most frequent in serif fonts displaying some specific combinations of lowercase letters. There are a few alphabets, such as the arabic abjad, that use ligatures very prominently.
A zero width space (ZWSP) does everything a ZWNJ does, but it also creates opportunities for line breaks. Very good for displaying file paths and long URLs, but beware that it might screw up copy pasting.
By the way, I tested regular expression matching in Python 3.8 and Javascript 1.5 and none of them match \s. Unicode considers these characters as formatting characters (similar to direction markers and such) as opposed to space/punctuation. There are other codepoints in the same Unicode block (e.g. Thin Space, U+2009) that are considered space by Unicode and do match \s.
American Standard Code Of Information Interchange
Unicode
or more character set Contains Digits? Example:1 2 3
Assuming you're asking which character sets contain digits, the answers are yes, yes and yes.
ASCII has the digits from 0 thru 9 at code points 0x30 thru 0x39.
So do ISO 10646 and Unicode, which share the same character encodings in the low range at least. They are also at code points 0x30/U+30 thru 0x39/U+39 (since the lower 128 characters are equivalent to ASCII).
As for the final paragraph asking if there are other character set that contain digits, I can think of at least one off the top of my head, EBCDIC. These are at code points 0xf0 thru 0xf9.
No doubt there are other encoding schemes that include digits as well, given their usefulness.
What is the difference between ‘combining characters’ and ‘grapheme extenders’ in Unicode?
They seem to do the same thing, as far as I can tell – although the set of grapheme extenders is larger than the set of combining characters. I’m clearly missing something here. Why the distinction?
The Unicode Standard, Chapter 3, D52
Combining character: A character with the General Category of Combining Mark (M).
Combining characters consist of all characters with the General Category values of Spacing Combining Mark (Mc), Nonspacing Mark (Mn), and Enclosing Mark (Me).
All characters with non-zero canonical combining class are combining characters, but the reverse is not the case: there are combining characters with a zero canonical combining class.
The interpretation of private-use characters (Co) as combining characters or not is determined by the implementation.
These characters are not normally used in isolation unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.
The graphic positioning of a combining character depends on the last preceding base character, unless they are separated by a character that is neither a combining character nor either zero width joiner or zero width non- joiner. The combining character is said to apply to that base character.
There may be no such base character, such as when a combining character is at the start of text or follows a control or format character—for example, a carriage return, tab, or right-left mark. In such cases, the combining characters are called isolated combining characters.
With isolated combining characters or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character.
The representative images of combining characters are depicted with a dotted circle in the code charts. When presented in graphical combination with a preceding base character, that base character is intended to appear in the position occupied by the dotted circle.
The Unicode Standard, Chapter 3, D59
Grapheme extender: A character with the property Grapheme_Extend.
Grapheme extender characters consist of all nonspacing marks, zero width joiner, zero width non-joiner, U+FF9E, U+FF9F, and a small number of spacing marks.
A grapheme extender can be conceived of primarily as the kind of nonspacing graphical mark that is applied above or below another spacing character.
zero width joiner and zero width non-joiner are formally defined to be grapheme extenders so that their presence does not break up a sequence of other grapheme extenders.
The small number of spacing marks that have the property Grapheme_Extend are all the second parts of a two-part combining mark.
The set of characters with the Grapheme_Extend property and the set of characters with the Grapheme_Base property are disjoint, by definition.
The difference in actual usage is that combining characters are defined as a General Category for rough classification of characters and grapheme extenders are mainly used for UAX #29 text segmentation.
EDIT: Since you offered a bounty, I can elaborate a bit.
Combining characters are characters that can't be use as stand-alone characters but must be combined with another character. They're used to define combining character sequences.
Grapheme extenders were introduced in Unicode 3.2 to be used in Unicode Technical Report #29: Text Boundaries (then in a proposed status, now known as Unicode Standard Annex #29:
Unicode Text Segmentation). The main use is to define grapheme clusters. Grapheme clusters are basically user-perceived characters. According to UAX #29:
Grapheme cluster boundaries are important for collation, regular expressions, UI interactions (such as mouse selection, arrow key movement, backspacing), segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text.
The main difference is that grapheme extenders don't include most of the spacing marks (the set is actually smaller than the set of combining characters). Most of the spacing marks are vowel signs for Asian scripts. In these scripts, vowels are sometimes written by modifying a consonant character. If this modification takes up horizontal space (spacing mark), it used to be seen as a separate user-perceived character and forms a new (legacy) grapheme cluster. In later versions of UAX #29, this was changed and extended grapheme clusters were introduced where most but not all spacing marks don't break a cluster.
I think they key sentence from the standard is: "A grapheme extender can be conceived of primarily as the kind of nonspacing graphical mark that is applied above or below another spacing character." Combining characters, on the other hand, also include spacing marks that are applied to the left or right. There are a few exceptions, though (see property Other_Grapheme_Extend).
Example
U+0995 BENGALI LETTER KA:
ক
U+09C0 BENGALI VOWEL SIGN II (combining character, but no grapheme extender):
ী
Combination of the two:
কী
This is a single combining character sequence consisting of two legacy grapheme clusters. The vowel sign can't be used by itself but it still counts as a legacy grapheme cluster. A text editor, for example, could allow to place the cursor between the two characters.
There are over 300 combining characters like this which do not extend graphemes, and four characters which are not combining but do extend graphemes.
I’ve posted this question on the Unicode mailing list and got some more responses. I’ll post some of them here.
Tom Gewecke wrote:
I'm not an expert on this aspect of Unicode, but I understand that
"grapheme extender" is a finer distinction in character properties
designed to be used in certain specific and complex processes like
grapheme breaking. You might find this blog article helpful in seeing
where it comes into play:
http://useless-factor.blogspot.com/2007/08/unicode-implementers-guide-part-4.html
PS The answer by nwellnhof at StackOverflow is an excellent explanation of this issue in my view.
Philippe Verdy wrote:
Many grapheme extenders are not "combining characters". Combining
characters are classified this way for legacy reasons (the very weak
"general category" property) and this property is normatively
stabilized. As well most combining characters have a non-zero
combining class and they are stabilized for the purpose of
normalization.
Grapheme extenders include characters that are also NOT combining
characters but controls (e.g. joiners). Some graphemclusters are also
more complex in some scripts (there are extenders encoded BEFORE the
base character; and they cannot be classified as combining characters
because combining characters are always encoded AFTER a base
character)
For legacy reasons (and roundtrip compatibility with older standards)
not all scripts are encoded using the UCS character model using
combining characters. (E.g. the Thai script; not following the
"logical" encoding order; but following the model used in TIS-620 and
other standards based on it; including for Windows, and *nix/*nux).
Richard Wordingham wrote:
Spacing combining marks (category Mc) are in general not grapheme
extenders. The ones that are included are mostly included so that the
boundaries between 'legacy grapheme clusters'
http://www.unicode.org/reports/tr29/tr29-23.html are invariant under
canonical equivalence. There are six grapheme extenders that are not
nonspacing (Mn) or enclosing (Me) and are not needed by this rule:
ZWNJ, ZWJ, U+302E HANGUL SINGLE DOT TONE MARK U+302F HANGUL DOUBLE DOT
TONE MARK U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK U+FF9F HALFWIDTH
KATAKANA SEMI-VOICED SOUND MARK
I can see that it will sometimes be helpful to ZWNJ and ZWJ along with
the previous base character. The fullwidth soundmarks U+3099 and
U+309A are included for reasons of canonical equivalence, so it makes
sense to include their halfwidth versions.
I don't actually see the logic for including U+302E and U+302F. If
you're going to encourage forcing someone who's typed the wrong base
character before a sequence of 3 non-spacing marks to retype the lot,
you may as well do the same with Hangul tone marks.
May I quote from Yannis Haralambous' Fonts and Encodings, page 116f.:
The idea is that a script or a system of notation is sometimes too
finely divided into characters. And when we have cut constructs up
into characters, there is no way to put them back together again to
rebuild larger characters. For example, Catalan has the ligature
‘ŀl’. This ligature is encoded as two Unicode characters: an ‘ŀ’
0x0140 latin small letter l with middle dot and an ordinary ‘l’. But
this division may not always be what we want.
Suppose that we wish to
place a circumflex accent over this ligature, as we might well wish to
do with the ligatures ‘œ’ and ‘æ’. How can this be done in Unicode?
To allow users to build up characters in constructs that play the rôle
of new characters, Unicode introduced three new properties (grapheme
base, grapheme extension, grapheme link) and one new character:
0x034F combining grapheme joiner.
So the way I see it, this means that grapheme extenders are used to apply (for example) accents on characters that are themselves composed of several characters.
I know unicode contains all characters from most world aphabets..but what about digits? Are they part of unicode or not? I was not able to find straight answer.
Thanks
As already stated, Indo-Arabic numerals (0,1,..,9) are included in Unicode, inherited from ASCII. If you're talking about representation of numbers in other languages, the answer is still yes, they are also part of Unicode.
//numbers (0-9) in Malayalam (language spoken in Kerala, India)
൦ ൧ ൨ ൩ ൪ ൫ ൬ ൭ ൮ ൯
//numbers (0-9) in Hindi (India's national language)
० १ २ ३ ४ ५ ६ ७ ८ ९
You can use \p{N} or \p{Number} in a regular expression to match any kind of numeric character in any script.
This document (Page-3) describes the Unicode code points for Malayalam digits.
In short: yes, of course. There are three categories in UNICODE containing various representations of digits and numbers:
Number, Decimal Digit (characters) – e.g. Arabic, Thai, Devanagari digits;
Number, Letter (characters) – e.g. Roman numerals;
Number, Other (characters) – e.g. fractions.
The Unicode points below 128 are exactly the same as ASCII so, yes, they're at U+0030 through U+0039 inclusive.
Yes they are - codepoints 0030 to 0039, as you can see e.g. on decodeunicode.org
btw, codepoints 0000-007E are the same as ASCII (0-127, 128+ isn't ASCII anymore), so anything that you can find in ASCII you can find in Unicode.
Yes I think so:
Information Taken From Here
U+0030 0 30 DIGIT ZERO
U+0031 1 31 DIGIT ONE
U+0032 2 32 DIGIT TWO
U+0033 3 33 DIGIT THREE
U+0034 4 34 DIGIT FOUR
U+0035 5 35 DIGIT FIVE
U+0036 6 36 DIGIT SIX
U+0037 7 37 DIGIT SEVEN
U+0038 8 38 DIGIT EIGHT
U+0039 9 39 DIGIT NINE
You can answer that question yourself: if they weren’t part of Unicode, this would rather drastically reduce the usefulness of Unicode, don’t you think?
Basically, any text that needs to use numbers couldn’t be represented using Unicode code points. (This is assuming that you don’t switch to and fro between different character encodings in one text: I don’t know a single software / programming language that supports this, and for good reason.)
If such questions crop up, you badly need to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky. Seriously. Go read it.