The Unicode Basic Latin (ASCII) code chart has the following entries for the parenthesis code points:
0028 ( LEFT PARENTHESIS
= opening parenthesis (1.0)
0029 ) RIGHT PARENTHESIS
= closing parenthesis (1.0)
• see discussion on semantics of paired
bracketing characters
As far as I can see, the document has no further information on which "discussion on semantics of paired bracketing characters" it is referring to. I'm guessing it is about "left"/"right" vs "opening"/"closing", but I would like to know more.
When I search the web for phrases like "semantics of paired bracketing characters" I only get the same document in various versions.
What discussion is it referring to and where can I read it?
Related
My father created in mid 90's an encoding for his engineering purposes for his company's computers. It was close to ISO 8859-2 (Latin 2), but with some differences.
For example there was added a special "MARKER CHARACTER". This character wasn't determined to be a literal, but also it wasn't a control character.
The purpose of this character was to be inserted by machine when needed to split text into parts. See the following Python parser script:
re.sub(r'\{\{', r'~{{', text)
re.sub(r'\[\[', r'~[[', text)
re.sub(r'\]\]', r']]~', text)
re.sub(r'\}\}', r'}}~', text)
parts = text.strip('~').split('~')
inCurly = [False]
inSharp = [False]
whereAmI = ['']
for part in parts:
if part[:2] == '{{':
inCurly.append(True)
whereAmI.append('Curly')
elif part[:2] == '[[':
inSharp.append(True)
whereAmI.append('Sharp')
if whereAmI[-1] == 'Sharp' and not inCurly[-1]:
# some advanced magic on current part,
# if it is directly surrounded by sharp brackets,
# but these sharp brackets are not in curly brackets anyhow
# (not: "{{ (( [[ some text ]] )) }}")
# detecting closing brackets and popping inSharp, inCurly, whereAmI
# joining parts back to text
This is an easy parser for advanced purposes, you can detect more parenthesis or quotation marks as you want. But this have one huge fault. It break things when a ~ is in text.
For this purpose and similar purposes like this (but in C lang I think) he added to his encoding/character set that marker character.
For years I use for this purpose three german "sharp s": ßßß, because it is almost impossible to see three of them in a row. But this is not an ideal solution.
Yesterday my father told me this story and I immediatelly thought: is there some equivalent in an Unicode family? Unicode is a modern developing standard spreaded all over the world in past decade drastically. There should be a special character only for this particular purpose, or not?
I don't think there's anything called that specifically, but you might find zero-width space or information separator, among others, suitable for the purpose. You can arbitrarily select any character as your marker, and use an escape character if it occurs within the string.
In the control pictures block, there is a symbol for the group separator.
Is there any way I can write a superscript slash with Unicode?
My aim is to represent rational exponents in a nicer form than 123**(456/789).
Well unicode is full of characters. The meaning is up to its interpretation.
For superscript slash you can use:
Canadian Syllabics Final Acute 123⁴⁵⁶ᐟ⁷⁸⁹
Right Raised Omission Bracket 123⁴⁵⁶⸍⁷⁸⁹
Musical Symbol Repeated Figure-1 123⁴⁵⁶𝄍⁷⁸⁹
For subscript slash you can use:
Right Low Paraphrase Bracket 123₄₅₆⸝₇₈₉
If you have other solutions please comment and I will update my answer.
A helpful site to find special unicode characters: shapecatcher
No. On general grounds, we can be pretty sure that if such a character existed, it would be in the Superscripts and Subscripts block (not all superscripts are there, but the odds are that if any superscripts will be added, they will be placed there).
So you need some higher-level protocol, as you usually do, when you need superscripts beyond a fairly limited repertoire. Unicode is about encoding characters, not about layout and mathematical expressions.
Assuming http://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts is accurate, the answer is no.
Looking at the complete official Unicode name list and making the bold assumption it would have "slash" either in its name or description, there is no such character at this time.
What is the likelihood that I'll run into COMBINING LATIN SMALL LETTER C (U+0368) in "real life" (besides clever Scottish folk)?
I'm asking since it's in both the Unicode Block Combining Diacritical Marks and the Category Mark, Nonspacing [Mn].
As a result, it seems to gets treated the same as characters such as COMBINING GRAVE ACCENT (U+0300) by Utilities such as the ICU Transliterator (using either the suggested "NFD; [:Nonspacing Mark:] Remove; NFC" or a straight "Latin-ASCII" transliteration).
The likelihood is very close to zero, but not exactly zero. You cannot prevent anyone from using a Unicode character as he likes. There is no specific information about U+0368 in the Unicode Standard, but it has definitely been defined as a combining character that will cause a symbol (c) to be displayed above the preceding character. I would expect to find it mostly in digitized forms of medieval manuscripts, or something like that.
Using it after a space character, as in the “clever” page mentioned, is not the intended use, but not invalid either. Unicode lets you use any combining mark after any character, whether it makes sense or not.
It has no canonical or compatibility decomposition, so there is no clear-cut way to deal with in a context where you cannot, or do not want to, retain the character.
The likelihood is utterly indeterminate except to say that if you expect it not to occur, then it will occur.
I'm having trouble understanding some concepts. In the Unicode spec, there's a property called general category.
OK I understood what are each of letters (usual characters; GC=L), numbers (like digits 0–9 and other characters that have numeric values; GC=N) and separators (dividers; GC=Z). But it's really hard to distinguish between symbols (GC=S), punctuation (GC=P), and marks (GC=M).
I looked up a list of them, but I couldn't find conceptual difference. And the document doesn't help me a lot. What's the difference between all these?
Marks aren't standalone characters, but are applied to another character. Non-spacing marks are displayed over the target character, spacing marks are displayed attached to the target character and enclosing marks are displayed surrounding the target character. For example here's an a in a box (the character "a" combined with the enclosing square character):
a⃞
Regarding punctuations versus symbols: As the text you linked explains, some edge cases are classified rather arbitrarily, but in principle the difference is that punctuation is used "to organize and delimit textual units" (i.e. to mark the end of a sentence, separate different parts of a sentence, separate the elements of an enumeration etc.) and symbols "to represent concepts" (like units for example or mathematical notations).
There are these arrows in Unicode ⬅ ⬆ ⬇ ⬈ ⬉ ⬊ ⬋ ⬌ ⬍
But it's missing a right one. The name should be something like RIGHTWARDS BLACK ARROW, but there's no Unicode character of that name.
There are some char that seems similar, but i couldn't really find the right match. I'm looking for the right-pointing char of this set. (based on char name or semantic of the char, not font appearance)
Anyone? I need the Unicode code point.
Here's some of the char's code point
character: ⬅ (11013, #o25405, #x2b05)
character: ⬆ (11014, #o25406, #x2b06)
character: ⬍ (11021, #o25415, #x2b0d)
From the Unicode 7.0 Character Code Chart
for Miscellaneous Symbols and Arrows:
And from the Dingbats chart:
This issue was
discussed on the
Unicode Mail List, and
Jukka Korpela (author of
Unicode Explained)
mentioned:
My guess is that U+27A1 was included along with other dingbat arrows
(which are mostly rightward-pointing arrowlike symbols)
because it had been included in legacy character codes.
It was then regarded as unnecessary to duplicate it in the
Miscellaneous Symbols and Arrows block.
This is somewhat unfortunate.
Here's the timeline of when these characters were added to Unicode:
1991
The Dingbats block (2700 to 27BF) has been present since Unicode 1.0. Its characters were copied from the
ITC Zapf Dingbats font (released in 1978).
2003
The Miscellaneous Symbols and Arrows
block was created in Unicode 4.0.
2014
The RIGHTWARDS BLACK ARROW character (2B95) was
added in Unicode 7.0.
➡ http://www.fileformat.info/info/unicode/char/27a1/index.htm : ➡
None of those are a thick right arrow that the OP is looking for. If you go to the wikipedia page for arrows and scroll down to Miscellaneous Symbols and Arrows, you will see that the OP is exactly right. There are thick arrows, but no right-hand arrow.