How to add a vector notation above a variable name composed of multiple characters? - unicode

In Julia it is possible to add Unicode characters with LaTeX like syntax. All allowed unicode characters can be found here. For example, it is possible to add a right arrow over a character with this simple code
F\vec[TAB]
and it produces the following character
But I couldn't find a syntax to add the same right arrow over a whole word as \vec seems to always add the arrow over the previous character and does not allow to group them. For example
force\vec[TAB]
produces
Does the syntax for this feature exists ?

Related

VS code interprets wrong comma from my keyboard

When I type a comma in VS code I get
,
but what I really want is
,
I'm not sure why there's two types of commas, but the first one is giving me run time errrors in python
"I'm not sure why there's two types of commas"
Well ... it is because Unicode defines them1. The first one is U+FF0C : FULLWIDTH COMMA. The second one is U+002C : COMMA
The latter is the one that should be bound to the "comma" key on your keyboard, but it is possible that something has changed your VSCode key bindings. This page describes how to examine your VSCode key bindings.
But I think the more likely explanation is that you have copy-pasted some source code from (for example) a PDF file that is using U+FF0C instead of U+002C for cosmetic reasons ... or something. It ia also possible that they were placed there by the author of the original document, or by their word-processing software.
You could try using the Gremlins extension to highlight any potentially troublesome characters in your source code.
1 - According to Wikipedia, the purpose is "so that older encodings containing both halfwidth and fullwidth characters can have lossless translation to/from Unicode.".

Can a combining character be used alone in Unicode?

Let's take COMBINING ACUTE ACCENT, for example. Its browser test page does include it alone in the page, but it reacts in a strange way: I can't select it with my mouse, and if I try to interact with it in the DOM inspector, it feels like it's not part of the text at all (there's no before and after this character):
Is a combining character, used alone, still a valid Unicode string?
Or does it have to follow another character?
Yes, a combining character alone is a valid Unicode string (even though its behaviour may be weird without a base character). Section 2.11 of the Unicode Standard emphasises this:
In the Unicode Standard, all sequences of character codes are permitted.
The presentation of such strings is described in D52:
There may be no such base character, such as when a combining character is at the start of text or follows a control or format character [...] In such cases, the combining characters are called isolated combining characters.
With isolated combining characters or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character.
However, if you want to display a combining character by itself, it is recommended that you attach it to a no-break space base character:
Nonspacing combining marks used by the Unicode Standard may be exhibited in apparent
isolation by applying them to U+00A0 NO-BREAK SPACE. This convention might be
employed, for example, when talking about the combining mark itself as a mark, rather
than using it in its normal way in text (that is, applied as an accent to a base letter or in
other combinations).
Also, a dotted circle ◌ (U+25CC, ◌) character can be used as a base character.
Source: https://en.wikipedia.org/wiki/Dotted_circle

Determine the individual unicode characters that make up a word

I'm having trouble breaking a word into its individual unicode components. I'm working with the devanagari script using google input tools. An example is र्म (pronounced -rm), which I want to break into म (-m) and the that hook at the top (-r). But I can't seem to find the unicode character that corresponds to the hook at the top. Here's some of the solutions I tried
1. copy and past र्म into MS word and hit alt x. But this breaks the word into र् and म. It doesn't give me the unicode character for the top hook
2. I tried the site http://shapecatcher.com/. I found a character called latin egyptological ain; while similar in shape, it cannot be used on top of another character. I'm looking the conjunct version of the hook.
Any help would be appreciated. I'm using TekMaker on Windows 8.
The ‘hook at the top’ representing a preceding र् is an inseparable part of the glyph for a variety of biconsonantal ligatures. It's not a discrete, freely-combinable diacritical mark as we would understand it in Latin-like scripts.
Consequently the visual rendering element doesn't have its own Unicode representation distinct from its linguistic meaning र्, sorry!

Funny strange (unicode) characters take more than one line

I found some "funny" characters (e.g. ḓ̵̙͎̖̯̞̜̞̪̠ and •̩̩̩̩̩̩̩̩̩̩) in social media that takes more than one line. First I think it is the bug of Firefox. I tried this in Gedit and LibreOffice Writer, they are all the same. So, what is this actually? Actually I am asking about the character encoding and rendering.
I tried to find the character in GNOME Character Map, they could not be found.
I tried to check the character code of both of them in unicode (probably UTF-8). It seems they takes more than one character. How come one character is more than one character? This is the result by using Python.
Character ḓ̵̙͎̖̯̞̜̞̪̠
u'\u2022\u0329\u0329\u0329\u0329\u0329\u0329\u0329\u0329\u0329\u0329\u0329\u0329
\u0329\u0329\u0329\u0329\u0329\u0329\u0329\u0329\u0329\u0329'
Character •̩̩̩̩̩̩̩̩̩̩
u'\u1e13\u0335\u0319\u034e\u0316\u032f\u031e\u031c\u031e\u032a\u0320\u033c\u031e
\u0320\u034e\u033c\u0353\u034b\u036e\u034c\u0346\u0300\u035c\u0345'
U+0329 is COMBINING VERTICAL LINE BELOW. It is a combining character (and so are all the others in there except U+2022 and U+1E13), meaning that it combines with the previous one. What you see here is merely the result of someone stacking way too many combining characters on the same base.

Is it possible to change an emacs syntax table based on context?

I'm working on improving an emacs major mode for UnrealScript. One of the (many) quirks is that it allows syntax like this for specifying tooltips in the Unreal editor:
var() int MyEditorVar <Foo=Bar|Tooltip=My tooltip text isn't quoted>;
The angle brackets after the variable declaration denote a pipe-separated list of Key=Value metadata pairs, and the metadata is not quoted but can contain quote marks -- a pipe (|) or right angle bracket (>) denotes the end.
Is there a way I can get the emacs syntax table to recognize this context-dependent syntax in a useful way? I'd like everything except for pipes and right angle brackets to be highlighted in some way inside of these variable metadata declarations, but otherwise retain their normal highlighting.
Right now, the single quote character is set up to be a quote delimiter (syntax designator "), so font-lock-mode interprets such a quote as starting a quoted string, which it's not in this very specific instance, so it mishighlights everything until it finds another supposedly matching single quote.
You'll need to setup a syntax-propertize-function which lets you apply different syntax designators to different characters in the buffer, depending on their context.
Grep for syntax-propertize-function in Emacs's lisp directory to see various examples (from simple to pretty complex ones).
You'll probably want to mark the "=" chars after your "Foo" and after your "Tooltip" as "generic string delimiter", then do the same with the corresponding terminating "|" and ">". An alternative could be to mark the char before the ">" as a (closing) generic string delimiter, so that you can then mark the "<" and ">" as open&close parens.