Padding mixed CJK/ASCII input for monospaced output - unicode

I have been trying to solve this for some time. I have a need to output console stuff like:
ABC-花海 | 123 | 456789
JustSomeAnsi | 123 | 456789
花花花花花花 | 123 | 456789
追不到我别生气 | 123 | 456789
Now I know perfect alignment is not possible, but the example above would be tolerable to me. The question is how would you figure out how much to pad and with what space char.
If context is relevant, this is being output into Unreal Engine console. I've already have replacement for original Roboto font - either Sarasa Gothic or GNU Unifont or just any normal monospaced font like Fira Code + fallback for CJK (DroindSans right now, but it is not monospace)

In Unicode, there are fullwidth versions of ASCII characters compatible with CJK in the U+FF00 code point block (See Unicode code chart).
Here is your text using only fullwidth characters, including IDEOGRAPHIC SPACE U+3000:
           ABC-花海 | 123 | 456789
     JustSomeAnsi | 123 | 456789
           花花花花花花 | 123 | 456789
          追不到我别生气 | 123 | 456789
Also see this answer for a related Python-coded example that translates ASCII to the fullwidth forms.

Related

Two different eye emojis?

As far as I knew, there are currently two emojis for eyes. The pair of eyes (U+1F440) with hex code f09f9180 (👀), and a single eye (U+1F441) with hex code f09f9181 (👁).
I now found when using the emojis of the keyboard in my phone that another eye emoji exists, with hex code f09f9181efb88f (👁️).
The gajim messenger on the PC, and the Conversations app on the mobile phone, can display both. The gajim emoji-chooser only contains the short sequence and the Swiftkey-Keyboard Emoji-Chooser only the longer one.
When I copy and paste the emojis i.e. in the Firefox URL address bar, they look the same (blue eye, while the messengers both display them in black). When I Google for the emojis, I only find pages describing the shorter code point.
Firefox renders both emojis the same, but Vivaldi (Chromium based) shows the one with the shorter code point as narrow black and white emoji and the other one as larger brown eye.
When I Google for the hex dump, I find a lot of emojipedia sites for the shorter dump, and nothing useful at all for the longer one.
Is there somewhere any documentation about the additional emoji? Why aren't both emojis available in both emoji choosers?
f0 9f 91 80 is the UTF-8 encoded form of codepoint U+1F440.
f0 9f 91 81 is the UTF-8 encoded form of codepoint U+1F441.
f0 9f 91 81 ef b8 8f is the UTF-8 encoded form of codepoints U+1F441 U+FE0F.
U+FE0F is a Variation Selector:
Variation Selectors is a Unicode block containing 16 Variation Selector format characters (designated VS1 through VS16). They are used to specify a specific glyph variant for a Unicode character. They are currently used to specify standardized variation sequences for mathematical symbols, emoji symbols, 'Phags-pa letters, and CJK unified ideographs corresponding to CJK compatibility ideographs. At present only standardized variation sequences with VS1, VS15 and VS16 have been defined.
Where U+FE0F is VARIATION SELECTOR-16:
U+FE0F was added to Unicode in version 3.2 (2002). It belongs to the block Variation Selectors in the Basic Multilingual Plane.
This character is a Nonspacing Mark and inherits its script property from the preceding character.
The glyph is not a composition. It has a Ambiguous East Asian Width. In bidirectional context it acts as Nonspacing Mark and is not mirrored. In text U+FE0F behaves as Combining Mark regarding line breaks. It has type Extend for sentence and Extend for word breaks. The Grapheme Cluster Break is Extend.
This codepoint may change the appearance of the preceding character. If that is a symbol, dingbat or emoji, U+FE0F forces it to be rendered as a colorful image as compared to a monochrome text variant. The Unicode standard defines some standardized variants. See also “Unicode symbol as text or emoji” for a discussion of this codepoint.
In other words, U+FE0F tells VS-aware software to render U+1F441 as a colorful emoji instead of as monochromatic text.
The singular ‘👁’ is used as an emoji, but is defined as being text-style (i.e. black-and-white rather than colourful) by default. This isn’t implemented consistently across all platforms, however, so sometimes the character will also display as emoji style instead. In order to explicitly force one or the other style, the characters U+FE0E and U+FE0F can be appended to 👁 to make it appear as text style (👁︎) or emoji style (👁️) respectively. Because of the inconsistencies I mentioned, some devices and applications automatically add U+FE0F to the character (resulting in the longer code your phone keyboard produced), while others leave the character as-is (leaving just the code for the eye itself).

How to escape '|' in emacs's org-mode? [duplicate]

I've got a table in Emacs org-mode, and the contents are regular expressions. I can't seem to figure out how to escape a literal pipe-character (|) that's part of a regex though, so it's interpreted as a table-cell separator. Could someone point me to some help? Thanks.
Update: I'm also looking for escapes for a slash (/), so that it doesn't trigger the start of an italic/emphasis sequence. I experimented with \/ and \// - for example, suppose I want the literal text /foo/ in a table cell. Here are 3 ways of attempting it:
| /foo/ | \/foo/ | \//foo/ |
In LaTeX export, that becomes:
\emph{foo} & \/foo/ & \//foo/
So none of them is the plain /foo/ I'm hoping for.
\vert for the pipe.
Forward slashes seem to work fine for me unescaped when exporting both to HTML and PDF.
Use a broken-bar character, “¦”, Unicode 00A6 BROKEN BAR. This may or may not work for your specific needs, but it’s a good visual approximation.
You could also format the relevant text as verbatim or code:
Text in the code and verbatim string is not processed for Org mode
specific syntax; it is exported verbatim.
So you might try something like =foo | bar= (code) or foo ~|~ bar (verbatim). It does change the output format, though.

Unicode: English characters above code point 127

I'm giving a tech talk about Unicode and encoding in my company, in which I'm trying to make the point that strings are always encoded, and developers should never carelessly assume that everything is 0-127 ASCII.
I have numerous examples of problems caused by mis-encoded text, but I didn't find any example of simple English text with numbers that's encoded above Unicode code point 127.
The basic English alphabet is mapped in Unicode to the same numerical value as the plain old ASCII: The range A-Z is mapped to [65-90] (or [0x41-0x5a] in hex), and [a-z] is mapped to [97-122] (hex [0x61-0x7a]).
Does the English alphabet appear elsewhere in the code charts? I do not mean circumflex letters or other Latin variants, just the plain English alphabet.
CJK characters are generally monospaced in all fonts, since that's how those languages tend to be written.
When mixing CJK and English characters, however, you run into a problem: ASCII characters do not in general have the width of a CJK character. This means that if you use ASCII, you lose the monospaced property - which may not always be desirable.
For this purpose, fullwidth characters (U+FF00-FFEE, Wikipedia, Unicode code chart) may be used in place of "regular" characters. These have the property that they have the same width as a single CJK character.
Note, however, that fullwidth characters are virtually never used outside of a CJK context, and even in those contexts, plain ASCII is frequently used as well, when monospacing is considered unimportant.
Plenty of punctuation and symbols have code point values above U+007F:
“Hello.”
He had been given the comprehensive sixty-four-crayon Crayola box—including the gold and silver crayons—and would not let me look.
x ≠ y
The above examples use:
U+201C and U+201D — smart quotes
U+2014 — em-dash
U+2260 — not equal to
See the Unicode charts for more.
Well, if you just mean a-z and A-Z then no, there are no English characters above 127. But words like fiancé, resumé etc are sometimes spelled like that in English and use codepoints above 127.
Then there are various punctuation signs, currency symbols and so on that are above 127. Not sure if this counts as simple English text.

How to display cross symbol as superscript with unicode?

I want to display a cross in the place of superscript, I know the unicode character of the cross (\u2020).
Unicode encodes plain text. Superscripting isn’t plain text, so you need something external to, or “on top of” plain text. For example, on a web page, you could use the CSS to position character above the baseline (and reduce font size). In a word processor, you would use superscripting command or style.
In Unicode, there is a limited number of superscript characters, i.e. variants of characters in superscript style encoded as separate characters, such as superscript two “²”. But Unicode has no mechanism for superscripts in general.

Escape pipe-character in org-mode

I've got a table in Emacs org-mode, and the contents are regular expressions. I can't seem to figure out how to escape a literal pipe-character (|) that's part of a regex though, so it's interpreted as a table-cell separator. Could someone point me to some help? Thanks.
Update: I'm also looking for escapes for a slash (/), so that it doesn't trigger the start of an italic/emphasis sequence. I experimented with \/ and \// - for example, suppose I want the literal text /foo/ in a table cell. Here are 3 ways of attempting it:
| /foo/ | \/foo/ | \//foo/ |
In LaTeX export, that becomes:
\emph{foo} & \/foo/ & \//foo/
So none of them is the plain /foo/ I'm hoping for.
\vert for the pipe.
Forward slashes seem to work fine for me unescaped when exporting both to HTML and PDF.
Use a broken-bar character, “¦”, Unicode 00A6 BROKEN BAR. This may or may not work for your specific needs, but it’s a good visual approximation.
You could also format the relevant text as verbatim or code:
Text in the code and verbatim string is not processed for Org mode
specific syntax; it is exported verbatim.
So you might try something like =foo | bar= (code) or foo ~|~ bar (verbatim). It does change the output format, though.