How to generate a "phantom" string for a Unicode string, consisting of whitespace characters of same width? - unicode

Given a sequence of Unicode characters, how can I obtain a string of whitespace characters that has the same width (at least in monospace fonts that display each character with single or double width of the characters from Basic Latin)?
Examples
For example, given the string `\u0061\u0020\u0062\u0020\u0063' with five characters that looks like this:
a b c
('a', space, 'b', space, 'c'), I would like to obtain a string consisting of just five spaces:
\u0020\u0020\u0020\u0020\u0020
and given \u6b22\u8fce\u5149\u4e34 that looks like
欢迎光临
I'd want to obtain a string containing four ideographic spaces: \u3000\u3000\u3000\u3000.
Background
Here is an example where this matters: error reporting in compilers for languages that support Unicode. Suppose that we have some hypothetical programming language PL (could be Python, Java, Scala, Ruby ...) that has string literals and parentheses. Suppose that this is an invalid snippet of PL-code, because it contains an unmatched parenthesis:
"stringLiteral")
If we tried to compile it, the compiler of PL could produce an error message that looks as follows:
:1: error: ';' expected but ')' found.
"stringLiteral")
^
Note the "phantom string" followed by '^' in the last line: it points exactly at the unmatched closing parenthesis.
If I try the same with CJK characters, here is what I get:
:1: error: ';' expected but ')' found.
"欢迎光临欢迎光临欢迎光临欢迎光临欢迎光临欢迎")
^
Note that now the "phantom string" in the last line consists of ordinary Latin whitespaces, and in the console, the '^' looks as if it's somewhere in the middle of the string of the CJK characters, instead of at the parenthesis.
If I try the same with Croatian characters:
:1: error: ';' expected but ')' found.
"DŽDždžLJLjljNJNjnj")
^
the '^' pointer also ends up at a seemingly completely wrong position, because those special Croatian characters are much wider than ordinary spaces.
All of the examples produce similar results in such languages as Python, Java, Scala, Ruby (just copy-paste " y⃝e҈s҉ ") or "临欢迎光临欢迎") into the interactive shell, and see where the ^ ends up).
Solution attempt
Here is a naïve attempt to generate "phantom"-strings in Scala. There is a method Character.isIdeographic. I can use it to define a phantom method by mapping every ideographic character to \u3000, and all other characters to ' ' (ordinary space).
def phantom(s: String) =
s.map(c => if (Character.isIdeographic(c)) '\u3000' else ' ')
In simple cases, it works. For example, if I define a string
val s = "foo欢迎光临欢迎bar光临欢baz"
and then print the string followed by a vertical bar |, a line break, and then the phantom(s) followed by vertical bar |,
println(s + "|\n" + phantom(s) + "|")
then I obtain:
foo欢迎光临欢迎bar光临欢baz|
           |
and the vertical bars in the end of the strings line up perfectly, because the phantom(s) is now
\u0020\u0020\u0020\u3000\u3000\u3000\u3000\u3000\u3000\u0020\u0020\u0020\u3000\u3000\u3000\u0020\u0020\u0020
that is:
three ordinary spaces corresponding to "foo"
six ideographic spaces corresponding to the "欢迎光临欢迎" piece
then again three spaces corresponding to "bar"
...
and so on.
However, if I try the same with Croatian characters, I again get a mess:
DŽDždžLJLjljNJNjnj|
|
(vertical bars don't line up).
Question
Does Unicode define any properties that would allow me to generate robust "phantom" strings of same width?

Related

Alphanumeric substitution with vim

I'm using the vscode vimplugin. I have a bunch of lines that look like:
Terry,169,80,,,47,,,22,,,6,,
I want to remove all the alphanumeric characters after the first comma so I get:
Terry,,,,,,,,,,,,,
In command mode I tried:
s/^.+\,[a-zA-Z0-9-]\+//g
But this does not appear to do anything. How can I get this working?
edit:
s/^[^,]\+,[a-zA-Z0-9-]\+//g
\+ is greedy; ^.\+, eats the entire line up to the last ,.
Instead of the dot (which means "any character") use [^,] which means "any but a comma". Then ^[^,]\+, means "any characters up to the first comma".
The problem with your requirement is that you want to anchor at the beginning using ^ so you cannot use flag g — with the anchor any substitution will be done once. The only way I can solve the puzzle is to use expressions: match and preserve the anchored text and then use function substitute() with flag g.
I managed with the following expression:
:s/\(^[^,]\+\)\(,\+\)\(.\+\)$/\=submatch(1) . submatch(2) . substitute(submatch(3), '[^,]', '', 'g')/
Let me split it in parts. Searching:
\(^[^,]\+\) — first, match any non-commas
\(,\+\) — any number of commas
\(.\+\)$ — all chars to the end of the string
Substituting:
\= — the substitution is an expression
See http://vimdoc.sourceforge.net/htmldoc/change.html#sub-replace-expression
submatch(1) — replace with the first match (non-commas anchored with ^)
submatch(2) — replace with the second match (commas)
substitute(submatch(3), '[^,]', '', 'g') — replace in the rest of the string
The last call to substitute() is simple, it replaces all non-commas with empty strings.
PS. Tested in real vim, not vscode.

create a generic regex for a string in perl

I have tried to create regex for the below:
STRING sou_u02_mlpv0747_CCF_ASB001_LU_FW_ALERT|/opt/app/medvhs/mvs/applications/cm_vm5/fwhome/UnifiedLogging|UL_\d{8}_CCF_ASB001_LU_sou_u02_mlpv0747_Primary.log.csv|FATAL|red|1h||fw_alert
REGEX----> /^[^#]\w+\|[^\|]+\|\w+\|\w+\|\w*\|\w*\|([^\|]+|)\|\w*$/
I am unable to figure out the mistake here.
I created the above by referring another regex which working fine and given below
/^[^#]\w+\|[^\|]+\|([^\|]+|)\|[rm]\|(in|out|old|new|arch|missing)\|\w+\|([0-9-,]+|)\|\w*\|\w*$/
sou_u02_mlpv0747_CCF_ASB001_LU_ODR|/opt/app/medvhs/mvs/applications/cm_vm5/components/CCF_ASB001_LU/SPOOL/ODR||r|out|30m|0400-1959|30m|gprs_in_stag
can some one please help me. Any leads would be highly apprciated.
Let's start from a brief look at your source text (the first that you included).
It is composed of "sections" separated with | char.
This char (|) must be matched by \|. Remember about the preceding
backslash, otherwise, a "bare" | would mean the alternative separator
(you used it in one place).
And now take a look at each section (between |):
Some of them contain only a sequence of word chars (and can be matched
by \w+).
Other sections, however, contain also other chars, e.g. slashes,
backslash, braces and dots, so each such section is actually a sequence
of chars other than "|" and must be matched by [^|]+ (here,
between [ and ], the vertical bar may be unescaped).
Now let's write each section and its "type":
sou_u02_..._FW_ALERT - word chars.
/opt/app/.../UnifiedLogging - other chars (because of slashes).
UL_\d{8}_..._Primary.log.csv - other chars (because of \d{8}
and dots).
FATAL|red|1h - 3 sections composed of word chars.
An empty section, between 2 consecutive | chars.
fw_alert - word chars.
And now, how to match these groups, and the separating |:
Point 1: \w+\| - word chars and (escaped) vertical bar.
Point 2 and 3 (together): (?:[^|]+\|){2} - a non-capturing
group - (?:...), containing a sequence of "other" chars - [^|]+
and a vertical bar - \|, occurring 2 times {2}.
Point 4 (three "word char" groups): (?:\w+\|){3} - similiar to
the previous point.
Point 5: Just as in your solution - ([^|]+|)\|, a capturing group -
(...), with 2 alternatives ...|.... The first alternative is
[^|]+ (a sequence of "other" chars), and the second alternative
is empty. After the capturing group there is \| to match the vertical
bar.
Point 6: \w+ - word chars. This time no \|, as this is the last
section.
The regex assembled so far must be:
prepended with a ^ (start of string) and
appended with a $ (end of string).
So the whole regex, matching your source text can be:
^\w+\|(?:[^|]+\|){2}(?:\w+\|){3}([^|]+|)\|\w+$
Actually, the only capturing group can be written another way,
as ([^|]*) - without alternatives, but with * as the
repetition count, allowing also empty content.
Your choice, which variant to apply.
The third field
UL_\d{8}_CCF_ASB001_LU_sou_u02_mlpv0747_Primary.log.csv
Contains a backslash, \, braces { } and dots .. None of these can be matched by \w
Note also that there is no need to escape a pipe | inside a characters class: [^|]+ is fine

Character literal for vertical tab?

How can I write a character literal for a vertical tab ('\v', ASCII 11) in Scala?
'\v' doesn't work. (invalid escape character)
'\11' should be it, but...
scala> '\11'.toInt
res13: Int = 9
But 9 is the ASCII code for a normal tab('\t'). What is going on there?
EDIT: This works and produces the right character, but I'd still like to know the syntax for a literal.
val c:Char = 11
You need to use '\13'. It's in octal.
For more information see Scala Language Specification.
1.3.4 Character Literals
Syntax:
characterLiteral ::= ‘\’’ printableChar ‘\’’ | ‘\’’ charEscapeSeq ‘\’’
A character literal is a single character enclosed in quotes. The
character is either a printable unicode character or is described by
an escape sequence (§1.3.6).
Example 1.3.4 Here are some character
literals: ’a’ ’\u0041’ ’\n’ ’\t’ Note that ‘\u000A’ is not a valid
character literal because Unicode conversion is done before literal
parsing and the Unicode character \u000A (line feed) is not a
printable character. One can use instead the escape sequence ‘\n’ or
the octal escape ‘\12’ (§1.3.6).

Io string (Sequence) manipulation/formatting?

Does Io have built in methods that mirror the ord() and chr() functions in other languages (namely being able to take an integer and return the ASCII character associated with it, or take a string character and return the ASCII number for that character)?
Is there a print/write function that allows for formatting of the output? I'm wanting to create ANSI colored output to the command line, and need the means to print an escape character (ASCII character 27) to do that.
For chr() see asCharacter in the Number object.
For ord() either asBinarySignedInteger or asBinaryUnsignedInteger from the Seqence object seems to fit the bill.
# ord
"#" asBinarySignedInteger println # => 64
# chr
64 asCharacter println # => "#"

sed - remove specific subscript from string

please provide me a sed oneliner which provides this output:
sdc3 sdc2
for Input :
sdc3[1] sdc2[0]
I mean remove all subscript value from the string ..
sed 's/\[[^]]*\]//g'
reads: substitute any string with literal "[" followed by zero or more characters that aren't a "]", and then the closing "]", with an empty string.
You need the [^]] bit to prevent greedy matching treating "[1] sdc2[0]" as a single match in your sample string.
As for your comment:
sed 's#\([^[ ]*\)\[[^]]*\]#/dev/\1#g'
I switch the seperator from the usual '/' to '#', just to avoid escaping the /dev/ bit you asked for (I won't say "for clarity")
the \(...\) bit matches a subgroup, here sdc2 or whatever, so we can refer to it in the replacement
the subgroup uses a similar character class to the one we used discarding the index: [^[ ] means any character except an "[" (again, to avoid greedily matching the index) or a space (assuming your values are space-delimited as per your post)
the replacement is now the literal "/dev/" followed by the first (and only) subgroup match
the g flag at the end tells it to perform multiple matches per line, instead of stopping at the first one