Io string (Sequence) manipulation/formatting? - iolanguage

Does Io have built in methods that mirror the ord() and chr() functions in other languages (namely being able to take an integer and return the ASCII character associated with it, or take a string character and return the ASCII number for that character)?
Is there a print/write function that allows for formatting of the output? I'm wanting to create ANSI colored output to the command line, and need the means to print an escape character (ASCII character 27) to do that.

For chr() see asCharacter in the Number object.
For ord() either asBinarySignedInteger or asBinaryUnsignedInteger from the Seqence object seems to fit the bill.
# ord
"#" asBinarySignedInteger println # => 64
# chr
64 asCharacter println # => "#"

Related

Combining characters using their hexa name

This works:
say "\c[COMBINING BREVE, COMBINING DOT ABOVE]" # OUTPUT: «̆̇␤»
However, this does not:
say "\c[0306, 0307]"; # OUTPUT: «IJij␤»
It's treating it as two different characters. Is there a way to make it work directly by using the numbers, other than use uniname to convert it to names?
The \c[…] escape is for declaring a character by its name or an alias.
0306 is not a name, it is the ordinal/codepoint of a character.
The \x[…] escape is for declaring a character by its hexadecimal ordinal.
say "\x[0306, 0307]"; # OUTPUT: «̆̇␤»
(Hint: There is an x in a hexadecimal literal 0x0306)
\c uses decimal numbers:
say "\c[774, 775]"
where 774 is the decimal equivalent of 0306, works perfectly.

How to generate a "phantom" string for a Unicode string, consisting of whitespace characters of same width?

Given a sequence of Unicode characters, how can I obtain a string of whitespace characters that has the same width (at least in monospace fonts that display each character with single or double width of the characters from Basic Latin)?
Examples
For example, given the string `\u0061\u0020\u0062\u0020\u0063' with five characters that looks like this:
a b c
('a', space, 'b', space, 'c'), I would like to obtain a string consisting of just five spaces:
\u0020\u0020\u0020\u0020\u0020
and given \u6b22\u8fce\u5149\u4e34 that looks like
欢迎光临
I'd want to obtain a string containing four ideographic spaces: \u3000\u3000\u3000\u3000.
Background
Here is an example where this matters: error reporting in compilers for languages that support Unicode. Suppose that we have some hypothetical programming language PL (could be Python, Java, Scala, Ruby ...) that has string literals and parentheses. Suppose that this is an invalid snippet of PL-code, because it contains an unmatched parenthesis:
"stringLiteral")
If we tried to compile it, the compiler of PL could produce an error message that looks as follows:
:1: error: ';' expected but ')' found.
"stringLiteral")
^
Note the "phantom string" followed by '^' in the last line: it points exactly at the unmatched closing parenthesis.
If I try the same with CJK characters, here is what I get:
:1: error: ';' expected but ')' found.
"欢迎光临欢迎光临欢迎光临欢迎光临欢迎光临欢迎")
^
Note that now the "phantom string" in the last line consists of ordinary Latin whitespaces, and in the console, the '^' looks as if it's somewhere in the middle of the string of the CJK characters, instead of at the parenthesis.
If I try the same with Croatian characters:
:1: error: ';' expected but ')' found.
"DŽDždžLJLjljNJNjnj")
^
the '^' pointer also ends up at a seemingly completely wrong position, because those special Croatian characters are much wider than ordinary spaces.
All of the examples produce similar results in such languages as Python, Java, Scala, Ruby (just copy-paste " y⃝e҈s҉ ") or "临欢迎光临欢迎") into the interactive shell, and see where the ^ ends up).
Solution attempt
Here is a naïve attempt to generate "phantom"-strings in Scala. There is a method Character.isIdeographic. I can use it to define a phantom method by mapping every ideographic character to \u3000, and all other characters to ' ' (ordinary space).
def phantom(s: String) =
s.map(c => if (Character.isIdeographic(c)) '\u3000' else ' ')
In simple cases, it works. For example, if I define a string
val s = "foo欢迎光临欢迎bar光临欢baz"
and then print the string followed by a vertical bar |, a line break, and then the phantom(s) followed by vertical bar |,
println(s + "|\n" + phantom(s) + "|")
then I obtain:
foo欢迎光临欢迎bar光临欢baz|
           |
and the vertical bars in the end of the strings line up perfectly, because the phantom(s) is now
\u0020\u0020\u0020\u3000\u3000\u3000\u3000\u3000\u3000\u0020\u0020\u0020\u3000\u3000\u3000\u0020\u0020\u0020
that is:
three ordinary spaces corresponding to "foo"
six ideographic spaces corresponding to the "欢迎光临欢迎" piece
then again three spaces corresponding to "bar"
...
and so on.
However, if I try the same with Croatian characters, I again get a mess:
DŽDždžLJLjljNJNjnj|
|
(vertical bars don't line up).
Question
Does Unicode define any properties that would allow me to generate robust "phantom" strings of same width?

What does \x do in print

I would like to start by saying that I am not familiar with Perl. That being said, I came across this piece of code and I could not figure out what the \x was for in the code below. In addition, I was unsure why nothing was displayed when I ran the following:
perl -e 'print "\x7c\x8e\x04\x08"'
It's not about print: it's about string representation, in which codes represent characters from your character set. For more information you should read Quote and Quote-like Operators and Effects of Character Semantics
In your case the character code is in hex. You should look in your character set table, and you may need to convert to decimal first.
You said "I was unsure why nothing was displayed when I ran the following:"
perl -e 'print "\x7c\x8e\x04\x08"'
That command outputs 4 characters to STDOUT. Each of the characters is specified in hexadecimal. The "\x7c" part will output the vertical bar character |. The other three characters are control characters, so probably wouldn't produce any visible output. If you redirect output to a file, you will end up with a 4 byte file.
It's possible that you're not seeing the vertical bar character because it's being overwritten by your command prompt. Unlike the shell echo or Python's print, Perl's print function does not automatically append a newline to all output. If you want new lines, you can insert them in the string using \n.
\x signifies the start of a hexadecimal character notation.

Check characters inside string for their Unicode value

I would like to replace characters with certain Unicode values in a variable with dash. I have two ideas which might work, but I do not know how to check for the value of character:
1/ processing variable as string, checking every characters value and placing these characters in a new variable (replacing those characters which are invalid)
2/ use these magic :-)
$variable = s/[$char_range]/-/g;
char_range should be similar to [0-9] or [A-Z], but it should be values for utf-8 characters. I need range from 0x00 to 0x7F to be exact.
The following expression should replace anything that is not ASCII with a hyphen, which is (I think) what you want to do:
s/[\N{U+0080}-\N{U+FFFF}]/-/g
There's no such thing as UTF-8 characters. There are only characters that you encode into UTF-8. Even then, you don't want to make ranges outside of the magical ones that Perl knows about. You're likely to get more than you expect.
To get the ordinal value for a character, use ord:
use utf8;
my $code_number = ord '😸'; # U+1F638
say sprintf "%#x", $code_number;
However, I don't think that's what you need. It sounds like you want to replace characters in the ASCII range with a -. You can specify ranges of code numbers:
s/[\000-\177]/-/g; # in octal
s/[\x00-\x7f]/-/g; # in hexadecimal
You can specify wide character ordinal values in braces:
s/[\x80-\x{10ffff}]/-/g; # wide characters, replace non-ASCII in this case
When the characters have a common property, you can use that:
s/\p{ASCII}/-/g;
However, if you are replacing things character for character, you might want a transliteration:
$string =~ tr/\000-\177/-/;

Which encoding uses the \x (backslash x) prefix?

I'm attempting to decode text which is prefixing certain 'special characters' with \x. I've worked out the following mappings by hand:
\x28 (
\x29 )
\x3a :
e.g. 12\x3a39\x3a03 AM
Does anyone recognise what this encoding is?
It's ASCII. All occurrences of the four characters \xST are converted to 1 character, whose ASCII code is ST (in hexadecimal), where S and T are any of 0123456789abcdefABCDEF.
The '\xAB' notation is used in C, C++, Perl, and other languages taking a cue from C, as a way of expressing hexadecimal character codes in the middle of a string.
The notation '\007' means use octal for the character code, when there are digits after the backslash.
In C99 and later, you can also use \uabcd and \U00abcdef to encode Unicode characters in hexadecimal (with 4 and 8 hex digits required; the first two hex digits in \U must be 0 to be valid, and often the third digit will be 0 too — 1 is the only other valid value).
Note that in C, octal escapes are limited to a maximum of 3 digits but hexadecimal escapes are not limited to 2 or 3 digits; the hexadecimal escape ends at the first character that's not a hexadecimal digit. In the question, the sequence is "12\x3a39\x3a03". That is a string containing 4 characters: 1, 2, \x3a39 and \x3a03. The actual value used for the 4-digit hex characters is implementation-defined. To achieve the desired result (using \x3A to represent a colon :), the code would have to use string concatenation:
"12\x3a" "39\x3a" "03"
This now contains 8 characters: 1, 2, :, 3, 9, :, 0, 3.
I use CyberChef for this sort of thing.
If you drop it in the input field and drag Magic from the Favourites list into the recipe it'll tell you the conversion and that you could've used the From_Hex recipe with a \x delimiter.
I'm guessing that what you are dealing with is a unicode string that has been encoded differently than the output stream it was sent to. ie. a utf-16 string output to a latin-1 device. In that situation, certain characters will be outputted as escape values to avoid sending control characters or wrong characters to the output device. This happens in python at least.