SWI-Prolog: How to get unicode char from escaped string? - unicode

I have a problem, I've got an escaped string for example "\\u0026" and I need this to transform to unicode char '\u0026'.
Tricks like
string_concat('\\', S, "\\u0026"), write(S).
didn't help, because it will remove \ not only the escape . So basically my problem is, how to remove escape chars from the string.
EDIT: Oh, I've just noticed, that stackoverflow also plays with escape \.
write_canonical/1 gives me "\\u0026", how to transform that into a single '&' char?

In ISO Prolog a char is usually considered an atom of length 1.
Atoms and chars are enclosed in single quotes, or written without
quotes if possible. Here are some examples:
?- X = abc. /* an atom, but not a char */
X = abc
?- X = a. /* an atom and also a char */
X = a
?- X = '\u0061'.
X = a
The \u notation is SWI-Prolog specific, and not found in the ISO
Prolog. In SWI-Prolog there is a data type string again not found
in the ISO Prolog, and always enclosed in double quotes. Here are
some examples:
?- X = "abc". /* a string */
X = "abc"
?- X = "a". /* again a string */
X = "a"
?- X = "\u0061".
X = "a"
If you have a string at hand of length 1, you can convert it to a char
via the predicate atom_string/2. This is a SWI-Prolog specific predicate,
not in ISO Prolog:
?- atom_string(X, "\u0061").
X = a
?- atom_string(X, "\u0026").
X = &
Some recommendation. Start learning the ISO Prolog atom predicates first,
there are quite a number. Then learn the SWI-Prolog atom and string predicates.
You dont have to learn so many new SWI-Prolog predicates, since in SWI-Prolog most of the ISO Prolog predicates also accept strings. Here is an example of the ISO Prolog predicate atom_codes/2 used with a string in the first argument:
?- atom_codes("\u0061\u0026", L).
L = [97, 38].
?- L = [0'\u0061, 0'\u0026].
L = [97, 38].
?- L = [0x61, 0x26].
L = [97, 38].
P.S: The 0' notation is defined in the ISO Prolog, its neither a char, atom or string, but it represents an integer data type. The value is the code of the given char after the 0'. I have combined it with the SWI-Prolog \u notation.
P.P.S: The 0' notation in connection of the \u notation is of course redundant, in ISO Prolog one can directly use the hex notation prefix 0x for integer values.

The thing is that "\\u0026" is already what you are searching for because it represents \u0026.

Related

Getting a random emoji/character from a unicode string

My goal is to get a random emoticon, from a list, in F#.
I started with this:
let pickOne (icons: string) : char = icons.[Helpers.random.Next(icons.Length)]
let happySymbols = "๐Ÿ”ฅ๐Ÿ˜‚๐Ÿ˜Š๐Ÿ˜๐Ÿ™๐Ÿ˜Ž๐Ÿ’ช๐Ÿ˜‹๐Ÿ˜‡๐ŸŽ‰๐Ÿ™Œ๐Ÿค˜๐Ÿ‘๐Ÿค‘๐Ÿคฉ๐Ÿคช๐Ÿค ๐Ÿฅณ๐Ÿ˜Œ๐Ÿคค๐Ÿ˜๐Ÿ˜€"
let sadSymbols = "๐Ÿ˜ญ๐Ÿ˜”๐Ÿ˜’๐Ÿ˜ฉ๐Ÿ˜ข๐Ÿคฆ๐Ÿคท๐Ÿ˜ฑ๐Ÿ‘Ž๐Ÿคจ๐Ÿ˜‘๐Ÿ˜ฌ๐Ÿ™„๐Ÿคฎ๐Ÿ˜ต๐Ÿคฏ๐Ÿง๐Ÿ˜•๐Ÿ˜Ÿ๐Ÿ˜ค๐Ÿ˜ก๐Ÿคฌ"
that doesn't work because:
"๐Ÿ”ฅ๐Ÿ˜‚๐Ÿ˜Š๐Ÿ˜๐Ÿ™๐Ÿ˜Ž๐Ÿ’ช๐Ÿ˜‹๐Ÿ˜‡๐ŸŽ‰๐Ÿ™Œ๐Ÿค˜๐Ÿ‘๐Ÿค‘๐Ÿคฉ๐Ÿคช๐Ÿค ๐Ÿฅณ๐Ÿ˜Œ๐Ÿคค๐Ÿ˜๐Ÿ˜€".Length
is returning 44 as length returns the number of chars in a string, which is not working well with unicode characters.
I can't just divide by 2 because I may add some single byte characters in the string at some point.
Indexing doesn't work either:
let a = "๐Ÿ”ฅ๐Ÿ˜‚๐Ÿ˜Š๐Ÿ˜๐Ÿ™๐Ÿ˜Ž๐Ÿ’ช๐Ÿ˜‹๐Ÿ˜‡๐ŸŽ‰๐Ÿ™Œ๐Ÿค˜๐Ÿ‘๐Ÿค‘๐Ÿคฉ๐Ÿคช๐Ÿค ๐Ÿฅณ๐Ÿ˜Œ๐Ÿคค๐Ÿ˜๐Ÿ˜€"
a.[0]
will not return ๐Ÿ”ฅ but I get some unknown character symbol.
so, plan B was: let's make this an array instead of a string:
let a = [| '๐Ÿ”ฅ'; '๐Ÿ˜‚'; '๐Ÿ˜Š'; '๐Ÿ˜'; '๐Ÿ™'; '๐Ÿ˜Ž'; '๐Ÿ’ช'; '๐Ÿ˜‹'; '๐Ÿ˜‡'; '๐ŸŽ‰'; '๐Ÿ™Œ'; '๐Ÿค˜'; '๐Ÿ‘'; '๐Ÿค‘'; '๐Ÿคฉ'; '๐Ÿคช'; '๐Ÿค '; '๐Ÿฅณ'; '๐Ÿ˜Œ'; '๐Ÿคค'; '๐Ÿ˜'; '๐Ÿ˜€' |]
this is not compiling, I'm getting:
Parse error Unexpected quote symbol in binding. Expected '|]' or other token.
why is that?
anyhow, I can make a list of strings and get it to work, but I'm curious: is there a "proper" way to make the first one work and take a random unicode character from a unicode string?
Asti's answer works for your purpose, but I wasn't too happy about where we landed on this. I guess I got hung up in the word "proper" in the answer. After a lot of research in various places, I got curious about the method String.EnumerateRunes, which again lead me to the type Rune. The documentation for that type is particularly enlightening about proper string handling, and what's in a Unicode UTF-8 string in .NET. I also experimented in LINQPad, and got this.
let dump x = x.Dump()
let runes = "abcABCรฆรธรฅร†ร˜ร…๐Ÿ˜‚๐Ÿ˜Š๐Ÿ˜โ‚…่Œจ่Œง่Œฆ่Œฅ".EnumerateRunes().ToArray()
runes.Length |> dump
// 20
runes |> Array.iter (fun rune -> dump (string rune))
// a b c A B C รฆ รธ รฅ ร† ร˜ ร… ๐Ÿ˜‚ ๐Ÿ˜Š ๐Ÿ˜ โ‚… ่Œจ ่Œง ่Œฆ ่Œฅ
dump runes
// see screenshot
let smiley = runes.[13].ToString()
dump smiley
// ๐Ÿ˜Š
All strings in .NET are 16-bit unicode strings.
That's the definition of char:
Represents a character as a UTF-16 code unit.
All characters take up the minimum encoding size (2 bytes for UTF-16), up to as many bytes as required. Emojis don't fit in 2 bytes, so they align to 4 bytes, or 2 chars.
So what's the solution? align(4) all the things! (insert GCC joke here).
First we convert everything into UTF32:
let utf32 (source: string) =
Encoding.Convert(Encoding.Unicode, Encoding.UTF32, Encoding.Unicode.GetBytes(source))
Then we can pick and choose any "character":
let pick (arr: byte[]) index =
Encoding.UTF32.GetString(arr, index * 4, 4)
Test:
let happySymbols = "๐Ÿ”ฅ๐Ÿ˜‚๐Ÿ˜Š๐Ÿ˜๐Ÿ™๐Ÿ˜Ž๐Ÿ’ช๐Ÿ˜‹๐Ÿ˜‡๐ŸŽ‰๐Ÿ™Œ๐Ÿค˜๐Ÿ‘๐Ÿค‘๐Ÿคฉ๐Ÿคช๐Ÿค ๐Ÿฅณ๐Ÿ˜Œ๐Ÿคค๐Ÿ˜๐Ÿ˜€YTHO"
pick (utf32 happySymbols) 0;;
val it : string = "๐Ÿ”ฅ"
> pick (utf32 happySymbols) 22;;
val it : string = "Y"
For the actual length, just div by 4.
let surpriseMe arr =
let rnd = Random()
pick arr (rnd.Next(0, arr.Length / 4))
Hmmm
> surpriseMe (utf32 happySymbols);;
val it : string = "๐Ÿ˜"

Alphabetic comparison - why is a less than b? [duplicate]

in this example:
var str1 = "hello"
var str2 = "Hello"
if str1 < str2 { print("hello is less than Hello")}
else {print("hello is more than Hello")}
on what basis it is found that str1 is greater than str2?
Swift strings are compared according to the
Unicode Collation Algorithm,
which means that (effectively),
each string is put into "Unicode Normalization Form D",
the unicode scalar values of these "decomposed" strings are compared lexicographically.
In your example, "hello" and "Hello" have the Unicode values
hello: U+0068, U+0065, U+006C, U+006C, U+006F
Hello: U+0048, U+0065, U+006C, U+006C, U+006F
and therefore "Hello" < "hello".
The "normalization" or "decomposing" is relevant e.g. for characters
with diacritical marks. As an example,
a = U+0061
รค = U+00E4
b = U+0062
have the decomposed form
a: U+0061
รค: U+0061, U+0308 // LATIN SMALL LETTER A + COMBINING DIAERESIS
b: U+0062
and therefore "a" < "รค" < "b".
For more details and examples, see What does it mean that string and character comparisons in Swift are not locale-sensitive?
The two strings are compared, character by character, using each character's Unicode value. Since h has a higher code (U+0068) than H (U+0048), str1 is "greater" than str2.
Based on Martin's comment below the question, it's slightly more complex than I stated. Please see What does it mean that string and character comparisons in Swift are not locale-sensitive? for more detail.
I think it is based on the Lexicographical Order.https://en.wikipedia.org/wiki/Lexicographical_order
In Swift 4.2 -
//Unicode Value gives you an idea why "hello" is greater than "Hello" as the length of both Strings are the same.
var str1 = "hello"
var str2 = "Hello"
if (str1 < str2){
print("hello is less than Hello")
}
else {
print("hello is more than Hello")
}
print(str1.unicodeScalars[str1.unicodeScalars.startIndex].value)
print(str2.unicodeScalars[str2.unicodeScalars.startIndex].value)

How String Comparison happens in Swift

in this example:
var str1 = "hello"
var str2 = "Hello"
if str1 < str2 { print("hello is less than Hello")}
else {print("hello is more than Hello")}
on what basis it is found that str1 is greater than str2?
Swift strings are compared according to the
Unicode Collation Algorithm,
which means that (effectively),
each string is put into "Unicode Normalization Form D",
the unicode scalar values of these "decomposed" strings are compared lexicographically.
In your example, "hello" and "Hello" have the Unicode values
hello: U+0068, U+0065, U+006C, U+006C, U+006F
Hello: U+0048, U+0065, U+006C, U+006C, U+006F
and therefore "Hello" < "hello".
The "normalization" or "decomposing" is relevant e.g. for characters
with diacritical marks. As an example,
a = U+0061
รค = U+00E4
b = U+0062
have the decomposed form
a: U+0061
รค: U+0061, U+0308 // LATIN SMALL LETTER A + COMBINING DIAERESIS
b: U+0062
and therefore "a" < "รค" < "b".
For more details and examples, see What does it mean that string and character comparisons in Swift are not locale-sensitive?
The two strings are compared, character by character, using each character's Unicode value. Since h has a higher code (U+0068) than H (U+0048), str1 is "greater" than str2.
Based on Martin's comment below the question, it's slightly more complex than I stated. Please see What does it mean that string and character comparisons in Swift are not locale-sensitive? for more detail.
I think it is based on the Lexicographical Order.https://en.wikipedia.org/wiki/Lexicographical_order
In Swift 4.2 -
//Unicode Value gives you an idea why "hello" is greater than "Hello" as the length of both Strings are the same.
var str1 = "hello"
var str2 = "Hello"
if (str1 < str2){
print("hello is less than Hello")
}
else {
print("hello is more than Hello")
}
print(str1.unicodeScalars[str1.unicodeScalars.startIndex].value)
print(str2.unicodeScalars[str2.unicodeScalars.startIndex].value)

Matlab: Function that returns a string with the first n characters of the alphabet

I'd like to have a function generate(n) that generates the first n lowercase characters of the alphabet appended in a string (therefore: 1<=n<=26)
For example:
generate(3) --> 'abc'
generate(5) --> 'abcde'
generate(9) --> 'abcdefghi'
I'm new to Matlab and I'd be happy if someone could show me an approach of how to write the function. For sure this will involve doing arithmetic with the ASCII-codes of the characters - but I've no idea how to do this and which types that Matlab provides to do this.
I would rely on ASCII codes for this. You can convert an integer to a character using char.
So for example if we want an "e", we could look up the ASCII code for "e" (101) and write:
char(101)
'e'
This also works for arrays:
char([101, 102])
'ef'
The nice thing in your case is that in ASCII, the lowercase letters are all the numbers between 97 ("a") and 122 ("z"). Thus the following code works by taking ASCII "a" (97) and creating an array of length n starting at 97. These numbers are then converted using char to strings. As an added bonus, the version below ensures that the array can only go to 122 (ASCII for "z").
function output = generate(n)
output = char(97:min(96 + n, 122));
end
Note: For the upper limit we use 96 + n because if n were 1, then we want 97:97 rather than 97:98 as the second would return "ab". This could be written as 97:(97 + n - 1) but the way I've written it, I've simply pulled the "-1" into the constant.
You could also make this a simple anonymous function.
generate = #(n)char(97:min(96 + n, 122));
generate(3)
'abc'
To write the most portable and robust code, I would probably not want those hard-coded ASCII codes, so I would use something like the following:
output = 'a':char(min('a' + n - 1, 'z'));
...or, you can just generate the entire alphabet and take the part you want:
function str = generate(n)
alphabet = 'a':'z';
str = alphabet(1:n);
end
Note that this will fail with an index out of bounds error for n > 26, so you might want to check for that.
You can use the char built-in function which converts an interger value (or array) into a character array.
EDIT
Bug fixed (ref. Suever's comment)
function [str]=generate(n)
a=97;
% str=char(a:a+n)
str=char(a:a+n-1)
Hope this helps.
Qapla'

Comparing characters in Rebol 3

I am trying to compare characters to see if they match. I can't figure out why it doesn't work. I'm expecting true on the output, but I'm getting false.
character: "a"
word: "aardvark"
(first word) = character ; expecting true, getting false
So "a" in Rebol is not a character, it is actually a string.
A single unicode character is its own independent type, with its own literal syntax, e.g. #"a". For example, it can be converted back and forth from INTEGER! to get a code point, which the single-letter string "a" cannot:
>> to integer! #"a"
== 97
>> to integer! "a"
** Script error: cannot MAKE/TO integer! from: "a"
** Where: to
** Near: to integer! "a"
A string is not a series of one-character STRING!s, it's a series of CHAR!. So what you want is therefore:
character: #"a"
word: "aardvark"
(first word) = character ;-- true!
(Note: Interestingly, binary conversions of both a single character string and that character will be equivalent:
>> to binary! "ฮผ"
== #{CEBC}
>> to binary! #"ฮผ"
== #{CEBC}
...those are UTF-8 byte representations.)
I recommend for cases like this, when things start to behave in a different way than you expected, to use things like probe and type?. This will help you get a sense of what's going on, and you can use the interactive Rebol console on small pieces of code.
For instance:
>> character: "a"
>> word: "aardvark"
>> type? first word
== char!
>> type? character
== string!
So you can indeed see that the first element of word is a character #"a", while your character is the string! "a". (Although I agree with #HostileFork that comparing a string of length 1 and a character is for a human the same.)
Other places you can test things are http://tryrebol.esperconsultancy.nl or in the chat room with RebolBot