I have a problem, I've got an escaped string for example "\\u0026" and I need this to transform to unicode char '\u0026'.
Tricks like
string_concat('\\', S, "\\u0026"), write(S).
didn't help, because it will remove \ not only the escape . So basically my problem is, how to remove escape chars from the string.
EDIT: Oh, I've just noticed, that stackoverflow also plays with escape \.
write_canonical/1 gives me "\\u0026", how to transform that into a single '&' char?
In ISO Prolog a char is usually considered an atom of length 1.
Atoms and chars are enclosed in single quotes, or written without
quotes if possible. Here are some examples:
?- X = abc. /* an atom, but not a char */
X = abc
?- X = a. /* an atom and also a char */
X = a
?- X = '\u0061'.
X = a
The \u notation is SWI-Prolog specific, and not found in the ISO
Prolog. In SWI-Prolog there is a data type string again not found
in the ISO Prolog, and always enclosed in double quotes. Here are
some examples:
?- X = "abc". /* a string */
X = "abc"
?- X = "a". /* again a string */
X = "a"
?- X = "\u0061".
X = "a"
If you have a string at hand of length 1, you can convert it to a char
via the predicate atom_string/2. This is a SWI-Prolog specific predicate,
not in ISO Prolog:
?- atom_string(X, "\u0061").
X = a
?- atom_string(X, "\u0026").
X = &
Some recommendation. Start learning the ISO Prolog atom predicates first,
there are quite a number. Then learn the SWI-Prolog atom and string predicates.
You dont have to learn so many new SWI-Prolog predicates, since in SWI-Prolog most of the ISO Prolog predicates also accept strings. Here is an example of the ISO Prolog predicate atom_codes/2 used with a string in the first argument:
?- atom_codes("\u0061\u0026", L).
L = [97, 38].
?- L = [0'\u0061, 0'\u0026].
L = [97, 38].
?- L = [0x61, 0x26].
L = [97, 38].
P.S: The 0' notation is defined in the ISO Prolog, its neither a char, atom or string, but it represents an integer data type. The value is the code of the given char after the 0'. I have combined it with the SWI-Prolog \u notation.
P.P.S: The 0' notation in connection of the \u notation is of course redundant, in ISO Prolog one can directly use the hex notation prefix 0x for integer values.
The thing is that "\\u0026" is already what you are searching for because it represents \u0026.
Related
My goal is to get a random emoticon, from a list, in F#.
I started with this:
let pickOne (icons: string) : char = icons.[Helpers.random.Next(icons.Length)]
let happySymbols = "๐ฅ๐๐๐๐๐๐ช๐๐๐๐๐ค๐๐ค๐คฉ๐คช๐ค ๐ฅณ๐๐คค๐๐"
let sadSymbols = "๐ญ๐๐๐ฉ๐ข๐คฆ๐คท๐ฑ๐๐คจ๐๐ฌ๐๐คฎ๐ต๐คฏ๐ง๐๐๐ค๐ก๐คฌ"
that doesn't work because:
"๐ฅ๐๐๐๐๐๐ช๐๐๐๐๐ค๐๐ค๐คฉ๐คช๐ค ๐ฅณ๐๐คค๐๐".Length
is returning 44 as length returns the number of chars in a string, which is not working well with unicode characters.
I can't just divide by 2 because I may add some single byte characters in the string at some point.
Indexing doesn't work either:
let a = "๐ฅ๐๐๐๐๐๐ช๐๐๐๐๐ค๐๐ค๐คฉ๐คช๐ค ๐ฅณ๐๐คค๐๐"
a.[0]
will not return ๐ฅ but I get some unknown character symbol.
so, plan B was: let's make this an array instead of a string:
let a = [| '๐ฅ'; '๐'; '๐'; '๐'; '๐'; '๐'; '๐ช'; '๐'; '๐'; '๐'; '๐'; '๐ค'; '๐'; '๐ค'; '๐คฉ'; '๐คช'; '๐ค '; '๐ฅณ'; '๐'; '๐คค'; '๐'; '๐' |]
this is not compiling, I'm getting:
Parse error Unexpected quote symbol in binding. Expected '|]' or other token.
why is that?
anyhow, I can make a list of strings and get it to work, but I'm curious: is there a "proper" way to make the first one work and take a random unicode character from a unicode string?
Asti's answer works for your purpose, but I wasn't too happy about where we landed on this. I guess I got hung up in the word "proper" in the answer. After a lot of research in various places, I got curious about the method String.EnumerateRunes, which again lead me to the type Rune. The documentation for that type is particularly enlightening about proper string handling, and what's in a Unicode UTF-8 string in .NET. I also experimented in LINQPad, and got this.
let dump x = x.Dump()
let runes = "abcABCรฆรธรฅรรร
๐๐๐โ
่จ่ง่ฆ่ฅ".EnumerateRunes().ToArray()
runes.Length |> dump
// 20
runes |> Array.iter (fun rune -> dump (string rune))
// a b c A B C รฆ รธ รฅ ร ร ร
๐ ๐ ๐ โ
่จ ่ง ่ฆ ่ฅ
dump runes
// see screenshot
let smiley = runes.[13].ToString()
dump smiley
// ๐
All strings in .NET are 16-bit unicode strings.
That's the definition of char:
Represents a character as a UTF-16 code unit.
All characters take up the minimum encoding size (2 bytes for UTF-16), up to as many bytes as required. Emojis don't fit in 2 bytes, so they align to 4 bytes, or 2 chars.
So what's the solution? align(4) all the things! (insert GCC joke here).
First we convert everything into UTF32:
let utf32 (source: string) =
Encoding.Convert(Encoding.Unicode, Encoding.UTF32, Encoding.Unicode.GetBytes(source))
Then we can pick and choose any "character":
let pick (arr: byte[]) index =
Encoding.UTF32.GetString(arr, index * 4, 4)
Test:
let happySymbols = "๐ฅ๐๐๐๐๐๐ช๐๐๐๐๐ค๐๐ค๐คฉ๐คช๐ค ๐ฅณ๐๐คค๐๐YTHO"
pick (utf32 happySymbols) 0;;
val it : string = "๐ฅ"
> pick (utf32 happySymbols) 22;;
val it : string = "Y"
For the actual length, just div by 4.
let surpriseMe arr =
let rnd = Random()
pick arr (rnd.Next(0, arr.Length / 4))
Hmmm
> surpriseMe (utf32 happySymbols);;
val it : string = "๐"
in this example:
var str1 = "hello"
var str2 = "Hello"
if str1 < str2 { print("hello is less than Hello")}
else {print("hello is more than Hello")}
on what basis it is found that str1 is greater than str2?
Swift strings are compared according to the
Unicode Collation Algorithm,
which means that (effectively),
each string is put into "Unicode Normalization Form D",
the unicode scalar values of these "decomposed" strings are compared lexicographically.
In your example, "hello" and "Hello" have the Unicode values
hello: U+0068, U+0065, U+006C, U+006C, U+006F
Hello: U+0048, U+0065, U+006C, U+006C, U+006F
and therefore "Hello" < "hello".
The "normalization" or "decomposing" is relevant e.g. for characters
with diacritical marks. As an example,
a = U+0061
รค = U+00E4
b = U+0062
have the decomposed form
a: U+0061
รค: U+0061, U+0308 // LATIN SMALL LETTER A + COMBINING DIAERESIS
b: U+0062
and therefore "a" < "รค" < "b".
For more details and examples, see What does it mean that string and character comparisons in Swift are not locale-sensitive?
The two strings are compared, character by character, using each character's Unicode value. Since h has a higher code (U+0068) than H (U+0048), str1 is "greater" than str2.
Based on Martin's comment below the question, it's slightly more complex than I stated. Please see What does it mean that string and character comparisons in Swift are not locale-sensitive? for more detail.
I think it is based on the Lexicographical Order.https://en.wikipedia.org/wiki/Lexicographical_order
In Swift 4.2 -
//Unicode Value gives you an idea why "hello" is greater than "Hello" as the length of both Strings are the same.
var str1 = "hello"
var str2 = "Hello"
if (str1 < str2){
print("hello is less than Hello")
}
else {
print("hello is more than Hello")
}
print(str1.unicodeScalars[str1.unicodeScalars.startIndex].value)
print(str2.unicodeScalars[str2.unicodeScalars.startIndex].value)
in this example:
var str1 = "hello"
var str2 = "Hello"
if str1 < str2 { print("hello is less than Hello")}
else {print("hello is more than Hello")}
on what basis it is found that str1 is greater than str2?
Swift strings are compared according to the
Unicode Collation Algorithm,
which means that (effectively),
each string is put into "Unicode Normalization Form D",
the unicode scalar values of these "decomposed" strings are compared lexicographically.
In your example, "hello" and "Hello" have the Unicode values
hello: U+0068, U+0065, U+006C, U+006C, U+006F
Hello: U+0048, U+0065, U+006C, U+006C, U+006F
and therefore "Hello" < "hello".
The "normalization" or "decomposing" is relevant e.g. for characters
with diacritical marks. As an example,
a = U+0061
รค = U+00E4
b = U+0062
have the decomposed form
a: U+0061
รค: U+0061, U+0308 // LATIN SMALL LETTER A + COMBINING DIAERESIS
b: U+0062
and therefore "a" < "รค" < "b".
For more details and examples, see What does it mean that string and character comparisons in Swift are not locale-sensitive?
The two strings are compared, character by character, using each character's Unicode value. Since h has a higher code (U+0068) than H (U+0048), str1 is "greater" than str2.
Based on Martin's comment below the question, it's slightly more complex than I stated. Please see What does it mean that string and character comparisons in Swift are not locale-sensitive? for more detail.
I think it is based on the Lexicographical Order.https://en.wikipedia.org/wiki/Lexicographical_order
In Swift 4.2 -
//Unicode Value gives you an idea why "hello" is greater than "Hello" as the length of both Strings are the same.
var str1 = "hello"
var str2 = "Hello"
if (str1 < str2){
print("hello is less than Hello")
}
else {
print("hello is more than Hello")
}
print(str1.unicodeScalars[str1.unicodeScalars.startIndex].value)
print(str2.unicodeScalars[str2.unicodeScalars.startIndex].value)
I'd like to have a function generate(n) that generates the first n lowercase characters of the alphabet appended in a string (therefore: 1<=n<=26)
For example:
generate(3) --> 'abc'
generate(5) --> 'abcde'
generate(9) --> 'abcdefghi'
I'm new to Matlab and I'd be happy if someone could show me an approach of how to write the function. For sure this will involve doing arithmetic with the ASCII-codes of the characters - but I've no idea how to do this and which types that Matlab provides to do this.
I would rely on ASCII codes for this. You can convert an integer to a character using char.
So for example if we want an "e", we could look up the ASCII code for "e" (101) and write:
char(101)
'e'
This also works for arrays:
char([101, 102])
'ef'
The nice thing in your case is that in ASCII, the lowercase letters are all the numbers between 97 ("a") and 122 ("z"). Thus the following code works by taking ASCII "a" (97) and creating an array of length n starting at 97. These numbers are then converted using char to strings. As an added bonus, the version below ensures that the array can only go to 122 (ASCII for "z").
function output = generate(n)
output = char(97:min(96 + n, 122));
end
Note: For the upper limit we use 96 + n because if n were 1, then we want 97:97 rather than 97:98 as the second would return "ab". This could be written as 97:(97 + n - 1) but the way I've written it, I've simply pulled the "-1" into the constant.
You could also make this a simple anonymous function.
generate = #(n)char(97:min(96 + n, 122));
generate(3)
'abc'
To write the most portable and robust code, I would probably not want those hard-coded ASCII codes, so I would use something like the following:
output = 'a':char(min('a' + n - 1, 'z'));
...or, you can just generate the entire alphabet and take the part you want:
function str = generate(n)
alphabet = 'a':'z';
str = alphabet(1:n);
end
Note that this will fail with an index out of bounds error for n > 26, so you might want to check for that.
You can use the char built-in function which converts an interger value (or array) into a character array.
EDIT
Bug fixed (ref. Suever's comment)
function [str]=generate(n)
a=97;
% str=char(a:a+n)
str=char(a:a+n-1)
Hope this helps.
Qapla'
I am trying to compare characters to see if they match. I can't figure out why it doesn't work. I'm expecting true on the output, but I'm getting false.
character: "a"
word: "aardvark"
(first word) = character ; expecting true, getting false
So "a" in Rebol is not a character, it is actually a string.
A single unicode character is its own independent type, with its own literal syntax, e.g. #"a". For example, it can be converted back and forth from INTEGER! to get a code point, which the single-letter string "a" cannot:
>> to integer! #"a"
== 97
>> to integer! "a"
** Script error: cannot MAKE/TO integer! from: "a"
** Where: to
** Near: to integer! "a"
A string is not a series of one-character STRING!s, it's a series of CHAR!. So what you want is therefore:
character: #"a"
word: "aardvark"
(first word) = character ;-- true!
(Note: Interestingly, binary conversions of both a single character string and that character will be equivalent:
>> to binary! "ฮผ"
== #{CEBC}
>> to binary! #"ฮผ"
== #{CEBC}
...those are UTF-8 byte representations.)
I recommend for cases like this, when things start to behave in a different way than you expected, to use things like probe and type?. This will help you get a sense of what's going on, and you can use the interactive Rebol console on small pieces of code.
For instance:
>> character: "a"
>> word: "aardvark"
>> type? first word
== char!
>> type? character
== string!
So you can indeed see that the first element of word is a character #"a", while your character is the string! "a". (Although I agree with #HostileFork that comparing a string of length 1 and a character is for a human the same.)
Other places you can test things are http://tryrebol.esperconsultancy.nl or in the chat room with RebolBot