Comparing characters in Rebol 3 - unicode

I am trying to compare characters to see if they match. I can't figure out why it doesn't work. I'm expecting true on the output, but I'm getting false.
character: "a"
word: "aardvark"
(first word) = character ; expecting true, getting false

So "a" in Rebol is not a character, it is actually a string.
A single unicode character is its own independent type, with its own literal syntax, e.g. #"a". For example, it can be converted back and forth from INTEGER! to get a code point, which the single-letter string "a" cannot:
>> to integer! #"a"
== 97
>> to integer! "a"
** Script error: cannot MAKE/TO integer! from: "a"
** Where: to
** Near: to integer! "a"
A string is not a series of one-character STRING!s, it's a series of CHAR!. So what you want is therefore:
character: #"a"
word: "aardvark"
(first word) = character ;-- true!
(Note: Interestingly, binary conversions of both a single character string and that character will be equivalent:
>> to binary! "μ"
== #{CEBC}
>> to binary! #"μ"
== #{CEBC}
...those are UTF-8 byte representations.)

I recommend for cases like this, when things start to behave in a different way than you expected, to use things like probe and type?. This will help you get a sense of what's going on, and you can use the interactive Rebol console on small pieces of code.
For instance:
>> character: "a"
>> word: "aardvark"
>> type? first word
== char!
>> type? character
== string!
So you can indeed see that the first element of word is a character #"a", while your character is the string! "a". (Although I agree with #HostileFork that comparing a string of length 1 and a character is for a human the same.)
Other places you can test things are http://tryrebol.esperconsultancy.nl or in the chat room with RebolBot

Related

Swift string indexing combines "\r\n" as one char instead of two

I am dealing with strings containing \r\n with Swift 4.2. I ran into kind of strange behavior of Swift index, it appears \r\n will be treated as one character instead of two by Swift indexing methods. I wrote a piece of code to present this behavior:
var text = "ABC\r\n\r\nDEF"
func printChar(_ lower: Int, _ upper: Int) {
let start = text.index(text.startIndex, offsetBy: lower)
let end = text.index(text.startIndex, offsetBy: upper)
print("\"" + text[start..<end] + "\"")
}
printChar(0, 1) // "A"
printChar(1, 2) // "B"
printChar(2, 3) // "C"
printChar(3, 4) // new line
printChar(4, 5) // new line (okay, what's going on here?)
printChar(5, 6) // "D"
printChar(6, 7) // "E"
printChar(7, 8) // "F"
The print result will be
"A"
"B"
"C"
"
"
"
"
"D"
"E"
"F"
Any idea why it's like this?
TLDR: \r\n is a grapheme cluster and is treated as a single Character in Swift because Unicode.
Swift treats \r\n as one Character.
Objective-C NSString treats it as two characters (in terms of the result from length).
On the swift-users forum someone wrote:
– "\r\n" is a single Character. Is this the correct behaviour?
– Yes, a Character corresponds to a Unicode grapheme cluster, and "\r\n" is considered a single grapheme cluster.
And the subsequent response posted a link to Unicode documentation, check out this table which officially states CRLF is a grapheme cluster.
Take a look at the Apple documentation on Characters and Grapheme Clusters.
It's common to think of a string as a sequence of characters, but when working with NSString objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string.
The Swift documentation on Strings and Characters is also worth reading.
This overview from objc.io is interesting as well.
NSString represents UTF-16-encoded text. Length, indices, and ranges are all based on UTF-16 code units.
Another example of this is an emoji like 👍🏻. This single character is actually %uD83D%uDC4D%uD83C%uDFFB, four different unicode scalars. But if you called count on a string with just that emoji you'd (correctly) get 1.
If you wanted to see the scalars you could iterate them as follows:
for scalar in text.unicodeScalars {
print("\(scalar.value) ", terminator: "")
}
Which for "\r\n" would give you 13 10
In the Swift documentation you'll find why NSString is different:
The count of the characters returned by the count property isn’t always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.
Thus this isn't really "strange" behaviour of Swift string indexing, but rather a result of how Unicode treats these characters and how String in Swift is designed. Swift string indexing goes by Character and \r\n is a single Character.

String to Integer (atoi) [Leetcode] gave wrong answer?

String to Integer (atoi)
This problem is implement atoi to convert a string to an integer.
When test input = " +0 123"
My code return = 123
But why expected answer = 0?
======================
And if test input = " +0123"
My code return = 123
Now expected answer = 123
So is that answer wrong?
I think this is expected result as it said
Requirements for atoi:
The function first discards as many whitespace characters as necessary until the first non-whitespace character is found. Then, starting from this character, takes an optional initial plus or minus sign followed by as many numerical digits as possible, and interprets them as a numerical value.
Your first test case has a space in between two different digit groups, and atoi only consider the first group which is '0' and convert into integer

Is there a clean way to specify character literals in Swift?

Swift seems to be trying to deprecate the notion of a string being composed of an array of atomic characters, which makes sense for many uses, but there's an awful lot of programming that involves picking through datastructures that are ASCII for all practical purposes: particularly with file I/O. The absence of a built in language feature to specify a character literal seems like a gaping hole, i.e. there is no analog of the C/Java/etc-esque:
String foo="a"
char bar='a'
This is rather inconvenient, because even if you convert your strings into arrays of characters, you can't do things like:
let ch:unichar = arrayOfCharacters[n]
if ch >= 'a' && ch <= 'z' {...whatever...}
One rather hacky workaround is to do something like this:
let LOWCASE_A = ("a" as NSString).characterAtIndex(0)
let LOWCASE_Z = ("z" as NSString).characterAtIndex(0)
if ch >= LOWCASE_A && ch <= LOWCASE_Z {...whatever...}
This works, but obviously it's pretty ugly. Does anyone have a better way?
Characters can be created from Strings as long as those Strings are only made up of a single character. And, since Character implements ExtendedGraphemeClusterLiteralConvertible, Swift will do this for you automatically on assignment. So, to create a Character in Swift, you can simply do something like:
let ch: Character = "a"
Then, you can use the contains method of an IntervalType (generated with the Range operators) to check if a character is within the range you're looking for:
if ("a"..."z").contains(ch) {
/* ... whatever ... */
}
Example:
let ch: Character = "m"
if ("a"..."z").contains(ch) {
println("yep")
} else {
println("nope")
}
Outputs:
yep
Update: As #MartinR pointed out, the ordering of Swift characters is based on Unicode Normalization Form D which is not in the same order as ASCII character codes. In your specific case, there are more characters between a and z than in straight ASCII (ä for example). See #MartinR's answer here for more info.
If you need to check if a character is in between two ASCII character codes, then you may need to do something like your original workaround. However, you'll also have to convert ch to an unichar and not a Character for it to work (see this question for more info on Character vs unichar):
let a_code = ("a" as NSString).characterAtIndex(0)
let z_code = ("z" as NSString).characterAtIndex(0)
let ch_code = (String(ch) as NSString).characterAtIndex(0)
if (a_code...z_code).contains(ch_code) {
println("yep")
} else {
println("nope")
}
Or, the even more verbose way without using NSString:
let startCharScalars = "a".unicodeScalars
let startCode = startCharScalars[startCharScalars.startIndex]
let endCharScalars = "z".unicodeScalars
let endCode = endCharScalars[endCharScalars.startIndex]
let chScalars = String(ch).unicodeScalars
let chCode = chScalars[chScalars.startIndex]
if (startCode...endCode).contains(chCode) {
println("yep")
} else {
println("nope")
}
Note: Both of those examples only work if the character only contains a single code point, but, as long as we're limited to ASCII, that shouldn't be a problem.
If you need C-style ASCII literals, you can just do this:
let chr = UInt8(ascii:"A") // == UInt8( 0x41 )
Or if you need 32-bit Unicode literals you can do this:
let unichr1 = UnicodeScalar("A").value // == UInt32( 0x41 )
let unichr2 = UnicodeScalar("é").value // == UInt32( 0xe9 )
let unichr3 = UnicodeScalar("😀").value // == UInt32( 0x1f600 )
Or 16-bit:
let unichr1 = UInt16(UnicodeScalar("A").value) // == UInt16( 0x41 )
let unichr2 = UInt16(UnicodeScalar("é").value) // == UInt16( 0xe9 )
All of these initializers will be evaluated at compile time, so it really is using an immediate literal at the assembly instruction level.
The feature you want was proposed to be in Swift 5.1, but that proposal was rejected for a few reasons:
Ambiguity
The proposal as written, in the current Swift ecosystem, would have allowed for expressions like 'x' + 'y' == "xy", which was not intended (the proper syntax would be "x" + "y" == "xy").
Amalgamation
The proposal was two in one.
First, it proposed a way to introduce single-quote literals into the language.
Second, it proposed that these would be convertible to numerical types to deal with ASCII values and Unicode codepoints.
These are both good proposals, and it was recommended that this be split into two and re-proposed. Those follow-up proposals have not yet been formalized.
Disagreement
It never reached consensus whether the default type of 'x' would be a Character or a Unicode.Scalar. The proposal went with Character, citing the Principle of Least Surprise, despite this lack of consensus.
You can read the full rejection rationale here.
The syntax might/would look like this:
let myChar = 'f' // Type is Character, value is solely the unicode U+0066 LATIN SMALL LETTER F
let myInt8: Int8 = 'f' // Type is Int8, value is 102 (0x66)
let myUInt8Array: [UInt8] = [ 'a', 'b', '1', '2' ] // Type is [UInt8], value is [ 97, 98, 49, 50 ] ([ 0x61, 0x62, 0x31, 0x32 ])
switch someUInt8 {
case 'a' ... 'f': return "Lowercase hex letter"
case 'A' ... 'F': return "Uppercase hex letter"
case '0' ... '9': return "Hex digit"
default: return "Non-hex character"
}
It also looks like you can use the following syntax:
Character("a")
This will create a Character from the specified single character string.
I have only tested this in Swift 4 and Xcode 10.1
Why do I exhume 7 year old posts? Fun I guess? Seriously though, I think I can add to the discussion.
It is not a gaping hole, or rather, it is a deliberate gaping hole that explicitly discourages conflating a string of text with a sequence of ASCII bytes.
You absolutely can pick apart a String. A String implements BidirectionalCollection and has many ways to manipulate the atoms. See: https://developer.apple.com/documentation/swift/string.
But you have to get used to the more generalized notion of a String. It can be picked apart from the User perspective, which is a sequence of grapheme clusters, each (usually) which a visually separable appearance, or from the encoding perspective, which can be one of several (UTF32, UTF16, UTF8).
At the risk of overanalyzing the wording of your question:
A data structure is conceptual, and independent of encoding in storage
A data structure encoded as an ASCII string is just one kind of ASCII string
By design the encoding of ASCII values 0-127 will have an identical encoding in UTF-8, so loading that stream with a UTF8 API is fine
A data structure encoded as a string where fields of the structure have UTF-8 Unicode string values is not an ASCII string, but a UTF-8 string itself
A string is either ASCII-encoded or not; "for practical purposes" isn't a meaningful qualifier. A UTF-8 database field where 99.99% of the text falls in the ASCII range (where encodings will match), but occasionally doesn't, will present some nasty bug opportunities.
Instead of a terse and low-level equivalence of fixed-width integers and English-only text, Swift has a richer API that forces more explicit naming of the involved categories and entities. If you want to deal with ASCII, there's a name (method) for that, and if you want to deal with human sub-categories, there's a name for that, too, and they're totally independent of one another. There is a strong move away from ASCII and the English-centric string handling model of C. This is factual, not evangelizing, and it can present an irksome learning curve.
(This is aimed at new-comers, acknowledging the OP probably has years of experience with this now.)
For what you're trying to do there, consider:
let foo = "abcDeé#¶œŎO!##"
foo.forEach { c in
print((c.isASCII ? "\(c) is ascii with value \(c.asciiValue ?? 0); " : "\(c) is not ascii; ")
+ ((c.isLetter ? "\(c) is a letter" : "\(c) is not a letter")))
}
b is ascii with value 98; b is a letter
c is ascii with value 99; c is a letter
D is ascii with value 68; D is a letter
e is ascii with value 101; e is a letter
é is not ascii; é is a letter
# is ascii with value 64; # is not a letter
¶ is not ascii; ¶ is not a letter
œ is not ascii; œ is a letter
Ŏ is not ascii; Ŏ is a letter
O is ascii with value 79; O is a letter
! is ascii with value 33; ! is not a letter
# is ascii with value 64; # is not a letter
# is ascii with value 35; # is not a letter

How to use Unicode codepoints above U+FFFF in Rebol 3 strings like in Rebol 2?

I know you can't use caret style escaping in strings for codepoints bigger than ^(FF) in Rebol 2, because it doesn't know anything about Unicode. So this doesn't generate anything good, it looks messed up:
print {Q: What does a Zen master's {Cow} Say? A: "^(03BC)"!}
Yet the code works in Rebol 3 and prints out:
Q: What does a Zen master's {Cow} Say? A: "μ"!
That's great, but R3 maxes out its ability to hold a character in a string at all at U+FFFF apparently:
>> type? "^(FFFF)"
== string!
>> type? "^(010000)"
** Syntax error: invalid "string" -- {"^^(010000)"}
** Near: (line 1) type? "^(010000)"
The situation is a lot better than the random behavior of Rebol 2 when it met codepoints it didn't know about. However, there used to be a workaround in Rebol for storing strings if you knew how to do your own UTF-8 encoding (or got your strings by way of loading source code off disk). You could just assemble them from individual characters.
So the UTF-8 encoding of U+010000 is #F0908080, and you could before say:
workaround: rejoin [#"^(F0)" #"^(90)" #"^(80)" #"^(80)"]
And you'd get a string with that single codepoint encoded using UTF-8, that you could save to disk in code blocks and read back in again. Is there any similar trick in R3?
There is a workaround using the string! datatype as well. You cannot use UTF-8 in that case, but you can use UTF-16 workaround as follows:
utf-16: "^(d800)^(dc00)"
, which encodes the ^(10000) code point using UTF-16 surrogate pair. In general, the following function can do the encoding:
utf-16: func [
code [integer!]
/local low high
] [
case [
code < 0 [do make error! "invalid code"]
code < 65536 [append copy "" to char! code]
code < 1114112 [
code: code - 65536
low: code and 1023
high: code - low / 1024
append append copy "" to char! high + 55296 to char! low + 56320
]
'else [do make error! "invalid code"]
]
]
Yes, there is a trick...which is the trick you should have been using in R2 as well. Don't use a string! Use a binary! if you have to do this sort of thing:
good-workaround: #{F0908080}
It would've worked in Rebol2, and it works in Rebol3. You can save it and load it without any funny business.
In fact, if care about Unicode at all, ever...stop doing string processing that is using codepoints higher than ^(7F) if you are stuck in Rebol 2 and not 3. We'll see why by looking at that terrible workaround:
terrible-workaround: rejoin [#"^(F0)" #"^(90)" #"^(80)" #"^(80)"]
..."And you'd get a string with that single UTF-8 codepoint"...
The only thing you should get is a string with four individual character codepoints, and with 4 = length? terrible-workaround. Rebol2 is broken because string! is basically no different from binary! under the hood. In fact, in Rebol2 you could alias the two types back and forth without making a copy, look up AS-BINARY and AS-STRING. (This is impossible in Rebol3 because they really are fundamentally different, so don't get attached to the feature!)
It's somewhat deceptive to see these strings reporting a length of 4, and there's a false comfort of each character producing the same value if you convert them to integer!. Because if you ever write them out to a file or port somewhere, and they need to be encoded, you'll get bitten. Note this in Rebol2:
>> to integer! #"^(80)"
== 128
>> to binary! #"^(80)"
== #{80}
But in R3, you have a UTF-8 encoding when binary conversion is needed:
>> to integer! #"^(80)"
== 128
>> to binary! #"^(80)"
== #{C280}
So you will be in for a surprise when your seemingly-working code does something different at a later time, and winds up serializing differently. In fact, if you want to know how "messed up" R2 is in this regard, look at why you got a weird symbol for your "mu". In R2:
>> to binary! #"^(03BC)"
== #{BC}
It just threw the "03" away. :-/
So if you need for some reason to work with a Unicode strings and can't switch to R3, try something like this for the cow example:
mu-utf8: #{03BC}
utf8: rejoin [#{} {Q: What does a Zen master's {Cow} Say? A: "} mu-utf8 {"!}]
That gets you a binary. Only convert it to string for debug output, and be ready to see gibberish. But it is the right thing to do if you're stuck in Rebol2.
And to reiterate the answer: it's also what to do if for some odd reason stuck needing to use those higher codepoints in Rebol3:
utf8: rejoin [#{} {Q: What did the Mycenaean's {Cow} Say? A: "} #{010000} {"!}]
I'm sure that would be a very funny joke if I knew what LINEAR B SYLLABLE B008 A was. Which leads me to say that most likely, if you're doing something this esoteric you probably only have a few codepoints being cited as examples. You can hold most of your data as string up until you need to slot them in conveniently, and hold the result in a binary series.
UPDATE: If one hits this problem, here is a utility function that can be useful for working around it temporarily:
safe-r2-char: charset [#"^(00)" - #"^(7F)"]
unsafe-r2-char: charset [#"^(80)" - #"^(FF)"]
hex-digit: charset [#"0" - #"9" #"A" - #"F" #"a" - #"f"]
r2-string-to-binary: func [
str [string!] /string /unescape /unsafe
/local result s e escape-rule unsafe-rule safe-rule rule
] [
result: copy either string [{}] [#{}]
escape-rule: [
"^^(" s: 2 hex-digit e: ")" (
append result debase/base copy/part s e 16
)
]
unsafe-rule: [
s: unsafe-r2-char (
append result to integer! first s
)
]
safe-rule: [
s: safe-r2-char (append result first s)
]
rule: compose/deep [
any [
(either unescape [[escape-rule |]] [])
safe-rule
(either unsafe [[| unsafe-rule]] [])
]
]
unless parse/all str rule [
print "Unsafe codepoints found in string! by r2-string-to-binary"
print "See http://stackoverflow.com/questions/15077974/"
print mold str
throw "Bad codepoint found by r2-string-to-binary"
]
result
]
If you use this instead of a to binary! conversion, you will get the consistent behavior in both Rebol2 and Rebol3. (It effectively implements a solution for terrible-workaround style strings.)

How can I work with a single byte and binary! byte arrays in Rebol 3?

In Rebol 2, it is possible to use to char! to produce what is effectively a single byte, that you can use in operations on binaries such as append:
>> buffer: #{DECAFBAD}
>> data: #{FFAE}
>> append buffer (to char! (first data))
== #{DECAFBADFF}
Seems sensible. But in Rebol 3, you get something different:
>> append buffer (to char! (first data))
== #{DECAFBADC3BF}
That's because it doesn't model single characters as single bytes (due to Unicode). So the integer value of first data (255) is translated into a two-byte sequence:
>> to char! 255
== #"ÿ"
>> to binary! (to char! 255)
== #{C3BF}
Given that CHAR! is no longer equivalent to a byte in Rebol 3, and no BYTE! datatype was added (such that a BINARY! could be considered a series of these BYTE!s just as a STRING! is a series of CHAR!), what is one to do about this kind of situation?
Use an integer!, the closest match we have for expressing a byte in R3, at the moment.
Note that integers are range-checked when used as bytes in context of a binary!:
>> append #{} 1024
** Script error: value out of range: 1024
** Where: append
** Near: append #{} 1024
For your first example, you actually append one element of one series to another series of the same type. In R3 you can express this in the obvious and most straight-forward way:
>> append #{DECAFBAD} first #{FFAE}
== #{DECAFBADFF}
So for that matter, a binary! is a series of range-constrained integer!s.
Unfortunately, that won't work in R2, because its binary! model was just broken in many tiny ways, including the above. While conceptually a binary! in R2 can be considered a series of char!s, that concept is not consistently implemented.
>> buffer: #{DECAFBAD}
== #{DECAFBAD}
>> data: #{FFAE}
== #{FFAE}
>> append buffer data/1
== #{DECAFBADFF}
The easiest workaround for the issue in both Rebol 2 and Rebol 3 is to use COPY/PART. This way, rather than trying to reduce your content to a byte value, you keep it as a binary series type. You're merely appending a binary sequence of length 1:
>> append buffer (copy/part data 1)
== #{DECAFBADFF}
It does seem that for completeness, Rebol 3 would have a BYTE! type, as calling it a "series of INTEGER!" clearly does not match the precedent set by STRING!
It's open source, so you (um, I?) might try adding it and seeing what the full ramifications are. :-)