Getting a random emoji/character from a unicode string - unicode

My goal is to get a random emoticon, from a list, in F#.
I started with this:
let pickOne (icons: string) : char = icons.[Helpers.random.Next(icons.Length)]
let happySymbols = "๐Ÿ”ฅ๐Ÿ˜‚๐Ÿ˜Š๐Ÿ˜๐Ÿ™๐Ÿ˜Ž๐Ÿ’ช๐Ÿ˜‹๐Ÿ˜‡๐ŸŽ‰๐Ÿ™Œ๐Ÿค˜๐Ÿ‘๐Ÿค‘๐Ÿคฉ๐Ÿคช๐Ÿค ๐Ÿฅณ๐Ÿ˜Œ๐Ÿคค๐Ÿ˜๐Ÿ˜€"
let sadSymbols = "๐Ÿ˜ญ๐Ÿ˜”๐Ÿ˜’๐Ÿ˜ฉ๐Ÿ˜ข๐Ÿคฆ๐Ÿคท๐Ÿ˜ฑ๐Ÿ‘Ž๐Ÿคจ๐Ÿ˜‘๐Ÿ˜ฌ๐Ÿ™„๐Ÿคฎ๐Ÿ˜ต๐Ÿคฏ๐Ÿง๐Ÿ˜•๐Ÿ˜Ÿ๐Ÿ˜ค๐Ÿ˜ก๐Ÿคฌ"
that doesn't work because:
"๐Ÿ”ฅ๐Ÿ˜‚๐Ÿ˜Š๐Ÿ˜๐Ÿ™๐Ÿ˜Ž๐Ÿ’ช๐Ÿ˜‹๐Ÿ˜‡๐ŸŽ‰๐Ÿ™Œ๐Ÿค˜๐Ÿ‘๐Ÿค‘๐Ÿคฉ๐Ÿคช๐Ÿค ๐Ÿฅณ๐Ÿ˜Œ๐Ÿคค๐Ÿ˜๐Ÿ˜€".Length
is returning 44 as length returns the number of chars in a string, which is not working well with unicode characters.
I can't just divide by 2 because I may add some single byte characters in the string at some point.
Indexing doesn't work either:
let a = "๐Ÿ”ฅ๐Ÿ˜‚๐Ÿ˜Š๐Ÿ˜๐Ÿ™๐Ÿ˜Ž๐Ÿ’ช๐Ÿ˜‹๐Ÿ˜‡๐ŸŽ‰๐Ÿ™Œ๐Ÿค˜๐Ÿ‘๐Ÿค‘๐Ÿคฉ๐Ÿคช๐Ÿค ๐Ÿฅณ๐Ÿ˜Œ๐Ÿคค๐Ÿ˜๐Ÿ˜€"
a.[0]
will not return ๐Ÿ”ฅ but I get some unknown character symbol.
so, plan B was: let's make this an array instead of a string:
let a = [| '๐Ÿ”ฅ'; '๐Ÿ˜‚'; '๐Ÿ˜Š'; '๐Ÿ˜'; '๐Ÿ™'; '๐Ÿ˜Ž'; '๐Ÿ’ช'; '๐Ÿ˜‹'; '๐Ÿ˜‡'; '๐ŸŽ‰'; '๐Ÿ™Œ'; '๐Ÿค˜'; '๐Ÿ‘'; '๐Ÿค‘'; '๐Ÿคฉ'; '๐Ÿคช'; '๐Ÿค '; '๐Ÿฅณ'; '๐Ÿ˜Œ'; '๐Ÿคค'; '๐Ÿ˜'; '๐Ÿ˜€' |]
this is not compiling, I'm getting:
Parse error Unexpected quote symbol in binding. Expected '|]' or other token.
why is that?
anyhow, I can make a list of strings and get it to work, but I'm curious: is there a "proper" way to make the first one work and take a random unicode character from a unicode string?

Asti's answer works for your purpose, but I wasn't too happy about where we landed on this. I guess I got hung up in the word "proper" in the answer. After a lot of research in various places, I got curious about the method String.EnumerateRunes, which again lead me to the type Rune. The documentation for that type is particularly enlightening about proper string handling, and what's in a Unicode UTF-8 string in .NET. I also experimented in LINQPad, and got this.
let dump x = x.Dump()
let runes = "abcABCรฆรธรฅร†ร˜ร…๐Ÿ˜‚๐Ÿ˜Š๐Ÿ˜โ‚…่Œจ่Œง่Œฆ่Œฅ".EnumerateRunes().ToArray()
runes.Length |> dump
// 20
runes |> Array.iter (fun rune -> dump (string rune))
// a b c A B C รฆ รธ รฅ ร† ร˜ ร… ๐Ÿ˜‚ ๐Ÿ˜Š ๐Ÿ˜ โ‚… ่Œจ ่Œง ่Œฆ ่Œฅ
dump runes
// see screenshot
let smiley = runes.[13].ToString()
dump smiley
// ๐Ÿ˜Š

All strings in .NET are 16-bit unicode strings.
That's the definition of char:
Represents a character as a UTF-16 code unit.
All characters take up the minimum encoding size (2 bytes for UTF-16), up to as many bytes as required. Emojis don't fit in 2 bytes, so they align to 4 bytes, or 2 chars.
So what's the solution? align(4) all the things! (insert GCC joke here).
First we convert everything into UTF32:
let utf32 (source: string) =
Encoding.Convert(Encoding.Unicode, Encoding.UTF32, Encoding.Unicode.GetBytes(source))
Then we can pick and choose any "character":
let pick (arr: byte[]) index =
Encoding.UTF32.GetString(arr, index * 4, 4)
Test:
let happySymbols = "๐Ÿ”ฅ๐Ÿ˜‚๐Ÿ˜Š๐Ÿ˜๐Ÿ™๐Ÿ˜Ž๐Ÿ’ช๐Ÿ˜‹๐Ÿ˜‡๐ŸŽ‰๐Ÿ™Œ๐Ÿค˜๐Ÿ‘๐Ÿค‘๐Ÿคฉ๐Ÿคช๐Ÿค ๐Ÿฅณ๐Ÿ˜Œ๐Ÿคค๐Ÿ˜๐Ÿ˜€YTHO"
pick (utf32 happySymbols) 0;;
val it : string = "๐Ÿ”ฅ"
> pick (utf32 happySymbols) 22;;
val it : string = "Y"
For the actual length, just div by 4.
let surpriseMe arr =
let rnd = Random()
pick arr (rnd.Next(0, arr.Length / 4))
Hmmm
> surpriseMe (utf32 happySymbols);;
val it : string = "๐Ÿ˜"

Related

If a sequence of code points forms a Unicode character, does every non-empty prefix of that sequence also form a valid character?

The problem I have is that given a sequence of bytes, I want to determine its longest prefix which forms a valid Unicode character (extended grapheme cluster) assuming UTF8 encoding.
I am using Swift, so I would like to use Swift's built-in function(s) to do so. But these functions only decode a complete sequence of bytes. So I was thinking to convert prefixes of the byte sequence via Swift and take the last prefix that didn't fail and consists of 1 character only. Obviously, this might lead to trying out the entire sequence of bytes, which I want to avoid. A solution would be to stop trying out prefixes after 4 prefixes in a row failed. If the property asked in my question holds, this would then guarantee that all longer prefixes must also fail.
I find the Unicode Text Segmentation Standard unreadable, otherwise I would try to directly implement boundary detection of extended grapheme clusters...
After taking a long hard look at the specification for computing the boundaries for extended grapheme clusters (EGCs) at https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules,
it is obvious that the rules for EGCs all have the shape of describing when it is allowed to append a code point to an existing EGC to form a longer EGC. From that fact alone my two questions follow: 1) Yes, every non-empty prefix of code points which form an EGC is also an EGC. 2) No, by adding a code point to a valid Unicode string you will not decrease its length in terms of number of EGCs it consists of.
So, given this, the following Swift code will extract the longest Unicode character from the start of a byte sequence (or return nil if there is no valid Unicode character there):
func lex<S : Sequence>(_ input : S) -> (length : Int, out: Character)? where S.Element == UInt8 {
// This code works under three assumptions, all of which are true:
// 1) If a sequence of codepoints does not form a valid character, then appending codepoints to it does not yield a valid character
// 2) Appending codepoints to a sequence of codepoints does not decrease its length in terms of extended grapheme clusters
// 3) a codepoint takes up at most 4 bytes in an UTF8 encoding
var chars : [UInt8] = []
var result : String = ""
var resultLength = 0
func value() -> (length : Int, out : Character)? {
guard let character = result.first else { return nil }
return (length: resultLength, out: character)
}
var length = 0
var iterator = input.makeIterator()
while length - resultLength <= 4 {
guard let char = iterator.next() else { return value() }
chars.append(char)
length += 1
guard let s = String(bytes: chars, encoding: .utf8) else { continue }
guard s.count == 1 else { return value() }
result = s
resultLength = length
}
return value()
}

How do I format a string from a string with %# in Swift

I am using Swift 4.2. I am getting extraneous characters when formatting one string (s1) from another string(s0) using the %# format code.
I have searched extensively for details of string formatting but have come up with only partial answers including the code in the second line below. I need to be able to format s1 so that I can customize output from a Swift process. I ask this because I have not found an answer while searching for ways to format a string from a string.
I tried the following three statements:
let s0:[String] = ["abcdef"]
let s1:[String] = [String(format:"%#",s0)]
print(s1)
...
The output is shown below. It may not be clear, here, but there are four leading spaces to the left of the abcdef string.
["(\n abcdef\n)"]
How can I format s1 so it does not include the brackets, the \n escape characters, and the leading spaces?
The issue here is you are using an array but a string in s0.
so the following index will help you.
let s0:[String] = ["abcdef"]
let s1:[String] = [String(format:" %#",s0[0])]
I am getting extraneous characters when formatting one string (s1) from another string (s0) ...
The s0 is not a string. It is an array of strings (i.e. the square brackets of [String] indicate an array and is the same as saying Array<String>). And your s1 is also array, but one that that has one element, whose value is the string representation of the entire s0 array of strings. Thatโ€™s obviously not what you intended.
How can I format s1 so it does not include the brackets, the \n escape characters, and the leading spaces?
Youโ€™re getting those brackets because s1 is an array. Youโ€™re getting the string with the \n and spaces because its first value is the string representation of yet another array, s0.
So, if youโ€™re just trying to format a string, s0, you can do:
let s0: String = "abcdef"
let s1: String = String(format: "It is โ€˜%#โ€™", s0)
Or, if you really want an array of strings, you can call String(format:) for each using the map function:
let s0: [String] = ["abcdef", "ghijkl"]
let s1: [String] = s0.map { String(format: "It is โ€˜%#โ€™", $0) }
By the way, in the examples above, I didnโ€™t use a string format of just %#, because that doesnโ€™t accomplish anything at all, so I assumed you were formatting the string for a reason.
FWIW, we generally donโ€™t use String(format:) very often. Usually we do โ€œstring interpolationโ€, with \( and ):
let s0: String = "abcdef"
let s1: String = "It is โ€˜\(s0)โ€™"
Get rid of all the unneccessary arrays and let the compiler figure out the types:
let s0 = "abcdef" // a string
let s1 = String(format:"- %# -",s0) // another string
print(s1) // prints "- abcdef -"

How to split a Korean word into it's components?

So, for example the character ๊น€ is made up of ใ„ฑ, ใ…ฃ and ใ…. I need to split the Korean word into it's components to get the resulting 3 characters.
I tried by doing the following but it doesn't seem to output it correctly:
let str = "๊น€"
let utf8 = str.utf8
let first:UInt8 = utf8.first!
let char = Character(UnicodeScalar(first))
The problem is, that that code returns รช, when it should be returning ใ„ฑ.
You need to use the decomposedStringWithCompatibilityMapping string to get the unicode scalar values and then use those scalar values to get the characters. Something below,
let string = "๊น€"
for scalar in string.decomposedStringWithCompatibilityMapping.unicodeScalars {
print("\(scalar) ")
}
Output:
แ„€
แ…ต
แ†ท
You can create list of character strings as,
let chars = string.decomposedStringWithCompatibilityMapping.unicodeScalars.map { String($0) }
print(chars)
// ["แ„€", "แ…ต", "แ†ท"]
Korean related info in Apple docs
Extended grapheme clusters are a flexible way to represent many
complex script characters as a single Character value. For example,
Hangul syllables from the Korean alphabet can be represented as either
a precomposed or decomposed sequence. Both of these representations
qualify as a single Character value in Swift:
let precomposed: Character = "\u{D55C}" // ํ•œ
let decomposed: Character = "\u{1112}\u{1161}\u{11AB}" // แ„’, แ…ก, แ†ซ
// precomposed is ํ•œ, decomposed is แ„’แ…กแ†ซ

How to write all ASCII characters in Swift String

Trying to find the shortest / most compact way to write out ASCII characters in Swift into a single string. For example, in JavaScript you can do '\x00' for the decimal equivalent of 0 in ASCII, or you can write '\0, which is 2 characters shorter. So if you have a lot of these characters, that is 2x smaller file size.
Wondering how to write the ASCII characters 0-31 and 127 in Swift so they are minimal, into a single string. In JavaScript, that sort of looks like this:
'\0...\33abcdef...\127ยกยขยฃยคยฅยฆยงยจยฉยชยซยฌยญยฎยฏยฐยฑยฒยณยดยตยถยทยธยนยบยปยผยฝ...'
In general, you would use \u{x} where x is the hex value. In your case \u{0} through \u{1f} and \u{7f}.
As in C based languages, Swift strings also supports \0 for "null", \t for "tab", \n for "newline", and \r for "carriage return". Unlike C, Swift does not support \b or \f.
If you want to create single String will all 128 ASCII characters then you can do:
let ascii = String(Array(0...127).map { Character(Unicode.Scalar($0)) })
If you have a lot of these characters, maybe put them in a Data object and then convert it to a string:
let data = Data(bytes: Array(0...31) + [127])
let text = String(data: data, encoding: .utf8)!
Based on your comment, you could do:
let tab = Data(bytes: [9])
let null = Data(bytes: [0])
let data = "abc".data(using: .utf8)! + tab + null + "morechars".data(using: .utf8)! + tab

Is there a clean way to specify character literals in Swift?

Swift seems to be trying to deprecate the notion of a string being composed of an array of atomic characters, which makes sense for many uses, but there's an awful lot of programming that involves picking through datastructures that are ASCII for all practical purposes: particularly with file I/O. The absence of a built in language feature to specify a character literal seems like a gaping hole, i.e. there is no analog of the C/Java/etc-esque:
String foo="a"
char bar='a'
This is rather inconvenient, because even if you convert your strings into arrays of characters, you can't do things like:
let ch:unichar = arrayOfCharacters[n]
if ch >= 'a' && ch <= 'z' {...whatever...}
One rather hacky workaround is to do something like this:
let LOWCASE_A = ("a" as NSString).characterAtIndex(0)
let LOWCASE_Z = ("z" as NSString).characterAtIndex(0)
if ch >= LOWCASE_A && ch <= LOWCASE_Z {...whatever...}
This works, but obviously it's pretty ugly. Does anyone have a better way?
Characters can be created from Strings as long as those Strings are only made up of a single character. And, since Character implements ExtendedGraphemeClusterLiteralConvertible, Swift will do this for you automatically on assignment. So, to create a Character in Swift, you can simply do something like:
let ch: Character = "a"
Then, you can use the contains method of an IntervalType (generated with the Range operators) to check if a character is within the range you're looking for:
if ("a"..."z").contains(ch) {
/* ... whatever ... */
}
Example:
let ch: Character = "m"
if ("a"..."z").contains(ch) {
println("yep")
} else {
println("nope")
}
Outputs:
yep
Update: As #MartinR pointed out, the ordering of Swift characters is based on Unicode Normalization Form D which is not in the same order as ASCII character codes. In your specific case, there are more characters between a and z than in straight ASCII (รค for example). See #MartinR's answer here for more info.
If you need to check if a character is in between two ASCII character codes, then you may need to do something like your original workaround. However, you'll also have to convert ch to an unichar and not a Character for it to work (see this question for more info on Character vs unichar):
let a_code = ("a" as NSString).characterAtIndex(0)
let z_code = ("z" as NSString).characterAtIndex(0)
let ch_code = (String(ch) as NSString).characterAtIndex(0)
if (a_code...z_code).contains(ch_code) {
println("yep")
} else {
println("nope")
}
Or, the even more verbose way without using NSString:
let startCharScalars = "a".unicodeScalars
let startCode = startCharScalars[startCharScalars.startIndex]
let endCharScalars = "z".unicodeScalars
let endCode = endCharScalars[endCharScalars.startIndex]
let chScalars = String(ch).unicodeScalars
let chCode = chScalars[chScalars.startIndex]
if (startCode...endCode).contains(chCode) {
println("yep")
} else {
println("nope")
}
Note: Both of those examples only work if the character only contains a single code point, but, as long as we're limited to ASCII, that shouldn't be a problem.
If you need C-style ASCII literals, you can just do this:
let chr = UInt8(ascii:"A") // == UInt8( 0x41 )
Or if you need 32-bit Unicode literals you can do this:
let unichr1 = UnicodeScalar("A").value // == UInt32( 0x41 )
let unichr2 = UnicodeScalar("รฉ").value // == UInt32( 0xe9 )
let unichr3 = UnicodeScalar("๐Ÿ˜€").value // == UInt32( 0x1f600 )
Or 16-bit:
let unichr1 = UInt16(UnicodeScalar("A").value) // == UInt16( 0x41 )
let unichr2 = UInt16(UnicodeScalar("รฉ").value) // == UInt16( 0xe9 )
All of these initializers will be evaluated at compile time, so it really is using an immediate literal at the assembly instruction level.
The feature you want was proposed to be in Swift 5.1, but that proposal was rejected for a few reasons:
Ambiguity
The proposal as written, in the current Swift ecosystem, would have allowed for expressions like 'x' + 'y' == "xy", which was not intended (the proper syntax would be "x" + "y" == "xy").
Amalgamation
The proposal was two in one.
First, it proposed a way to introduce single-quote literals into the language.
Second, it proposed that these would be convertible to numerical types to deal with ASCII values and Unicode codepoints.
These are both good proposals, and it was recommended that this be split into two and re-proposed. Those follow-up proposals have not yet been formalized.
Disagreement
It never reached consensus whether the default type of 'x' would be a Character or a Unicode.Scalar. The proposal went with Character, citing the Principle of Least Surprise, despite this lack of consensus.
You can read the full rejection rationale here.
The syntax might/would look like this:
let myChar = 'f' // Type is Character, value is solely the unicode U+0066 LATIN SMALL LETTER F
let myInt8: Int8 = 'f' // Type is Int8, value is 102 (0x66)
let myUInt8Array: [UInt8] = [ 'a', 'b', '1', '2' ] // Type is [UInt8], value is [ 97, 98, 49, 50 ] ([ 0x61, 0x62, 0x31, 0x32 ])
switch someUInt8 {
case 'a' ... 'f': return "Lowercase hex letter"
case 'A' ... 'F': return "Uppercase hex letter"
case '0' ... '9': return "Hex digit"
default: return "Non-hex character"
}
It also looks like you can use the following syntax:
Character("a")
This will create a Character from the specified single character string.
I have only tested this in Swift 4 and Xcode 10.1
Why do I exhume 7 year old posts? Fun I guess? Seriously though, I think I can add to the discussion.
It is not a gaping hole, or rather, it is a deliberate gaping hole that explicitly discourages conflating a string of text with a sequence of ASCII bytes.
You absolutely can pick apart a String. A String implements BidirectionalCollection and has many ways to manipulate the atoms. See: https://developer.apple.com/documentation/swift/string.
But you have to get used to the more generalized notion of a String. It can be picked apart from the User perspective, which is a sequence of grapheme clusters, each (usually) which a visually separable appearance, or from the encoding perspective, which can be one of several (UTF32, UTF16, UTF8).
At the risk of overanalyzing the wording of your question:
A data structure is conceptual, and independent of encoding in storage
A data structure encoded as an ASCII string is just one kind of ASCII string
By design the encoding of ASCII values 0-127 will have an identical encoding in UTF-8, so loading that stream with a UTF8 API is fine
A data structure encoded as a string where fields of the structure have UTF-8 Unicode string values is not an ASCII string, but a UTF-8 string itself
A string is either ASCII-encoded or not; "for practical purposes" isn't a meaningful qualifier. A UTF-8 database field where 99.99% of the text falls in the ASCII range (where encodings will match), but occasionally doesn't, will present some nasty bug opportunities.
Instead of a terse and low-level equivalence of fixed-width integers and English-only text, Swift has a richer API that forces more explicit naming of the involved categories and entities. If you want to deal with ASCII, there's a name (method) for that, and if you want to deal with human sub-categories, there's a name for that, too, and they're totally independent of one another. There is a strong move away from ASCII and the English-centric string handling model of C. This is factual, not evangelizing, and it can present an irksome learning curve.
(This is aimed at new-comers, acknowledging the OP probably has years of experience with this now.)
For what you're trying to do there, consider:
let foo = "abcDeรฉ#ยถล“ลŽO!##"
foo.forEach { c in
print((c.isASCII ? "\(c) is ascii with value \(c.asciiValue ?? 0); " : "\(c) is not ascii; ")
+ ((c.isLetter ? "\(c) is a letter" : "\(c) is not a letter")))
}
b is ascii with value 98; b is a letter
c is ascii with value 99; c is a letter
D is ascii with value 68; D is a letter
e is ascii with value 101; e is a letter
รฉ is not ascii; รฉ is a letter
# is ascii with value 64; # is not a letter
ยถ is not ascii; ยถ is not a letter
ล“ is not ascii; ล“ is a letter
ลŽ is not ascii; ลŽ is a letter
O is ascii with value 79; O is a letter
! is ascii with value 33; ! is not a letter
# is ascii with value 64; # is not a letter
# is ascii with value 35; # is not a letter