I accidentally wrote this simple code to print alphabet in terminal:
var alpha:Int = 97
while (alpha <= 122) {
write(1, &alpha, 1)
alpha += 1
}
write(1, "\n", 1)
//I'm using write() function from C, to avoid newline on each symbol
And I've got this output:
abcdefghijklmnopqrstuvwxyz
Program ended with exit code: 0
So, here is the question: Why does it work?
In my logic, it should display a row of numbers, because an integer variable is being used. In C, it would be a char variable, so we would mean that we point to a sign at some index in ASCII. Then:
char alpha = 97;
Would be a code point to an 'a' sign, by incrementing alpha variable in a loop we would display each element of ascii through 122nd.
In Swift though, I couldn't assign an integer to Character or String type variable. I used Integer and then declared several variables to assign UnicodeScalar, but accidentally I found out that when I'm calling write, I point to my integer, not the new variable of UnicodeScalar type, although it works! Code is very short and readable, but I don't completely understand how does work and why at all.
Has anyone had such situation?
Why does it work?
This works “by chance” because the integer is stored in little-endian byte order.
The integer 97 is stored in memory as 8 bytes
0x61 0x00 0x00 0x00 0x00 0x00 0x00 0x00
and in write(1, &alpha, 1), the address of that memory location is
passed to the write system call. Since the last parameter (nbyte)
is 1, the first byte at that memory address is written to the
standard output: That is 0x61 or 97, the ASCII code of the letter
a.
In Swift though, I couldn't assign an integer to Character or String type variable.
The Swift equivalent of char is CChar, a type alias for Int8:
var alpha: CChar = 97
Here is a solution which does not rely on the memory layout and
works for non-ASCII character as well:
let first: UnicodeScalar = "α"
let last: UnicodeScalar = "ω"
for v in first.value...last.value {
if let c = UnicodeScalar(v) {
print(c, terminator: "")
}
}
print()
// αβγδεζηθικλμνξοπρςστυφχψω
Related
I want to get letters from their ASCII code in Swift 3. I would do it like this in Java :
for(int i = 65 ; i < 75 ; i++)
{
System.out.print((char)i);
}
Which would log letters from A to J.
Now I tried this in Swift :
let s = String(describing: UnicodeScalar(i))
Instead of only getting the letter, I get this :
Optional("A")
Optional("B")
Optional("C")
...
What am I doing wrong? Thanks for your help.
UnicodeScalar has a few failable initialisers for integer types that can reprent values that aren't valid Unicode code points. Therefore you'll need to unwrap the UnicodeScalar? returned, as in the case that you pass an invalid code point, the initialiser will return nil.
However, given that you're dealing exclusively with ASCII characters, you can simply annotate i as a UInt8 and take advantage of the fact that UnicodeScalar has a non-failable initialiser for a UInt8 input (as it will always represent a valid code point):
for i : UInt8 in 65..<75 {
print(UnicodeScalar(i))
}
I'm trying to encode from utf16 to say utf32 using Apple Core Foundation API :
cfString = CFStringCreateWithBytes(nullptr, str, strLen, kCFStringEncodingUTF16, FALSE);
auto range = CFRangeMake(0, CFStringGetLenth(cfString));
CFStringGetBytes(cfString, range, kCFStringEncodingUTF32, 0, false, buffer, bufferSize, usedsize);
Most of the time that works, untill input buffer contains first part of surrogate pair say U+df9f, Corefoundation will simply return output without ill-formed characters.
So to be a bit unicode compliant, I have to manually determine that situation and follow unicode documentation to create standard substitution for that in form of U+FFFD: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
Same situation for other encodings: like symbol 0x80 in the middle of utf-8, then CFStringCreateWithBytes always return nullptr instead of pointing to invalid character.
Is that expected behaviour or UB of Corefoundation, or may be there is a hint to tune CF to be reporting malformed input somehow?
UPDATE:
I did exactly following:
UInt8 str[] = {0x41, 0x00, 0x9f, 0xdf}; // coresponding to unicode A + invalid surogate pair
CFStringRef mystr = CFStringCreateWithBytes(nullptr, str, 4, kCFStringEncodingUTF16, false, FALSE);
after that mystr has 2 characters len according to CFStringGetLength(), so looks invalid char gets processed
std::vector<char> str(7);
CFStringGetCString(mystr, &*str.begin(), str.size(), kCFStringEncodingUTF8);
that gives me false, so no conversion to utf8 is possible, and Xcode debug watches shows nothing for string myStr.
So output is nothing for utf8, and c-string, ok after that i checked with conversion to utf-32 with get bytes routine
result = CFStringGetBytes(s, range, kCFStringEncodingUTF32BE, 0, false, buffer, bufferSize, usedSize);
that gives me usedSize=4, result=1, and output contains 0x0041, so only A symbol converted. So that is why i’m thinking no substitution happened for malformed surogate pair.
Swift seems to be trying to deprecate the notion of a string being composed of an array of atomic characters, which makes sense for many uses, but there's an awful lot of programming that involves picking through datastructures that are ASCII for all practical purposes: particularly with file I/O. The absence of a built in language feature to specify a character literal seems like a gaping hole, i.e. there is no analog of the C/Java/etc-esque:
String foo="a"
char bar='a'
This is rather inconvenient, because even if you convert your strings into arrays of characters, you can't do things like:
let ch:unichar = arrayOfCharacters[n]
if ch >= 'a' && ch <= 'z' {...whatever...}
One rather hacky workaround is to do something like this:
let LOWCASE_A = ("a" as NSString).characterAtIndex(0)
let LOWCASE_Z = ("z" as NSString).characterAtIndex(0)
if ch >= LOWCASE_A && ch <= LOWCASE_Z {...whatever...}
This works, but obviously it's pretty ugly. Does anyone have a better way?
Characters can be created from Strings as long as those Strings are only made up of a single character. And, since Character implements ExtendedGraphemeClusterLiteralConvertible, Swift will do this for you automatically on assignment. So, to create a Character in Swift, you can simply do something like:
let ch: Character = "a"
Then, you can use the contains method of an IntervalType (generated with the Range operators) to check if a character is within the range you're looking for:
if ("a"..."z").contains(ch) {
/* ... whatever ... */
}
Example:
let ch: Character = "m"
if ("a"..."z").contains(ch) {
println("yep")
} else {
println("nope")
}
Outputs:
yep
Update: As #MartinR pointed out, the ordering of Swift characters is based on Unicode Normalization Form D which is not in the same order as ASCII character codes. In your specific case, there are more characters between a and z than in straight ASCII (ä for example). See #MartinR's answer here for more info.
If you need to check if a character is in between two ASCII character codes, then you may need to do something like your original workaround. However, you'll also have to convert ch to an unichar and not a Character for it to work (see this question for more info on Character vs unichar):
let a_code = ("a" as NSString).characterAtIndex(0)
let z_code = ("z" as NSString).characterAtIndex(0)
let ch_code = (String(ch) as NSString).characterAtIndex(0)
if (a_code...z_code).contains(ch_code) {
println("yep")
} else {
println("nope")
}
Or, the even more verbose way without using NSString:
let startCharScalars = "a".unicodeScalars
let startCode = startCharScalars[startCharScalars.startIndex]
let endCharScalars = "z".unicodeScalars
let endCode = endCharScalars[endCharScalars.startIndex]
let chScalars = String(ch).unicodeScalars
let chCode = chScalars[chScalars.startIndex]
if (startCode...endCode).contains(chCode) {
println("yep")
} else {
println("nope")
}
Note: Both of those examples only work if the character only contains a single code point, but, as long as we're limited to ASCII, that shouldn't be a problem.
If you need C-style ASCII literals, you can just do this:
let chr = UInt8(ascii:"A") // == UInt8( 0x41 )
Or if you need 32-bit Unicode literals you can do this:
let unichr1 = UnicodeScalar("A").value // == UInt32( 0x41 )
let unichr2 = UnicodeScalar("é").value // == UInt32( 0xe9 )
let unichr3 = UnicodeScalar("😀").value // == UInt32( 0x1f600 )
Or 16-bit:
let unichr1 = UInt16(UnicodeScalar("A").value) // == UInt16( 0x41 )
let unichr2 = UInt16(UnicodeScalar("é").value) // == UInt16( 0xe9 )
All of these initializers will be evaluated at compile time, so it really is using an immediate literal at the assembly instruction level.
The feature you want was proposed to be in Swift 5.1, but that proposal was rejected for a few reasons:
Ambiguity
The proposal as written, in the current Swift ecosystem, would have allowed for expressions like 'x' + 'y' == "xy", which was not intended (the proper syntax would be "x" + "y" == "xy").
Amalgamation
The proposal was two in one.
First, it proposed a way to introduce single-quote literals into the language.
Second, it proposed that these would be convertible to numerical types to deal with ASCII values and Unicode codepoints.
These are both good proposals, and it was recommended that this be split into two and re-proposed. Those follow-up proposals have not yet been formalized.
Disagreement
It never reached consensus whether the default type of 'x' would be a Character or a Unicode.Scalar. The proposal went with Character, citing the Principle of Least Surprise, despite this lack of consensus.
You can read the full rejection rationale here.
The syntax might/would look like this:
let myChar = 'f' // Type is Character, value is solely the unicode U+0066 LATIN SMALL LETTER F
let myInt8: Int8 = 'f' // Type is Int8, value is 102 (0x66)
let myUInt8Array: [UInt8] = [ 'a', 'b', '1', '2' ] // Type is [UInt8], value is [ 97, 98, 49, 50 ] ([ 0x61, 0x62, 0x31, 0x32 ])
switch someUInt8 {
case 'a' ... 'f': return "Lowercase hex letter"
case 'A' ... 'F': return "Uppercase hex letter"
case '0' ... '9': return "Hex digit"
default: return "Non-hex character"
}
It also looks like you can use the following syntax:
Character("a")
This will create a Character from the specified single character string.
I have only tested this in Swift 4 and Xcode 10.1
Why do I exhume 7 year old posts? Fun I guess? Seriously though, I think I can add to the discussion.
It is not a gaping hole, or rather, it is a deliberate gaping hole that explicitly discourages conflating a string of text with a sequence of ASCII bytes.
You absolutely can pick apart a String. A String implements BidirectionalCollection and has many ways to manipulate the atoms. See: https://developer.apple.com/documentation/swift/string.
But you have to get used to the more generalized notion of a String. It can be picked apart from the User perspective, which is a sequence of grapheme clusters, each (usually) which a visually separable appearance, or from the encoding perspective, which can be one of several (UTF32, UTF16, UTF8).
At the risk of overanalyzing the wording of your question:
A data structure is conceptual, and independent of encoding in storage
A data structure encoded as an ASCII string is just one kind of ASCII string
By design the encoding of ASCII values 0-127 will have an identical encoding in UTF-8, so loading that stream with a UTF8 API is fine
A data structure encoded as a string where fields of the structure have UTF-8 Unicode string values is not an ASCII string, but a UTF-8 string itself
A string is either ASCII-encoded or not; "for practical purposes" isn't a meaningful qualifier. A UTF-8 database field where 99.99% of the text falls in the ASCII range (where encodings will match), but occasionally doesn't, will present some nasty bug opportunities.
Instead of a terse and low-level equivalence of fixed-width integers and English-only text, Swift has a richer API that forces more explicit naming of the involved categories and entities. If you want to deal with ASCII, there's a name (method) for that, and if you want to deal with human sub-categories, there's a name for that, too, and they're totally independent of one another. There is a strong move away from ASCII and the English-centric string handling model of C. This is factual, not evangelizing, and it can present an irksome learning curve.
(This is aimed at new-comers, acknowledging the OP probably has years of experience with this now.)
For what you're trying to do there, consider:
let foo = "abcDeé#¶œŎO!##"
foo.forEach { c in
print((c.isASCII ? "\(c) is ascii with value \(c.asciiValue ?? 0); " : "\(c) is not ascii; ")
+ ((c.isLetter ? "\(c) is a letter" : "\(c) is not a letter")))
}
b is ascii with value 98; b is a letter
c is ascii with value 99; c is a letter
D is ascii with value 68; D is a letter
e is ascii with value 101; e is a letter
é is not ascii; é is a letter
# is ascii with value 64; # is not a letter
¶ is not ascii; ¶ is not a letter
œ is not ascii; œ is a letter
Ŏ is not ascii; Ŏ is a letter
O is ascii with value 79; O is a letter
! is ascii with value 33; ! is not a letter
# is ascii with value 64; # is not a letter
# is ascii with value 35; # is not a letter
Like some other emoji characters, the 0x0001F1E9 0x0001F1EA combination (German flag) is represented as a single character on screen although it is really two different Unicode character points combined. Is it represented as one or two different characters in Swift?
let flag = "\u{1f1e9}\u{1f1ea}"
then flag is 🇩🇪 .
For more regional indicator symbols, see:
http://en.wikipedia.org/wiki/Regional_Indicator_Symbol
Support for "extended grapheme clusters" has been added to Swift in the meantime.
Iterating over the characters of a string produces a single character for
the "flags":
let string = "Hi🇩🇪!"
for char in string.characters {
print(char)
}
Output:
H
i
🇩🇪
!
Swift 3 implements Unicode in its String struct. In Unicode, all flags are pairs of Regional Indicator Symbols. So, 🇩🇪 is actually 🇩 followed by 🇪 (try copying the two and pasting them next to eachother!).
When two or more Regional Indicator Symbols are placed next to eachother, they form an "Extended Grapheme Cluster", which means they're treated as one character. This is why "🇪🇺 = 🇫🇷🇪🇸🇩🇪...".characters gives you ["🇪🇺", " ", "=", " ", "🇫🇷🇪🇸🇩🇪", ".", ".", "."].
If you want to see every single Unicode code point (AKA "scalar"), you can use .unicodeScalars, so that "Hi🇩🇪!".unicodeScalars gives you ["H", "i", "🇩", "🇪", "!"]
tl;dr
🇩🇪 is one character (in both Swift and Unicode), which is made up of two code points (AKA scalars). Don't forget these are different! 🙂
See Also
Why are emoji characters like 👩👩👧👦 treated so strangely in Swift strings?
The Swift Programming Language (Swift 3.1) - Strings and Characters - Unicode
With Swift 5, you can iterate over the unicodeScalars property of a flag emoji character in order to print the Unicode scalar values that compose it:
let emoji: Character = "🇮🇹"
for scalar in emoji.unicodeScalars {
print(scalar)
}
/*
prints:
🇮
🇹
*/
If you combine those scalars (that are Regional Indicator Symbols), you get a flag emoji:
let italianFlag = "🇮" + "🇹"
print(italianFlag) // prints: 🇮🇹
print(italianFlag.count) // prints: 1
Each Unicode.Scalar instance also has a property value that you can use in order to display a numeric representation of it:
let emoji: Character = "🇮🇹"
for scalar in emoji.unicodeScalars {
print(scalar.value)
}
/*
prints:
127470
127481
*/
You can create Unicode scalars from those numeric representations then associate them into a string:
let scalar1 = Unicode.Scalar(127470)
let scalar2 = Unicode.Scalar(127481)
let italianFlag = String(scalar1!) + String(scalar2!)
print(italianFlag) // prints: 🇮🇹
print(italianFlag.count) // prints: 1
If needed, you can use Unicode.Scalar's escaped(asASCII:) method in order to display a string representation of the Unicode scalars (using ASCII characters):
let emoji: Character = "🇮🇹"
for scalar in emoji.unicodeScalars {
print(scalar.escaped(asASCII: true))
}
/*
prints:
\u{0001F1EE}
\u{0001F1F9}
*/
let italianFlag = "\u{0001F1EE}\u{0001F1F9}"
print(italianFlag) // prints: 🇮🇹
print(italianFlag.count) // prints: 1
String's init(_:radix:uppercase:) may also be relevant to convert the scalar value to an hexadecimal value:
let emoji: Character = "🇮🇹"
for scalar in emoji.unicodeScalars {
print(String(scalar.value, radix: 16, uppercase: true))
}
/*
prints:
1F1EE
1F1F9
*/
let italianFlag = "\u{1F1EE}\u{1F1F9}"
print(italianFlag) // prints: 🇮🇹
print(italianFlag.count) // prints: 1
Swift doesn't tell you what the internal representation of a String is. You interact with a String as a list of full-size (32-bit) Unicode code points:
for character in "Dog!🐶" {
println(character)
}
// prints D, o, g, !, 🐶
If you want to work with a string as a sequence of UTF-8 or UTF-16 code points, use its utf8 or utf16 properties. See Strings and Characters in the docs.
I am looking for a way to convert an array of 16-bit unsigned integer into ASCII char array. I am using char to do the conversion
D=[65 65 65 65];
char(D)
which will show 4 'A'. However, since each number in D is 16-bit, I expect it to convert each number to 2 chars. For example, if I have
D=[16707]
char(D)
I expect it gives me two chars 'A' and 'C'. But char always return 1 character. Is that anyway to force char to convert like the way I stated? Thanks.
For this, you need to write your own function.
You can use char() to convert most significant byte and least significant byte separately.
k = 16707;
first = char(bitand(bitshift(k, -8), 255));
second = char(bitand(k, 255));
Have a look at
http://www.mathworks.com/help/matlab/ref/char.html
It cleatly states that the char function is valid only for 8 bit numbers. you can convert each part of cell of the array with this and contact the results for each two cells.
Use typecast to convert each uint16 to two uint8, and then apply char. Make sure that the input to typecastr is really of type uint16.
If you need to reverse char order, use swapbytes on the uint16 vector.
>> D = [16707 16708];
>> char(typecast(uint16(D),'uint8'))
ans =
CADA
>> char(typecast(swapbytes(uint16(D)),'uint8'))
ans =
ACAD