I'm trying to insert a symbol with ASCII code 255 (Telnet IAC) into a String, but when converting the data back to utf8 I'm getting a different symbol:
var s = "\u{ff}"
print(s.utf8.count) // 2
try! s.write(toFile: "output.txt", atomically: true, encoding: .utf8)
The file contains C3 BF, not FF. I've also tried using
var s = "\(Character(UnicodeScalar(255)))"
but this produced the same result. How to escape it properly?
ASCII defines 128 characters from 0x00 to 0x7F. 0xFF (255) is not included.
In Unicode, U+00FF (in Swift, "\u{ff}") represents "ÿ" (LATIN SMALL LETTER Y WITH DIARESIS).
And its UTF-8 representation is 0xC3 0xBF. See UTF-8, characters with code point from U+0080 to U+07FF are represented with two-byte sequence.
Also you need to know that 0xFF is not a valid byte in UTF-8 byte sequence, which means you cannot get any 0xFF bytes in UTF-8 text file.
If you want to output "\u{ff}" as a single-byte 0xFF, use ISO-8859-1 (aka ISO-Latin-1) instead:
try! s.write(toFile: "output.txt", atomically: true, encoding: .isoLatin1)
Related
Trying to find the shortest / most compact way to write out ASCII characters in Swift into a single string. For example, in JavaScript you can do '\x00' for the decimal equivalent of 0 in ASCII, or you can write '\0, which is 2 characters shorter. So if you have a lot of these characters, that is 2x smaller file size.
Wondering how to write the ASCII characters 0-31 and 127 in Swift so they are minimal, into a single string. In JavaScript, that sort of looks like this:
'\0...\33abcdef...\127¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½...'
In general, you would use \u{x} where x is the hex value. In your case \u{0} through \u{1f} and \u{7f}.
As in C based languages, Swift strings also supports \0 for "null", \t for "tab", \n for "newline", and \r for "carriage return". Unlike C, Swift does not support \b or \f.
If you want to create single String will all 128 ASCII characters then you can do:
let ascii = String(Array(0...127).map { Character(Unicode.Scalar($0)) })
If you have a lot of these characters, maybe put them in a Data object and then convert it to a string:
let data = Data(bytes: Array(0...31) + [127])
let text = String(data: data, encoding: .utf8)!
Based on your comment, you could do:
let tab = Data(bytes: [9])
let null = Data(bytes: [0])
let data = "abc".data(using: .utf8)! + tab + null + "morechars".data(using: .utf8)! + tab
I am new to Scala.
Please advise how to convert strings in UTF-8 to ASCII ignoring errors and removing non ASCII characters in output.
For example, how to remove non ASCII character \uc382 from result string: "hello���", so that "hello" is printed in output.
scala.io.Source.fromBytes("hello\uc382".getBytes ("UTF-8"), "US-ASCII").mkString
val str = "hello\uc382"
str.filter(_ <= 0x7f) // keep only valid ASCII characters
If you had text in UTF-8 as bytes that is now in a String then it was converted.
If you have text in a String and you want it in ASCII as bytes, you can convert it later.
It seems that you just want to filter for only the UTF-16 code units for the C0 Controls and Basic Latin codepoints. Fortunately, such codepoints take only one code unit so we can filter them directly without converting them to codepoints.
"hello\uC382"
.filter(Character.UnicodeBlock.of(_) == Character.UnicodeBlock.BASIC_LATIN)
.getBytes(StandardCharsets.US_ASCII)
.foreach {
println }
With the question generalized to an arbitrary, known character encoding, filtering doesn't do the job. Instead, the feature of the encoder to ignore characters that are not present in the target Charset can be used. An Encoder requires a bit more wrapping and unwrapping. (The API design is based on streaming and reusing the buffer within the same stream and even other streams.) So, with ISO_8859_1 as an example:
val encoder = StandardCharsets.ISO_8859_1
.newEncoder()
.onMalformedInput(CodingErrorAction.IGNORE)
.onUnmappableCharacter(CodingErrorAction.IGNORE)
val string = "ñhello\uc382"
println(string)
val chars = CharBuffer.allocate(string.length())
.put(string)
chars.rewind()
val buffer = encoder.encode(chars)
val bytes = Array.ofDim[Byte](buffer.remaining())
buffer.get(bytes)
println(bytes)
bytes
.foreach {
println }
How can I get the length (not number of bytes) of a string in its UTF-8 encoded form (PHP's mb_strlen(.., 'UTF-8') equivalent)?
I tried string.characters.count but it does not return the correct length for certain characters like an emoji.
Example:
let s = "✌🏿️"
print(s.characters.count) // prints 2, but should print 3.
You can access the UTF-8 encoding of a string with the .utf8 property. Use count on that to get the number of UTF-8 code units in the string:
let string = "\u{1f603}" // One of the smiley face emojis...
print(string.utf8.count) // prints "4"
Based on your edited question, what you are probably looking for is the number of UnicodeScalars used to encode the string. You access that with the unicodeScalars property:
let s = "✌🏿️"
print(s.unicodeScalars.count) // prints 3
The reason everyone is confused is because your original question asks for the length of the string in its UTF-8 encoded form. The answer that you actually wanted had nothing to do with the length of the string in its UTF-8 encoded form.
I think you are confused about the difference between Unicode "extended grapheme clusters", Unicode code points, and the various encodings (like UTF-8) that can be used to encode a Unicode code point.
A Character in Swift represents what Unicode calls an "extended grapheme cluster". That is to say, it is a single visual character, even if it is made up of multiple Unicode code points.
A Unicode code point is a single linguistic symbol that is given a 32-bit value. Two or more Unicode code points can combine to create a single Character. In Swift, the Unicode code point is represented by the UnicodeScalar type.
When it comes time to store a string, or send it over the internet, or otherwise turn it into data that is represented by bytes, you have to decide how to encode it. There are all kinds of encodings, the most common is probably UTF-8, which encodes the string as a series of UInt8 values.
That's just a brief snippet of the difference between the three concepts. It is actually a really interesting subject and if you Google some of those terms, you will find a lot more good information.
let str = "ačŘ"
print("str has \(str.characters.count) characters") // 3
print("and \(str.utf8.count) bytes as encoded in UTF-8") // 5
update (based on your notes)
let s = "✌🏿️"
let arr:[UInt8] = [226, 156, 140, 240, 159, 143, 191, 239, 184, 143]
var arrCchar = arr.map { (uint8) -> Int8 in
Int8(bitPattern: uint8)
}
arrCchar += [0] // to be null terminated
let str = String.fromCString(&arrCchar)
print(str) // Optional("✌🏿️")
s == str // TRUE !!!!
by characters
s.characters.forEach { (c) -> () in
let str = String(c)
print(str.utf8.map{$0}, "which represents character: ", c)
str.unicodeScalars.forEach({ (u) -> () in
print("composed from unicode scalar(s): ", u.debugDescription)
})
}
/*
[226, 156, 140] which represents character: ✌
composed from unicode scalar(s): "\u{270C}"
[240, 159, 143, 191, 239, 184, 143] which represents character: 🏿️
composed from unicode scalar(s): "\u{0001F3FF}"
composed from unicode scalar(s): "\u{FE0F}"
*/
Every character in Unicode can be represented by one or more unicode scalars. A unicode scalar is a unique 21-bit number (and name) for a character or modifier, such as U+0061 for LOWERCASE LATIN LETTER A("a"), or U+1F425 for FRONT-FACING BABY CHICK ("\U0001f425").
When a Unicode string is written to a text file or some other storage, these unicode scalars are encoded in one of several Unicode-defined formats. Each format encodes the string in small chunks known as code units. These include the UTF-8 format (which encodes a string as 8-bit code units) and the UTF-16 format (which encodes a string as 16-bit code units).
//copy from Apple Developer swift programming guide
On a String, I can use utf8 and count to get the number of bytes required to encode the String with UTF-8 encoding:
"a".utf8.count // 1
"チャオ".utf8.count // 9
"チ".utf8.count // 3
However, I don't see an equivalent method on a single Character value. To get the number of bytes required to encode a character in the string to UTF-8, I could iterate through the string by character, convert the Character to a String, and get the utf8.count of that String:
"チャオ".characters.forEach({print(String($0).utf8.count)}) // 3, 3, 3
This seems unnecessarily verbose. Is there a way to get the UTF-8 encoding of a Character in Swift?
Character has no direct (public) accessor to its UTF-8 representation.
There are some internal methods in Character.swift dealing with the UTF-8 bytes, but the public stuff is implemented in
String.UTF8View in StringUTF8.swift.
Therefore String(myChar).utf8.count is the correct way to obtain
the length of the characters UTF-8 representation.
I'm trying to encode from utf16 to say utf32 using Apple Core Foundation API :
cfString = CFStringCreateWithBytes(nullptr, str, strLen, kCFStringEncodingUTF16, FALSE);
auto range = CFRangeMake(0, CFStringGetLenth(cfString));
CFStringGetBytes(cfString, range, kCFStringEncodingUTF32, 0, false, buffer, bufferSize, usedsize);
Most of the time that works, untill input buffer contains first part of surrogate pair say U+df9f, Corefoundation will simply return output without ill-formed characters.
So to be a bit unicode compliant, I have to manually determine that situation and follow unicode documentation to create standard substitution for that in form of U+FFFD: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
Same situation for other encodings: like symbol 0x80 in the middle of utf-8, then CFStringCreateWithBytes always return nullptr instead of pointing to invalid character.
Is that expected behaviour or UB of Corefoundation, or may be there is a hint to tune CF to be reporting malformed input somehow?
UPDATE:
I did exactly following:
UInt8 str[] = {0x41, 0x00, 0x9f, 0xdf}; // coresponding to unicode A + invalid surogate pair
CFStringRef mystr = CFStringCreateWithBytes(nullptr, str, 4, kCFStringEncodingUTF16, false, FALSE);
after that mystr has 2 characters len according to CFStringGetLength(), so looks invalid char gets processed
std::vector<char> str(7);
CFStringGetCString(mystr, &*str.begin(), str.size(), kCFStringEncodingUTF8);
that gives me false, so no conversion to utf8 is possible, and Xcode debug watches shows nothing for string myStr.
So output is nothing for utf8, and c-string, ok after that i checked with conversion to utf-32 with get bytes routine
result = CFStringGetBytes(s, range, kCFStringEncodingUTF32BE, 0, false, buffer, bufferSize, usedSize);
that gives me usedSize=4, result=1, and output contains 0x0041, so only A symbol converted. So that is why i’m thinking no substitution happened for malformed surogate pair.