Get the UTF-8 Encoding of a Character in Bytes - swift

On a String, I can use utf8 and count to get the number of bytes required to encode the String with UTF-8 encoding:
"a".utf8.count // 1
"チャオ".utf8.count // 9
"チ".utf8.count // 3
However, I don't see an equivalent method on a single Character value. To get the number of bytes required to encode a character in the string to UTF-8, I could iterate through the string by character, convert the Character to a String, and get the utf8.count of that String:
"チャオ".characters.forEach({print(String($0).utf8.count)}) // 3, 3, 3
This seems unnecessarily verbose. Is there a way to get the UTF-8 encoding of a Character in Swift?

Character has no direct (public) accessor to its UTF-8 representation.
There are some internal methods in Character.swift dealing with the UTF-8 bytes, but the public stuff is implemented in
String.UTF8View in StringUTF8.swift.
Therefore String(myChar).utf8.count is the correct way to obtain
the length of the characters UTF-8 representation.

Related

How to convert string in UTF-8 to ASCII ignoring errors and removing non ASCII characters

I am new to Scala.
Please advise how to convert strings in UTF-8 to ASCII ignoring errors and removing non ASCII characters in output.
For example, how to remove non ASCII character \uc382 from result string: "hello���", so that "hello" is printed in output.
scala.io.Source.fromBytes("hello\uc382".getBytes ("UTF-8"), "US-ASCII").mkString
val str = "hello\uc382"
str.filter(_ <= 0x7f) // keep only valid ASCII characters
If you had text in UTF-8 as bytes that is now in a String then it was converted.
If you have text in a String and you want it in ASCII as bytes, you can convert it later.
It seems that you just want to filter for only the UTF-16 code units for the C0 Controls and Basic Latin codepoints. Fortunately, such codepoints take only one code unit so we can filter them directly without converting them to codepoints.
"hello\uC382"
.filter(Character.UnicodeBlock.of(_) == Character.UnicodeBlock.BASIC_LATIN)
.getBytes(StandardCharsets.US_ASCII)
.foreach {
println }
With the question generalized to an arbitrary, known character encoding, filtering doesn't do the job. Instead, the feature of the encoder to ignore characters that are not present in the target Charset can be used. An Encoder requires a bit more wrapping and unwrapping. (The API design is based on streaming and reusing the buffer within the same stream and even other streams.) So, with ISO_8859_1 as an example:
val encoder = StandardCharsets.ISO_8859_1
.newEncoder()
.onMalformedInput(CodingErrorAction.IGNORE)
.onUnmappableCharacter(CodingErrorAction.IGNORE)
val string = "ñhello\uc382"
println(string)
val chars = CharBuffer.allocate(string.length())
.put(string)
chars.rewind()
val buffer = encoder.encode(chars)
val bytes = Array.ofDim[Byte](buffer.remaining())
buffer.get(bytes)
println(bytes)
bytes
.foreach {
println }

I was wondering if someone could explain to me .decode and .encode in hashlib?

I understand that you have a hex string and perform SHA256 on it twice and then byte-swap the final hex string. The goal of this code is to find a Merkle Root by concatenating two transactions. I would like to understand what's going on in the background a bit more. What exactly are you decoding and encoding?
import hashlib
transaction_hex = "93a05cac6ae03dd55172534c53be0738a50257bb3be69fff2c7595d677ad53666e344634584d07b8d8bc017680f342bc6aad523da31bc2b19e1ec0921078e872"
transaction_bin = transaction_hex.decode('hex')
hash = hashlib.sha256(hashlib.sha256(transaction_bin).digest()).digest()
hash.encode('hex_codec')
'38805219c8ac7e9a96416d706dc1d8f638b12f46b94dfd1362b5d16cf62e68ff'
hash[::-1].encode('hex_codec')
'ff682ef66cd1b56213fd4db9462fb138f6d8c16d706d41969a7eacc819528038'
header_hex is a regular string of lower case ASCII characters and the decode() method with 'hex' argument changes it to a (binary) string (or bytes object in Python 3) with bytes 0x93 0xa0 etc. In C it would be an array of unsigned char of length 64 in this case.
This array/byte string of length 64 is then hashed with SHA256 and its result (another binary string of size 32) is again hashed. So hash is a string of length 32, or a bytes object of that length in Python 3. Then encode('hex_codec') is a synomym for encode('hex') (in Python 2); in Python 3, it replaces it (so maybe this code is meant to work in both versions). It outputs an ASCII (lower hex) string again that replaces each raw byte (which is just a small integer) with a two character string that is its hexadecimal representation. So the final bit reverses the double hash and outputs it as hexadecimal, to a form which I usually call "lowercase hex ASCII".

Calculate UTF8 bytesize of string [duplicate]

How can I get the length (not number of bytes) of a string in its UTF-8 encoded form (PHP's mb_strlen(.., 'UTF-8') equivalent)?
I tried string.characters.count but it does not return the correct length for certain characters like an emoji.
Example:
let s = "✌🏿️"
print(s.characters.count) // prints 2, but should print 3.
You can access the UTF-8 encoding of a string with the .utf8 property. Use count on that to get the number of UTF-8 code units in the string:
let string = "\u{1f603}" // One of the smiley face emojis...
print(string.utf8.count) // prints "4"
Based on your edited question, what you are probably looking for is the number of UnicodeScalars used to encode the string. You access that with the unicodeScalars property:
let s = "✌🏿️"
print(s.unicodeScalars.count) // prints 3
The reason everyone is confused is because your original question asks for the length of the string in its UTF-8 encoded form. The answer that you actually wanted had nothing to do with the length of the string in its UTF-8 encoded form.
I think you are confused about the difference between Unicode "extended grapheme clusters", Unicode code points, and the various encodings (like UTF-8) that can be used to encode a Unicode code point.
A Character in Swift represents what Unicode calls an "extended grapheme cluster". That is to say, it is a single visual character, even if it is made up of multiple Unicode code points.
A Unicode code point is a single linguistic symbol that is given a 32-bit value. Two or more Unicode code points can combine to create a single Character. In Swift, the Unicode code point is represented by the UnicodeScalar type.
When it comes time to store a string, or send it over the internet, or otherwise turn it into data that is represented by bytes, you have to decide how to encode it. There are all kinds of encodings, the most common is probably UTF-8, which encodes the string as a series of UInt8 values.
That's just a brief snippet of the difference between the three concepts. It is actually a really interesting subject and if you Google some of those terms, you will find a lot more good information.
let str = "ačŘ"
print("str has \(str.characters.count) characters") // 3
print("and \(str.utf8.count) bytes as encoded in UTF-8") // 5
update (based on your notes)
let s = "✌🏿️"
let arr:[UInt8] = [226, 156, 140, 240, 159, 143, 191, 239, 184, 143]
var arrCchar = arr.map { (uint8) -> Int8 in
Int8(bitPattern: uint8)
}
arrCchar += [0] // to be null terminated
let str = String.fromCString(&arrCchar)
print(str) // Optional("✌🏿️")
s == str // TRUE !!!!
by characters
s.characters.forEach { (c) -> () in
let str = String(c)
print(str.utf8.map{$0}, "which represents character: ", c)
str.unicodeScalars.forEach({ (u) -> () in
print("composed from unicode scalar(s): ", u.debugDescription)
})
}
/*
[226, 156, 140] which represents character: ✌
composed from unicode scalar(s): "\u{270C}"
[240, 159, 143, 191, 239, 184, 143] which represents character: 🏿️
composed from unicode scalar(s): "\u{0001F3FF}"
composed from unicode scalar(s): "\u{FE0F}"
*/
Every character in Unicode can be represented by one or more unicode scalars. A unicode scalar is a unique 21-bit number (and name) for a character or modifier, such as U+0061 for LOWERCASE LATIN LETTER A("a"), or U+1F425 for FRONT-FACING BABY CHICK ("\U0001f425").
When a Unicode string is written to a text file or some other storage, these unicode scalars are encoded in one of several Unicode-defined formats. Each format encodes the string in small chunks known as code units. These include the UTF-8 format (which encodes a string as 8-bit code units) and the UTF-16 format (which encodes a string as 16-bit code units).
//copy from Apple Developer swift programming guide

Inserting ASCII symbols into a String (Swift)

I'm trying to insert a symbol with ASCII code 255 (Telnet IAC) into a String, but when converting the data back to utf8 I'm getting a different symbol:
var s = "\u{ff}"
print(s.utf8.count) // 2
try! s.write(toFile: "output.txt", atomically: true, encoding: .utf8)
The file contains C3 BF, not FF. I've also tried using
var s = "\(Character(UnicodeScalar(255)))"
but this produced the same result. How to escape it properly?
ASCII defines 128 characters from 0x00 to 0x7F. 0xFF (255) is not included.
In Unicode, U+00FF (in Swift, "\u{ff}") represents "ÿ" (LATIN SMALL LETTER Y WITH DIARESIS).
And its UTF-8 representation is 0xC3 0xBF. See UTF-8, characters with code point from U+0080 to U+07FF are represented with two-byte sequence.
Also you need to know that 0xFF is not a valid byte in UTF-8 byte sequence, which means you cannot get any 0xFF bytes in UTF-8 text file.
If you want to output "\u{ff}" as a single-byte 0xFF, use ISO-8859-1 (aka ISO-Latin-1) instead:
try! s.write(toFile: "output.txt", atomically: true, encoding: .isoLatin1)

How to encode the Numeric code in iPhone

I am having some numerical code and i want to encode the "Numerical Code". So how can i encode the string?. I have tried with NSASCIIStringEncoding and NSUTF8StringEncoding, but it doesn't encoded the string. So please help me out.
Eg :
İ -> İ
ı -> ı
Thanks!
What you have are Unicode code points, not strings. You don't need to specify a string encoding, because what you are dealing with aren't strings at all; they're just single characters. And an NSString does not have an "encoding" in this sense.
To get those characters into a string, you need to use:
[NSString stringWithCharacters: length];
For example: you don't want to be creating a string with the contents "304"; that's just a string of numbers. Instead, create a unichar with the value of 304:
unichar iWithDot = 304;
"Unichar" is just an unsigned short, so no pointer and no quotes; you are just assigning the code point to a numerical value. Bundle all of the characters you need into a C string and pass the pointer to stringWithCharacters.