What does the index of an UTF-8 encoding error indicate? - unicode

fn main() {
let ud7ff = String::from_utf8(vec![0xed, 0x9f, 0xbf]);
if ud7ff.is_ok() {
println!("U+D7FF OK! Get {}", ud7ff.unwrap());
} else {
println!("U+D7FF Fail!");
}
let ud800 = String::from_utf8(vec![0xed, 0xa0, 0x80]);
if ud800.is_ok() {
println!("U+D800 OK! Get {}", ud800.unwrap());
} else {
println!("{}", ud800.unwrap_err());
}
}
Running this code prints invalid utf-8 sequence of 1 bytes from index 0. I understand it's an encoding error, but why does the error say index 0? Shouldn't it be index 1 because index 0 is the same in both cases?

That's because Rust is reporting the byte index which begins an invalid code point sequence, not any specific byte within that sequence. After all, the error could be the second byte, or maybe the first byte was corrupted? Or maybe the leading byte of the sequence went missing.
Rust doesn't, and can't, know, so it just reports the most convenient position: the first offset at which it couldn't decode a complete code point.

Related

How to print a content of the CharacterSet.decimalDigits?

I tried to print a content of the CharacterSet.decimalDigits with:
print(CharacterSet.decimalDigits)
output: CFCharacterSet Predefined DecimalDigit Set
But my expectation was something like this:
[1, 2, 3, 4 ...]
So my question is: How to print content of the CharacterSet.decimalDigits?
This is not easy. Character sets are not made to be iterated, they are made to check whether a character is inside them or not. They don't contain the characters themselves and the ranges cannot be accessed.
The only thing you can do is to iterate over all characters and check every one of them against the character set, e.g.:
let set = CharacterSet.decimalDigits
let allCharacters = UInt32.min ... UInt32.max
allCharacters
.lazy
.compactMap { UnicodeScalar($0) }
.filter { set.contains($0) }
.map { String($0) }
.forEach { print($0) }
However, note that such a thing takes significant time and shouldn't be used inside a production application.
I don't think you can to that, at least not directly. If you look at the output of
let data = CharacterSet.decimalDigits.bitmapRepresentation
for byte in data {
print(String(format: "%02x", byte))
}
you'll see that the set internally stores bits at the code positions where the decimal digits are.

Unable to understand how withUnsafeBytes method works

I am trying to convert Data to UnsafePointer. I found an answer here where I can use withUnsafeBytes to get the bytes.
Then I did a small test my self to see I could just print out the bytes value of the string "abc"
let testData: Data = "abc".data(using: String.Encoding.utf8)!
testData.withUnsafeBytes(
{(bytes: UnsafePointer<UInt8>) -> Void in
NSLog("\(bytes.pointee)")
})
But the output is just the value of one character, which is "a".
2018-07-11 14:40:32.910268+0800 SwiftTest[44249:651107] 97
How could I get the byte value of all three characters then?
The "pointer" points to the address of the first byte in the sequence. If you want to want a pointer to the other bytes, you have to use pointer arithmetic, that is, move the pointer to the next address:
testData.withUnsafeBytes{ (bytes: UnsafePointer<UInt8>) -> Void in
NSLog("\(bytes.pointee)")
NSLog("\(bytes.successor().pointee)")
NSLog("\(bytes.advanced(by: 2).pointee)")
}
or
testData.withUnsafeBytes { (bytes: UnsafePointer<UInt8>) -> Void in
NSLog("\(bytes[0])")
NSLog("\(bytes[1])")
NSLog("\(bytes[2])")
}
However, you must be aware of the byte size of testData and don't overflow it.
You are getting '97' because 'bytes' is pointing to the staring address of the 'testdata'.
you can get byte value of all three OR n number of characters like in the following code :
let testData: Data = "abc".data(using: String.Encoding.utf8)!
print(testData.count)
testData.withUnsafeBytes(
{(bytes: UnsafePointer<UInt8>) -> Void in
for idx in 0..<testData.count {
NSLog("\(bytes[idx])")
}
})

Why utf8.Validstring function not detecting invalid unicode characters?

From https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points, I got to know that U+D800 through U+DFFF are invalid. So in decimal system, it is 55296 through 57343.
And Maximum valid Unicode is '\U0010FFFF'. In decimal system, it is 1114111
My code:
package main
import "fmt"
import "unicode/utf8"
func main() {
fmt.Println("Case 1(Invalid Range)")
str := fmt.Sprintf("%c", rune(55296+1))
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
fmt.Println("Case 2(More than maximum valid range)")
str = fmt.Sprintf("%c", rune(1114111+1))
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
}
Why ValidString function is not returning false for invalid unicode characters given as input ? I am sure my understanding is wrong, could some one explain??
Your problem happens in Sprintf. Since you give it an invalid character Sprintf replaces with with rune(65533) which is the unicode replacement character used instead of invalid characters. So your string is valid UTF8.
This will also happen if you do something like this: str := string([]rune{ 55297 }) so this might be something that happens when creating runes. It's not immediately obvious from: https://blog.golang.org/strings
If you want to force your string to contain invalid UTF8 you can write the first string like this:
str := string([]byte{237, 159, 193})
You take an invalid value and convert it using Sprintf. It's converted to the error value. You then check the error value, which is a valid Unicode code point.
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
fmt.Println("Case 1: Invalid Range")
str := fmt.Sprintf("%c", rune(55296+1))
fmt.Printf("%q %X %d %d\n", str, str, []rune(str)[0], utf8.RuneError)
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
fmt.Println("Case 2: More than maximum valid range")
str = fmt.Sprintf("%c", rune(1114111+1))
fmt.Printf("%q %X %d %d\n", str, str, []rune(str)[0], utf8.RuneError)
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
}
Output:
Case 1: Invalid Range
"�" EFBFBD 65533 65533
� is valid unicode character
Case 2: More than maximum valid range
"�" EFBFBD 65533 65533
� is valid unicode character

access element of fixed length C array in swift

I'm trying to convert some C code to swift.
(Why? - to use CoreMIDI in OS-X in case you asked)
The C code is like this
void printPacketInfo(const MIDIPacket* packet) {
int i;
for (i=0; i<packet->length; i++) {
printf("%d ", packet->data[i]);
}
}
And MIDIPacket is defined like this
struct MIDIPacket
{
MIDITimeStamp timeStamp;
UInt16 length;
Byte data[256];
};
My Swift is like this
func printPacketInfo(packet: UnsafeMutablePointer<MIDIPacket>){
// print some things
print("length", packet.memory.length)
print("time", packet.memory.timeStamp)
print("data[0]", packet.memory.data.1)
for i in 0 ..< packet.memory.length {
print("data", i, packet.memory.data[i])
}
}
But this gives a compiler error
error: type '(UInt8, UInt8, .. cut .. UInt8, UInt8, UInt8)'
has no subscript members
So how can I dereference the I'th element of a fixed size array?
in your case you could try to use something like this ...
// this is tuple with 8 Int values, in your case with 256 Byte (UInt8 ??) values
var t = (1,2,3,4,5,6,7,8)
t.0
t.1
// ....
t.7
func arrayFromTuple<T,R>(tuple:T) -> [R] {
let reflection = Mirror(reflecting: tuple)
var arr : [R] = []
for i in reflection.children {
// better will be to throw an Error if i.value is not R
arr.append(i.value as! R)
}
return arr
}
let arr:[Int] = arrayFromTuple(t)
print(arr) // [1, 2, 3, 4, 5, 6, 7, 8]
...
let t2 = ("alfa","beta","gama")
let arr2:[String] = arrayFromTuple(t2)
arr2[1] // "beta"
This was suggested by https://gist.github.com/jckarter/ec630221890c39e3f8b9
func printPacketInfo(packet: UnsafeMutablePointer<MIDIPacket>){
// print some things
print("length", packet.memory.length)
print("time", packet.memory.timeStamp)
let len = Int(packet.memory.length)
withUnsafePointer(&packet.memory.data) { p in
let p = UnsafeMutablePointer<UInt8>(p)
for i:Int in 0 ..< len {
print(i, p[i])
}
}
}
This is horrible - I hope the compiler turns this nonsense into some good code
The error message is a hint: it shows that MIDIPacket.data is imported not as an array, but as a tuple. (Yes, that's how all fixed length arrays import in Swift.) You seem to have noticed this in the preceding line:
print("data[0]", packet.memory.data.1)
Tuples in Swift are very static, so there isn't a way to dynamically access a tuple element. Thus, in some sense the only "safe" or idiomatic way to print your packet (in the way that you're hinting at) would be 256 lines of code (or up to 256, since the packet's length field tells you when it's safe to stop):
print("data[1]", packet.memory.data.2)
print("data[2]", packet.memory.data.3)
print("data[3]", packet.memory.data.4)
/// ...
print("data[254]", packet.memory.data.255)
print("data[255]", packet.memory.data.256)
Clearly that's not a great solution. Using reflection, per user3441734's answer, is one (cumbersome) alternative. Unsafe memory access, per your own answer (via jckarter), is another (but as the name of the API says, it's "unsafe"). And, of course, you can always work with the packet through (Obj)C.
If you need to do something beyond printing the packet, you can extend the UnsafePointer-based solution to convert it to an array like so:
extension MIDIPacket {
var dataBytes: [UInt8] {
mutating get {
return withUnsafePointer(&data) { tuplePointer in
let elementPointer = UnsafePointer<UInt8>(tuplePointer)
return (0..<Int(length)).map { elementPointer[$0] }
}
}
}
}
Notice that this uses the packet's existing length property to expose an array that has only as many valid bytes as the packet claims to have (rather than filling up the rest of a 256-element array with zeroes). This does allocate memory, however, so it might not be good for the kinds of real-time run conditions you might be using CoreMIDI in.
Should this:
for i in 0 ..< packet.memory.length
Be this?
for i in 0 ..< packet.memory.data.length

how to Insert multiple columns using PQputCopyData

I am trying to insert two columns using PQputCopyData with following code. But once it checks for the final result it shows error invalid byte sequence for encoding UTF8 and data is not getting inserted into the database.
Both columns type is character varying. What am I doing wrong here ?
const char *buffer = "john,doe";
PGresult *res;
res=PQexec(conn,"COPY john FROM STDIN DELIMITER ',';");
cout<<buffer;
if(PQresultStatus(res) != PGRES_COPY_IN)
{
cout<<"copy in not ok";
}
else
{
if(PQputCopyData(conn,buffer,400) == 1)
{
if(PQputCopyEnd(conn,NULL) == 1)
{
PGresult *res = PQgetResult(conn);
if(PQresultStatus(res) == PGRES_COMMAND_OK)
{
cout<<"done";
}
else
{
cout<<PQerrorMessage(conn); Here I get the error invalid byte sequence for encoding "UTF8"
}
}
else
{
cout<<PQerrorMessage(conn);
}
}
}
if(PQputCopyData(conn,buffer,400) == 1)
What's wrong is passing 400 instead of the actual size of the contents in buffer, making it send unallocated garbage after the real data. Use strlen(buffer) instead.
Also you want each line to finish with a newline, so buffer should be :
const char *buffer = "john,doe\n";