Why utf8.Validstring function not detecting invalid unicode characters? - unicode

From https://en.wikipedia.org/wiki/UTF-8#Invalid_code_points, I got to know that U+D800 through U+DFFF are invalid. So in decimal system, it is 55296 through 57343.
And Maximum valid Unicode is '\U0010FFFF'. In decimal system, it is 1114111
My code:
package main
import "fmt"
import "unicode/utf8"
func main() {
fmt.Println("Case 1(Invalid Range)")
str := fmt.Sprintf("%c", rune(55296+1))
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
fmt.Println("Case 2(More than maximum valid range)")
str = fmt.Sprintf("%c", rune(1114111+1))
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
}
Why ValidString function is not returning false for invalid unicode characters given as input ? I am sure my understanding is wrong, could some one explain??

Your problem happens in Sprintf. Since you give it an invalid character Sprintf replaces with with rune(65533) which is the unicode replacement character used instead of invalid characters. So your string is valid UTF8.
This will also happen if you do something like this: str := string([]rune{ 55297 }) so this might be something that happens when creating runes. It's not immediately obvious from: https://blog.golang.org/strings
If you want to force your string to contain invalid UTF8 you can write the first string like this:
str := string([]byte{237, 159, 193})

You take an invalid value and convert it using Sprintf. It's converted to the error value. You then check the error value, which is a valid Unicode code point.
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
fmt.Println("Case 1: Invalid Range")
str := fmt.Sprintf("%c", rune(55296+1))
fmt.Printf("%q %X %d %d\n", str, str, []rune(str)[0], utf8.RuneError)
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
fmt.Println("Case 2: More than maximum valid range")
str = fmt.Sprintf("%c", rune(1114111+1))
fmt.Printf("%q %X %d %d\n", str, str, []rune(str)[0], utf8.RuneError)
if !utf8.ValidString(str) {
fmt.Print(str, " is not a valid Unicode")
} else {
fmt.Println(str, " is valid unicode character")
}
}
Output:
Case 1: Invalid Range
"�" EFBFBD 65533 65533
� is valid unicode character
Case 2: More than maximum valid range
"�" EFBFBD 65533 65533
� is valid unicode character

Related

Swift. How to get the previous character?

For example: I have character "b" and I what to get "a", so "a" is the previous character.
let b: Character = "b"
let a: Character = b - 1 // Compilation error
It's actually pretty complicated to get the previous character from Swift's Character type because Character is actually comprised of one or more Unicode.Scalar values. Depending on your needs you could restrict your efforts to just the ASCII characters. Or you could support all characters comprised of a single Unicode scalar. Once you get into characters comprised of multiple Unicode scalars (such as the flag Emojis or various skin toned Emojis) then I'm not even sure what the "previous character" means.
Here is a pair of methods added to a Character extension that can handle ASCII and single-Unicode scalar characters.
extension Character {
var previousASCII: Character? {
if let ascii = asciiValue, ascii > 0 {
return Character(Unicode.Scalar(ascii - 1))
}
return nil
}
var previousScalar: Character? {
if unicodeScalars.count == 1 {
if let scalar = unicodeScalars.first, scalar.value > 0 {
if let prev = Unicode.Scalar(scalar.value - 1) {
return Character(prev)
}
}
}
return nil
}
}
Examples:
let b: Character = "b"
let a = b.previousASCII // Gives Optional("a")
let emoji: Character = "😆"
let previous = emoji.previousScalar // Gives Optional("😅")

What does the index of an UTF-8 encoding error indicate?

fn main() {
let ud7ff = String::from_utf8(vec![0xed, 0x9f, 0xbf]);
if ud7ff.is_ok() {
println!("U+D7FF OK! Get {}", ud7ff.unwrap());
} else {
println!("U+D7FF Fail!");
}
let ud800 = String::from_utf8(vec![0xed, 0xa0, 0x80]);
if ud800.is_ok() {
println!("U+D800 OK! Get {}", ud800.unwrap());
} else {
println!("{}", ud800.unwrap_err());
}
}
Running this code prints invalid utf-8 sequence of 1 bytes from index 0. I understand it's an encoding error, but why does the error say index 0? Shouldn't it be index 1 because index 0 is the same in both cases?
That's because Rust is reporting the byte index which begins an invalid code point sequence, not any specific byte within that sequence. After all, the error could be the second byte, or maybe the first byte was corrupted? Or maybe the leading byte of the sequence went missing.
Rust doesn't, and can't, know, so it just reports the most convenient position: the first offset at which it couldn't decode a complete code point.

Strange String.unicodeScalars and CharacterSet behaviour

I'm trying to use a Swift 3 CharacterSet to filter characters out of a String but I'm getting stuck very early on. A CharacterSet has a method called contains
func contains(_ member: UnicodeScalar) -> Bool
Test for membership of a particular UnicodeScalar in the CharacterSet.
But testing this doesn't produce the expected behaviour.
let characterSet = CharacterSet.capitalizedLetters
let capitalAString = "A"
if let capitalA = capitalAString.unicodeScalars.first {
print("Capital A is \(characterSet.contains(capitalA) ? "" : "not ")in the group of capital letters")
} else {
print("Couldn't get the first element of capitalAString's unicode scalars")
}
I'm getting Capital A is not in the group of capital letters yet I'd expect the opposite.
Many thanks.
CharacterSet.capitalizedLetters
returns a character set containing the characters in Unicode General Category Lt aka "Letter, titlecase". That are
"Ligatures containing uppercase followed by lowercase letters (e.g., Dž, Lj, Nj, and Dz)" (compare Wikipedia: Unicode character property or
Unicode® Standard Annex #44 – Table 12. General_Category Values).
You can find a list here: Unicode Characters in the 'Letter, Titlecase' Category.
You can also use the code from
NSArray from NSCharacterset to dump the contents of the character
set:
extension CharacterSet {
func allCharacters() -> [Character] {
var result: [Character] = []
for plane: UInt8 in 0...16 where self.hasMember(inPlane: plane) {
for unicode in UInt32(plane) << 16 ..< UInt32(plane + 1) << 16 {
if let uniChar = UnicodeScalar(unicode), self.contains(uniChar) {
result.append(Character(uniChar))
}
}
}
return result
}
}
let characterSet = CharacterSet.capitalizedLetters
print(characterSet.allCharacters())
// ["Dž", "Lj", "Nj", "Dz", "ᾈ", "ᾉ", "ᾊ", "ᾋ", "ᾌ", "ᾍ", "ᾎ", "ᾏ", "ᾘ", "ᾙ", "ᾚ", "ᾛ", "ᾜ", "ᾝ", "ᾞ", "ᾟ", "ᾨ", "ᾩ", "ᾪ", "ᾫ", "ᾬ", "ᾭ", "ᾮ", "ᾯ", "ᾼ", "ῌ", "ῼ"]
What you probably want is CharacterSet.uppercaseLetters which
Returns a character set containing the characters in Unicode General Category Lu and Lt.

Convert normal space/whitespace to non-breaking space?

This is a simple question, but I still can't figure out how to do it.
Say I have this string:
x := "this string"
The whitespace between 'this' and 'string' defaults to the regular unicode whitespace character 32/U+0020. How would I convert it into the non-breaking unicode whitespace character U+00A0 in Go?
Use the documentation to identify the standard strings package as a likely candidate, and then search it (or read through it all, you should know what's available in the standard library/packages of any language you use) to find strings.Map.
Then the obvious short simple solution to convert any white space would be:
package main
import (
"fmt"
"strings"
"unicode"
)
func main() {
const nbsp = '\u00A0'
result := strings.Map(func(r rune) rune {
if unicode.IsSpace(r) {
return nbsp
}
return r
}, "this string")
fmt.Printf("%s → %[1]q\n", result)
}
Playground
As previously mentioned, if you really only want to replace " " then perhaps strings.Replace.
I think a basic way to do it is by creating a simple function:
http://play.golang.org/p/YT8Cf917il
package main
import "fmt"
func ReplaceSpace(s string) string {
var result []rune
const badSpace = '\u0020'
for _, r := range s {
if r == badSpace {
result = append(result, '\u00A0')
continue
}
result = append(result, r)
}
return string(result)
}
func main() {
fmt.Println(ReplaceSpace("this string"))
}
If you need more advanced manipulations you could create something with
"golang.org/x/text/transform"
"golang.org/x/text/unicode/norm"
Read http://blog.golang.org/normalization for more information on how to use it

Converting a byte array to a string given encoding

I read from a file to a byte array:
auto text = cast(immutable(ubyte)[]) read("test.txt");
I can get the type of character encoding using the following function:
enum EncodingType {ANSI, UTF8, UTF16LE, UTF16BE, UTF32LE, UTF32BE}
EncodingType DetectEncoding(immutable(ubyte)[] data){
switch (data[0]){
case 0xEF:
if (data[1] == 0xBB && data[2] == 0xBF){
return EncodingType.UTF8;
} break;
case 0xFE:
if (data[1] == 0xFF){
return EncodingType.UTF16BE;
} break;
case 0xFF:
if (data[1] == 0xFE){
if (data[2] == 0x00 && data[3] == 0x00){
return EncodingType.UTF32LE;
}else{
return EncodingType.UTF16LE;
}
}
case 0x00:
if (data[1] == 0x00 && data[2] == 0xFE && data[3] == 0xFF){
return EncodingType.UTF32BE;
}
default:
break;
}
return EncodingType.ANSI;
}
I need a function that takes a byte array and returns the text string (utf-8).
If the text is encoded in UTF-8, then the transformation is trivial. Similarly, if the encoding is UTF-16 or UTF-32 native byte order for the system.
string TextDataToString(immutable(ubyte)[] data){
import std.utf;
final switch (DetectEncoding(data[0..4])){
case EncodingType.ANSI:
return null;/*???*/
case EncodingType.UTF8:
return cast(string) data[3..$];
case EncodingType.UTF16LE:
wstring result;
version(LittleEndian) { result = cast(wstring) data[2..$]; }
version(BigEndian) { result = "";/*???*/ }
return toUTF8(result);
case EncodingType.UTF16BE:
return null;/*???*/
case EncodingType.UTF32LE:
dstring result;
version(LittleEndian) { result = cast(dstring) data[4..$]; }
version(BigEndian) { result = "";/*???*/ }
return toUTF8(result);
case EncodingType.UTF32BE:
return null;/*???*/
}
}
But I could not figure out how to convert byte array with ANSI encoded text (for example, windows-1251) or UTF-16/32 with NOT native byte order.
I ticked the appropriate places in the code with /*???*/.
As a result, the following code should work, with any encoding of a text file:
string s = TextDataToString(text);
writeln(s);
Please help!
BOMs are optional. You cannot use them to reliably detect the encoding. Even if there is a BOM, using it to distinguish UTF from code page encodings is problematic, because the byte sequences are usually valid (if nonsensical) in those, too. E.g. 0xFE 0xFF is "юя" in Windows-1251.
Even if you could tell UTF from code page encodings, you couldn't tell the different code pages from another. You could analyze the whole text and make guesses, but that's super error prone and not very practical.
So, I'd advise you to not try to detect the encoding. Instead, require a specific encoding, or add a mechanism to specify it.
As for trandscoding from a different byte order, example for UTF16BE:
import std.algorithm: map;
import std.bitmanip: bigEndianToNative;
import std.conv: to;
import std.exception: enforce;
import std.range: chunks;
alias C = wchar;
enforce(data.length % C.sizeof == 0);
auto result = data
.chunks(C.sizeof)
.map!(x => bigEndianToNative!C(x[0 .. C.sizeof]))
.to!string;