Output unquoted Unicode in Go - unicode

I'm using goyaml as a YAML beautifier. By loading and dumping a YAML file, I can source-format it. I unmarshal the data from a YAML source file into a struct, marshal those bytes, and write the bytes to an output file. But the process morphs my Unicode strings into the literal version of the quoted strings, and I don't know how to reverse it.
Example input subtitle.yaml:
line: 你好
I've stripped everything down to the smallest reproducible problem. Here's the code, using _ to catch errors which don't pop-up:
package main
import (
"io/ioutil"
//"unicode/utf8"
//"fmt"
"gopkg.in/yaml.v1"
)
type Subtitle struct {
Line string
}
func main() {
filename := "subtitle.yaml"
in, _ := ioutil.ReadFile(filename)
var subtitle Subtitle
_ = goyaml.Unmarshal(in, &subtitle)
out, _ := goyaml.Marshal(&subtitle)
//for len(out) > 0 { // For debugging, see what the runes are
// r, size := utf8.DecodeRune(out)
// fmt.Printf("%c ", r)
// out = out[size:]
//}
_ = ioutil.WriteFile(filename, out, 0644)
}
Actual output subtitle.yaml:
line: "\u4F60\u597D"
I want to reverse the weirdness in goyaml after I get the variable out.
The commented-out rune-printing code block, which adds spaces between runes for clarity, outputs the following. It shows that Unicode runes like 你 aren't being decoded, but treated literally:
l i n e : " \ u 4 F 6 0 \ u 5 9 7 D "
How can I unquote out, before writing it to the output file, so that the output looks like the input (albeit beautified)?
Desired output subtitle.yaml:
line: "你好"
Temporary Solution
I've filed https://github.com/go-yaml/yaml/issues/11. In the meantime, #bobince's tip on yaml_emitter_set_unicode was helpful in unconvering the problem. It was defined as a C binding but never called (or given an option to set it)! I changed encode.go and added yaml_emitter_set_unicode(&e.emitter, true) to line 20, and everything works as expected. It would be better to make it optional, but that would require a change in the Marshal API.

Had a similar issue and could apply this to circumvent the bug in goyaml.Marshal(). (*Regexp) ReplaceAllFunc is your friend which you can use to expand the escaped Unicode runes in the byte array. A little bit too dirty for production maybe, but works for the example ;-)
package main
import (
"io/ioutil"
"unicode/utf8"
"regexp"
"strconv"
"launchpad.net/goyaml"
)
type Subtitle struct {
Line string
}
var reFind = regexp.MustCompile(`^\s*[^\s\:]+\:\s*".*\\u.*"\s*$`)
var reFindU = regexp.MustCompile(`\\u[0-9a-fA-F]{4}`)
func expandUnicodeInYamlLine(line []byte) []byte {
// TODO: restrict this to the quoted string value
return reFindU.ReplaceAllFunc(line, expandUnicodeRune)
}
func expandUnicodeRune(esc []byte) []byte {
ri, _:= strconv.ParseInt(string(esc[2:]), 16, 32)
r := rune(ri)
repr := make([]byte, utf8.RuneLen(r))
utf8.EncodeRune(repr, r)
return repr
}
func main() {
filename := "subtitle.yaml"
filenameOut := "subtitleout.yaml"
in, _ := ioutil.ReadFile(filename)
var subtitle Subtitle
_ = goyaml.Unmarshal(in, &subtitle)
out, _ := goyaml.Marshal(&subtitle)
out = reFind.ReplaceAllFunc(out, expandUnicodeInYamlLine)
_ = ioutil.WriteFile(filenameOut, out, 0644)
}

Related

How can we prove that copy on write is applied to strings of more than 15 characters in Swift?

I wrote the following code in Xcode to try and prove copy on write in Swift:
func print(address o: UnsafeRawPointer) {
print(o)
}
func longStringMemoryTest() {
var longStr1 = "abcdefghijklmnopqr"
var longStr2 = longStr1
print(address: longStr1)
print(address: longStr2)
print("[append 'stu' to 'longStr2']")
longStr2 += "stu"
print(address: longStr1)
print(address: longStr2)
var test = "abcdefghijklmnopqr"
print(address: test)
print("[Fin]")
}
However, the console always prints the same address for longStr1 and longStr2, even though longStr1 has a value of "abcdefghijklmnopqr" and longStr2 has a value of "abcdefghijklmnopqrstu". I can't figure out what I'm missing in this code. Can you explain how to prove copy on write for strings in Swift and why the address is always the same for longStr1 and longStr2?

getch() equivalent in Swift: read a single character from stdin without a newline

I'm looking for a Swift function like getch() from C to read a single character from terminal input without requiring the user to press the return key. getchar() and readLine() are not sufficient, as they both require return.
There's a getch() function from ncurses which looked promising, but unfortunately seems to require taking over the display of the whole window.
After searching for a while online, I landed on the following (partly based on this answer):
import Foundation
extension FileHandle {
func enableRawMode() -> termios {
var raw = termios()
tcgetattr(self.fileDescriptor, &raw)
let original = raw
raw.c_lflag &= ~UInt(ECHO | ICANON)
tcsetattr(self.fileDescriptor, TCSADRAIN, &raw)
return original
}
func restoreRawMode(originalTerm: termios) {
var term = originalTerm
tcsetattr(self.fileDescriptor, TCSADRAIN, &term)
}
}
func getch() -> UInt8 {
let handle = FileHandle.standardInput
let term = handle.enableRawMode()
defer { handle.restoreRawMode(originalTerm: term) }
var byte: UInt8 = 0
read(handle.fileDescriptor, &byte, 1)
return byte
}
fputs("Press any key to continue... ", stdout)
fflush(stdout)
let x = getch()
print()
print("Got character: \(UnicodeScalar(x))")

Count the number of lines in a Swift String

After reading a medium sized file (about 500kByte) from a web-service I have a regular Swift String (lines) originally encoded in .isolatin1. Before actually splitting it I would like to count the number of lines (quickly) in order to be able to initialise a progress bar.
What is the best Swift idiom to achieve this?
I came up with the following:
let linesCount = lines.reduce(into: 0) { (count, letter) in
if letter == "\r\n" {
count += 1
}
}
This does not look too bad but I am asking myself if there is a shorter/faster way to do it. The characters property provides access to a sequence of Unicode graphemes which treat \r\n as only one entity. Checking this with all CharacterSet.newlines does not work, since CharacterSet is not a set of Character but a set of Unicode.Scalar (a little counter-intuitively in my book) which is a set of code points (where \r\n counts as two code points), not graphemes. Trying
var lines = "Hello, playground\r\nhere too\r\nGalahad\r\n"
lines.unicodeScalars.reduce(into: 0) { (cnt, letter) in
if CharacterSet.newlines.contains(letter) {
cnt += 1
}
}
will count to 6 instead of 3. So this is more general than the above method, but it will not work correctly for CRLF line endings.
Is there a way to allow for more line ending conventions (as in CharacterSet.newlines) that still achieves the correct result for CRLF? Can the number of lines be computed with less code (while still remaining readable)?
If it's ok for you to use a Foundation method on an NSString, I suggest using
enumerateLines(_ block: #escaping (String, UnsafeMutablePointer<ObjCBool>) -> Void)
Here's an example:
import Foundation
let base = "Hello, playground\r\nhere too\r\nGalahad\r\n"
let ns = base as NSString
ns.enumerateLines { (str, _) in
print(str)
}
It separates the lines properly, taking into account all linefeed types, such as "\r\n", "\n", etc:
Hello, playground
here too
Galahad
In my example I print the lines but it's trivial to count them instead, as you need to - my version is just for the demonstration.
As I did not find a generic way to count newlines I ended up just solving my problem by iterating through all the characters using
let linesCount = text.reduce(into: 0) { (count, letter) in
if letter == "\r\n" { // This treats CRLF as one "letter", contrary to UnicodeScalars
count += 1
}
}
I was sure this would be a lot faster than enumerating lines for just counting, but I resolved to eventually do the measurement. Today I finally got to it and found ... that I could not have been more wrong.
A 10000 line string counted lines as above in about 1.0 seconds , but counting through enumeration using
var enumCount = 0
text.enumerateLines { (str, _) in
enumCount += 1
}
only took around 0.8 seconds and was consistently faster by a little more than 20%. I do not know what tricks the Swift engineers hide in their sleves, but they sure manage to enumerateLines very quickly. This just for the record.
You can use the following extension
extension String {
var numberOfLines: Int {
return self.components(separatedBy: "\n").count
}
}
Swift 5 Extension
extension String {
func numberOfLines() -> Int {
return self.numberOfOccurrencesOf(string: "\n") + 1
}
func numberOfOccurrencesOf(string: String) -> Int {
return self.components(separatedBy:string).count - 1
}
}
Example:
let testString = "First line\nSecond line\nThird line"
let numberOfLines = testString.numberOfLines() // returns 3
I use this, a CharacterSet which Apple provides, made for this task:
let newLines = text.components(separatedBy: .newlines).count - 1

Golang encode string UTF16 little endian and hash with MD5

I am a Go beginner and stuck with a problem.
I want to encode a string with UTF16 little endian and then hash it with MD5 (hexadecimal). I have found a piece of Python code, which does exactly what I want. But I am not able to transfer it to Google Go.
md5 = hashlib.md5()
md5.update(challenge.encode('utf-16le'))
response = md5.hexdigest()
The challenge is a variable containing a string.
You can do it with less work (or at least more understandability, IMO) by using golang.org/x/text/encoding and golang.org/x/text/transform to create a Writer chain that will do the encoding and hashing without so much manual byte slice handling. The equivalent function:
func utf16leMd5(s string) []byte {
enc := unicode.UTF16(unicode.LittleEndian, unicode.IgnoreBOM).NewEncoder()
hasher := md5.New()
t := transform.NewWriter(hasher, enc)
t.Write([]byte(s))
return hasher.Sum(nil)
}
You can use the unicode/utf16 package for UTF-16 encoding. utf16.Encode() returns the UTF-16 encoding of the Unicode code point sequence (slice of runes: []rune). You can simply convert a string to a slice of runes, e.g. []rune("some string"), and you can easily produce the byte sequence of the little-endian encoding by ranging over the uint16 codes and sending/appending first the low byte then the high byte to the output (this is what Little Endian means).
For Little Endian encoding, alternatively you can use the encoding/binary package: it has an exported LittleEndian variable and it has a PutUint16() method.
As for the MD5 checksum, the crypto/md5 package has what you want, md5.Sum() simply returns the MD5 checksum of the byte slice passed to it.
Here's a little function that captures what you want to do:
func utf16leMd5(s string) [16]byte {
codes := utf16.Encode([]rune(s))
b := make([]byte, len(codes)*2)
for i, r := range codes {
b[i*2] = byte(r)
b[i*2+1] = byte(r >> 8)
}
return md5.Sum(b)
}
Using it:
s := "Hello, playground"
fmt.Printf("%x\n", utf16leMd5(s))
s = "エヌガミ"
fmt.Printf("%x\n", utf16leMd5(s))
Output:
8f4a54c6ac7b88936e990256cc9d335b
5f0db9e9859fd27f750eb1a212ad6212
Try it on the Go Playground.
The variant that uses encoding/binary would look like this:
for i, r := range codes {
binary.LittleEndian.PutUint16(b[i*2:], r)
}
(Although this is slower as it creates lots of new slice headers.)
So, for reference, I used this complete python program:
import hashlib
import codecs
md5 = hashlib.md5()
md5.update(codecs.encode('Hello, playground', 'utf-16le'))
response = md5.hexdigest()
print response
It prints 8f4a54c6ac7b88936e990256cc9d335b
Here is the Go equivalent: https://play.golang.org/p/Nbzz1dCSGI
package main
import (
"crypto/md5"
"encoding/binary"
"encoding/hex"
"fmt"
"unicode/utf16"
)
func main() {
s := "Hello, playground"
fmt.Println(md5Utf16le(s))
}
func md5Utf16le(s string) string {
encoded := utf16.Encode([]rune(s))
b := convertUTF16ToLittleEndianBytes(encoded)
return md5Hexadecimal(b)
}
func md5Hexadecimal(b []byte) string {
h := md5.New()
h.Write(b)
return hex.EncodeToString(h.Sum(nil))
}
func convertUTF16ToLittleEndianBytes(u []uint16) []byte {
b := make([]byte, 2*len(u))
for index, value := range u {
binary.LittleEndian.PutUint16(b[index*2:], value)
}
return b
}

Convert normal space/whitespace to non-breaking space?

This is a simple question, but I still can't figure out how to do it.
Say I have this string:
x := "this string"
The whitespace between 'this' and 'string' defaults to the regular unicode whitespace character 32/U+0020. How would I convert it into the non-breaking unicode whitespace character U+00A0 in Go?
Use the documentation to identify the standard strings package as a likely candidate, and then search it (or read through it all, you should know what's available in the standard library/packages of any language you use) to find strings.Map.
Then the obvious short simple solution to convert any white space would be:
package main
import (
"fmt"
"strings"
"unicode"
)
func main() {
const nbsp = '\u00A0'
result := strings.Map(func(r rune) rune {
if unicode.IsSpace(r) {
return nbsp
}
return r
}, "this string")
fmt.Printf("%s → %[1]q\n", result)
}
Playground
As previously mentioned, if you really only want to replace " " then perhaps strings.Replace.
I think a basic way to do it is by creating a simple function:
http://play.golang.org/p/YT8Cf917il
package main
import "fmt"
func ReplaceSpace(s string) string {
var result []rune
const badSpace = '\u0020'
for _, r := range s {
if r == badSpace {
result = append(result, '\u00A0')
continue
}
result = append(result, r)
}
return string(result)
}
func main() {
fmt.Println(ReplaceSpace("this string"))
}
If you need more advanced manipulations you could create something with
"golang.org/x/text/transform"
"golang.org/x/text/unicode/norm"
Read http://blog.golang.org/normalization for more information on how to use it