This is a simple question, but I still can't figure out how to do it.
Say I have this string:
x := "this string"
The whitespace between 'this' and 'string' defaults to the regular unicode whitespace character 32/U+0020. How would I convert it into the non-breaking unicode whitespace character U+00A0 in Go?
Use the documentation to identify the standard strings package as a likely candidate, and then search it (or read through it all, you should know what's available in the standard library/packages of any language you use) to find strings.Map.
Then the obvious short simple solution to convert any white space would be:
package main
import (
"fmt"
"strings"
"unicode"
)
func main() {
const nbsp = '\u00A0'
result := strings.Map(func(r rune) rune {
if unicode.IsSpace(r) {
return nbsp
}
return r
}, "this string")
fmt.Printf("%s → %[1]q\n", result)
}
Playground
As previously mentioned, if you really only want to replace " " then perhaps strings.Replace.
I think a basic way to do it is by creating a simple function:
http://play.golang.org/p/YT8Cf917il
package main
import "fmt"
func ReplaceSpace(s string) string {
var result []rune
const badSpace = '\u0020'
for _, r := range s {
if r == badSpace {
result = append(result, '\u00A0')
continue
}
result = append(result, r)
}
return string(result)
}
func main() {
fmt.Println(ReplaceSpace("this string"))
}
If you need more advanced manipulations you could create something with
"golang.org/x/text/transform"
"golang.org/x/text/unicode/norm"
Read http://blog.golang.org/normalization for more information on how to use it
Related
I want to check a string to be able to understand that string is suitable for using as a display name in the app. Below block looks only for english characters. How can I cover all language letters? Also all punctuations and numbers won't be allowed.
func isSuitableForDisplayName(inputString: String) -> Bool {
let mergedString = inputString.stringByRemovingWhitespaces
let characterset = CharacterSet(charactersIn: "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
if mergedString.rangeOfCharacter(from: characterset.inverted) != nil {
return false
} else {
return true
}
}
You can use CharacterSet.letters, which contains all the characters in the Unicode categories L and M.
Category M includes combining marks. If you don't want those, use:
CharacterSet.letters.subtracting(.nonBaseCharacters)
Also, your way of checking whether a string contains only the characters in a character set is quite weird. I would do something like this:
return mergedString.trimmingCharacters(in: CharacterSet.letters) == ""
I am a Go beginner and stuck with a problem.
I want to encode a string with UTF16 little endian and then hash it with MD5 (hexadecimal). I have found a piece of Python code, which does exactly what I want. But I am not able to transfer it to Google Go.
md5 = hashlib.md5()
md5.update(challenge.encode('utf-16le'))
response = md5.hexdigest()
The challenge is a variable containing a string.
You can do it with less work (or at least more understandability, IMO) by using golang.org/x/text/encoding and golang.org/x/text/transform to create a Writer chain that will do the encoding and hashing without so much manual byte slice handling. The equivalent function:
func utf16leMd5(s string) []byte {
enc := unicode.UTF16(unicode.LittleEndian, unicode.IgnoreBOM).NewEncoder()
hasher := md5.New()
t := transform.NewWriter(hasher, enc)
t.Write([]byte(s))
return hasher.Sum(nil)
}
You can use the unicode/utf16 package for UTF-16 encoding. utf16.Encode() returns the UTF-16 encoding of the Unicode code point sequence (slice of runes: []rune). You can simply convert a string to a slice of runes, e.g. []rune("some string"), and you can easily produce the byte sequence of the little-endian encoding by ranging over the uint16 codes and sending/appending first the low byte then the high byte to the output (this is what Little Endian means).
For Little Endian encoding, alternatively you can use the encoding/binary package: it has an exported LittleEndian variable and it has a PutUint16() method.
As for the MD5 checksum, the crypto/md5 package has what you want, md5.Sum() simply returns the MD5 checksum of the byte slice passed to it.
Here's a little function that captures what you want to do:
func utf16leMd5(s string) [16]byte {
codes := utf16.Encode([]rune(s))
b := make([]byte, len(codes)*2)
for i, r := range codes {
b[i*2] = byte(r)
b[i*2+1] = byte(r >> 8)
}
return md5.Sum(b)
}
Using it:
s := "Hello, playground"
fmt.Printf("%x\n", utf16leMd5(s))
s = "エヌガミ"
fmt.Printf("%x\n", utf16leMd5(s))
Output:
8f4a54c6ac7b88936e990256cc9d335b
5f0db9e9859fd27f750eb1a212ad6212
Try it on the Go Playground.
The variant that uses encoding/binary would look like this:
for i, r := range codes {
binary.LittleEndian.PutUint16(b[i*2:], r)
}
(Although this is slower as it creates lots of new slice headers.)
So, for reference, I used this complete python program:
import hashlib
import codecs
md5 = hashlib.md5()
md5.update(codecs.encode('Hello, playground', 'utf-16le'))
response = md5.hexdigest()
print response
It prints 8f4a54c6ac7b88936e990256cc9d335b
Here is the Go equivalent: https://play.golang.org/p/Nbzz1dCSGI
package main
import (
"crypto/md5"
"encoding/binary"
"encoding/hex"
"fmt"
"unicode/utf16"
)
func main() {
s := "Hello, playground"
fmt.Println(md5Utf16le(s))
}
func md5Utf16le(s string) string {
encoded := utf16.Encode([]rune(s))
b := convertUTF16ToLittleEndianBytes(encoded)
return md5Hexadecimal(b)
}
func md5Hexadecimal(b []byte) string {
h := md5.New()
h.Write(b)
return hex.EncodeToString(h.Sum(nil))
}
func convertUTF16ToLittleEndianBytes(u []uint16) []byte {
b := make([]byte, 2*len(u))
for index, value := range u {
binary.LittleEndian.PutUint16(b[index*2:], value)
}
return b
}
I've searched and searched but can't see to find what the instr or strpos equivalent in Swift is. Just need to see if a string contains a string. How can I do this in Swift?
Here is a useful extension of String in Swift. Both functions work like the well known PHP functions. Just paste this code in any .swift file of your project:
import Foundation
extension String {
func strpos(needle:String)->Int{
//Returns the index of the first occurrence of a substring in a string, or -1 if absent
if let range = self.rangeOfString(needle) {
return startIndex.distanceTo(range.startIndex)
} else {
return -1
}
}
func instr(needle:String)->Bool{
return self.containsString(needle)
}
}
Usage, anywhere else in your projet
let myString = "Hello, world"
print(myString.strpos("world")) //->prints "7"
print(myString.strpos("Dolly")) //->prints "-1"
print(myString.strpos(",")) //->prints "5"
Use rangOfString():
if string.rangeOfString("mySubstring") != nil
{
println("string contains substring")
}
I'm using goyaml as a YAML beautifier. By loading and dumping a YAML file, I can source-format it. I unmarshal the data from a YAML source file into a struct, marshal those bytes, and write the bytes to an output file. But the process morphs my Unicode strings into the literal version of the quoted strings, and I don't know how to reverse it.
Example input subtitle.yaml:
line: 你好
I've stripped everything down to the smallest reproducible problem. Here's the code, using _ to catch errors which don't pop-up:
package main
import (
"io/ioutil"
//"unicode/utf8"
//"fmt"
"gopkg.in/yaml.v1"
)
type Subtitle struct {
Line string
}
func main() {
filename := "subtitle.yaml"
in, _ := ioutil.ReadFile(filename)
var subtitle Subtitle
_ = goyaml.Unmarshal(in, &subtitle)
out, _ := goyaml.Marshal(&subtitle)
//for len(out) > 0 { // For debugging, see what the runes are
// r, size := utf8.DecodeRune(out)
// fmt.Printf("%c ", r)
// out = out[size:]
//}
_ = ioutil.WriteFile(filename, out, 0644)
}
Actual output subtitle.yaml:
line: "\u4F60\u597D"
I want to reverse the weirdness in goyaml after I get the variable out.
The commented-out rune-printing code block, which adds spaces between runes for clarity, outputs the following. It shows that Unicode runes like 你 aren't being decoded, but treated literally:
l i n e : " \ u 4 F 6 0 \ u 5 9 7 D "
How can I unquote out, before writing it to the output file, so that the output looks like the input (albeit beautified)?
Desired output subtitle.yaml:
line: "你好"
Temporary Solution
I've filed https://github.com/go-yaml/yaml/issues/11. In the meantime, #bobince's tip on yaml_emitter_set_unicode was helpful in unconvering the problem. It was defined as a C binding but never called (or given an option to set it)! I changed encode.go and added yaml_emitter_set_unicode(&e.emitter, true) to line 20, and everything works as expected. It would be better to make it optional, but that would require a change in the Marshal API.
Had a similar issue and could apply this to circumvent the bug in goyaml.Marshal(). (*Regexp) ReplaceAllFunc is your friend which you can use to expand the escaped Unicode runes in the byte array. A little bit too dirty for production maybe, but works for the example ;-)
package main
import (
"io/ioutil"
"unicode/utf8"
"regexp"
"strconv"
"launchpad.net/goyaml"
)
type Subtitle struct {
Line string
}
var reFind = regexp.MustCompile(`^\s*[^\s\:]+\:\s*".*\\u.*"\s*$`)
var reFindU = regexp.MustCompile(`\\u[0-9a-fA-F]{4}`)
func expandUnicodeInYamlLine(line []byte) []byte {
// TODO: restrict this to the quoted string value
return reFindU.ReplaceAllFunc(line, expandUnicodeRune)
}
func expandUnicodeRune(esc []byte) []byte {
ri, _:= strconv.ParseInt(string(esc[2:]), 16, 32)
r := rune(ri)
repr := make([]byte, utf8.RuneLen(r))
utf8.EncodeRune(repr, r)
return repr
}
func main() {
filename := "subtitle.yaml"
filenameOut := "subtitleout.yaml"
in, _ := ioutil.ReadFile(filename)
var subtitle Subtitle
_ = goyaml.Unmarshal(in, &subtitle)
out, _ := goyaml.Marshal(&subtitle)
out = reFind.ReplaceAllFunc(out, expandUnicodeInYamlLine)
_ = ioutil.WriteFile(filenameOut, out, 0644)
}
If you run fmt.Println("\u554a"), it shows '啊'.
But how to get unicode-style-string \u554a from a rune '啊' ?
package main
import "fmt"
import "strconv"
func main() {
quoted := strconv.QuoteRuneToASCII('啊') // quoted = "'\u554a'"
unquoted := quoted[1:len(quoted)-1] // unquoted = "\u554a"
fmt.Println(unquoted)
}
This outputs:
\u554a
IMHO, it should be better:
func RuneToAscii(r rune) string {
if r < 128 {
return string(r)
} else {
return "\\u" + strconv.FormatInt(int64(r), 16)
}
}
You can use fmt.Sprintf along with %U to get the hexadecimal value:
test = fmt.Sprintf("%U", '啊')
fmt.Println("\\u" + test[2:]) // Print \u554A
For example,
package main
import "fmt"
func main() {
r := rune('啊')
u := fmt.Sprintf("%U", r)
fmt.Println(string(r), u)
}
Output:
啊 U+554A
fmt.Printf("\\u%X", '啊')
http://play.golang.org/p/Jh9ns8Qh15
(Upper or lowercase 'x' will control the case of the hex characters)
As hinted at by package fmt's documentation:
%U Unicode format: U+1234; same as "U+%04X"
package main
import "fmt"
func main() {
fmt.Printf("%+q", '啊')
}
I'd like to add to the answer that hardPass has.
In the case where the hex representation of the unicode is less that 4 characters (ü for example) strconv.FormatInt will result in \ufc which will result in a unicode syntax error in Go. As opposed to the full \u00fc that Go understands.
Padding the hex with zeros using fmt.Sprintf with hex formatting will fix this:
func RuneToAscii(r rune) string {
if r < 128 {
return string(r)
} else {
return fmt.Sprintf("\\u%04x", r)
}
}
https://play.golang.org/p/80w29oeBec1
This would do the job..
package main
import (
"fmt"
)
func main() {
str := fmt.Sprintf("%s", []byte{0x80})
fmt.Println(str)
}