Go and UTF-8 encoding - Is conversion automatic? - unicode

I am making http requests using Go.
request, err := http.NewRequest("GET", url, nil)
This request, if successful, returns a response.
response, err := client.Do(request)
After receiving a response, I want to save the content.
content, err := ioutil.ReadAll(response.Body)
ioutil.WriteFile(destination, content, 0644)
I looked at the Headers of the responses.
response.Header.Get("Content-Type")
I saw the majority are already UTF-8 encoded, which is good. But there are some that have different encodings. I know Go has built in unicode support. Does that mean that if I write, for example, the content of a big-5 encoded page, it will be automatically converted to utf-8? Or do I need to manually decode using the big-5 encoding and re-encode using utf-8?
Basically, I want to ensure that everything that gets written is utf-8 encoded. What is the best way to achieve this?
Thanks!

What ioutil.ReadAll reads will be written with ioutil.WriteFile without any conversions whatsoever.
If you want to force UTF-8 encoded you will have to do the de-/encoding yourself, e.g. with the help of golang.org/x/text/encoding{,/charmap} and/or the unicode/utf{8,16} packages.
Be prepared for all sorts of ugliness and a lot of pain.

Related

How do I set HTTP status code conditionally in a Go HTTP server?

I have the following HTTP handler function:
func (h *UptimeHttpHandler) CreateSchedule(w http.ResponseWriter, r *http.Request) {
defer r.Body.Close()
dec := json.NewDecoder(r.Body)
var req ScheduleRequest
if err := dec.Decode(&req); err != nil {
// error handling omited
}
result, err := saveToDb(req)
if err != nil {
// error handling omited
}
// Responding with 201-Created instead of default 200-Ok
w.WriteHeader(http.StatusCreated)
enc := json.NewEncoder(w)
if err := enc.Encode(result); err != nil {
// this will have no effect as the status is already set before
w.WriteHeader(http.StatusInternalServerError)
fmt.Fprintf(w, "%v", err)
}
}
Above code does the following:
Request comes as a JSON data. It is decoded into req
Request is persisted in the DB. This returns a result object which will be the response
Now here, as soon as the DB insert is successful, we set the status code to 201. Then use a JSON encoder to stream the JSON encoded value directly to ResponseWriter.
When encode returns an error, I need to change the status code to 500. But currently I can't do it as Go allows to set the status code only once.
I can handle this by keeping the encoded JSON in memory, and setting the status code only when it is success. But then this creates unwanted copies and not really nice.
Is there a better way to handle this?
Expanding my comment a bit:
There is no way to do this without buffering. And if you think about it, even if there was, how would that be implemented? The response header needs to be sent before the contents. If the code depends on encoding the response, you must first encode, check the result, set the header and flush the encoded buffer.
So even if the Go API supported a "delayed" status, you would be basically pushing the buffering problem one level down, to the http library :)
You should make the call - either it's important to get the response code right at all costs and you can afford to buffer, or you want to stream the response and you cannot change the result.
Theoretically Go could create an encoding validator that will make sure the object you are trying to encode will 100% pass before actually encoding.
BTW, this makes me think of another thing - the semantic meaning of the response code. You return HTTP Created, right? It's the correct code since the object has in fact been created. But if you return a 500 error due to encoding, has the object not been created? It still has! So the Created code is still valid, the error is just in the encoding stage. So maybe a better design would be not to return the object as a response?
The JSON marshalling may fail if some of your types (deeply) can not be serialized to JSON.
So you should ask yourself: what are the case where the encoding of the response can fail?
If you know the answer, just fix those cases. If you don't know, this is probably that those cases just don't exist.
If you are not sure, the only way to catch errors is to buffer the encoding. You have to balance the "not nice" between buffering and not returning a clean error.
Anyway, returning a 500 error if just the JSON encoding failed is really not nice. At least, you should revert the object creation (the work of saveToDb), so you have to wrap in a transaction that you will rollback in any failure cases.

vertx-web httpserver messy code when response a Chinese word

I met an issue when I was learning Vert.x-Web, below code will return a messy code for Chinese words, any one can help?
HttpServer server = vertx.createHttpServer();
server.requestHandler(request -> {
// This handler gets called for each request that arrives on the server
HttpServerResponse response = request.response();
response.putHeader("content-type", "text/plain charset='utf-8'");
// Write to the response and end it
response.end("Hello World!中文");
});
server.listen(8080);
I just found the reason, I think actually vert.x support UTF-8 encoding, but we need to make sure all the html files and related files including css, js, and font files all match UTF-8 format while saving it. we can use notepad open the file and check if it is UTF-8 format, if not, use "Save As..." to save it as UTF-8 format.

Is form data automatically encoded by browsers?

I have read some stuff about form data encoding, but one thing remains unclear. In case of enctype="application/x-www-form-urlencoded" we need to urlencode data by hand, don't we?
... Forms submitted with this content type must be encoded as follows
Must be encoded by whom? By browsers? Or by application developers?
The other thing is -- what encoding (if any) is used, or should be used, in case of multipart/form-data?
I'm kindda mislead so big thx in advance.
Actually, browsers url-encode data automatically. And this w3 docs is first of all for those who make browsers. So that phrase, Forms submitted with this content type must be encoded as follows means that data should be encoded by browsers. Anyway, one can check it by viewing raw post in the form data handling script (in case of php in looks like file_get_contents("php://input");)

Hotmail messing with encoded URL parameters

We have a system that sends out regular emails with links in, many of which contain URL encoded parameters such as this:
href="http://www.mydomain.com/login.aspx?returnurl=http%3A%2F%2Fwww.mydomain.com%2Fview.aspx%3Fid%3D1234%26alert%3Dtrue"
You can see that the "returnurl" parameter is encoded. However, it seems that a large number of our users (seemingly hotmail) are receiving the emails with this paramater partly decoded such as:
href="http://www.mydomain.com/login.aspx?returnurl=http://www.mydomain.com/view.aspx?view.aspx%3Fid%3D1234%26alert%3Dtrue"
Why would it decode like this? Why only partly decode?? I therefore have no idea how to deal with it. I thought of base-64 encoding but that base64 strings contain characters that would need decoding too... I thought of double encoding but then I will not know whether to double-decode the parameter or not... Can anyone help? Thanks.
One reason this could be happening is because url rules for encoding are different before and after ? so if mechanism that is doing decoding does it from the 'back' of url and apples query decoding rules until it finds first ? then this could cause problem you are describing...
Not sure how to deal with it though as I understand system that does this inappropriate decoding is outside of your control. I would try to hide the ? in return url query somehow...

httprequest encoding mismatch

I'm using a Google Gears Worker to submt a POST httprequest (using var request = google.gears.factory.create('beta.httprequest'); )
with a parameter containing the string
"bford%20%24%23%26!%3F%40%20%E5%BE%B3%E5%8A%9B%E5%9F%BA%E5%BD%A6"
but the Django HttpRequest is receiving it as "bford $#&!?# å¾³å\u008a\u009bå\u009fºå½¦"
How do I specify to one or the other of the parties in the transaction to leave it untranslated?
Check the HttpRequest.encoding and the DEFAULT_CHARSET settings. Judging by the encoded value, this should be UTF-8 (which is indeed usually the right thing).
You can get the ‘untranslated’ (with %s still in) value by looking at the input stream (for POST) or environ QUERY_STRING (for GET) and decoding it manually, but it would be better to fix Django's incorrect string-to-unicode decoding really.
As I understand it, Django 1.0 should default to using UTF-8, so I'm not sure why it's not in your case.