can't get correct chinese character from " google::protobuf::TextFormat::PrintToString " - protobuf-c

I was use the protobuf to read and write config file. but I found the chinese character can't correctly write to the file.
the encode code:
zrd::Config cfg;
zrd::Market *market = nullptr;
market = cfg.add_market();
market->set_id("11");
market->set_name("清江冷链市场");
market->set_district("六合区");
string content;
google::protobuf::TextFormat::PrintToString(cfg, &content);
when run finished , the content is like this:
market {\n id: \"11\"\n name: \"\346\270\205\346\261\237\345\206\267\351\223\276\345\270\202\345\234\272\"\n district: \"\345\205\255\345\220\210\345\214\272\"\n}
why the chinese character is convert to that way ? when I use ofstream to write the content to file, such chinese characters are not convenient to read. but the probobuf can decode it successfully.
I wonder know whether there is way to save the chinese characters in right way?

Related

IW8ISO8859P8 to utf-8 conversion

My perl script reads values extracted from Oracle DB based on IW8ISO8859P8 codepage into a string. I have also an input file (saved as UTF-8) from which I read also a string.
I am trying to compare both string. printing first string gives me gibberish e.g òøáä -îåñáåú whereas the other string gives me the hebrew letter e.g הסבות. How can I encode the first string to give me the right Hebrew string ?
Thanks
Shimon
I tried using the format $string1 = Encode (uft-8, $string1) but it did not help

Python requests says it's UTF-8, so why are there still unicode characters?

Using requests to query the DarkSky API says it returns UTF-8 encoded document, but string is defaulting to ASCII with error. If I explicitly encode as UTF-8, there are no errors, but string contains extra characters and raw unicode. What's going on? I've set my py file to use UTF-8 encoding in Sublime.
# Fetch weather data from DarkSky, parse resulting JSON
try:
url = "https://api.darksky.net/forecast/" + API_KEY + "/" + LAT + "," + LONG + "?exclude=[minutely,hourly,alerts,flags]&units=us"
response = requests.get(url)
data = response.json()
print(response.headers['content-type'])
print(response.encoding)
which returns:
application/json; charset=utf-8
d_summary = data['daily']['summary']
print("Daily Summary: ", d_summary.encode('utf-8'))
which returns: Daily Summary: b'No precipitation throughout the week, with temperatures rising to 82\xc2\xb0F on Tuesday.'
What's going on with the extra characters in front and quoted substring with unicode text?
I don't see any problem here. Decoding the JSON doesn't cause an error, and encoding to UTF-8 produces a byte string literal repr b'...' as expected. Top-bit-set bytes are expected to look like \xXX in byte string literals.
string is defaulting to ASCII with error
What do you mean by that? Please show us the actual problem.
My guess is you are trying to print non-ASCII characters to the terminal on Windows and getting UnicodeEncodeError. If so that's because the Windows Console is broken and can't print Unicode properly. PEP 528 works around the problem in Python 3.6.

Trouble understanding C# URL decode with Unicode character(s) in PowerShell

I'm currently working on something that requires me to pass a Base64 string to a PowerShell script. But while decoding the string back to the original I'm getting some unexpected results as I need to use UTF-7 during decoding and I don't understand why. Would someone know why?
The Mozilla documentation would suggest that it's insufficient to use Base64 if you have Unicode characters in your string. Thus you need to use a workaround that consists of using encodeURIComponent and a replace. I don't really get why the replace is needed and shortened it to btoa(escape('✓ à la mode')) to encode the string. The result of that operation would be JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl.
Using PowerShell to decode the string back to the original, I need to first undo the Base64 encoding. In order to do System.Convert can be used (which results in a byte array) and its output can be converted to a UTF-8 string using System.Text.Encoding. Together this would look like the following:
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
What's left to do is URL decode the whole thing. As it is a UTF-8 string I'd expect only to need to run the URL decode without any further parameters. But if you do that you end up with a accented a that looks like � in a file or ? on the console. To get the actual original string it's necessary to tell the URL decode to use UTF-7 as the character set. It's nice that this works but I don't really get why it's necessary since the string should be UTF-8 and UTF-8 certainly supports an accented a. See the last two lines of the entire script for what I mean. With those two lines you will end up with one line that has the garbled text and one which has the original text in the same file encoded as UTF-8
Entire PowerShell script:
Add-Type -AssemblyName System.Web
$inputstring = "JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl"
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
[System.Web.HttpUtility]::UrlDecode($utf8string) | Out-File -Encoding utf8 C:\temp\output.txt
[System.Web.HttpUtility]::UrlDecode($utf8string, [System.Text.UnicodeEncoding]::UTF7) | Out-File -Append -Encoding utf8 C:\temp\output.txt
Clarification:
The problem isn't the conversion of the Base64 to UTF-8. The problem is some inconsistent behavior of the UrlDecode of C#. If you run escape('✓ à la mode') in your browser you will end up with the following string %u2713%20%E0%20la%20mode. So we have a Unicode representation of the check mark and a HTML entity for the á. If we use this directly in UrlDecode we end up with the same error. My current assumption would be that it's an issue with the encoding of the PowerShell window and pasting characters into it.
Turns out it actually isn't all that strange. It's just for what I want to do it's advantages to use a newer function. I'm still not sure why it works if you use the UTF-7 encoding. But anyways, as an explanation:
... The hexadecimal form for characters, whose code unit value is 0xFF or less, is a two-digit escape sequence: %xx. For characters with a greater code unit, the four-digit format %uxxxx is used.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/escape
As TesselatingHecksler pointed out What is the proper way to URL encode Unicode characters? would indicate that the %u format wasn't formerly standardized. A newer version to escape characters exists though, which is encodeURIComponent.
The encodeURIComponent() function encodes a Uniform Resource Identifier (URI) component by replacing each instance of certain characters by one, two, three, or four escape sequences representing the UTF-8 encoding of the character (will only be four escape sequences for characters composed of two "surrogate" characters).
The output of this function actually works with the C# implementation of UrlDecode without supplying an additional encoding of UTF-7.
The original linked Mozilla article about a Base64 encode for an UTF-8 strings modifies the whole process in a way to allows you to just call the Base64 decode function in order to get the whole string. This is realized by converting the URL encode version of the string to bytes.

C# To transform Facebook Response to proper encoded string

I am using regular Stream Reader to get response from Facebook graph API response
https://graph.facebook.com/XXXX?access_token=&fields=id,name,about,address,last_name
I am reading the response stream yet it returns me
{"id":"XXXXX","name":"K\u0131r\u0131nt\u0131 Reklam"...}
My code is below - I unsuccessfully tried using explicitly UTF-8 and "iso-8859-9" (Turkish) encodings and setting accept-charset headers. I read Joel's famous article about encodings. It looks like each of the chars '\' 'u' '1' '3' '1' are coming as characters from facebook - I thought this would have been 2 bytes for value 131 in UTF-8. I am confused. I expect this string to be "Kırıntı Reklam".
I could simply find/replace those strings - yet it would be far from elegant and maintainable. How should I properly process or convert the facebook graph api response for strings with accents?
using (WebResponse response = request.GetResponse())
{
using (Stream dataStream = response.GetResponseStream())
{
if (dataStream != null)
{
using (StreamReader reader = new StreamReader(dataStream))
{
responseFromServer = reader.ReadToEnd();
}
}
}
}
Thank you in advance
tldr; use a JSON library - I like Json.NET - and don't worry about it.
The JSON shown is valid JSON where \uABCD in a JSON string represents a UTF-16 encoded character1. The internal JSON character escaping format is useful to avoid having to deal with Unicode stream encoding issues - it allows JSON to be represented entirely in ASCII/7-bit-clean characters (which is a subset of UTF-8).
Using a conforming JSON library to parse the JSON with such escapes would restore the JSON into an appropriate object-graph, of which some values will be properly-decoded String values. The library is responsible for understanding JSON and converting/reading it as appropriate - this includes correctly handling any such \u escape sequences.
The stream itself (that of the JSON text) should use the encoding that the server says, is indicated by a BOM, or has been pre-negotiated: but really, just UTF-8 here. This is how the JSON text is encoded, but has no bearing on the escape sequences found in JSON strings.
1 Per RFC 4627, The application/json Media Type for JavaScript Object Notation (JSON):
Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lowercase. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".
Alternatively, there are two-character sequence escape
representations of some popular characters. So, for example, a
string containing only a single reverse solidus character may be
represented more compactly as "\\".
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E"
For the doubters, here is a LINQPad example. This uses JSON.Net and imports the Newtonsoft.Json.Linq namespace.
var json = #"{""name"":""K\u0131r\u0131nt\u0131 Reklam""}";
json.Dump(); // -> {"name":"K\u0131r\u0131nt\u0131 Reklam"}
var name = JObject.Parse(json)["name"].ToString();
(name == "Kırıntı Reklam").Dump(); // -> true

Decoding URL containing unicode characters

I have the following code in a Mako template:
<a href="#" onclick='getCompanyHTML("${fund.investments[inv_name].name | u}"); return false;'>${inv_name}</a>
This applies url escaping to the name string of an object representing a company. The resulting escaped string is then used in a url. The Mako documentation states that url encoding is provided using urllib.quote_plus(string.encode('utf-8')).
On the server I receive the company name part into the argument investment_name:
def Investment(client, fund_name, investment_name, **kwargs):
client = urllib.unquote_plus(client)
fund_name = urllib.unquote_plus(fund_name)
investment_name = urllib.unquote_plus(investment_name)
I then use investment_name as a key back in to the same dictionary from which it was extracted in the template.
This works fine for all the standard cases, such as spaces, slashes, and single quotes in the company name. However, it fails if the company name contains unicode characters outside of the ascii character set.
For instance, the url for company name "Eptisa Servicios de Ingeniería S.L." is rendered as "Eptisa+Servicios+de+Ingenier%C3%ADa+S.L." When this value arrives back at the server, I'm reversing the url escaping but clearly failing to decode the unicode properly because my attempt to use the result as a dictionary key generates a key error.
I've tried adding unicode decoding in these two forms, without luck:
investment_name = urllib.unquote_plus(investment_name.decode('utf-8'))
investment_name = urllib.unquote_plus(investment_name.encode('raw_unicode_escape').decode('utf-8'))
Can anyone suggest what I must do to "Eptisa+Servicios+de+Ingenier%C3%ADa+S.L." to turn it back into "Eptisa Servicios de Ingeniería S.L."?
Do it in the reverse order: first unquote then .decode('utf-8')
Do not mix bytes and Unicode strings.
Example
import urllib
q = "Eptisa+Servicios+de+Ingenier%C3%ADa+S.L."
b = urllib.unquote_plus(q)
u = b.decode("utf-8")
print u
Note: print u might produce UnicodeEncodeError. To fix it:
print u.encode(character_encoding_your_console_understands)
Or set PYTHONIOENCODING environment variable.
On Unix you could try locale.getpreferredencoding() as character encoding, on Windows see output of chcp