How to save non-Ascii characters in redis HMSET? - unicode

I'd like to save Arabic characters like سلام in a redis hash, like this:
HMSET arabicHash "سلام" 5
OK
But the result is not as intended:
127.0.0.1:6379> HGETALL arabicHash
1) "\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85"
2) "5"
I'm wondering if there is a way to save سلام directly inot redis set? And if not, how can I convert back "\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85" to human-readable charachters after being retrived?
Update: I've tested on my Ubuntu Bash terminal, but the result is not formatted correctly here. The screenshot:

You need to put quotes to enclose the key and value being stored in the hash.
Tested on try.redis.io ( it shows the redis output in utf-8 decode form )
The text may showup as utf-8 encoded in redis response, but on decoding, it would show correctly in arabic characters

Related

Polish name (Wężarów) returned from json service as W\u0119\u017car\u00f3w, renders as Wężarów. Can't figure out encoding/charset.

I'm using DB-IP.com to get city names from IP addresses. Many of these are international cities, with special characters in the names.
As an example, one of these cities is Wężarów in Poland. Checking the JSON return in the console or opening the request URL directly, it's being returned from DB-IP as "W\u0119\u017car\u00f3w" with a Content-Type of text/javascript;charset=UTF-8. This is rendered in the browser as Wężarów - it is also saved in my mysql database as Wężarów (which I've tried with both utf8 and latin1 encoding).
I'm ok with saving it in the DB as another format, as long as I can convert it back to Wężarów for display in browser. I've tried encoding and decoding to/from several formats, even just to display directly on the screen (ignoring the DB entirely). I'm completely confused on what I need to do here to get it in readable format.
I'm working with PERL, however if I can figure out what I need to do with the encoding/decoding/charset (as I'm currently clueless), I'm sure I can figure it out from there.
It looks like the UTF-8 encoded string was interpreted by the browser as if it were Windows-1252. Here's how I deduced it:
% python3
>>> s = "W\u0119\u017car\u00f3w"
>>> b = bytes(s, encoding='utf-8')
>>> b
b'W\xc4\x99\xc5\xbcar\xc3\xb3w'
>>> str(b, encoding='utf-8')
'Wężarów'
>>> str(b, encoding='latin-1')
'WÄ\x99żarów'
>>> str(b, encoding='windows-1252')
'Wężarów'
If you're not good with Python, what I'm doing here is encoding the string "W\u0119\u017car\u00f3w" into UTF-8, yielding the byte sequence 'W\xc4\x99\xc5\xbcar\xc3\xb3w'. Decoding that with UTF-8 yielded 'Wężarów', confirming that this is the correct UTF-8 encoding of the string you want. So I took a guess that the browser is using the wrong encoding to render it, and decoded it using Latin-1. That gave me something very close, so I looked up Latin-1 and noticed that it's named as the basis for Windows-1252. Decoding again as Windows-1252 gives the result you saw.
What's gone wrong here is that the browser can't tell what encoding to use to render the page, and it's guessing wrong. You need to fix this by telling it explicitly to use UTF-8. Here's a page by the W3C that describes how to do that. Essentially what you need to do is add an HTML <meta> element to the document head. If you also set an HTTP header with the encoding name in it, make sure they are consistent.
(In Firefox, while you're debugging, you can go to View -> Character Encoding to set the encoding on a page-by-page basis. I assume other browsers have the same feature.)

How to get Python 3 to correctly handle unicode strings from MongoDB?

I'm using Windows 7 64-bit, Python 3, MongoDB, and PyMongo. I know that in Python 3, all strings are unicode. I also know that MongoDB stores all strings as unicode. So I don't understand why, when I pull a document from my database where the value of a particular field is "C:\Some Folder\E=mc².xyz", Python treats that string as "C:\Some Folder\E=mc².xyz". It doesn't just print that way; os.path.exists() returns False. Now, as if that wasn't confusing enough, if I save the string to a text file, and then open it with the encoding explicitly set to "utf-8", the string appears correctly, and os.path.exists() returns True. What's going wrong, and how do I fix it?
Edit:
Here's some code I just wrote to demonstrate my problem:
from pymongo import MongoClient
db = MongoClient().test_db
orig_doc = {'string': 'E=mc²'}
_id = db.test_col.insert(orig_doc)
new_doc = db.test_col.find_one(_id)
print(new_doc['string'])
>>> E=mc²
As you can see, it works exactly as it should! Thus I now realize that I must've messed up when I migrated from PostgreSQL. Now I just need to fix the strings. I know that it's possible, but there's got to be a better way than writing the strings to a text file and then reading them back. I could do that, just as I did in my previous testing, but it just doesn't seem like the right way.
You can't store Unicode. It is a concept. MongoDB has to be using an encoding of Unicode, and it looks like UTF-8. Python 3 Unicode strings are stored internally as one of a number of encodings depending on the content of the string. What you have is a string decoded to Unicode with the wrong encoding:
>>> s='"C:\Some Folder\E=mc².xyz"' # The invalid decoding.
>>> print(s)
"C:\Some Folder\E=mc².xyz"
>>> print(s.encode('latin1').decode('utf8')) # Undo the wrong decoding, and apply the right one.
"C:\Some Folder\E=mc².xyz"
There's not enough information to tell you how to read MondoDB correctly, but this should help you along.

encoding error in Mojolicious' template

I have non-ascii characters with utf-8 encoding (Chinese characters), but they're not printed correctly. I have to add decode('utf8', $str) (in controller or template file) to get the right output. How could I set the template to recognize utf-8 string?
Anyhow, the literal stashed string can make the rigth output, and I don't know why.
The content are stored in MySQL with utf-8 collection. I added $DB->do("SET NAMES 'UTF8'"); after database is connected, but no effect.
try DBI option mysql_enable_utf8 set to 1.

Blob data replace '+' with space

I have an iphone app that converts a image into NSData & then converts into base64 encoded string.
When this encoded string is submitted to server in server's database, while storing on server '+' gets converted into 'space' and so the decoder does not work properly.
I guess the issue is with default encoding of table in database. Currently its latin, i tried changing it to UTF8 but problem still exits.
Any other encoding, please help
Of course - that has nothing to do with encoding. It is the format of the POST and GET parameters which creates a clash with base64. In http://en.wikipedia.org/wiki/Base64#Variants_summary_table you see alternatives which are designed to make base64 work with URLs etc.
One of these variants is "Base64 with URL and Filename Safe Alphabet (RFC 4648 'base64url' encoding)" which replaces the + with - and the / with _.
Another alternative would be to replace the offending characters +/= by their respective hexrepresentations with %xx - but that makes the data unnecessarily longer.

I do replace literal \xNN with their character in Perl?

I have a Perl script that takes text values from a MySQL table and writes it to a text file. The problem is, when I open the text file for viewing I am getting a lot of hex characters like \x92 and \x93 which stands for single and double quotes, I guess.
I am using DBI->quote function to escape the special chars before writing the values to the text file. I have tried using Encode::Encoder, but with no luck. The character set on both the tables is latin1.
How do I get rid of those hex characters and get the character to show in the text file?
ISO Latin-1 does not define characters in the range 0x80 to 0x9f, so displaying these bytes in hex is expected. Most likely your data is actually encoded in Windows-1252, which is the same as Latin1 except that it defines additional characters (including left/right quotes) in this range.
\x92 and \x93 are empty characters in the latin1 character set (see here or here). If you are certain that you are indeed dealing with latin1, you can simply delete them.
It sounds like you need to change the character sets on the tables, or translate the non-latin-1 characters into latin-1 equivalents. I'd prefer the first solution. Get used to Unicode; you're going to have to learn it at some point. :)