I am consuming a post request that refers to an s3 which may contain non-english characters such as:
{'key': 'náme'}
When the post body is loaded using json.loads it becomes:
{u'key': u'n\xe1me'}
Now unfortunately boto does not grab the object given this unicode key (I have confirmed that the object náme does exist in the desired bucket).
Is there a way to get the desired object via boto or do I need to look at other options such as enforcing a more boto friendly naming policy?
UPDATE:
So from what I can gather the json.loads() encoding is latin1 and simply calling náme in the terminal returns the utf-8 encoded version (n\xc3\xa1me) but I am not familiar with the encoding that boto uses when listing keys (u'na\u0301me') I'm hoping that identifying that will make it easy to convert the latin1 encoding to this unknown encoding so I can begin accessing the keys when given the latin1 version from the posted data.
Related
I'm using DB-IP.com to get city names from IP addresses. Many of these are international cities, with special characters in the names.
As an example, one of these cities is Wężarów in Poland. Checking the JSON return in the console or opening the request URL directly, it's being returned from DB-IP as "W\u0119\u017car\u00f3w" with a Content-Type of text/javascript;charset=UTF-8. This is rendered in the browser as Wężarów - it is also saved in my mysql database as Wężarów (which I've tried with both utf8 and latin1 encoding).
I'm ok with saving it in the DB as another format, as long as I can convert it back to Wężarów for display in browser. I've tried encoding and decoding to/from several formats, even just to display directly on the screen (ignoring the DB entirely). I'm completely confused on what I need to do here to get it in readable format.
I'm working with PERL, however if I can figure out what I need to do with the encoding/decoding/charset (as I'm currently clueless), I'm sure I can figure it out from there.
It looks like the UTF-8 encoded string was interpreted by the browser as if it were Windows-1252. Here's how I deduced it:
% python3
>>> s = "W\u0119\u017car\u00f3w"
>>> b = bytes(s, encoding='utf-8')
>>> b
b'W\xc4\x99\xc5\xbcar\xc3\xb3w'
>>> str(b, encoding='utf-8')
'Wężarów'
>>> str(b, encoding='latin-1')
'WÄ\x99żarów'
>>> str(b, encoding='windows-1252')
'Wężarów'
If you're not good with Python, what I'm doing here is encoding the string "W\u0119\u017car\u00f3w" into UTF-8, yielding the byte sequence 'W\xc4\x99\xc5\xbcar\xc3\xb3w'. Decoding that with UTF-8 yielded 'Wężarów', confirming that this is the correct UTF-8 encoding of the string you want. So I took a guess that the browser is using the wrong encoding to render it, and decoded it using Latin-1. That gave me something very close, so I looked up Latin-1 and noticed that it's named as the basis for Windows-1252. Decoding again as Windows-1252 gives the result you saw.
What's gone wrong here is that the browser can't tell what encoding to use to render the page, and it's guessing wrong. You need to fix this by telling it explicitly to use UTF-8. Here's a page by the W3C that describes how to do that. Essentially what you need to do is add an HTML <meta> element to the document head. If you also set an HTTP header with the encoding name in it, make sure they are consistent.
(In Firefox, while you're debugging, you can go to View -> Character Encoding to set the encoding on a page-by-page basis. I assume other browsers have the same feature.)
Is it possible to store unicode (utf-8) characters in ColdFusion's CLIENT scope? (CF9)
If I set a CLIENT scope variable and dump it immediately, it looks fine. But on next page load (ie: when CLIENT scope read back from storage) I just see question marks for the unicode characters.
I'm using a database for persistence and the data column in the CDATA table has been set to ntext.
Looking directly in the database I can see that the records have not been written correctly (again, just question marks showing for unicode characters).
(From the comments)
Have you checked/enabled the: String format --Enable High Ascii characters and Unicode ..." option in your client datasource?
From the docs:
Enable this option if your application uses Unicode data in
DBMS-specific Unicode data types, such as National Character or nchar.
I am using Spreadsheet::Read to get data from Excel (xls or xlsx) files and put them in a MySQL database using DBI.
If I print out the data to the console, it displays all special characters properly, but when I insert it into the database, some files end up with corrupted characters. For example, "Möbelwerkstätte" becomes "Möbelwerkstätte".
I think that Spreadsheet::Read "knows" which character set is coming out of the file, as it prints properly to the console each time, regardless of the file encoding. How do I make sure that it is going into the database in UTF-8?
The answer you have received (and accepted) already will probably work most of the time, but it's a little fragile and probably only works because Perl's internal character representation is a lot like UTF-8.
For a more robust solution, you should read the Perl Unicode Tutorial and follow the recommendations in there. They boil down to:
Decode any data that you get from outside your program
Encode any data that you send out of your program
In your case, you'll want to decode the data that you read from the spreadsheet and encode the data that you are sending to the database.
Both DBI and DBD::MySQL defaults to Latin1 (compiled with Latin1).
By sending "USE NAMES utf8" as your first query you will change it for that session.
From the manual:
SET NAMES indicates what character set the client will use to send SQL statements to the server. Thus, SET NAMES 'cp1251' tells the server, “future incoming messages from this client are in character set cp1251.” It also specifies the character set that the server should use for sending results back to the client. (For example, it indicates what character set to use for column values if you use a SELECT statement.)
See http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html for full documentation.
I am analyzing this metasploit module, and I am wondering what encoding method does payload.encoded retrieves by default in metasploit.
I did a print payload.encoded in that exploit (without setting any encoder), and I get a normal string like:
PYIIIIIIIIIIQZVTX30VX4AP0A3HH0A00ABAABTAAQ2AB2BB0BBXP........
The module has an encoder option but it's commented.
I am use to see payloads encoded with the standard hex values like:
\xd9\xf7\xbd\x0f\xee\xaa\x47.......
Could someone help me understand where that string returned by payload.encoded comes from and what encoding it uses?
Turns out the first one was an alpha_upper encoded payload, the second is just binary data encoded with hex.
I am creating a coldfusion page, that takes language translation data stored in a table in my database, and makes static js files for each language pairing of english to ___ etc...
I am now starting to work on russian, I was able to get the other languages to work fine..
However, when it saves the file, all the text looks like question marks. Even when I run my translation app, the text for just that language looks like all ?????
I have tried writing it via cffile as utf-8 or ISO-8859-1 but neither seems to get it to display properly.
Any suggestions?
Have you tried ISO-8859-5? I believe it's the encoding that "should" be used for Russian.
By all means do use UTF-8 over any other encoding type. You need to make sure that:
your cfm templates were written to disk with UTF-8 encoding (notepad++ handles that nicely, and so does Eclipse or the new ColdFusion Builder)
your database was created with the proper codepage for nvarchar (and varchar) datatypes
your database connection handles UTF-8
How to go about the last two items depends on your database back-end. Coldfusion is quite agnostic in that regard, as it will happily use any jdbc driver that you may need.
When working in a multi-character set environment, character set conversion issues can occur and it can be difficult to determine where the conversion issue occurred.
There are two categories into which conversion issues can be placed. The first involves sending data in the wrong format to the client API. Although this cannot happen with Unicode APIs, it is possible with all other client APIs and results in garbage data.
The second category of issue involves a character that does not have an equivalent in the final character set, or in one of the intermediate character sets. In this case, a substitution character is used. This is called lossy conversion and can happen with any client API. You can avoid lossy conversions by configuring the database to use UTF-8 for the database character set.
The advantage of UTF-8 over any other encoding is that you can handle any number of languages in the same database / client.
I can't personally reproduce this problem at all. Is the ColdFusion template that is making the call itself UTF-8? (with or without a BOM it matters not for Russian). In any case UTF-8 is absolutely what you should be using. Make sure you get a UTF-8 compliant editor. Which is most things on Mac. On Windows you could use Scite or GVim.
The correct encoding to use in a .js file is whatever encoding the parent page is in. Whilst there are methods to serve JavaScript using a different encoding to the page including it, they don't work on all browsers.
So make sure your web page is being saved and served in an encoding that contains the Russian characters, and then save the .js file using the same encoding. That will be either:
ISO-8859-5. A single-byte encoding with Cyrillic in the high bytes, similar to Windows code page 1251. cp1251 will be the default encoding when you save in a text editor from a Russian install of Windows;
or UTF-8. A multi-byte encoding that contains every character. All modern websites should be using UTF-8.
(ISO-8859-1 is Western European and does not include any Cyrillic. It is similar to code page 1252, the default on a Western Windows install. It's of no use to you.)
So, best is to save both the cf template and the js file as UTF-8, and add <cfprocessingdirective pageencoding="utf-8"> if CF doesn't pick it up automatically.
If you can't control the encoding of the page that includes the script (for example because it's a third party), then you can't use any non-ASCII characters directly. You would have to use JavaScript string literal escapes instead:
var translation_ru= {
launchMyCalendar: '\u0417\u0430\u043f\u0443\u0441\u043a \u041c\u043e\u0439 \u043a\u0430\u043b\u0435\u043d\u0434\u0430\u0440\u044c'
};
when it saves to file it is "·ÐßãáÚ ¼ÞÙ ÚÐÛÕÝÔÐàì" so the charset is wrong
Looks like you've saved as cp1251 (ie. default codepage on a Russian machine) and then copied the file to a Western server where the default codepage is cp1252.
I also just found out that my text editor of choice, textpad, doesn't support unicode.
Yes, that was my reason for no longer using it too. EmEditor (commercial) and Notepad++ (open-source) are good replacements.