I'm writing an Emacs extension and want to fetch some data from the Internet. Using url-retrieve-synchronously and some simple text processing I can get a string like
"\273\313\271\311\267\335 abcd"
The first several characters are encoded in GBK, I'd like to know how to decode them? Many thanks.
See decode-coding-string.
Related
I would like to save and read a string with Windows-1250 code page, but I dont know how to do it.
The correct way is to write an Encoding which is a Converter.
You need to map the incoming Unicode characters to their corresponding Windows-1250 equivalents (and probably throw, if it contains characters that are outside its range). You can take the Iso-Latin-1 encoder as a starting point: latin1.dart
I am pretty sure that this is a very basic question but after hours of searching and many attempts to fix this myself I still havent made progress.
Umlauts in my json file are saved like this. I found lots of ways to go from ö -> \xf6 but how can I go the other way round and end up with a utf-8 encoded file?
As per your comment I'd assume you're using python. When using json.load, parse it the utf-8 encoding parameter.
Look at the python documentation.
I've got a unicode string (s) which I want to write into a file.
In Python 2 I could write:
open('filename', 'w').write(s.encode('utf-8'))
But this fails for Python 3. Apparently, s.encode() returns something of type 'bytes', which the write() function does not accept:
TypeError: must be str, not bytes
Does anyone know how to port the above code to Python 3?
Edit:
Thanks to all of you who proposed using binary mode! Unfortunately, this causes a problem with the \n characters. Is there any way to achieve the same result I had with Python 2 (namely to encode non-ANSI characters in UTF-8 while keeping the OS-specific rendition of \n)?
Thanks!
You do not want to muck around with manually encoding each and every piece of data like that! Simply pass the encoding as an argument to open, like this:
#!/usr/bin/env python3.2
slist = [
"Ca\N{LATIN SMALL LETTER N WITH TILDE}on City",
"na\N{LATIN SMALL LETTER I WITH DIAERESIS}vet\N{LATIN SMALL LETTER E WITH ACUTE}",
"fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade",
"\N{GREEK SMALL LETTER BETA}-globulin"
]
with open("/tmp/sample.utf8", mode="w", encoding="utf8") as f:
for s in slist:
print(s, file=f)
Now if you the file you made, you’ll see that it says:
$ cat /tmp/sample.utf8
Cañon City
naïveté
façade
β-globulin
And you can see that those are the right code points this way:
$ uniquote -x /tmp/sample.utf
Ca\x{F1}on City
na\x{EF}vet\x{E9}
fa\x{E7}ade
\x{3B2}-globulin
See how much easier that is? Let the stream object handle any low-level encoding or decoding for you.
Summary: Don't call encode or decode yourself when all you are doing is using them to process a homogeneous stream that's all of it in the same encoding. That's way too much bother for zero gain. Use the encoding argument just once and for all.
Open the file in binary mode, that's the least invasive way in terms of changes.
On the other hand, you could set the output file encoding with open() and avoid explicit string encoding altogether.
You might want to read the manual of the open() function.
Open the file in binary mode
open('filename', 'wb').write(s.encode('utf-8'))
I am writing a game for iOS that uses .tmx map files. I am creating the maps in the application 'Tiled' and then at some point before they get to iOS, I'm parsing them with Perl.
When I save the files as straight XML, it's a cinch for perl to parse them. However, cocos2d insists that the files be base64-encoded. The 'Tiled' map editor has no problem saving files with this encoding scheme, and iOS reads them just fine, but it's presenting problems for my perl code.
For some reason, the standard MIME::Base64 decode_base64() method in perl is not cutting the mustard here- when I decode the strings, I get one or two binary characters-- question marks in diamond boxes and such.
And the vague documentation for the TMX file format makes it unclear if there is some other encoding going on before or after the base64 encoding which might be causing this problems. I looked at the cpp source for the encoder, and I saw lots of references to Latin1, but I couldn't decipher what's going on in detail.
I noticed that when I tried doing my own tests with MIME::Base64, encoding and then decoding a test string, the encoded text looks dramatically different than that which I see coming out of the TMX files-- for instance, my base64-encoded text for a short string looks like this:
aGVyZSBpcyBhIHNlbnRlbmNl
But the base64-encoded text coming from the TMX files looks like this:
9QAAAAABAAANAQAAGAEAAA==
Any suggestions on what else I might try in attempts to decode a string that looks like that?
I think this page might be what you're looking for. It suggests that first you decode_base64, then (if the compression="gzip" attribute is present) use gunzip to uncompress it, and finally use unpack('V*', $data) to extract the list of 4-byte little-endian integers.
I need to get a string from <STDIN>, written in latin and russian mixed encodings, and convert it to some url:
$search_url = "http://searchengine.com/search?text=" . uri_escape($query);
But this proccess goes bad and gives out Mojibake (a mixture of weird letters). What can I do with Perl to solve it?
Before you can get started, there's a few things you need to know.
You'll need to know the encoding of your input. "Latin" and "russian" aren't (character) encodings.
If you're dealing with multiple encodings, you'll need to know what is encoded using which encoding. "It's a mix" isn't good enough.
You'll need to know the encoding the site expects the query to use. This should be the same encoding as the page that contains the search form.
Then, it's just a matter of decoding the input using the correct encoding, and encoding the query using the correct encoding. That's the easy part. Encode provides functions decode and encode to do just that.