I am pretty sure that this is a very basic question but after hours of searching and many attempts to fix this myself I still havent made progress.
Umlauts in my json file are saved like this. I found lots of ways to go from ö -> \xf6 but how can I go the other way round and end up with a utf-8 encoded file?
As per your comment I'd assume you're using python. When using json.load, parse it the utf-8 encoding parameter.
Look at the python documentation.
Related
I've been working on a small project and came across some information that has some sort of encoding (I assume).
7C-FC-1B-C9-97-1B-A9-EB-2E-45-2A-73-CE-E3-17-F9
01-3E-6A-50-09-ED-1C-A1-80-A0-27-B9-0C-D3-C4-9D
89-4C-B3-52-4A-B8-93-CB-95-4F-E2-9A-0C-59-7C-FD
Does anyone know what sort of encoding this is? I looked into UTF-8 since this came from a SQL file. No luck there.
I think that is written in hexadecimal. Not encoded
I have problem in my database where some of the Cyrillic text is seen like this "болно Ð±Ð°Ñ Ð°Ð¼ÑŒÐ´Ñ€ÑƒÑƒÐ»Ð¶ ч Ð". Is there a way to convert this to back to human readable format.
I need to read actual context of this.
Best I could do from your data...it looks Cyrillic but Google Translate didn't make anything of it. It seems it was decoded under the default US Windows codec but was really UTF-8, but the data is not quite right. I'm using Python to attempt to fix it:
>>> s.encode('cp1252').decode('utf8',errors='replace')
'болно ба� амьдруулж ч �'
We have uploaded a file with bad encoding now when downloading it again all the "strange" French characters are mixed up.
Example of the bad text:
R�union
Now when opening the CSV with Openoffice we tried all of the encodings in the Dropdown none of them seem to work.
Anyone have a way to fix the encoding to the correct one that we can view the chars?
Links to file https://drive.google.com/file/d/0BwgeuQK3LAFRWkJuNHd2TlF2WjQ/view?usp=sharing
Kr.
Sadly there is no way to automatically fix the linked file. Consider the two words afectación and sécurité. In the file they have been converted incorrectly to afectaci?n and s?curit?. There is no way to convert the question marks back because sometimes they're ó and other times é.
(Actually instead of question marks the file uses the unicode replacement character, but that doesn't change the problem).
Hopefully you have an earlier version of the file that has not been converted incorrectly.
Next time try to use a consistent encoding. This question gives some suggestions for how to do this.
If the original data cannot be obtained, there is one thing that could be done outside of retyping the whole thing. It is possible to use dictionary lookups to guess the missing words. However this would be a difficult project, and there would be mistakes where incorrect guesses were made. It's probably not worth it.
Does anyone have a sample code for parsing the CEDICT file? CEDICT is a Chinese-English Dictionary. For instance, currently, if I open it in a text editor, a line in the CEDICT file looks like:
不 不 [bu4] /(negative prefix)/not/no/
I would like to see it as:
不 不 [bu4] /(negative prefix)/not/no/
I found Textwrangler to do this for me as a text editor. What I now need is sample code that achieves the same.
The thing is, it's just an encoding problem. If the line looks like
不 不 [bu4] /(negative prefix)/not/no/
It's because the text editor doesn't know/realize that the text is encoded as UTF-8. Text Wrangler, or its big brother BBEdit, are very good at guessing encoding, and can even be asked to display text in a specific encoding.
Since we don't know what you want, in the end, to achieve, it's hard to tell you exactly what has to be done, specifically. What I can say is that your app (which language are you using anyway?) needs to be Unicode aware (and be able to read/manipulate UTF strings).
I wrote a couple of apps based on the CEDICT, one for Mac OS X, one for Android. Parsing and indexing the CEDICT is not very hard.
UPDATE
Regarding the parsing itself of the CEDICT, it's nothing complicated. I don't do Objective-C, never have, never will, but the process would be the same in any language:
Read a line. Say your own example: 不 不 [bu4] /(negative prefix)/not/no/
You have four fields: Trad. Ch., Simp. Ch., Reading, Meaning(s).
These fields are space separated. Of course the 4th field may contain spaces, so be careful.
Store (I used an sqlite db) the 4 fields in to db.
You might want to remove the slashes from the definition field, replace them with something else.
Loop
You have now converted the CEDICT to a database. That's the easy part. As for tokenizing Chinese, good luck with that, mate. Better minds than mine are still banging their heads on this one.
I support a website written in Tcl which displays data in Traditional Chinese (big5). We then have a Java servlet, using the translation code from mandarintools.com, to translate a page request into Simplified Chinese. The conversion as specified to the translation code is from UTF-8 to UTF-8S; Java is apparently correctly translating the data to UTF-8 as it comes in.
The Java translation code works but is slow, and since the website is written in Tcl someone on another list suggested I try using that. Unfortunately, Tcl doesn't support UTF-8S and I have been unable to figure out what translation to use in its place. I've tried gb2312, gb2312-raw,gb1988, euc-cn... all result in gibberish. My assumption is that Tcl is also translating to UTF-8 as it comes in, though I have tried converting from big5 first and it doesn't help.
My test code looks like this:
set page_body [ns_httpget http://www.mysite.com]
set translated_page_body [encoding convertto gb2312 $page_body]
ns_write $translated_page_body
I have also tried
set page_body [ns_httpget http://www.mysite.com]
set translated_page_body [encoding convertto gb2312 [encoding convertfrom big5 $page_body]]
ns_write $translated_page_body
But it didn't change anything.
Does anyone out there have enough experience with this to help me figure it out?
FYI for completeness' sake, I've been told by Tcl experts that you can't do the conversion this way, it has to be done via character replacement.
By any chance, are you grabbing your data from Oracle?
If so, see if you can use the CONVERT function to convert to from "utf8" to "al32utf8", which is the true Utf8 standard and which Tcl should work-with seamlessly.
If not, well, I guess I'll wait for you comment(s).