My word document has strange characters. Nothing is readable. I tried converting it to english and it was exactly the same. I have no clue how this happend and dont know what to do. it is a important document to me. please help. thank you
I tried converting it to english.
Related
I've been working on a small project and came across some information that has some sort of encoding (I assume).
7C-FC-1B-C9-97-1B-A9-EB-2E-45-2A-73-CE-E3-17-F9
01-3E-6A-50-09-ED-1C-A1-80-A0-27-B9-0C-D3-C4-9D
89-4C-B3-52-4A-B8-93-CB-95-4F-E2-9A-0C-59-7C-FD
Does anyone know what sort of encoding this is? I looked into UTF-8 since this came from a SQL file. No luck there.
I think that is written in hexadecimal. Not encoded
Accidentally created a collection with "�" in its name. Now i'm looking for a way to delete it.
P.s. I tried db['�'].drop() and that did not work out for me.
Figured it out myself. db['\ufffd'].drop() worked for me. Converting "�" to unicode solved the problem. Hope someone finds it useful. This should work for other special characters as well.
We have uploaded a file with bad encoding now when downloading it again all the "strange" French characters are mixed up.
Example of the bad text:
R�union
Now when opening the CSV with Openoffice we tried all of the encodings in the Dropdown none of them seem to work.
Anyone have a way to fix the encoding to the correct one that we can view the chars?
Links to file https://drive.google.com/file/d/0BwgeuQK3LAFRWkJuNHd2TlF2WjQ/view?usp=sharing
Kr.
Sadly there is no way to automatically fix the linked file. Consider the two words afectación and sécurité. In the file they have been converted incorrectly to afectaci?n and s?curit?. There is no way to convert the question marks back because sometimes they're ó and other times é.
(Actually instead of question marks the file uses the unicode replacement character, but that doesn't change the problem).
Hopefully you have an earlier version of the file that has not been converted incorrectly.
Next time try to use a consistent encoding. This question gives some suggestions for how to do this.
If the original data cannot be obtained, there is one thing that could be done outside of retyping the whole thing. It is possible to use dictionary lookups to guess the missing words. However this would be a difficult project, and there would be mistakes where incorrect guesses were made. It's probably not worth it.
Does anyone have a sample code for parsing the CEDICT file? CEDICT is a Chinese-English Dictionary. For instance, currently, if I open it in a text editor, a line in the CEDICT file looks like:
不 不 [bu4] /(negative prefix)/not/no/
I would like to see it as:
不 不 [bu4] /(negative prefix)/not/no/
I found Textwrangler to do this for me as a text editor. What I now need is sample code that achieves the same.
The thing is, it's just an encoding problem. If the line looks like
不 不 [bu4] /(negative prefix)/not/no/
It's because the text editor doesn't know/realize that the text is encoded as UTF-8. Text Wrangler, or its big brother BBEdit, are very good at guessing encoding, and can even be asked to display text in a specific encoding.
Since we don't know what you want, in the end, to achieve, it's hard to tell you exactly what has to be done, specifically. What I can say is that your app (which language are you using anyway?) needs to be Unicode aware (and be able to read/manipulate UTF strings).
I wrote a couple of apps based on the CEDICT, one for Mac OS X, one for Android. Parsing and indexing the CEDICT is not very hard.
UPDATE
Regarding the parsing itself of the CEDICT, it's nothing complicated. I don't do Objective-C, never have, never will, but the process would be the same in any language:
Read a line. Say your own example: 不 不 [bu4] /(negative prefix)/not/no/
You have four fields: Trad. Ch., Simp. Ch., Reading, Meaning(s).
These fields are space separated. Of course the 4th field may contain spaces, so be careful.
Store (I used an sqlite db) the 4 fields in to db.
You might want to remove the slashes from the definition field, replace them with something else.
Loop
You have now converted the CEDICT to a database. That's the easy part. As for tokenizing Chinese, good luck with that, mate. Better minds than mine are still banging their heads on this one.
I've read all the posts about dashes and tried pretty much everything mentioned in them, yet cannot figure out a strange problem I'm having.
For example, I have an author name like this:
Arturo Pérez-Reverte
A search for 'pérez-reverte' will not turn up anything, nor will 'pérez-reverte' so escaping the dash is not the issue.
But a search for 'spider-man' will return hits, proving that the dash seems to be working.
However, a search for 'perez reverte' also finds a hit because it searches each word separately and finds the 'reverte' in 'perez-reverte' (but doesn't seem to find the 'perez').
A search for either 'pérez' or 'perez' finds the same number of documents, suggesting that the accent is not an issue (I do have a charset_table which accounts for accented characters).
So I'm very confused as to what's happening here. It if it isn't the accent and it isn't the dash, what could it be?
I don't have any ignore_chars set, I'm using UTF-8 and have a charset_table to treat accented characters as regular characters.
The only difference between these two terms is that one of them is a title (spider-man) and the other an author, but they are both part of the same Sphinx index declaration, so I don't see that as an issue in any way.
Any help would be greatly appreciated.
After much fighting with it, I found out that even though my database is all UTF-8 with the proper collation I needed to add this in sphinx.conf for everything to work properly:
sql_query_pre = SET NAMES utf8
sql_query_pre = SET CHARACTER SET utf8
After doing that, and having the proper charset_table, everything seems to be working fine.
Hope this helps someone else.