UTF-8 encoding inside database encrypted - encoding

i Convert my database from this tutorial
http://en.gentoo-wiki.com/wiki/Convert_latin1_to_UTF-8_in_MySQL
but i didn't notice the arabic characters INSIDE the database is encrypted , like
اوÙاµ ®ØµØ… „Ù‡ Øكلق§Ø‡Ø°Ù…ا؄مشٳÙÙ‹ ÙÙ„...
through the php script connect with the database everything GOOD , but inside the database the arabic characters looks like that
i try to return the database to the old encoding which is WINDOWS-1256 using iconv by the following command
# iconv -f UTF-8 -t WINDOWS-1252 database.sql > database_1252.sql
i got this error
iconv: illegal input sequence at position
so i try to run the command again using -c option
# iconv -c -f UTF-8 -t WINDOWS-1252 database.sql > database_1252.sql
it's worked and i can see the arabic characters inside the database as well, but alot of characters missing , for example :
i would like to go shopping
after the converting
i would like to
i want to know how could i fix the Arabic Characters to be read as normal inside the database complete not missing anything
thanks

Wait wait .... you say your database was in WINDOWS-1256 (or WINDOWS-1252?) and you converted it based on tutorial latin1 -> utf8? No wonder the characters are malformed.
I wouldn't trust to the tutorial solution at all. I would recommend that you return to your former version of the database and use mysql alter table command to change the encoding.

Related

How to fix syntax errors in postgresql .sql dump file when restoring with psql?

I have a postgresql .sql dump file created by pg_dump on another windows 10 box. I am trying to restore it on my windows 10 laptop with
"psql -U user -d database -1 -f filename.sql". I created the database, but when I run the command to do the restore I get an error from psql after I give it my password:
psql:filename.sql:1:1: ERROR: syntax error at or near "ÿ_"
LINE 1: ÿ_;
The file looks like straight ascii (I only see two dashes on line one. I don't see a 'y' with an umlaut anywhere). I did a file on the .sql file with cygwin bash, and it says:
Little-endian UTF-16 Unicode text, with very long lines, with CRLF, CR line >terminators
I really don't want to recreate the database by hand. I am looking for any suggestions.
I tried psql with and without the '-1' option; no luck. I tried putting a ';' at the top of the sql file, which I found suggested somewhere; again no luck.
I did a psql -l on my postgresql installation and the encoding on all my databases (including the one to which I am trying to do the restore) shows UTF8.
There really is no code. It is just that I can't seem to restore this dump file because it errors out.
I think that captures my problem. The windows box that I got the dump from is not available to me now; so I'm just hoping there is a way to get around this problem. Recreating the database by hand table by table is something I would prefer to avoid.
Thanks--
Al
In my case , this exact thing happened because I was taking the dump using windows Powershell , due to which other characters got included in the dump file.
Simply using command prompt to take the solved my problem.
I can only give you leads how to debug the problem, because the cause is not immediately obvious.
First, there should be a line close to the beginning of the dump file that sets client_encoding. The dump file should be in that encoding.
I can see two possibilities:
The file got mangled during transfer. To test for that, calculate a checksum for both files and compare.
Always use binary mode to transfer PostgreSQL dumps.
some editor or something else sneaked a BOM (byte order mark) into the file at the very beginning.
That's my prime suspect, since the problem is at line 1.
Use a hex editor or od (in Cygwin) to verify that. If this is the problem, simply replace the BOM with spaces.

Does pg_dump preserve all Unicode characters when .sql file is ANSI?

I use
pg_dump.exe -U postgres -f "file-name.sql" database-name
to backup UTF-8 encoded databases on PostgreSQL 8.4 and 9.5, Windows host. Some may have foreign characters such as Chinese, Thai etc stored in Characters columns.
The resulting .sql file shows ANSI encoding when opening in Notepad++ (I'm NOT applying ANSI to opened files by default). How do I know if Unicode characters are always preserved in the dump file? Should I be using an archive (object) backup file instead?
Quote from the manual
By default, the dump is created in the database encoding.
There is no difference in a text file in ANSI encoding and UTF-8 if no extended characters are used. Maybe your dump has no special characters and thus the editor doesn't identify it as UTF-8.
If you want the SQL dump in a specific encoding, use the --encoding=encoding parameter or the PGCLIENTENCODING environment variable

STDERR of pg_dump in UTF-8

I am redirecting the stderr of pg_dump to file:
pg_dump ...... 2>pg_dump.log
but this file is ANSI-encoded. I would like to see it in UTF-8 or Unicode. Is this possible?
man pg_dump
-E encoding
--encoding=encoding
Create the dump in the specified character set encoding. By default, the dump is created in the database encoding.
BTW: regarding "UTF-8 or Unicode", the "or" does not make sense; UTF-8 is one of the encodings of Unicode (other is UTF-16)
Updated: Sorry, I misunderstood your problem. Are you interested in text error messages generated by Postresql or texts from some queries/data from your own data? If the former, I think the LC_MESSAGES setting should work http://www.postgresql.org/docs/9.2/interactive/locale.html
Elsewhere, you can always use iconv

wiki dump encoding

I'm using WikiPrep to process the latest wiki dump enwiki-20121101-pages-articles.xml.bz2. Instead of "use Parse::MediaWikiDump;" I replaced that by "use MediaWiki::DumpFile::Compat;" and did the proper changes in the code. Then, I ran
perl wikiprep.pl -f enwiki-20121101-pages-articles.xml.bz2
I got an error
enwiki-20121101-pages-articles.xml.bz2:1: parser error : Document is empty
BZh91AY&SY±H¦ÂOÿ~Ð`ÿÿÿ¿ÿÿÿ¿ÿÿÿÿÿÿÿÿÿÿ½ÿýþdß8õEnÞ¶zëJ¨Eà®mEÓP|f÷Ô
^
I guess there are some non-utf8 characters contained in the dump. So I ran
iconv -f utf8 -t utf8 enwiki-20121101-pages-articles.xml.bz2
And indeed, I got some errors
BZh91AY&SYiconv: illegal input sequence at position 10
So, my question is what's the encoding format of wiki dump and if I wish to convert it to utf-8, what shall I do? Or how should modify wikiprep.pl to avoid such problems.
Many thanks
-- [solved] I should first unzip the file first.
You are running iconv on the compressed (bz2) version of the file, rather than the XML file itself. Uncompress it first.
(Posting borrible's answer so that this resolved question is not listed as unanswered.)

MongoDB: errors while inserting document using mongoimport

I am trying to insert a huge(~831M) file into mongo collection using mongoimport
/Library/mongodb/bin/mongoimport --port 12345 -d staging -c collection < out.all.1
and see some errors like
exception:Failure parsing JSON string near: , 'Custome
and there are instances where I found some weird characters
'CustomerCity': u'Wall \xa0'
'CustomerCity': u'La Ca\xc3\xb1ada Flintridge'
'CustomerCity': u'La Ca\xf1ada Flintridge'
How do I resolve these issues?
Thank you
I struck a similar problem where mongoimport gave errors about non-UTF8 characters in a flat file I'd asked it to import. This google groups thread led me to try putting my source data file through iconv on the unix command line to 'correct' non-UTF-8 characters, thus:
iconv -f ISO-8859-1 -t UTF-8 inputfile.txt > outputfile.txt
That solved the issue for me. I wonder if that approach might help you? While the error you're seeing is different, it's the odd characters that are messing up the JSON parsing, no?
One wonders, however, how those odd characters are ending up in your output data if you're generating it yourself. Perhaps you could filter in the code that generates the output?