I need to convert a bunch of files to:
UTF-8 without BOM
I have installed iconv:
http://www.gnu.org/software/libiconv/
But I can only find the plain UTF-8 option:
iconv --list
ANSI_X3.4-1968 ANSI_X3.4-1986 ASCII CP367 IBM367 ISO-IR-6 ISO646-US ISO_646.IRV:1991 US US-ASCII CSASCII
UTF-8
ISO-10646-UCS-2 UCS-2 CSUNICODE
So after I run the conversion the file is now in:
Is it not possible to convert to:
UTF-8 without BOM
using iconv?
Related
Im importing a databse called adventure works to postgresql
and these message appears
ERROR: invalid byte sequence for encoding "UTF8": 0xff
CONTEXT: COPY businessentity, line 1
SQL state: 22021
As the error says, the byte 0xFF isn't valid in a UTF8 file. Since you're trying to load data from a SQL Server sample database I suspect the file was saved as UTF16 with a Byte Order Mark. Unicode isn't a single encoding. Unicode text files can contain a signature at the start which specifies the encoding used in the file. As the link shows, for UTF16 the BOM can be 0xFF 0xFE or 0xFE 0xFF, values which are invalid in UTF8.
As far as I know you can't specify a UTF16 encoding with COPY, so you'll have to either convert the CSV file to UTF8 with a command line tool or export it again as UTF8. If you exported the data using any SQL Server tool (SSMS, SSIS, bcp) you can easily specify the encoding you want. For example :
bcp Person.BusinessEntity out "c:\MyPath\BusinessEntity.csv" -c -C 65001
Will export the data using the 65001 codepage, which is UTF8
I use
pg_dump.exe -U postgres -f "file-name.sql" database-name
to backup UTF-8 encoded databases on PostgreSQL 8.4 and 9.5, Windows host. Some may have foreign characters such as Chinese, Thai etc stored in Characters columns.
The resulting .sql file shows ANSI encoding when opening in Notepad++ (I'm NOT applying ANSI to opened files by default). How do I know if Unicode characters are always preserved in the dump file? Should I be using an archive (object) backup file instead?
Quote from the manual
By default, the dump is created in the database encoding.
There is no difference in a text file in ANSI encoding and UTF-8 if no extended characters are used. Maybe your dump has no special characters and thus the editor doesn't identify it as UTF-8.
If you want the SQL dump in a specific encoding, use the --encoding=encoding parameter or the PGCLIENTENCODING environment variable
I am reading a csv file in my sql script and copying its data into a postgre sql table. The line of code is below :
\copy participants_2013 from 'C:/Users/Acrotrend/Desktop/mip_sahil/mip/reelportdata/Participating_Individual_Extract_Report_MIPJunior_2013_160414135957.Csv' with CSV delimiter ',' quote '"' HEADER;
I am getting following error : character with byte sequence 0x9d in encoding 'WIN1252' has no equivalent in encoding 'UTF8'.
Can anyone help me with what the cause of this issue and how can I resolve it?
The problem is that 0x9D is not a valid byte value in WIN1252.
There's a table here: https://en.wikipedia.org/wiki/Windows-1252
The problem may be that you are importing a UTF-8 file and postgresql is defaulting to Windows-1252 (which I believe is the default on many windows systems).
You need to change the character set on your windows command line before running the script with chcp. Or in postgresql you can:
SET CLIENT_ENCODING TO 'utf8';
Before importing the file.
Simply specify encoding 'UTF-8' as the encoding in the \copy command, e.g. (I broke it into two lines for readability but keep it all on the same line):
\copy dest_table from 'C:/src-data.csv'
(format csv, header true, delimiter ',', encoding 'UTF8');
More details:
The problem is that the Client Encoding is set to WIN1252, most likely because it is running on Windows machine but the file has a UTF-8 character in it.
You can check the Client Encoding with
SHOW client_encoding;
client_encoding
-----------------
WIN1252
Any encoding has numeric ranges of valid code. Are you sure so your data are in win1252 encoding?
Postgres is very strict and doesn't import any possible encoding broken files. You can use iconv that can works in tolerant mode, and it can remove broken chars. After cleaning by iconv you can import the file.
I had this problem today and it was because inside of a TEXT column I had fancy quotes that had been copy/pasted from an external source.
encoding of the following .pdf conversion on the linux console fails with "ContentNotFoundError"
wkhtmltopdf --page-size A4 --encoding utf-8 --viewport-size 1024x768 http://localhost/möja.html /tmp/test.pdf
Same problem in lynx with enabled UTF-8 charset:
The requested URL /möja.html was not found on this server.
locale settings are in utf-8. Console is typing the german special chars correctly.
LANG=de_DE.UTF-8
LANGUAGE=
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=
Accessing the page over the browser and with wkhtmltopdf on the development system (same debian wheezy distribution) is working as expected. pdf's would be created fine without german special chars in the url. I can't find any differences.
Thank you for every hint!
Apparently the server doesn't expect to see UTF-8 encoded characters, it probably expects Latin-1. URLs cannot contain non-ASCII characters to begin with. Encode the umlaut in the URL in percent encoding according to the expected character encoding. The Latin-1 (ISO-8859-1) percent encoded version would be:
http://localhost/k%F6nig.html
I opened a file and didn't noticed that it was in windows-1251 encoding. It was opened as utf-8 encoded file with incorrect characters. Then I pasted there a bunch of code in utf-8 encoding. After saving (with some error message about falling back to UTF-8) I can't restore file's original content. I reopen new file, cut all pasted code and save it. Nether "reopen with encoding" nor "save with encoding" don't give the correct-encoded file.
iconv -f UTF-8 -t WINDOWS-1251 file.txt > file_1251.txt
Iconv says there's an illegal input sequence.
It looks like it's still in Windows-1251. Decoding the original file incorrectly as UTF-8 and overwriting wouldn't result in a file that is incorrect UTF-8 and so you wouldn't see the error.
Try
iconv -f Windows-1251 -t UTF-8 file.txt > file_UTF8.txt
And open the UTF-8 file normally as UTF-8.