Non-ISO extended-ASCII CSV giving special character while importing in DB - postgresql

I am getting CSV from S3 server and inserting it into PostgreSQL using java.
S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, key));
BufferedReader reader = new BufferedReader(
new InputStreamReader(object.getObjectContent())
);
For some of the rows the value in a column contains the special characters �. I tried using the encodings UTF-8, UTF-16 and ISO-8859-1 with InputStreamReader, but it didn't work out.
When the encoding WIN-1252 is used, the DB still shows some special characters, but when I export the data to CSV it shows the same characters which I found in the raw file.
But again when I am opening the file in Notepad the character is fine, but when I open it in excel, the same special character appears.

All the PostgreSQL stuff is quite irrelevant; PostgreSQL can deal with practically any encoding. Check your data with an utility such as enca to determine how it is encoded, and set your PostgreSQL session to that encoding. If the server is in the same encoding or in some Unicode encoding, it should work fine.

Related

How to convert an xlsx file from utf-8 to windows-1252

I am working on PERL script that reads the data from an .XLSX Excel file and inserts the data into an Oracle database. The database has Windows-1252 encoding, the excel file has UTF-8 encoding(as I know it is the standard at xlsx files) and the special characters such as ö,ü,ű,ő are shown as ??. What is the correct way to convert the encoding of that .xlsx file? I have tried converting the the read string into windows-1252 before it is inserted into the the DB and I have tried convert the whole Excel file into win-1252 but none of them worked.
thank you all for reading it and trying help solve my problem.
Regards,
Krisz
The database has Windows-1252 encoding
The longer-term solution is to fix that so that the database encoding is UTF8.
In the meantime, you could parse your XML string using XML::LibXML and then serialise it to an alternative encoding, like this:
use XML::LibXML;
my $doc = XML::LibXML->load_xml(string => $xml);
$doc->setEncoding('ascii');
my $ascii_xml = $doc->toString();
Any non-ASCII character in the XML will then get converted to a numeric character entity with no loss of data, e.g.: <title>Café life</title> will become <title>Café life</title>.
If you can't put UTF-8 XML in the database, then I would suggest there is no particular advantage to using windows-1252 instead of ASCII, and using ASCII eliminates a number of potential "foot-guns".

Escape Cyrillic, Chinese, Arabic, Hebrew characters in postgresql

I'm trying to load in a postgres table, records from a flat file, I'm doing it with the Copy command, which has worked well so far.
But now I am receiving fields with words in Chinese, Japanese, Cyrillic and other languages, and when I try to do it, it gives me an error in the load.
How could those characters escape in Postgres, I searched, but I have not found any reference to this type of topic.
You should not escape the characters, you should load them as they are.
Your database encoding is UTF8, so that's no problem. If your database encoding is not UTF8, change that.
For each file, figure out what its encoding is and use the ENCODING option of COPY or the environment variable PGCLIENTENCODING so that PostgreSQL knows which encoding the file is in.

WiX Installer heat.exe and non-ascii filenames

I added a file in my WiX script with the character " î " in the path name. Light.exe will complain:
A string was provided with characters that are not available in the specified database code page '1252'
The character in question is 0xEE in Windows-1252 encoding, that is, 0x00EE Unicode or 0xC3AE in UTF-8. These files are in a wxs file generated by heat.exe, and this xml is encoded as UTF-8.
I assume the error message comes from the fact that it tries to input the character in UTF encoding while the database is 1252? Since UTF isn't really supported by Windows Installer (as described in the WiX documentation), should I be using input xml encoded in 1252 or iso-8859? If so, can I tell heat.exe to use another encoding for its output?
My question is similar to this one:
Leveraging heat.exe and harvest already localized file names and including them to msi using wix but the difference is that in that case the characters are "true" non-ansi charcaters, in my case the character can be encoded correctly in 1252, but it seems the conversion from utf-8 input files does not work.
The WiX toolset verifies codepages like so (roughly):
encoding = Encoding.GetEncoding(codepage, new EncoderExceptionFallback(),
new DecoderExceptionFallback());
writer = new StreamWriter(idtPath, false, encoding);
try
{
// GetBytes will throw an exception if any character doesn't
// match our current encoding
rowBytes = writer.Encoding.GetBytes(rowString);
}
catch (EncoderFallbackException)
{
rowBytes = convertEncoding.GetBytes(rowString);
messageHandler.OnMessage(WixErrors.InvalidStringForCodepage(
row.SourceLineNumbers,
writer.Encoding.WindowsCodePage));
}
It is possible that NETFX is not translating that "i" correctly. Explicitly setting the codepage on your XML may help. To do that from heat, you can try to use an XSLT (I've never tried changing the XML doc codepage via XSL but seems possible) or post-edit the document.

How is this file encoded?

I have a .csv file generated by Excel that I got from my customer. My software has to open and parse it in java. I'm using universalchardet but it did not detect the encoding from the first 1,000 bytes of the file.
Within these 1,000 first bytes, there is a sequence that should be read as Boîte, however I cannot find the correct encoding to use to convert this file to UTF-8 strings in java.
In the file, Boîte is encoded as 42,6F,94,74,65 (read using a hex editor). B, o, t and e are using the standard latin encoding with 1 byte per character. The î is also encoded on only one byte, 0x94.
I don't know how to guess this charset, none of my searches online yielded relevant results.
I also tried to use file on that file:
$ file export.csv
/Users/bicou/Desktop/export.csv: Non-ISO extended-ASCII text, with CR line terminators
However I looked at the extended-ASCII charset, the value 0x94 stands for ö.
Have you got other ideas for guessing the encoding of that file?
This was Mac OS Roman encoding. When using the following java code, the text was properly decoded:
InputStreamReader isr = new InputStreamReader(new FileInputStream(targetFileName), "MacRoman");
I don't know how to delete my own question. I don't think it is useful anymore...

Creating files with french characters and encoding

HI, I am creating a file like so.
FileStream temp = File.Create( this.FileName );
Then putting data in the file like so.
this.Writer = new StreamWriter( this.Stream );
this.Writer.WriteLine( strMessage );
That code is encapsulated in a class hierarchy but that is the meat and potatoes of it.
My problem is this. MSDN says that the default encoding for creating a file this way is UTF8. And when I write a french character such as é Textpad interprets the file as a UTF 8 file, but notepad++ says it's "ANSI as UTF8" or maybe it's an ansi file but is reading it as UTF8. When I create a file the same way without the french character both textpad and notepad++ read the file as an ansi file even though according to msdn it should be a utf 8 file still.
Which program should be trusted. Notepad++ or textpad - Notepad++ seems to be more consistant, but is still the oppossite to what MSDN says it should be. My problem is that we create files that get sent off to another company and depending on whether there are french characters the encoding seems to keep changing.
Or is there a better way to determine the encoding of a file. I've read about byte order marks and preambles but as far as I understand neither are guaranteed to be there.
We initially thought that all the files we were building were ansi. Also please note that both ansi and utf8 should handle the french characters appropriately as the characters are part of both character sets.
as far as i know, "ansi" character encoding is another name for ascii-us.
if there are no characters in the file that aren't in the ascii charset then the file is valid ascii and valid utf8, there's no way to distinguish them. so your program can write it as utf8 and any other program would be correct in seeing it as ascii (ansi) just as it would be seeing it as utf8.