How to convert an xlsx file from utf-8 to windows-1252 - perl

I am working on PERL script that reads the data from an .XLSX Excel file and inserts the data into an Oracle database. The database has Windows-1252 encoding, the excel file has UTF-8 encoding(as I know it is the standard at xlsx files) and the special characters such as ö,ü,ű,ő are shown as ??. What is the correct way to convert the encoding of that .xlsx file? I have tried converting the the read string into windows-1252 before it is inserted into the the DB and I have tried convert the whole Excel file into win-1252 but none of them worked.
thank you all for reading it and trying help solve my problem.
Regards,
Krisz

The database has Windows-1252 encoding
The longer-term solution is to fix that so that the database encoding is UTF8.
In the meantime, you could parse your XML string using XML::LibXML and then serialise it to an alternative encoding, like this:
use XML::LibXML;
my $doc = XML::LibXML->load_xml(string => $xml);
$doc->setEncoding('ascii');
my $ascii_xml = $doc->toString();
Any non-ASCII character in the XML will then get converted to a numeric character entity with no loss of data, e.g.: <title>Café life</title> will become <title>Café life</title>.
If you can't put UTF-8 XML in the database, then I would suggest there is no particular advantage to using windows-1252 instead of ASCII, and using ASCII eliminates a number of potential "foot-guns".

Related

Spark: importing text file in UTF-8 encoding

I am trying to process a file which contains a lot of special characters such as German umlauts(ä,ü,o) etc. as follows :
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "\r\n\r\n") sc.textFile("/file/path/samele_file.txt")
But upon reading the contents, these special characters are not recognized.
I think the default encoding is not in UTF-8 or similar formats. I would like to know if there is a way to set encoding on this textFile method such as:
sc.textFile("/file/path/samele_file.txt",mode="utf-8")`
No, if you read a non UTF-8 format file in UTF-8 mode, non-ascii characters will not be decoded properly. Please convert file to UTF-8 encoding and then read.
You can refer to
Reading file in different formats
Default mode is UTF-8. You don't need to specify format explicitly for UTF-8. If it's a non UTF-8 then it depends if you need to read those unsupported characters or not

decipher encoding UTF-8 Issue

I have to compare 2 text files generated from SQL server(generated directly) and Impala ( through Unix).Both are saved as UTF-8 file. I have converted the SQL server generated file using dos2unix for direct compare in unix. I have some data which is encoded and am not able to check what the encoding is.
Below is some sample data from SQL server file.
Rock<81>ller
<81>hern
<81>ber
R<81>cking
Below is sample data from Unix file.
Rock�ller
�ber
R�cking
�ber
I checked the file using HXD editor and both SQL server generated data and unix file showed code 81. I looked up code 81 in UTF and found it is control<> character.
I am really lost as encoding is fairly new to me. Any help to decipher what encoding it is actualy used would be very helpful.

Non-ISO extended-ASCII CSV giving special character while importing in DB

I am getting CSV from S3 server and inserting it into PostgreSQL using java.
S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, key));
BufferedReader reader = new BufferedReader(
new InputStreamReader(object.getObjectContent())
);
For some of the rows the value in a column contains the special characters �. I tried using the encodings UTF-8, UTF-16 and ISO-8859-1 with InputStreamReader, but it didn't work out.
When the encoding WIN-1252 is used, the DB still shows some special characters, but when I export the data to CSV it shows the same characters which I found in the raw file.
But again when I am opening the file in Notepad the character is fine, but when I open it in excel, the same special character appears.
All the PostgreSQL stuff is quite irrelevant; PostgreSQL can deal with practically any encoding. Check your data with an utility such as enca to determine how it is encoded, and set your PostgreSQL session to that encoding. If the server is in the same encoding or in some Unicode encoding, it should work fine.

"utf8 "\x96" does not map to Unicode at <somefile.pl> at line no - 321" Error in Perl

I am setting up Perl application . I am getting this error. "utf8 "\x96" does not map to Unicode at"
Can anybody let me know the cause and solution. Am i missing any configuration or it is my installation problem ?
Following is the code :
open(FILE,"<:encoding(UTF-8)",$self->{BASEDIR}.$self->{FILENAME}) || die "could not open file $basedir$filename - $!";
The character 0x96 is not a valid UTF-8 encoding. There is a block of code points just above 0x80 that, in UTF-8, encodes the start of a 2- or 3-byte character.
The input you are reading must not be UTF-8, and is most likely Latin1 or CP1252.
You will need to convert the input data to UTF-8, however one does that in Perl (it's been a long time since I did any Perl and it didn't use UTF-8 by default when I was writing Perl :-)
I suspect that something you believe to be encoded in UTF-8 is not, in fact, encoded with UTF-8.
Just putting this info out there in case it helps someone in the future.
If you're working with a Microsoft product, this can be caused by non-US characters (European, Chinese, etc). For instance, if someone sends you an excel spreadsheet of data that you need to process and it's saved in .csv format, those characters can be outside of the utf-8 range if it wasn't saved properly.
Fortunately, at least in Excel for Mac v. 15, it is possible to take that data and "save as" specifically a CSV UTF-8 file - it's in the list of options. This is a separate option from the other CSV file option. This will convert non-US characters into the UTF-8 charset and solve this issue.

iPhone: Which encoding scheme should be used in importing CSV file into Sqlite database

In my iPhone app, I am importing CSV file into SQlite database using CHCSV parser.
My CSV file contains data in European Languages containing special characters like umlaut, etc.
Which encoding should I use?
Should it be UTF8StringEncoding or some other encoding scheme?
What encoding is the original file in? Because that's the encoding you should use. CSV doesn't define any specific encoding, so it really depends on how the file was created.