I have exported the data is CSV format but it contains funny character like é, .
What is the charset? UTF-8 or the one of my computer?
Is there a way to specify the charset at export?
It is unfortunately impossible to specify Charset at the export... But I think you can define encoding during CSV import process in LibreOffice or MS Excel. Tell me if it solves your issue.
Related
I am trying to import a value from a csv, using an input statement with encoding at utf8.
The value contains a u+2019 character, which sas doesn't recognize at all and displays a box instead.
Anyone knows what could the problem be?
The session needs to be running UTF8, otherwise SAS will try to transcode the text into the session encoding. Ask your SAS admin to show you how to connect to an application server that is running using UTF8 encoding.
I am working on PERL script that reads the data from an .XLSX Excel file and inserts the data into an Oracle database. The database has Windows-1252 encoding, the excel file has UTF-8 encoding(as I know it is the standard at xlsx files) and the special characters such as ö,ü,ű,ő are shown as ??. What is the correct way to convert the encoding of that .xlsx file? I have tried converting the the read string into windows-1252 before it is inserted into the the DB and I have tried convert the whole Excel file into win-1252 but none of them worked.
thank you all for reading it and trying help solve my problem.
Regards,
Krisz
The database has Windows-1252 encoding
The longer-term solution is to fix that so that the database encoding is UTF8.
In the meantime, you could parse your XML string using XML::LibXML and then serialise it to an alternative encoding, like this:
use XML::LibXML;
my $doc = XML::LibXML->load_xml(string => $xml);
$doc->setEncoding('ascii');
my $ascii_xml = $doc->toString();
Any non-ASCII character in the XML will then get converted to a numeric character entity with no loss of data, e.g.: <title>Café life</title> will become <title>Café life</title>.
If you can't put UTF-8 XML in the database, then I would suggest there is no particular advantage to using windows-1252 instead of ASCII, and using ASCII eliminates a number of potential "foot-guns".
I am trying to process a file which contains a lot of special characters such as German umlauts(ä,ü,o) etc. as follows :
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "\r\n\r\n") sc.textFile("/file/path/samele_file.txt")
But upon reading the contents, these special characters are not recognized.
I think the default encoding is not in UTF-8 or similar formats. I would like to know if there is a way to set encoding on this textFile method such as:
sc.textFile("/file/path/samele_file.txt",mode="utf-8")`
No, if you read a non UTF-8 format file in UTF-8 mode, non-ascii characters will not be decoded properly. Please convert file to UTF-8 encoding and then read.
You can refer to
Reading file in different formats
Default mode is UTF-8. You don't need to specify format explicitly for UTF-8. If it's a non UTF-8 then it depends if you need to read those unsupported characters or not
I am integrating data using some flat files. I'm getting the flat files delivered by FTP as .csv-files out of MS SQL exports from a business partner.
I asked him to encode it as UTF-8 (just using the standard I thought).
Now I can see in his files that a lot of UTF-8 bytes such as "& # 2 3 3 ;" (w/o the spaces) can be seen as plain text when I open it in Notedpad++ (or also using my "ETL" tool).
Before I ask him to fix it into proper UTF-8, I would like to understand the issue and whether my claim is actually correct?
Shouldn't special characters be shown as special characters when I open them in Notepad++ and not as plain text UTF-8 codes?
Any help is much appreciated :))
Cheers
Martin
é is an HTML entity. For some reason the text is HTML formatted, which I wouldn't count as "plaintext"/flat files. The file may or may not be encoded in UTF-8 in addition to that, we don't know from the information given.
A file containing "special characters" (meaning non-ASCII characters) encoded in UTF-8 opened in a text editor which correctly interprets the file as UTF-8 looks exactly like the text it should look like, e.g.:
正式名称は、ISO/IEC 10646では “UCS Transformation Format 8”、Unicodeでは “Unicode Transformation Format-8” という。両者はISO/IEC 10646とUnicodeのコード重複範囲で互換性がある。RFCにも仕様がある。
Put this in a file, save it as UTF-8, open it in another application as UTF-8, and this is what the text should look like.
I would like to import German language values in Openerp 7. Currently, the import fails due to special characters. Importing English text works perfectly.
Some of the sample values are:
beschränkt
öffentlich nach Einzelgewerken
Do I need to change the language preference to German first, before importing?
Also, do I need to know German to accomplish this task?
Any advice?
Changing the language of your User account to German is only useful if you'd like to load the German version of translatable fields. For example, if you were importing a list of Products as a CSV file, it would allow you to load the German translation of the product names. If you don't, the names will simply be stored as the master translation (English).
However it is very likely that your import fails due to encoding issue. Encoding comes into the picture because of German special characters, such as Umlauts. In that case you basically need to make sure that you are importing the CSV file using the same Encoding setting that was used to export it.
If your CSV was produced on Windows using Excel or something similar, there is a good chance it was produced with Windows-1252 encoding. By default OpenERP will select utf-8, so you will need to change that in the Encoding selection box of the File Format Options that appear after you select the CSV file to import (in OpenERP 7.0).