Spark: importing text file in UTF-8 encoding - scala

I am trying to process a file which contains a lot of special characters such as German umlauts(ä,ü,o) etc. as follows :
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "\r\n\r\n") sc.textFile("/file/path/samele_file.txt")
But upon reading the contents, these special characters are not recognized.
I think the default encoding is not in UTF-8 or similar formats. I would like to know if there is a way to set encoding on this textFile method such as:
sc.textFile("/file/path/samele_file.txt",mode="utf-8")`

No, if you read a non UTF-8 format file in UTF-8 mode, non-ascii characters will not be decoded properly. Please convert file to UTF-8 encoding and then read.
You can refer to
Reading file in different formats

Default mode is UTF-8. You don't need to specify format explicitly for UTF-8. If it's a non UTF-8 then it depends if you need to read those unsupported characters or not

Related

How should a properly UTF-8 encoded file look in notepad++

I am integrating data using some flat files. I'm getting the flat files delivered by FTP as .csv-files out of MS SQL exports from a business partner.
I asked him to encode it as UTF-8 (just using the standard I thought).
Now I can see in his files that a lot of UTF-8 bytes such as "& # 2 3 3 ;" (w/o the spaces) can be seen as plain text when I open it in Notedpad++ (or also using my "ETL" tool).
Before I ask him to fix it into proper UTF-8, I would like to understand the issue and whether my claim is actually correct?
Shouldn't special characters be shown as special characters when I open them in Notepad++ and not as plain text UTF-8 codes?
Any help is much appreciated :))
Cheers
Martin
é is an HTML entity. For some reason the text is HTML formatted, which I wouldn't count as "plaintext"/flat files. The file may or may not be encoded in UTF-8 in addition to that, we don't know from the information given.
A file containing "special characters" (meaning non-ASCII characters) encoded in UTF-8 opened in a text editor which correctly interprets the file as UTF-8 looks exactly like the text it should look like, e.g.:
正式名称は、ISO/IEC 10646では “UCS Transformation Format 8”、Unicodeでは “Unicode Transformation Format-8” という。両者はISO/IEC 10646とUnicodeのコード重複範囲で互換性がある。RFCにも仕様がある。
Put this in a file, save it as UTF-8, open it in another application as UTF-8, and this is what the text should look like.

Character Encodings compatibility with ASCII

I'm currently reading mails from file and process some of the header information. Non-ASCII characters are encoded according to RFC2047 in quoted-printable oder Base64, so the files contain no non-ASCII characters . If the file is encoded in UTF-8, Win-1252 or one of the ISO-8859-* character encodings, I won't run into problems because ASCII is embedded at the same place in all these charsets (so 0x41 is a A in all of those charsets).
But what if the file is encoded using an encoding that does not embed ASCII in that way? Do encodings like this even exist? And if so, is there even a reliable way of detecting them?
There is a Charset-detector of Mozilla based on this very interesting article. It can detect a very large amount of different encodings. There is also a port to C# available on GitHub which I used before. It turned out to be quite reliable. But of course, when the text just contains ASCII characters, it cannot distinguish between the different encodings that encode ASCII in the same way. But any encodings that encode ASCII in a different way should be detected correctly with this library.

file encoding and deformed strings

I am just working with a text file, that contains lots of deformed strings such as:
VyplÅ<88>te prosím pole "jméno
My editor says that the file encoding is latin1. The string is supposed to be a czech sentence that contains some diacritics so no wonder it is displayed wrong. I have tried to force utf8 and latin2 encodings in my editor but that did not help. I have also tried to use iconv to convert the file from latin1 to utf8 or latin2 but neither that helped. I quite often encounter issues likes this and I don't know any other solution than to manually rewrite the strings. Is there a better way to fix this?
EDIT:
Here is the original sentence:
Vyplňte prosím pole "jméno"
Here is hex dump of the part where the malformed string occurs:
0002640: 6a6d 656e 6f22 5d20 3d20 2744 453a 2056 jmeno"] = 'DE: V
0002650: 7970 6cc5 8874 6520 7072 6f73 c3ad 6d20 ypl..te pros..m
0002660: 706f 6c65 2022 6a6d c3a9 6e6f 222e 273b pole "jm..no".';
EDIT2:
The sentence above is really correct utf8 as deceze have said. But I have just found out some strange thing. If I try to transcode the file from utf8 to utf8 (with iconv), I get an error on a word: Postgebühr at character ü. If I look at hex dump, this character is represented as \xfc (252 in decimal), which is valid latin1 byte encoding for ü but completely invalid utf8 byte encoding. It seems that part of the file is in latin1 and another part in utf8. Here is part of the file that is in latin1 (probably):
0000250: 506f 7374 6765 62fc 6872 273b 0a09 0963 Postgeb.hr';...c
0000260: 6f6e 665b 2277 6166 6572 7322 5d20 3d20 onf["wafers"] =
0000270: 2744 453a 206f 706c c3a1 746b 20c3 273b 'DE: opl..tk .';
As I look into this more, this even does not seem to be valid latin1 cause even in latin1 it is garbled (DE: oplátk à instead of probably DE: oplatky za). This part of the file seems to contain some damaged text.
I can't understand how encoding in this file could have got mixed up like that. Any ideas?
If the file is supposed to contain Latin2 encoded text, then trying to convert it from Latin1 or similar is of course messing things up.
The problem is simply that your text editor does not automagically recognize the encoding, since the single-byte Latin* encodings all look identically interchangeable on a byte level. If your editor "tells" you the encoding is Latin1, what it means is that it is currently interpreting the file as Latin1. Obviously it has that wrong.
You either need to tell your editor to treat the file as Latin2 (Open As... Latin2, or however your editor gives you this choice) or to convert the file from Latin2 into an encoding your editor handles correctly.
To understand encodings better, I recommend you read What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
In response to your posted hex dump: That file is UTF-8 encoded.
Iconv is the way to go, but you must know the correct enconding. Latin2 (iso8859-2) is only one of the possibilities, since there were many ascii extensions in Europe. What language is this supposed to be in?

How to get vim to show a byte-by-byte representation of file data

I don't want vim to ever interpret my data in any encoding specific way. In other words, when I'm in vim, I want the character that my cursor is on to correspond to the actual byte, not a utf* (etc.) representation of that byte.
I need to use vim to analyze issues caused by Unicode conversion errors made by other people (using other software) so it's important that I see what is actually there.
For example, in Cygwin's vim, I have been able to see UTF-8 BOMs as
 [START OF FILE DATA]
This is perfect. I recognize this as a UTF-8 BOM and if I want to know what the hex for each character is, I can put the cursor on the characters and use 'ga'.
I recently got a proper Linux machine (Fedora). In /etc/vimrc, this line exists
set fileencodings=ucs-bom,utf-8,latin1
When I look at a UTF-8 BOM on this machine, the BOM is completely hidden.
When I add the following line to ~/.vimrc
set fileencodings=latin1
I see

The first 3 characters are the BOM (when ga is used against them). I don't know what the last 3 characters are.
At one point, I even saw the UTF-8 BOM represented as "feff" - the UTF-16 BOM.
Anyway, you see my problem. I need to see exactly what is in my file without vim interpreting the bytes for me. I know I could use xxd, od, etc but vim has always been very convenient as an analysis tool. Plus I want to be able to edit the files and save them without any conversion problems.
Thanks for your help.
Use 'binary' mode:
:edit ++bin file
or
vim -b file
From :help 'binary':
The 'fileencoding' and 'fileencodings' options will not be used, the
file is read without conversion.
I get some good mileage from doing :e ++enc=latin1 after loading the file (VIm's initial guess on the encoding isn't important at this stage).
The sequence  is actually the U+FEFF (BOM) encoded UTF-8, decoded latin1, encoded UTF-8, and decoded latin1 again.  is the U+FEFF (BOM) encoded as UTF-8 and decoded as latin1. You can't get away from encodings. Those aren't the actual bytes, they are the latin1 characters displayed from an incorrect decoding. If you want bytes, use a hex editor; otherwise, use the correct decoding.

Creating files with french characters and encoding

HI, I am creating a file like so.
FileStream temp = File.Create( this.FileName );
Then putting data in the file like so.
this.Writer = new StreamWriter( this.Stream );
this.Writer.WriteLine( strMessage );
That code is encapsulated in a class hierarchy but that is the meat and potatoes of it.
My problem is this. MSDN says that the default encoding for creating a file this way is UTF8. And when I write a french character such as é Textpad interprets the file as a UTF 8 file, but notepad++ says it's "ANSI as UTF8" or maybe it's an ansi file but is reading it as UTF8. When I create a file the same way without the french character both textpad and notepad++ read the file as an ansi file even though according to msdn it should be a utf 8 file still.
Which program should be trusted. Notepad++ or textpad - Notepad++ seems to be more consistant, but is still the oppossite to what MSDN says it should be. My problem is that we create files that get sent off to another company and depending on whether there are french characters the encoding seems to keep changing.
Or is there a better way to determine the encoding of a file. I've read about byte order marks and preambles but as far as I understand neither are guaranteed to be there.
We initially thought that all the files we were building were ansi. Also please note that both ansi and utf8 should handle the french characters appropriately as the characters are part of both character sets.
as far as i know, "ansi" character encoding is another name for ascii-us.
if there are no characters in the file that aren't in the ascii charset then the file is valid ascii and valid utf8, there's no way to distinguish them. so your program can write it as utf8 and any other program would be correct in seeing it as ascii (ansi) just as it would be seeing it as utf8.