Scala java.nio.charset.UnmappableCharacterException: Input length = 1 - scala

I've found several questions with similar titles, but couldn't seem to use any to resolve my issue. I Can't seem to load my .csv file:
val source = io.Source.fromFile("C:/mon_usatotaldat.csv")
Returns:
java.nio.charset.UnmappableCharacterException: Input length = 1
So I tried:
val source = io.Source.fromFile("UTF-8", "C:/mon_usatotaldat.csv")
and got:
java.nio.charset.IllegalCharsetNameException: C:/mon_usatotaldat.csv
I guess UTF-8 wouldn't work, if the file isn't in UTF-8 format, so that makes sense, but I don't know what to do next.
I've managed to discover the encoding is windows-1252 using:
val source = io.Source.fromFile("C:/mon_usatotaldat.csv").codec.decodingReplaceWith("UTF-8")
But this didn't do what I had expected, which was convert the file to UTF-8. I have no Idea how to work with it.
Another thing I've tried was:
val source = io.Source.fromFile("windows-1252","C:/mon_usatotaldat.csv")
But that returned:
java.nio.charset.IllegalCharsetNameException: C:/mon_usatotaldat.csv
Please help. Thanks in advance.

Try mapping your excel file to UTF-8 first and then try val source = io.Source.fromFile("UTF-8", "C:/mon_usatotaldat.csv")
To map to UTF-8 try:
(1) Open an Excel file where you have the info (.xls, .xlsx)
(2) In Excel, choose "CSV (Comma Delimited) (*.csv) as the file type
and save as that type.
(3) In NOTEPAD (found under "Programs" and then Accessories in Start
menu), open the saved .csv file in Notepad
(4) Then choose -> Save As..and at the bottom of the "save as" box,
there is a select box labelled as "Encoding". Select UTF-8 (do NOT use
ANSI or you lose all accents etc). After selecting UTF-8, then save
the file to a slightly different file name from the original.
This file is in UTF-8 and retains all characters and accents and can be imported, for example, into MySQL and other database programs.
Reference: Excel to CSV with UTF8 encoding
Hope this helps!

Set up an InputStreamReader to correctly read windows-1252. Don't bother with intermediate UTF-8.

Related

What are the characters shown on a file after forcefully changing the extension?

Recently I changed the extension of an .apk file to .txt and despite this, I was able to open it on Notepad with some random characters, that weren't available on the keyboard in the file. org/antlr/runtime/ANTLRFileStream.class…TmOÓP=w[×QËÀ)ê|A…ÑETÔ¢NP¢™ãË—º•Q3ZÓcüþ¿j",£ß4ñGÏmÇñ˽Ïs{žçœçeûùëóW ±¨á0F5d0ÖA˔‹LÈã’ŠËR˜PqEƒ†Iy\•ØkÒºÞÁЂ´¦TL«˜H­95{ÙÚ°2K/­×–Y³Üªù(ð·:%œv\'¸!Гû÷óðª#¢èUܵä¸öòæÆÛ_±^ÔÂt^Ùª­Z¾#ýæc"XwêKž_5-7¨ù¦¿éΆmÞZ^Y*ÍS “ÛÖ¹µ¹7eûUàxn]%µ‘Ð^TÊvË^…kžUˆ;u_àTw<sÁ}µDL%ÛªØ>ùÄš#º…Rø˜¨;o)\,0ǚԞ݇ؓ‡àΪ<ò6ýr³¥GsÃ횪EOÌ_…É =è•Ç¬Ž#8ª£½ú^fùõ˜Ž›¸%pü IT{`Á2þ¶<Š:î`NÇ<î긇A˜èÿïˆ8Ç0Q¥»¨#- Ze7srRÉšíVƒõÐ]0rí&tÀ”O´‡[Y±K ö¬H›¯Ü %÷¬8Ì) r+åšW·ÑÏF†¿,bd—i%h³­ˆá8½YÄiª‘
Not just this, but while converting many other extensions like .jar,.xapk, etc. would show me these characters.Can anyone please explain, what factors are these characters based on, and how does the OS decides or try what characters to show in an unsupported file exactly.
Is there a way to get the original content through this data?
Lets say you created a text editor, which can write and save text files as well as open text files. you also defined the encoding that will be used to save text in binary files(all files when saved are binary). So your encoding looks something as following:
Your encoding Emacs encoding
TEXT BINARY TEXT BINARY
A 01000001 ă 01000001
B 01000010 Ћ 01000010
... ...
Z 01011010 Ϡ 01011010
lets say you create a file with 'ABZ' as its contents. this file when saved contains value 010000010100001001011010. When you open this file with your text editor, the editor finds 010000010100001001011010 as file contents and using above encoding it knows that its 'ABZ' hence it prints 'ABZ' on the screen.
Now lets say you open same file using emacs, since emacs uses its own encoding it displays "ăЋϠ", There is nothing wrong with emacs. it just doesn't know that data was written using your custom encoding.
So the point is that every file is written in a specific format, for example APK format can only be correctly understood by Android system. when you try to open the APK file in a text editor it just tries to make sense of binary data in the same way as emacs does in above example.
Is there a way to get the original content through this data?
If you know the originally encoding using which data was written, then you can read the contents of file using same encoding.

PhpStorm: Converting folders encoding to another

I have project, where are lots of files in ISO-8859-15 and I need to convert them to UTF-8. If I change one file, it asks "Do you want to convert - plaplapla", if I say yes, important symbols wont become ???.
However, since my project file amount is HUGE, I cannot do that one by one. Changing encoding settings from project settings, it might change encoding to utf-8 but all the symbols will become ??? (thus no conversion).
So, how can I tell PhpStorm to convert all files into utf-8? Is it possible and if yes, how? What is the alternative method?
AFAIK it's not possible to do this for whole folder at a time .. but it can be done for multiple files (e.g. all files in certain folder):
Select desired files in Project View panel
Use File | File Encoding
When asked -- make sure you choose "convert" and not just "read in another encoding".
You can repeat this procedure for each subfolder (still much faster than doing this for each file individually).
Another possible alternative is to use something like iconv (or any other similar tool) and do it in terminal/console.
Watch out when opening the file inPHPStorm that you want to convert. In my case all the files were still encoded in ISO-8859 but opened in UTF-8 resulting in misspelled umlauts i.e. In this case direct conversion to UTF-8 is not possible.
If you encounter this do following:
Open the ISO-8859 file
Change file encoding dropdown (lower right corner) to ISO-8859-1 or ISO-8859-15 and choose REOPEN
Misspellings will now disappear
Then change the encoding again (dropdown lower right corner), this time to UTF-8 and choose CONVERT
Now the file is properly encoded in UTF-8
cheers

UTF-8 source files are not supported in avisynth

I use avisynth to demux video from audio.
When I use
x = "m.mkv"
ffvideosource(x)
It work correctly but when I change my video filename to a UTF-8 one and my script as:
x = "م.mkv"
ffvideosource(x)
I Got the following error:
failed to open for hashing avisynth
I found a link (UTF-8 source files are not supported) who tell UTF-8 file name not work in avisynth, and to correct the problem, it said:
specify the parameter utf8=true when calling ffvideosource, save the script as UTF-8 without BOM and then see if that works.
But, I couldn't solve the problem. As I Open the script in the notepad and save it in utf-8 format, I got the following error:
UTF-8 Source files are not supported, re-save script with ANSI encoding
How can I solve the problem, How can I run my script with a UTF-8 filename?
“Withoutt BOM” is important. You need to save the file as raw UTF-8 without the Microsoft-style faux-BOM. Notepad can't do this, it always saves UTF-8 files with that generally-undesirable 0xEF 0xBB 0xBF header. Most other text editors (e.g. Notepad++) can do it properly.
AviSynth isn't really Unicode-aware so it doesn't want you using UTF-8 and will give that error message to try to stop you making mistakes. ffvideosource's workaround of hiding UTF-8 bytes in what AviSynth sees as ‘ANSI’ characters only works as long as AviSynth sees the file as ANSI. AviSynth doesn't have very sophisticated encoding-guessing, so removing the faux-BOM is enough to convince it is dealing with ANSI.
Very common problem when using UTF-8 in AviSynth.
Follow these steps:
Check the plugins folder. There should exist these three files: ffms2.dll, ffmsindex.exe, and FFMS2.avsi. If you did not have problem with ANSI, I guess that you don't have FFMS2.avsi in your plugins folder; In this situation download the latest version form here.
After that make an AVS file with Notepad++. For example I do this:
x = "C:/Users/Nemat/Desktop/StackOverFlow/نعمت.mkv"
ffmpegsource2(x,utf8=true)
Please note that here I used ffmpegsource2().
In the Encoding menu from Notepadd++ select Encode in UTF-8 without BOM.
Save your file.
Check the video file exists in the addressed directory.
Double click on your AVS file.
Enjoy it!

Fetching Asian (Japanese / Chinese) characters from Excel file into TSV format using Perl's Spreadsheet::ParseExcel

Friends,
I am preparing a TSV file from excel file, containing Chinese (special) characters as follows - The Seonjeongneung ... Jeonghyeon (貞顯王后, 1462–1530) .....
I have tried using perl CPAN's Spreadsheet::ParseExcel and Spreadsheet::ParseExcel::FmtJapan. But no success. These characters are appearing as ?? in the TSV file, when opened in VIM.
I also tried " binmode STDOUT, ':utf8'; " and " binmode STDOUT, ':encoding(cp932)'; "
Please help me out, finding a way to extract information from Excel sheets and getting into TSV format.
PS : Excel allows direct save as TSV, but the output was screwed up there as well
I just exported your sample text perfectly from OpenOffice Calc, just by choosing the "Save as .csv" option and choosing UTF-8 as format. I'd be very surprised if Excel can't do the same. Have you considered the possibility that VIM / your console doesn't support Chinese characters correctly or that it's set to use a font that doesn't include Chinese characters? To check for this kind of error, open your .csv or .tsv file in your web browser. Web browsers will do anything to correctly display a file, including changing fonts as necessary.
If you want, send me the file you need to export and I'll check if there's anything weird about it. Could be one of the native Chinese encodings (gb or big5) instead of Unicode.

How is this file encoded?

I have a .csv file generated by Excel that I got from my customer. My software has to open and parse it in java. I'm using universalchardet but it did not detect the encoding from the first 1,000 bytes of the file.
Within these 1,000 first bytes, there is a sequence that should be read as Boîte, however I cannot find the correct encoding to use to convert this file to UTF-8 strings in java.
In the file, Boîte is encoded as 42,6F,94,74,65 (read using a hex editor). B, o, t and e are using the standard latin encoding with 1 byte per character. The î is also encoded on only one byte, 0x94.
I don't know how to guess this charset, none of my searches online yielded relevant results.
I also tried to use file on that file:
$ file export.csv
/Users/bicou/Desktop/export.csv: Non-ISO extended-ASCII text, with CR line terminators
However I looked at the extended-ASCII charset, the value 0x94 stands for ö.
Have you got other ideas for guessing the encoding of that file?
This was Mac OS Roman encoding. When using the following java code, the text was properly decoded:
InputStreamReader isr = new InputStreamReader(new FileInputStream(targetFileName), "MacRoman");
I don't know how to delete my own question. I don't think it is useful anymore...