CLI option to give encoding format to 'mongoimport ' - mongodb

Does mongoimport cli command support only UTF-8 format files?
Is there a way to provide encoding format so that, it can accept non-utf-8 files, without we manually converting each file to UTF-8?

This is one way of doing it on Linux/Unix. You could use iconv to convert non-utf8 to utf8 and then use mongoimport on the converted file:
iconv -f ISO-8859-1 -t utf-8 myfile.csv > myfileutf8.csv
man iconv should give you more details about options
Also, Import CSV file (contains some non-UTF8 characters) in MongoDb
discusses some options for windows.

Related

How to reencode Source file which has "é" instead of "é"?

I've just inherited a legacy project in which my predecessor pushed incorrectly encoded files.
The comments, in French, should include special characters as é,è,ç etc.
But, for instance here, a 'é' is shown as 'é'.
I'm looking for a command line tool to handle all files of the project. I'm pretty sure iconv should to the trick, but what I tried so far did not work :
Here are some initial informations:
# problematic file example
$ file Parametres.cpp
Parametres.cpp: C source, ISO-8859 text
# check that my OS handles utf8
$echo "éè" > test.tmp
$ file test.tmp
test.tmp: UTF-8 Unicode text
$ cat test.tmp
éè
I tried whithout success (meaning in Parametres.cpp.utf8 i still got 'é') :
iconv -f ISO-8859-1 -t UTF-8 Parametres.cpp -o Parametres.cpp.utf8
iconv -f ISO-8859-1 -t UTF-8//TRANSLIT Parametres.cpp -o Parametres.cpp.utf8
iconv -f ISO-8859-1//TRANSLIT -t UTF-8 Parametres.cpp -o Parametres.cpp.utf8
My guess is that the original encoding was not ISO-8859-1 but something else. And due to misconfigured IDE, chars 'Ã' and '©' got definitly encoded in ISO-8859-1. From what I understood, TRANSLIT should to the job, but it seems not.
So, here are my questions :
is there a better tool than iconv to do this job in CentOS7.2 (yes, I know. Legacy is legacy...) ?
Or, How to determine (or guess) the original encoding to make iconv solve my problem ?
Any help or ideas are appreciated :-)

Import CSV file (contains some non-UTF8 characters) in MongoDb

How can I import a CSV file that contains some non-UTF8 characters to MongoDB?
I tried a recommended importing code.
mongoimport --db dbname --collection colname --type csv --headerline --file D:/fastfood.xls
Error Message
exception: Invalid UTF8 character detected
I would remove those invalid characters manually, but the size of the data is considerably big.
Tried Google with no success.
PS: mongo -v = 2.4.6
Thanks.
Edit:
BTW, I'm on Win7
In Linux you could use the iconv command as suggested in: How to remove non UTF-8 characters from text file
iconv -f utf8 -t utf8 -c file.txt
I'm not familiar with MongoDB, so I have no insight on how to preserve the invalid characters during import.
For emacs users:
Open CSV file in emacs and change encoding using ‘C-x C-m f’ and choosing utf-8 as the coding system. For more information see ChangingEncodings
You're trying to import an xls file as a csv file. Save the file as csv first, then try again.

Converting ASCII to BIG5 encoding in Unix

Can we change from ASCII to BIG5??
Actually I have to generate a file in BIG5 format from ASCII format and I am not able to find a way to change the encoding of the file. My file created here contains Chinese data which is not displayed in ASCII format and it can only be displayed in BIG5 format.So once I have created an ASCII file I need to convert it to the BIG5. So thats why I need it to convert to BIG5.
I have no idea how a file in ASCII encoding could contain Chinese data but if it were possible this would be the command:
iconv -f ASCII -t BIG5 asciifile -o big5file.txt
It will convert your file in ASCII encoding to BIG5 and write the output to big5file.txt.
But most likely it is not ASCII that you have in the original file. Make sure you detect the exact encoding and then use it in the command. Use iconv -l to view all available encodings.
You can try to figure out the real encoding with chardet or cchardet. If not available in your terminal, you can install it with pip install chardet (or pip install cchardet).
Once installed pass the the file name as first argument:
chardet Tian.Jiang.Xiong.Shi.srt
>>> Tian.Jiang.Xiong.Shi.srt: GB2312 with confidence 0.99
If you install with pip3 then the script name will be chardet3 or chardetect3.

iconv: Converting from Windows ANSI to UTF-8 with BOM

I want to use iconv to convert files on my Mac. The goal is to go from "Windows ANSI" to "whatever Windows Notepad saves, if you tell it to use UFT8".
This is what I want:
$ file names.csv
names.csv: UTF-8 Unicode (with BOM) text, with CRLF line terminators
This is what I use:
$ iconv -f CP1252 -t UTF-8 names.csv > names.utf8.csv
This is what I get (not what I want):
$ file names.utf8.csv
names.utf8.csv: UTF-8 Unicode text, with CRLF line terminators
How do I get the BOM?
You can add it manually by first echoing the bytes into the file:
echo -ne '\xEF\xBB\xBF' > names.utf8.csv
and then concatenating your required information at the end:
iconv -f CP1252 -t UTF-8 names.csv >> names.utf8.csv
Note the >> rather than >.
Note that "Windows ANSI" may not be CP1252 - that is configured by users.
The BOM is not necessary for UTF-8.
And Windows Notepad can save UTF-8 with or without BOM.
I needed the opossite. (covert german text from UTF-8 to ANSI)
So command I used:
1. iconv -l (check available formats)
2. iconv -f UTF8 -t MS-ANSI de.txt > output.txt
and now if I open output.txt it is already in ANSI. Job done.

MysqlDump from Powershell and Windows encoding

I'm doing an export from command line on ms-dos with mysqldump:
& mysqldump -u root -p --default-character-set=utf8 -W -B dbname
> C:\mysql_backup.sql
My database/tables are encoded with UTF-8 and I specify the same encoding when I did the dump. But when I open the file with Notepad++ or Scite I see an encoding of UTF-16 (UCS-2). If I don't convert the file with iconv to UTF-8 before running the import I got an error.
It seems that MS-DOS / CMD.exe is redirecting by default with UTF-16. Can I change this ?
A side note: I use Powershell to call mysqldump.
UPDATE: it seems that it occurs only when calling mysqldump from Powershell. I change the command line with the one I use in my PS script
By default PowerShell represents text as Unicode and when you save it to a file it saves as Unicode by default. You can change the file save format by using the Out-File cmdlet instead of the > operator e.g.:
... | Out-File C:\mysql_backup.sql -Encoding UTF8
You may also need to give PowerShell a hint on how to interpret the UTF8 text coming from the dump utiltiy. This blog post shows how to handle this scenario in the event the utility isn't outputting a proper UTF8 BOM.