Weird key-name with mongoimport - mongodb

I have a Tab Separated Value file that I need to import in mongodb
I do
mongoimport -d mydb -c blsItem --type tsv --file .\BLS_3.01.txt --fieldFile .\fieldnames-bls.txt
fieldname-bls.txt contains all the keys nicely separated in an UTF-8 file:
blsKey
germanDescription
englishDescription
The result of the import is, that every blsKey starts with glibberish
{ "_id" : ObjectId("4eee82136e6ffebe9085debd"), "´╗┐blsKey" : "B100000", "germanDescription" : "Vollkornbrote", "englishDescription" : ""
But even VIM shows the "fieldname-bls.txt" nice and clean.
What is going on?

It looks like UTF-8 BOM. Convert your file into UTF-8 without BOM, that's it.

Related

replace capture match with capture group in bash GNU sed

I've looked around to find a solution to my problem in other posts listed bellow, but it looks my regex is quit different and need special care:
How to output only captured groups with sed
Replace one capture group with another with GNU sed (macOS) 4.4
sed replace line with capture groups
I'm trying to replace a regex match group in big JSON file,
My file has mongoDB exported objects, and I'm trying to replace the objectId with the string:
{"_id":{"$oid":"56cad2ce0481320c111d2313"},"recordId":{"$oid":"56cad2ce0481320c111d2313"}}
So the output in the original file should look like this:
{"_id":"56cad2ce0481320c111d2313","recordId":"56cad2ce0481320c111d2313"}
That's the command I run in the shell:
sed -i 's/(?:{"\$oid":)("\w+")}/\$1/g' data.json
I get no error, but the file remains the same.
What exactly am I doing wrong?
Finally I've managed to make it work, the way regex works in bash is different then in regexr.com tester tool.
echo '{"$oid":"56cad2ce0481320c111d2313"}' | sed 's/{"$oid":\("\w*"\)}/\1/g'
gives the correct output:
"56cad2ce0481320c111d2313"
I found it even better to read from stdin and output to file, instead of writing first to JSON file, then read, replace and write again.
Since I use mongoexport to export collection, replace the objectId and write the output to JSON file, my final solution looks like this:
mongoexport --host localhost --db myDB --collection my_collection | sed 's/{"$oid":\\("\\w*"\\)}/\\1/g' >> data.json

CLI option to give encoding format to 'mongoimport '

Does mongoimport cli command support only UTF-8 format files?
Is there a way to provide encoding format so that, it can accept non-utf-8 files, without we manually converting each file to UTF-8?
This is one way of doing it on Linux/Unix. You could use iconv to convert non-utf8 to utf8 and then use mongoimport on the converted file:
iconv -f ISO-8859-1 -t utf-8 myfile.csv > myfileutf8.csv
man iconv should give you more details about options
Also, Import CSV file (contains some non-UTF8 characters) in MongoDb
discusses some options for windows.

Import CSV file (contains some non-UTF8 characters) in MongoDb

How can I import a CSV file that contains some non-UTF8 characters to MongoDB?
I tried a recommended importing code.
mongoimport --db dbname --collection colname --type csv --headerline --file D:/fastfood.xls
Error Message
exception: Invalid UTF8 character detected
I would remove those invalid characters manually, but the size of the data is considerably big.
Tried Google with no success.
PS: mongo -v = 2.4.6
Thanks.
Edit:
BTW, I'm on Win7
In Linux you could use the iconv command as suggested in: How to remove non UTF-8 characters from text file
iconv -f utf8 -t utf8 -c file.txt
I'm not familiar with MongoDB, so I have no insight on how to preserve the invalid characters during import.
For emacs users:
Open CSV file in emacs and change encoding using ‘C-x C-m f’ and choosing utf-8 as the coding system. For more information see ChangingEncodings
You're trying to import an xls file as a csv file. Save the file as csv first, then try again.

mongoimport: append source field

With mongoimport I import the data of several external instances.
Does mongoimport allow me to add a field like source:"where-the-data-comes-from" to each document which is imported?
I.e. if i import the data of server A and B, I would like to store source:"A" or source:"B" to each document.
No. However, you can do this from the command line. Create a file 'header.txt' containing, e.g., (you can create this from your existing csv) by running
cat <(head -1 test.csv | tr "," "\n") <(echo source-a) > header.txt
header.txt should look like this:
field_a
field_b
.......
source
*note I have appended a 'source' field to this document.
Now you can run the command (assuming you have sed installed)
sed 's/$/,source-a/' test.csv | mongoimport -d test-db -c test-cl --type csv --fieldFile header.txt
If you already have a header line in your document, run
sed '1d' test.csv | sed 's/$/,source-a/' | mongoimport -d test -c test --type csv --fieldFile header.txt instead - where 'source-a' is the label you want with this document.
You can easily script this in bash so that you only supply the source and csv for each import job.

Convert pipe delimited csv to tab delimited using batch script

I am trying to write a batch script that will query a Postgres database and output the results to a csv. Currently, it queries the database and saves the output as a pipe delimited csv.
I want the output to be tab delimited rather than pipe delimited, since I will eventually be importing the csv into Access. Does anyone know how this can be achieved?
Current code:
cd C:\Program Files\PostgreSQL\9.1\bin
psql -c "SELECT * from jivedw_day;" -U postgres -A -o sample.csv cscanalytics
postgres = username
cscanalytics = database
You should be using COPY to dump CSV:
psql -c "copy jivedw_day to stdout csv delimiter E'\t'" -o sample.csv -U postgres -d csvanalytics
The delimiter E'\t' part will get you your output with tabs instead of commas as the delimiter. There are other other options as well, please see the documentation for further details.
Using -A like you are just dumps the usual interactive output to sample.csv without the normal padding to making the columns line up, that's why you're seeing the pipes:
-A
--no-align
Switches to unaligned output mode. (The default output mode is otherwise aligned.)