Mongoimport csv files with string _id and upsert - mongodb

I'm trying to use mongoimport to upsert data with string values in _id.
Since the ids look like integers (even though they're in quotes), mongoimport treats them as integers and creates new records instead of upserting the existing records.
Command I'm running:
mongoimport --host localhost --db database --collection my_collection --type csv --file mydata.csv --headerline --upsert
Example data in mydata.csv:
{ "_id" : "0364", someField: "value" }
The result would be for mongo to insert a record like this: { "_id" : 364, someField: "value" } instead of updating the record with _id "0364".
Does anyone know how to make it treat the _id as strings?
Things that don't work:
Surrounding the data with double double quotes ""0364"", double and single quotes "'0364'" or '"0364"'
Appending empty string to value: { "_id" : "0364" + "", someField: "value" }

Unfortunately there is not now a way to force number-like strings to be interpreted as strings:
https://jira.mongodb.org/browse/SERVER-3731
You could write a script in Python or some other language with which you're comfortable, along the lines of:
import csv, pymongo
connection = pymongo.Connection()
collection = connection.mydatabase.mycollection
reader = csv.DictReader(open('myfile.csv'))
for line in reader:
print '_id', line['_id']
upsert_fields = {
'_id': line['_id'],
'my_other_upsert_field': line['my_other_upsert_field']}
collection.update(upsert_fields, line, upsert=True, safe=True)

Just encountered this same issue and discovered an alternative. You can force Mongo to use string types for non-string values by converting your CSV to JSON and quoting the field. For example, if your CSV looks like this:
key value
123 foo
abc bar
Then you'll get an integer field for key 123 and a string field for key abc. If you convert that to JSON, making sure that all the keys are quoted, and then use --type json when you import, you'll end up with the desired behavior:
{
"123":"foo",
"abc":"bar"
}

I was able to prefix the numeric string and that worked for me. Example:
00012345 was imported as 12345 (Type Int)
string00012345 was imported as string00012345 (Type String)
My source was a SQL database so I just did
select 'string'+column as name
Of course, you also need to do a bit of post-processing to parse the string, but far less effort than converting a rather large tsv file to json.
I also added +1 to the jira link above for the enhancement.

As an alternative to #Jesse, you can do something similar in the mongo console, e.g.
db.my_collection.find().forEach(function (obj) {
db.my_collection.remove({_id: obj._id); // remove the old one
obj._id = '' + obj._id; // change to string
db.my_collection.save(obj); // resave
});
For non _id fields you can simply do:
db.my_collection.find().forEach(function (obj) {
obj.someField = '' + obj.someField; // change to string
db.my_collection.save(obj); // resave
});

I encountered the same issue.
I feel the simplest way is to convert the CSV file to a JSON file using an online tool and then import.
This is the tool I used:
http://www.convertcsv.com/csv-to-json.htm
It lets you wrap the integer values of your CSV file in double quotes for your JSON file.
If you have trouble importing this JSON file and encountering an error, just add --jsonArray to your import command. It will work for sure.
mongoimport --host localhost --db mydb -c mycollection --type json --jsonArray --file <file_path>

Related

MongoDB - mongoimport duplicate _id in JSON array

I would like to ask you for help. I encountered a problem where, when I'm importing JSON into mongodb via compass, it throws a duplicate _id error. Therefore, I tried to go to the terminal and go through mongoimport, which runs successfully and informs me that each document was imported without error, but I see that the documents are missing. Can you give me some advice on how to solve this problem?
This is terminal command in windows cmd
mongoimport D:\DimplomaThesis_data\transfer_json\180000-190000.json -d diplomovka -c transfer --jsonArray --stopOnError --maintainInsertionOrder --upsertFields _id
This is structure of record in JSON array:
{
"_id":"5d6566d086dc8b72382bc376",
"name":"Peter",
"surname":"Zubrík",
"titles":{
"before":"",
"after":""
},
"sex":"M",
"citizenship":"SVK",
"birthyear":1991,
"age":31,
"transfer":{
"source_ppo":"tj-polana-siba.futbalnet.sk",
"org_profile_id":"sportovnik-klub-fc-mukarov.futbalnet.sk",
"org_id":"5d5d3974eccb8850917918cd",
"sector":{
"_id":"sport:futbal:futbal",
"category":"sport",
"itemId":"futbal",
"sectorId":"futbal"
},
"competence_type":"player",
"transfer_type":"transfer",
"issfMoveType":"PWP",
"date_from":"2014-05-09T00:00:00.000Z",
"date_to":null,
"_id":"62e6d12c0ae29819010f611f",
"org_profile_name":"Sportovník klub FC Mukařov",
"org_name":"Sportovník klub FC Mukařov",
"source_ppo_name":"TJ Poľana Šiba"
},
"issfId":"1208658"
}
"_id":"5d6566d086dc8b72382bc376" this could have multiple records in array same. I download data from APIs, around 30 JSON each contain 10.000 records. Ideally import all document to mongodb and next create pipeline in compass.
I found solution for my problem.
I need to use python for creating compound_id (new primary key - unique identifier for each record in array (json)).
this code work for me:
# Load the JSON data from the file
with open("250000-260000", "r", encoding="utf-8") as f:
data = json.load(f)
# Modify the data to include the compound_key and player_id fields
for doc in data:
doc["player_id"] = doc["_id"]
doc["compound_key"] = doc["player_id"] + "_" + doc["transfer"]["date_from"]
doc["_id"] = doc["compound_key"]
# Save the modified data to a new JSON file
with open("26.json", "w") as f:
json.dump(data, f, indent=2)
Basically I created new modify json file and this file I import through Mongo Compass where import finish with 0 error (error duplicate _id)

mongoexport convert numeric value

I'm trying to export phone numbers from a collection. Below is the sample document
{ "_id" : ObjectId("5ad5cf864717256ff02b4923"),"userName":"9619324746", "firstName" : "D H", "contactPhone" : 9619324746}
The export command that I used is below
mongoexport --db dbname --collection accounts --type=json --out accounts.json --fields contactPhone,userName
And the contents of JSON looks like below
{"_id":{"$oid":"5ad5cf864717256ff02b4923"},"userName":"9619324746","contactPhone":9.619324746e+09}
Can somebody help me to get the contactPhone value not converted? Thank you.
-Srini
If mongoexport exported 123 as 123.0, then 123 was a Double type in the document. You should try inserting the value as a 32- or 64-bit integer
db.collection.insert({
"tweetId" : NumberLong(1234567)
})
mongoexport exports JSON, using strict mode JSON representation, which inserts some type information into the JSON so MongoDB JSON parsers (like mongoimport) can reproduce the correct BSON data types while the exported JSON still conforms to the JSON standard
{ "tweetId" : { "$numberLong" : "1234567" } }
To preserve all the type information, use mongodump/mongorestore instead. To export all field values as strings, you'll need to write your own script with a driver that fetches each doc and stringifies all the values.

how do I import json generated by javascript into Mongodb

I need to get rid of the newline character that separates objects in a json file so that I can import properly into mongodb without having the objects in an array. What do I use in javascript to do this? I need my data in this format so that I can import:
{ name: "Widget 1", desc: "This is Widget 1" }
{ name: "Widget 2", desc: "This is Widget 2" }
The answer is, you dont have to convert the file to an array, mongoimport expects a "json-line" format like you already have.
This format is very good for performance, because you don't have to load it at once, instead mongo will take line by line. So imagine a billion lines, if you convert it to an array, it will cost memory...
This way its just a linear time operation, the lines gets streamed into the db.
look here:
http://zaiste.net/2012/08/importing_json_into_mongodb/
However, if you think you need to do the conversion, just do it like this:
fs.readFile('my.json', function(e, text) {
var arrayLikeString = "[" + text.split('\n').join(',') + "]";
var array = JSON.parse(arrayLikeString);
})
To import an array of objects use this command:
mongoimport --db <db-name> --collection <coll-name> --type json --jsonArray --file seed.json
note the option: --jsonArray
Finally:
Take a look at this npm package, it looks very promisingly:
https://www.npmjs.com/package/jsonlines

how to remove quotations for ObjectId imported from csv

I have imported a csv file into a collection,
My document saved like
"_id" : "ObjectID(53874d952f92e2af1a5f0afb)"
I was unable to query this
Can anyone help me to remove quotations for ObjectId
Lets assume your collection name is myCollection.
Do this :
db.myCollection.forEach(function(doc) {
var oldId = doc._id;
var newIdStr = doc._id.replace(/ObjectID\((\w+)\)/g,"\$1");
var newObjId = ObjectId(newIdStr);
doc._id = newObjId;
db.myCollection.save(doc);
db.myCollection.remove({_id:oldId});
});
CSV does not fully preserve the types of your fields and can present strings and integers (not Object_ids). If I were you, I would write an a parser in the most suitable language and convert your _id in objectIds there.
Another approach (if your _ids are not important) you can change _id in csv header to any other name and during the import mongo will create new ids, then you will go and remove your created field.
Next time you can use mongodump and mongorestore to preserve the types.

How to change delimiter from comma to # in a csv using MongoDB

Can the delimiter be changed from comma to # while exporting records to csv file.
In the below example
mongoexport -d mydb -c coll --csv --fields "ProductId,ModerationStatus,Rating,TotalCommentCount" --out results.csv
Currently, mongoexport does not have this feature.
However, you can develop a simple JavaScript for doing this. So you have the control over the format of csv and field data types.
export.js
conn = new Mongo();
db = conn.getDB("myDB");
var cur = db.myCollection.find();
var obj;
while(cur.hasNext()){
obj = cur.next();
print("\""+obj._id+"\";\""+obj.field_1+"\";\""+obj.field_2+"\"");
}
Call this script from your OS shell:
mongo --quiet export.js > file_name.csv
--quiet: disables Mongo default option to print "version", "connecting to", etc, therefore the output of the script will be just things printed explicitly using print()
not a very clean solution but you can build a your own version of mongoexport adding inside the code "Comma" statement
https://github.com/mongodb/mongo-tools/blob/master/mongoexport/csv.go
https://golang.org/pkg/encoding/csv/#Writer
// WriteHeader writes a comma-delimited list of fields as the output header row.
func (csvExporter *CSVExportOutput) WriteHeader() error {
if !csvExporter.NoHeaderLine {
//here the trick
csvExporter.csvWriter.Comma = '|'
//
csvExporter.csvWriter.Write(csvExporter.Fields)
return csvExporter.csvWriter.Error()
}
return nil
}