Updating mongoData with MongoSpark - mongodb

From the following tutorial provided by Mongo:
MongoSpark.save(centenarians.write.option("collection", "hundredClub").mode("overwrite"))
am I correct in understanding that What is essentially happening is that Mongo is first dropping the collection, and then its overwritting that collection with the new data?
My question is then is it possible to use the MongoSpark connector to actually update records in Mongo,
lets say I've got data that looks like
{"_id" : ObjectId(12345), "name" : "John" , "Occupation" : "Baker"}
What I would then like to do is to merge the record of the person from another file that has more details, I.E. that file looks like
{"name" : "John", "address" : "1800 some street"}
the goal is to update the record in Mongo so now the JSON looks like
{"_id" : ObjectId(12345) "name" : "John" , "address" : 1800 some street", "Occupation" : "Baker"}
Now here's the thing, lets assume that we just want to update John, and that there are millions of other records that we would like to leave as is.

There are a few questions here, I'll try to break them down.
What is essentially happening is that Mongo is first dropping the collection, and then its overwritting that collection with the new data?
Correct, as of mongo-spark v2.x, if you specify mode overwrite, MongoDB Connector for Spark will first drop the collection the save new result into the collection. See source snippet for more information.
My question is then is it possible to use the MongoSpark connector to actually update records in Mongo,
The patch described on SPARK-66 (mongo-spark v1.1+) is , if a dataframe contains an _id field, the data will be upserted. Which means any existing documents with the same _id value will be updated and new documents without existing _id value in the collection will be inserted. 
What I would then like to do is to merge the record of the person from another file that has more details
As mentioned above, you need to know the _id value from your collection. Example steps:
Create a dataframe (A) by reading from your Person collection to retrieve John's _id value. i.e. ObjectId(12345).
Merge _id value of ObjectId(12345) into your dataframe (B - from the other file with more information). Utilise unique field value to join the two dataframes (A and B).
Save the merged dataframe (C). Without specifying overwrite mode.
we just want to update John, and that there are millions of other records that we would like to leave as is.
In that case, before you merge the two dataframes, filter out any unwanted records from dataframe B (the one from the other file with more details). In addition, when you call save(), specify mode append.

Related

Presto & MongoDB - Schema creation and updates

I have the Presto set-up done locally and am able to query data from MongoDB collections. At the start, I created presto_schema collection into MongoDB to let Presto understand the Collection details in order to query and I had one Collection entry added into my presto_schema. However, I noticed later that any new Collection into MongoDB, which was not added into presto_schema is still accessible from Presto and upon the first query it is observed that the new collection details are automatically amended into the presto_schema collection with the relevant new collection schema details.
But for the collections with nested schema, it is missing to automatically add all the nested fields and it only adds what it identifies from the initial query.
For example, consider below is my Collection (new_collection), which it got created newly with content as below:
{
"_id" : "13ec5e2a-ef04-4d05-b971-ef8e65638f83",
"name" : "npt",
"client" : "npt_client",
"attributes" : {
"level" : 697,
"country" : "SC",
"doy" : 2022
}
}
And say if my first query from Presto is as below:
presto:mydb> select count(*) from new_collection where attributes.level > 200;
The presto_schema gets automatically added with a new entry for this new collection, however, it adds all the non-nested fields info and the nested fields too that are available from the initial query, but fails to add the other nested fields. So any queries on the other nested fields, Presto does not recognize them. I could go ahead and amend the presto_schema with all the missing nested fields, but wondering if there is any other automated way. So that, we don't need to keep amending it manually on any new field addition into the collection (considering a scenario where we have a complete dynamic fields, which would be added into the Collection's nested object).
I would recommend upgrading to Trino (formerly PrestoSQL) because the MongoDB connector (version >= 360) supports mapping fields to JSON type. This type mapping is unavailable in prestodb.
https://trino.io/download.html

Bulk update all references between two collections in Mongodb

I have two collections, raw_doc and unique_doc in Mongo.
raw_doc receives imports of a large amount of data on a regular basis ( +500k rows ). unique_doc has every unique instance of 3 fields found in raw_doc.
A shortened example of the data in each table
raw_doc
{Licence : "Free", Publisher : "Jeff's music", Name: "Music for all",Customer:"Dave", uniqueclip_id:12345},
{Licence : "Free", Publisher : "Jeff's music", Name: "Music for all",Customer:"Jim", uniqueclip_id:12345}
unique_doc
{_id:12345, Licence : "Free", Publisher : "Jeff's music", Name: "Music for all"}
I would like to add a reference to raw_doc, linking it to the appropriate unique_doc. I can't use the three fields in unique_doc for the key as those fields will be edited eventually, but the data in raw_doc will stay the same(thus the data will no longer match but still needs to be linked).
Is there a query I could run in Mongo that would pull in bulk the IDs from unique_doc and insert them into the appropriate raw_docs?
You can try updateMany. Please try this:
db.raw_doc.updateMany({uniqueclip_id:"12345"},{$set:{uniqueclip_id:"54321"}})
This will update all the documents in raw_doc that contains the uniqueclip_id:"12345" and will set it to "54321".
Generating my own id up front seems to be the way to go. I have managed to keep the processing time down to around 120s for 500k rows.

How to comapre all records of two collections in mongodb using mapreduce?

I have an use case in which I want to compare each record of two collections in mongodb and after comparing each record I need to find mismatch fields of all record.
Let us take an example, in collection1 I have one record as {id : 1, name : "bks"}
and in collection2 I have a record as {id : 1, name : "abc"}
When I compare above two records with same key, then field name is a mismatch field as name is different.
I am thinking to achieve this use case using mapreduce in mongodb. But I am facing some problems while accessing collection name in map function. When I tried to compare it in map function, I got error as : "errmsg" : "exception: ReferenceError: db is not defined near '
Can anyone give me some thoughts on how to compare records using mapreduce?
I might have helped you to read the documentation:
When upgrading to MongoDB 2.4, you will need to refactor your code if your map-reduce operations, group commands, or $where operator expressions include any global shell functions or properties that are no longer available, such as db.
So from your error fragment, you appear to be referencing db in order to access another collection. You cannot do that.
If indeed you are intending to "compare" items in one collection to those in another, then there is no other approach other than looping code:
db.collection.find().forEach(function(doc) {
var another = db.anothercollection.findOne({ "_id": doc._id });
// Code to compare
})
There is simply no concept of "joins" as such available to MongoDB, and operations such as mapReduce or aggregate or others strictly work with one collection only.
The exception is db.eval(), but as per all of strict warnings in the documentation, this is almost always a very bad idea.
Live with your comparison in looping code.

MongoDB C driver _id generation

I use mongo_insert() three times to insert my data in three different collections. The problem is that the "_id" field must be exactly the same in each of the collections, but I do not know how to (ideally) recover and reuse the "_id" field generated in my first mongo_insert...
Please, advice me how to do it.
Normally, you could have different field, like CustomId for your private needs, and leave _id for mongo generation.
But if you still need it to be exactly the same - there could be 2 variants:
1) setting custom generated _id do each doc.
2) Save first doc, then read it again, check it's _id and set it to the other docs.

MongoDB - DBRef to a DBObject

Using Java ... not that it matters.
Having a problem and maybe it is just a design issue.
I assign "_id" field to all of my documents, even embedded ones.
I have a parent document ( and the collection for those ) which has an embedded document
So I have something like:
{ "_id" : "49902cde5162504500b45c2c" ,
"name" : "MongoDB" ,
"type" : "database" ,
"count" : 1 ,
"info" : { "_id" : "49902cde5162504500b45c2y",
"x" : 203 ,
"y" : 102
}
}
Now I want to have another document which references my "info" via a DBRef, don't want a copy. So, I create a DBRef which points to the collection of the parent document and specifies the _id as xxxx5c2y. However, calling fetch() on the DBRef gives a NULL.
Does it mean that DBRef and fetch() only works on top level collection entry "_id" fields?
I would have expected that fetch() would consume all keys:values within the braces of the document .. but maybe that is asking too much. Does anyone know?? Is there no way to create cross document references except at the top level?
Thanks
Yes, your DBRef _id references need to be to documents in your collection, not to embedded documents.
If you want to find the embedded document you'll need to do a query on info._id and you'll need to add an index on that too (for performance) OR you'll need to store that embedded document in a collection and treat the embedded one as a copy. Copying is OK in MongoDB ... 'one fact one place' doesn't apply here ... provided you have some way to update the copy when the main one changes (eventual consistency).
BTW, on DBRef's, the official guidance says "Most developers only use DBRefs if the collection can change from one document to the next. If your referenced collection will always be the same, the manual references outlined above are more efficient."
Also, why do you want to reference info within a document? If it was an array I could understand why you might want to refer to individual entries but since it doesn't appear to be an array in your example, why not just refer to the containing document by its _id?