MapReduce MongoDB User Agent - mongodb

I have a 5 million entries in a Mongo DB that look like this:
{
"_id" : ObjectId("525facace4b0c1f5e78753ea"),
"productId" : null,
"name" : "example name",
"time" : ISODate("2013-10-17T09:23:56.131Z"),
"type" : "hover",
"url" : "www.example.com",
"userAgent" : "curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 openssl/0.9.8r zlib/1.2.5"
}
I need to add to every entry a new field called device which will have either the value desktop or mobile. That means, the goal would be to have the following kind of entries:
{
"_id" : ObjectId("525facace4b0c1f5e78753ea"),
"productId" : null,
"device" : "desktop",
"name" : "example name",
"time" : ISODate("2013-10-17T09:23:56.131Z"),
"type" : "hover",
"url" : "www.example.com",
"userAgent" : "curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 openssl/0.9.8r zlib/1.2.5"
}
I am working with the MongoDB Java driver and so far I am doing the following:
DBObject query = new BasicDBObject();
query.put("device", new BasicDBObject("$exists", false)); //some entries already have such field
DBCursor cursor = resource.find(query);
cursor.addOption(Bytes.QUERYOPTION_NOTIMEOUT);
Iterator<DBObject> iterator = cursor.iterator();
int size = cursor.count();
And then I am iterating with a while(iterator.hasNext()), doing an if-else with a huge regular expression I found out there, and depending of the result of such if-else I execute something like:
BasicDBObject newDocument = new BasicDBObject("$set", new BasicDBObject().append("device", "desktop")); //of "mobile", depending on the if-else
BasicDBObject searchQuery = new BasicDBObject("_id", id);
resource.getCollection(DatabaseConfiguration.WEBSITE_STATISTICS).update(searchQuery, newDocument);
However, due to the big amount of data (more than 5 million entries) this takes forever.
Is there a way of doing this with map reduce? So far I've only used MapReduce for counting, so I am not sure if it can be used for other matters.

I found a way which was kind of tricky due to the whole configuration.
After installing Hadoop following this link, I did the following:
Created one class called MongoUpdate, with a method run where I set up all the configuration (like input and output URI) and create a job and configure all the settings. Among those, there is job.setMapperClass(MongoMapper.class)
Created MongoMapper where I have the method map which gets a BSONObject. Here I perform the if-else condition and at the very end I do:
Text id = new Text(pValue.get("_id").toString());
pContext.write(id, new BSONWritable(pValue));
Class Main whose main method simply instantiates a MongoUpdate class and runs it run method
Export the jar with all the libraries and type on the terminal: hadoop java NameOfTheJar.jar

Related

MongoListener + Spring Detect updated fields in Document

I have a Springboot application + MongoDB and I need to audit every update made to a collection on specified fields (data analysis purpose).
If I have a collection like:
{
"_id" : ObjectId("12345678910"),
"label_1" : ObjectId("someIdForLabel1"),
"label_2" : ObjectId("someIdForLabel2"),
"label_3" : ObjectId("someIdForLabel"),
"name": "my data",
"description": "some curious stuff",
"updatedAt" : ISODate("2022-06-21T08:28:23.115Z")
}
I want to write an audit document whenever a label_* is updated. Something like
{
"_id" : ObjectId("111213141516"),
"modifiedDocument" : ObjectId("12345678910"),
"modifiedLabel" : "label_1",
"newValue" : ObjectId("someNewIdForLabel1"),
"updatedBy" : ObjectId("userId"),
"updatedAt" : ISODate("2022-06-21T08:31:20.315Z")
}
How can I achieve this with MongoListener? I already have two methods for AfterSave and AfterDelete , for other purposes, but they give me the whole new Document.
I would rather avoid to query again the DB or to use a findAndModify() in the first place.
I gave a look to ChangeStreams too, but I have too many doubts when it comes to more than 1 instance.
Thank you so much, any tip will be appreciated!

pymongo update creates a new record without upserting

I have an issue where I do an update on a document, however, the update creates a new document and I'm not upserting in my update.
This is my testing code.
I do a find to see if the document exists by checking if "lastseen" doesn't exist:
result = DATA_Collection.find({"sessionID":"12345","lastseen":{"$exists":False}})
if result.count() == 1:
DATA_Collection.update({"sessionID":"12345"},{"$set":{"lastseen":"2021-05-07"}})
When I do an aggregate check to find duplicates I get a few, one example below.
> db.DATA_Collection.find({ "sessionID" : "237a5fb8" })
{ "_id" : ObjectId("60bdf7b05c961b4d27d33bde"), "sessionID" : "237a5fb8", "firstseen" : ISODate("1970-01-19T20:51:09Z"), "lastseen" : ISODate("2021-06-07T12:34:20Z") }
{ "_id" : ObjectId("60bdf7fa7d35ea0f046a2514"), "sessionID" : "237a5fb8", "firstseen" : ISODate("1970-01-19T20:51:09Z") }
I remove all the records in the collection and rerun the script, the same happens again.
Any advice will be much appreciated.
Firstly your pymongo commands are deprecated; use update_one() or update_many() instead of update(); count_documents() instead of count().
Secondly double check you are referencing the same collections as you mention DATA_Collection and VPN_DATA;
How are you defining a "duplicate"? Unless you create a unique index on the field(s), the records won't be duplicates as they have different _id fields.
You need something like:
record = db.VPN_DATA.find_one({'sessionID': '12345', 'lastseen': {'$exists': False}})
if record is not None:
db.VPN_DATA.update_one({'_id': record.get('_id')}, {'$set': {'lastseen': '2021-05-07'}})

Querying objectId in Meteor

I recently started to load mongodb using mongoimport and realized that it has added an ObjectId field associated with the "_id". When I query this using the "meteor mongo" commandline it works fine:
meteor:PRIMARY> db.Warehouses.find({"_id":ObjectId("571b7a89a990b5b8779b1315")})
{ "_id" : ObjectId("571b7a89a990b5b8779b1315"), "name" : "Stephan Lumber", "street" : "23 East St", "city" : "Plano", "state" : "TX"}
meteor:PRIMARY>
My code can read the value in "_id" using console.log( "id ", currentId)
It returns ObjectID("571b7a89a990b5b8779b1315")
the value currentId contains the current warehouse ID selected.
However, when I try to use this to access the data in the code I keep getting "undefined" errors. I have tried many different ways. Here are a few:
warehouse = Warehouses.findOne({"_id":Mongo.ObjectID(currentId)});
warehouse = Warehouses.findOne({"_id":ObjectId(currentId)});
Also for some reason "ObjectId" in not recognized on the latter.
I don't know what else to try. Any help would be appreciated.
You don't have to add anything like Mongo.ObjectID or ObjectId you just have to write directly currentId.
warehouse = Warehouses.findOne({"_id": currentId});

Mongo DB Update to a sub array document

I have a structure
{
"_id" : ObjectId("562dfb4c595028c9r74fda67"),
"office_id" : "123456",
"employee" : [
{
"status" : "declined",
"personId" : "123456",
"updated" : NumberLong("1428407042401")
}
]
}
This office can have multiple persons.Is there a way if I want to update the employee status for all the person under that specific office_id to say "approved".I am trying the same through plain mongo java driver.What I am trying is get all the office id using a query builder , then iterate over the list and save the document.Somewhat I am not satisfied with the iterative approach(fetch,iterate and save ) that I am following.Please suggest if there is alternative way.
You can update using the $ positional operator:
db.collection.update(
{
"office_id" : "123456",
"employee.status": "declined"
},
{
"$set": { "employee.$.status": "approved" }
}
);
The positional operator saves the index (0 in the case above) of the element from the array that matched the query. This means that if you knew the position of the element beforehand (which is nearly impossible in a real life case), you could just change the update statement to: {"$set": {"employee.0.status": "approved"}}.
Please note that the $ positional operator (for now) updates the first relevant document ONLY, there is a JIRA ticket for this.
EDIT:
Using the Java driver, the above update may be done like so (untested):
BasicDBObject update = new BasicDBObject();
BasicDBObject query = new BasicDBObject();
query.put("office_id", "123456");
query.put("employee.status", "declined");
BasicDBObject set = new BasicDBObject("$set", update);
update.put(""employee.$.status", "approved");
collection.update(query, set);

Mongodb - combine data from two collections

I know this has been covered quite a lot on here, however, i'm very new to MongoDB and am struggling with applying answers i've found to my situation.
In short, I have two collections 'total_by_country_and_isrc' which is the output from a MapReduce function and 'asset_report' which contains an asset_id not present in the 'total_by_country_and_isrc' collection or the original raw data collection this was MapReduced from.
An example of the data in 'total_by_country_and_isrc' is:
{ "_id" : { "custom_id" : 4748532, "isrc" : "GBCEJ0100080",
"country" : "AE" }, "value" : 0 }
And an example of the data in the 'asset_report' is:
{ "_id" : ObjectId("51824ef016f3edbb14ef5eae"), "Asset ID" :
"A836656134476364", "Asset Type" : "Web", "Metadata Origination" :
"Unknown", "Custom ID" : "4748532", "ISRC" : "", }
I'd like to end up with the following ('total_by_country_and_isrc_with_asset_id'):
{ "_id" : { "Asset ID" : "A836656134476364", "custom_id" : 4748532,
"isrc" : "GBCEJ0100080", "country" : "AE" }, "value" : 0 }
I know how I would approach with in a relational database but I really want to try and get this working in Mongo as i'm dealing with some pretty large collections and feel Mongo is the right tool for the job.
Can anyone offer some guidance here?
I think you want to use the "reduce" output action: Output to a Collection with an Action. You'll need to regenerate total_by_country_and_isrc, because it doesn't look like asset_report has the fields it needs to generate the keys you already have in total_by_country_and_isrc – so "joining" the data is impossible.
First, write a map method that is capable of generating the same keys from the original collection (used to generate total_by_country_and_isrc) and also from the asset_report collection. Think of these keys as the "join" fields.
Next, map and reduce your original collection to create total_by_country_and_isrc with the correct keys.
Finally, map asset_report with the same method you used to generate total_by_country_and_isrc, but use a reduce function that can be used to reduce the intersection (by key) of this mapped data from asset_report and the data in total_by_country_and_isrc.