Optimize Mongodb documents versioning - mongodb

In my application I have need to load a lot of data and compare it to existing documents inside a specific collection, and version them.
In order to do it, for every new document I have to insert, I simply made a query and search for last version, using a specific key (not _id), group data together and found last version.
Example of data:
{
"_id" : ObjectId("5c73a643f9bc1c2fg4ca6ef5"),
"data" : {
the data
}
},
"key" : {
"value1" : "545454344",
"value2" : "123212321",
"value3" : "123123211"
},
"version" : NumberLong("1"),
}
As you can see, key is composed of three values, related to data and my query to find last version is the following:
db.collection.aggregate(
{
{
"$sort" : {
"version" : NumberInt("-1")
}
},
{
"$group" : {
"_id" : "$key",
"content" : {
"$push" : "$data"
},
"version" : {
"$push" : "version"
},
"_oid" : {
"$push" : "$_id"
},
}
},
{
"$project" : {
"data" : {
"$arrayElemAt" : [
"$content",
NumberInt("0")
]
},
"version" : {
"$arrayElemAt" : [
"$version",
NumberInt("0")
]
},
"_id" : {
"$arrayElemAt" : [
"$_oid",
NumberInt("0")
]
}
}
}
}
)
To improve performance (from exponential to linear), I build an index that holds key and version:
db.getCollection("collection").createIndex({ "key": 1, "version" : 1})
So my question is: there are some other capabilities/strategies to optimize this search ?
Notes
in these collection there are some other field I already use to filter data using match, omitted for brevity
my prerequisite is to load a lot of data, process one to one, before insert: if there is a better approach to calculate version, I can consider also to change this
I'm not sure if an unique index on key could do the same as my query. I mean, if I do an unique index on key and version, I could have the uniqueness on that couple an iterate on it, for example:
no data on collection: just insert first version
insert new document: try to insert version 1, then get error, iterate on it, this should hit unique index, right ?

I had similar situation and this is how I solved it.
Create a seperate collection that will hold Key and corresponding latest version, say KeyVersionCollection
Make this collection "InMemory" for faster response
Store Key in "_id" field
When inserting document in your versioned collection, say EntityVersionedCollection
Query latest version from KeyVersionCollection
Update the version number by 1 or insert a new document with version 0 in KeyVersionCollection
You can even combine above 2 operations in 1 (https://docs.mongodb.com/manual/reference/method/db.collection.findAndModify/#db.collection.findAndModify)
Use new version number to insert document in EntityVersionedCollection
This will save time of aggregation and sorting. On side note, I would keep latest versions in seperate collection - EntityCollection. In this case, for each entity - insert a new version in EntityVersionedCollection and upsert it in EntityCollection.
In corner cases, where process is interrupted between getting new version number and using it while inserting entity, you might see that the version is skipped in EntityVersionedCollection; but that should be ok. Use timestamps to track inserts/updates so that it can be used to correlate/audit in future.
Hope that helps.

You can simply pass an array into the mongoDB insert function, and it should insert an entire JSON payload without any memory deficiencies.
You're welcome

Related

find() return the latest value only on MongoDB

I have this collection in MongoDB that contains the following entries. I'm using Robo3T to run the query.
{
"_id" : ObjectId("xxx1"),
"Evaluation Date" : "2021-09-09",
"Results" : [
{
"Name" : "ABCD",
"Version" : "3.2.x"
}
]
"_id" : ObjectId("xxx2"),
"Evaluation Date" : "2022-09-09",
"Results" : [
{
"Name" : "ABxD",
"Version" : "5.2.x"
}
]
}
This document contains multiple entries of similar format. Now, I need to extract the latest value for "Version".
Expected output:
5.2.x
Measures I've taken so far:
(1) I've only tried findOne() and while I was able to extract the value of "Version": db.getCollection('TestCollectionName').findOne().Results[0].Version
...only the oldest entry was returned.
3.2.x
(2) Using the find().sort().limit() like below, returns the entire document for the latest entry and not just the data value that I wanted; db.getCollection('TestCollectionName').find({}).sort({"Results.Version":-1}).limit(1)
Results below:
"_id" : ObjectId("xxx2"),
"Evaluation Date" : "2022-09-09",
"Results" : [
{
"Name" : "ABxD",
"Version" : "5.2.x"
}
]
(3) I've tried to use sort() and limit() alongside findOne() but I've read that findOne is maybe deprecated and also not compatible with sort. And thus, resulting to an error.
(4) Finally, if I try to use sort and limit on find like this: db.getCollection('LD_exit_Evaluation_Result_MFC525').find({"Results.New"}).sort({_id:-1}).limit(1) I would get an unexpected token error.
What would be a good measure for this?
Did I simply mistake to/remove a bracket or need to reorder the syntax?
Thanks in advance.
I'm not sure if I understood well, but maybe this could be what are you looking for:
db.collection.aggregate([
{
"$project": {
lastResult: {
"$last": "$Results"
},
},
},
{
"$project": {
version: "$lastResult.Version",
_id: 0
}
}
])
It uses aggregate with some operators: the first $project calculate a new field called lastResult with the last element of each array using $last operator. The second $project is just to clean the output. If you need the _id reference, just remove _id: 0 or change its value to 1.
You can check how it works here: https://mongoplayground.net/p/jwqulFtCh6b
Hope I helped

Add object to object array if an object property is not given yet

Use Case
I've got a collection band_profiles and I've got a collection band_profiles_history. The history collection is supposed to store a band_profile snapshot every 24 hour and therefore I am using MongoDB's recommended format for historical tracking: Each month+year is it's own document and in an object array I will store the bandProfile snapshot along with the current day of the month.
My models:
A document in band_profiles_history looks like this:
{
"_id" : ObjectId("599e3bc406955db4cbffe0a8"),
"month" : 7,
"tag_lowercased" : "9yq88gg",
"year" : 2017,
"values" : [
{
"_id" : ObjectId("599e3bc41c073a7418fead91"),
"profile" : {
"_id" : ObjectId("5989a65d0f39d9fd70cde1fe"),
"tag" : "9YQ88GG",
"name_normalized" : "example name1",
},
"day" : 1
},
{
"_id" : ObjectId("599e3bc41c073a7418fead91"),
"profile" : {
"_id" : ObjectId("5989a65d0f39d9fd70cde1fe"),
"tag" : "9YQ88GG",
"name_normalized" : "new name",
},
"day" : 2
}
]
}
And a document in band_profiles:
{
"_id" : ObjectId("5989a6190f39d9fd70cddeb1"),
"tag" : "9V9LRGU",
"name_normalized" : "example name",
"tag_lowercased" : "9v9lrgu",
}
This is how I upsert my documents into band_profiles_history at the moment:
BandProfileHistory.update(
{ tag_lowercased: tag, year, month},
{ $push: {
values: { day, profile }
}
},
{ upsert: true }
)
My problem:
I only want to insert ONE snapshot for every day. Right now it would always push a new object into the object array values no matter if I already have an object for that day or not. How can I achieve that it would only push that object if there is no object for the current day yet?
Putting mongoose aside for a moment:
There is an operation addToSet that will add an element to an array if it doesn't already exists.
Caveat:
If the value is a document, MongoDB determines that the document is a duplicate if an existing document in the array matches the to-be-added document exactly; i.e. the existing document has the exact same fields and values and the fields are in the same order. As such, field order matters and you cannot specify that MongoDB compare only a subset of the fields in the document to determine whether the document is a duplicate of an existing array element.
Since you are trying to add an entire document you are subjected to this restriction.
So I see the following solutions for you:
Solution 1:
Read in the array, see if it contains the element you want and if not push it to the values array with push.
This has the disadvantage of NOT being an atomic operation meaning that you could end up would duplicates anyways. This could be acceptable if you ran a periodical clean up job to remove duplicates from this field on each document.
It's up to you to decide if this is acceptable.
Solution 2:
Assuming you are putting the field _id in the subdocuments of your values field, stop doing it. Assuming mongoose is doing this for you (because it does, from what I understand) stop it from doing it like it says here: Stop mongoose from creating _id for subdocument in arrays.
Next you need to ensure that the fields in the document always have the same order, because order matters when comparing documents in the addToSet operation as stated in the citation above.
Solution 3
Change the schema of your band_profiles_history to something like:
{
"_id" : ObjectId("599e3bc406955db4cbffe0a8"),
"month" : 7,
"tag_lowercased" : "9yq88gg",
"year" : 2017,
"values" : {
"1": { "_id" : ObjectId("599e3bc41c073a7418fead91"),
"profile" : {
"_id" : ObjectId("5989a65d0f39d9fd70cde1fe"),
"tag" : "9YQ88GG",
"name_normalized" : "example name1"
}
},
"2": {
"_id" : ObjectId("599e3bc41c073a7418fead91"),
"profile" : {
"_id" : ObjectId("5989a65d0f39d9fd70cde1fe"),
"tag" : "9YQ88GG",
"name_normalized" : "new name"
}
}
}
Notice that the day field became the key for the subdocuments on the values. Notice also that values is now an Object instead of an Array.
No you can run an update query that would update values.<day> only if values.<day> didn't exist.
Personally I don't like this as it is using the fact that JSON doesn't allow duplicate keys to support the schema.
First of all, sadly mongodb does not support uniqueness of a field in an array of a collection. You can see there is major bug opened for 7 years and not closed yet(that is a shame in my opinion).
What you can do from here is limited and all is on application level. I had same problem and solve it in application level. Do something like this:
First read your document with document _id and values.day.
If your reading in step 1 returns null, that means there is no record on values array for given day, so you can push the new value(I assume band_profile_history has record with _id value).
If your reading in step 1 returns a document, that means values array has a record for given day. In that case you can use setoperation with $operator.
Like others said, they will be not atomic but while you are dealing with your problem in application level, you can make whole bunch of code synchronized. There will be 2 queries to run on mongodb among of 3 queries. Like below:
db.getCollection('band_profiles_history').find({"_id": "1", "values.day": 3})
if returns null:
db.getCollection('band_profiles_history').update({"_id": "1"}, {$push: {"values": {<your new band profile history for given day>}}})
if returns not null:
db.getCollection('band_profiles_history').update({"_id": "1", "values.day": 3}, {$set: {"values.$": {<your new band profile history for given day>}}})
To check if object is empty
{ field: {$exists: false} }
or if it is an array
{ field: {$eq: []} }
Mongoose also supports field: {type: Date} so you can use it instead counting a days, and do updates only for current date.

Casbah MongoDB, how to both add and remove values to an array in a single operation, to multiple documents?

After searching, I was unable to figure out how to perform multiple updates to a single field.
I have a document with a "tags" array field. Every document will have random tags before I begin the update. In a single operation, I want to add some tags and remove some tags.
The following update operator returns an error "Invalid modifier specified: $and"
updateOperators: { "$and" : [
{ "$addToSet" : { "tags" : { "$each" : [ "tag_1" , "tag_2"]}}},
{ "$pullAll" : { "tags" : [ "tag_2", "tag_3"]}}]}
collection.update(query, updateOperators, multi=true)
How do I both add and remove values to an array in a single operation, to multiple documents?
You don't need the $and with the update query, but you cannot update two fields at the same time with an update - as you would see if you tried the following in the shell:
db.test.update({}, { "$addToSet" : { "tags" : { "$each" : [ "tag_1" , "tag_2"]}},
"$pullAll" : { "tags" : [ "tag_2", "tag_3"] }}, true, false)
You would get a Cannot update 'tags' and 'tags' at the same time error message. So how to achieve this? Well with this schema you would need to do it in multiple operations, you could use the new bulk operation api as shown below (shell):
var bulk = db.coll.initializeOrderedBulkOp();
bulk.find({ "tags": 1 }).updateOne({ "$addToSet": { "$each" : [ "tag_1" , "tag_2"]}});
bulk.find({ "tags": 1 }).updateOne({ "$pullAll": { "tags": [ "tag_2", "tag_3"] } });
bulk.execute();
Or in Casbah with the dsl helpers:
val bulk = collection.initializeOrderedBulkOperation
bulk.find(MongoDBObject("tags" -> 1)).updateOne($addToSet("tags") $each("tag_1", tag_2"))
bulk.find(MongoDBObject("tags" -> 1)).updateOne($pullAll("tags" -> ("tags_2", "tags_3")))
bulk.execute()
Its not atomic and there is no guarantee that nothing else will try to modify, but it is as close as you will currently get.
Mongo does atomic updates so you could just construct the tags you want in the array and then replace the entire array.
I would advise against using an array to store these values all together as this is an "unbound" array of tags. Unbound arrays cause movement on disk and that causes indexes to be updated and the OS and mongo to do work.
Instead you should store each tag as a seperate document in a different collection and "bucket" them based on the _id of the related document.
Example
{_id : <_id> <key> <value>} - single docuemnt
This will allow you to query for all the tags for a single user with db.collection.find({_id : /^<_id>/}) and bucket the results.

MongoDB - Count and group on key

I'm quite new to MongoDb and I'm trying to get my head around the grouping/counting functions. I want to retrieve a list of votes per track ID, ordered by their votes. Can you help me make sense of this?
List schema:
{
"title" : "Bit of everything",
"tracks" : [
ObjectId("5310af6518668c271d52aa8d"),
ObjectId("53122ffdc974dd3c48c4b74e"),
ObjectId("53123045c974dd3c48c4b74f"),
ObjectId("5312309cc974dd3c48c4b750")
],
"votes" : [
{
"track_id" : "5310af6518668c271d52aa8d",
"user_id" : "5312551c92d49d6119481c88"
},
{
"track_id" : "53122ffdc974dd3c48c4b74e",
"user_id" : "5310f488000e4aa02abcec8e"
},
{
"track_id" : "53123045c974dd3c48c4b74f",
"user_id" : "5312551c92d49d6119481c88"
}
]
}
I'm looking to generate the following result (ideally ordered by the number of votes, also ideally including those with no entries in the votes array, using tracks as a reference.):
[
{
track_id: 5310af6518668c271d52aa8d,
track_votes: 1
},
{
track_id: 5312309cc974dd3c48c4b750,
track_votes: 0
}
...
]
In MySQL, I would execute the following
SELECT COUNT(*) AS track_votes, track_id FROM list_votes GROUP BY track_id ORDER BY track_votes DESC
I've been looking into the documentation for the various aggregation/reduce functions, but I just can't seem to get anything working for my particular case.
Can anyone help me get started with this query? Once I know how these are structured I'll be able to apply that knowledge elsewhere!
I'm using mongodb version 2.0.4 on Ubuntu 12, so I don't think I have access to the aggregation functions provided in later releases. Would be willing to upgrade, if that's the general consensus.
Many thanks in advance!
I recommend you to upgrade your MongoDB version to 2.2 and use the Aggregation Framework as follows:
db.collection.aggregate(
{ $unwind:"$votes"},
{ $group : { _id : "$votes.track_id", number : { $sum : 1 } } }
)
MongoDB 2.2 introduced a new aggregation framework, modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into an aggregated result.
The output will look like this:
{
"result":[
{
"_id" : "5310af6518668c271d52aa8d", <- ObjectId
"number" : 2
},
...
],
"ok" : 1
}
If this is not possible to upgrade, I recommend doing it in a programmatically way.

Retrieving only the relevant part of a stored document

I'm a newbie with MongoDB, and am trying to store user activity performed on a site. My data is currently structured as:
{ "_id" : ObjectId("4decfb0fc7c6ff7ff77d615e"),
"activity" : [
{
"action" : "added",
"item_name" : "iPhone",
"item_id" : 6140,
},
{
"action" : "added",
"item_name" : "iPad",
"item_id" : 7220,
}
],
"name" : "Smith,
"user_id" : 2
}
If I want to retrieve, for example, all the activity concerning item_id 7220, I would use a query like:
db.find( { "activity.item_id" : 7220 } );
However, this seems to return the entire document, including the record for item 6140.
Can anyone suggest how this might be done correctly? I'm not sure if it's a problem with my query, or with the structure of the data itself.
Many thanks.
You have to wait the following dev: https://jira.mongodb.org/browse/SERVER-828
You can use $slice only if you know insertion order and position of your element.
Standard queries on MongoDb always return all document.
(question also available here: MongoDB query to return only embedded document)