When doing an upsert to MongoDb is it possible to set a field with a timestamp only if other data in the record has changed? - mongodb

We need to cache records for a service with a terrible API.
This service provides us with API to query for data about our employees, but does not inform us whether employees are new or have been updated. Nor can we filter our queries to them for this information.
Our proposed solution to the problems this creates for us is to periodically (e.g. every 15 minutes) query all our employee data and upsert it into a Mongo database. Then, when we write to the MongoDb, we would like to include an additional property which indicates whether the record is new or whether the record has any changes since the last time it was upserted (obviously not including the field we are using for the timestamp).
The idea is, instead of querying the source directly, which we can't filter by such timestamps, we would instead query our cache which would include said timestamp and use it for a filter.
(Ideally, we'd like to write this in C# using the MongoDb driver, but more important right now is whether we can do this in an upsert call or whether we'd need to load all the records into memory, do comparisons, and then add the timestamps before upserting them....)

There might be a way of doing that, but how efficient that is, still needs to be seen. The update command in MongoDB can take an aggregation pipeline to perform an update operation. We can use the $addFields stage of MongoDB to add a new field denoting the update status, and we can use $function to compute its value. A short example is:
db.collection.update({
key: 1
},
[
{
"$addFields": {
changed: {
"$function": {
lang: "js",
"args": [
"$$ROOT",
{
"key": 1,
data: "somedata"
}
],
"body": "function(originalDoc, newDoc) { return JSON.stringify(originalDoc) !== JSON.stringify(newDoc) }"
}
}
}
}
],
{
upsert: true
})
Here's the playground link.
Some points to consider here, are:
If the order of fields in the old and new versions of the doc is not the same then JSON.stringify will fail.
The function specified in $function will run on the server-side, so ideally it needs to be lightweight. If there is a large number of users, that will get upserted, then it may or may not act as a bottleneck.

Related

Inserting multiple key value pair data under single _id in cloudant db at various timings?

My requirement is to get json pair from mqtt subscriber at different timings under single_id in cloudant, but I'm facing error while trying to insert new json pair in existing _id, it simply replace old one. I need at least 10 json pair under one _id. Injecting at different timings.
First, you should make sure about your architectural decision to update a particular document multiple times. In general, this is discouraged, though it depends on your application. Instead, you could consider a way to insert each new piece of information as a separate document and then use a map-reduce view to reflect the state of your application.
For example (I'm going to assume that you have multiple "devices", each with some kind of unique identifier, that need to add data to a cloudant DB)
PUT
{
"info_a":"data a",
"device_id":123
}
{
"info_b":"data b",
"device_id":123
}
{
"info_a":"message a"
"device_id":1234
}
Then you'll need a map function like
_design/device/_view/state
{
function (doc) {
emit(doc.device_id, 1);
}
Then you can GET the results of that view to see all of the "info_X" data that is associated with the particular device.
GET account.cloudant.com/databasename/_design/device/_view/state
{"total_rows":3,"offset":0,"rows":[
{"id":"28324b34907981ba972937f53113ac3f","key":123,"value":1},
{"id":"d50553d206d722b960fb176f11841974","key":123,"value":1},
{"id":"eaa710a5fa1ff4ba6156c997ddf6099b","key":1234,"value":1}
]}
Then you can use the query parameters to control the output, for example
GET account.cloudant.com/databasename/_design/device/_view/state?key=123&include_docs=true
{"total_rows":3,"offset":0,"rows":[
{"id":"28324b34907981ba972937f53113ac3f","key":123,"value":1,"doc":
{"_id":"28324b34907981ba972937f53113ac3f",
"_rev":"1-bac5dd92a502cb984ea4db65eb41feec",
"info_b":"data b",
"device_id":123}
},
{"id":"d50553d206d722b960fb176f11841974","key":123,"value":1,"doc":
{"_id":"d50553d206d722b960fb176f11841974",
"_rev":"1-a2a6fea8704dfc0a0d26c3a7500ccc10",
"info_a":"data a",
"device_id":123}}
]}
And now you have the complete state for device_id:123.
Timing
Another issue is the rate at which you're updating your documents.
Bottom line recommendation is that if you are only updating the document once per ~minute or less frequently, then it could be reasonable for your application to update a single document. That is, you'd add new key-value pairs to the same document with the same _id value. In order to do that, however, you'll need to GET the full doc, add the new key-value pair, and then PUT that document back to the database. You must make sure that your are providing the most recent _rev of that document and you should also check for conflicts that could occur if the document is being updated by multiple devices.
If you are acquiring new data for a particular device at a high rate, you'll likely run into conflicts very frequently -- because cloudant is a distributed document store. In this case, you should follow something like the example I gave above.
Example flow for the second approach outlined by #gadamcox for use cases where document updates are not required very frequently:
[...] you'd add new key-value pairs to the same document with the same _id value. In order to do that, however, you'll need to GET the full doc, add the new key-value pair, and then PUT that document back to the database.
Your application first fetches the existing document by id: (https://docs.cloudant.com/document.html#read)
GET /$DATABASE/100
{
"_id": "100",
"_rev": "1-2902191555...",
"No": ["1"]
}
Then your application updates the document in memory
{
"_id": "100",
"_rev": "1-2902191555...",
"No": ["1","2"]
}
and saves it in the database by specifying the _id and _rev (https://docs.cloudant.com/document.html#update)
PUT /$DATABASE/100
{
"_id": "100",
"_rev": "1-2902191555...",
"No":["1","2"]
}

MongoDB query subset of array of DBRefs

My main object looks like this:
{
"_id" : ObjectId("56eb0a06560fd7047318465a"),
...
"intervalAbsenceDates" : [
DBRef("ScheduleIntervalContainer", ObjectId("56eb0a06560fd7047318463b")),
DBRef("ScheduleIntervalContainer", ObjectId("56eb0a05560fd70473184467")),
DBRef("ScheduleIntervalContainer", ObjectId("56eb0a05560fd70473184468")),
DBRef("ScheduleIntervalContainer", ObjectId("56eb0a05560fd70473184469")),
An embedded ScheduleIntervalContainer object looks like this:
{
"_id" : ObjectId("56eb0a06560fd7047318463b"),
"end" : ISODate("2022-08-23T07:06:00Z"),
"available" : true,
"confirmation" : true,
"start" : ISODate("2022-08-19T09:33:00Z")
}
Now I will query all ScheduleIntervalContainers where start and end is in a range.
I have tried a lot but I not even can query one ScheduleIntervalContainer by id.
This is my approach:
db.InstitutionUserConnection.find( {
"intervalAbsenceDates" : {
"$ref" : "ScheduleIntervalContainer",
"$id" : ObjectId("56eb0a05560fd7047318446d")
}
})
Could anyone give me a hint how to query all ScheduleIntervalContainers which have start and end in a time range.
Using DBRef is fraught with problems and the usage is somewhat "antiquated" as it was more of a "shoehorn" solution to requests to provide a mechanism for providing references to external collection data than a really well thought solution.
The general recommendation is to not use DBRef and rather simply use a plain ObjectId or other unique identifier and resolve the actual "collection" or even "database container" with other information, being either another standard property stored in the document or just a plain external reference in your code which identifies the target collection.
The Basic BSON Problem
A good reason for this is that people often make the common mistake ( just like you have ) of interpretting the "serialized" output from a stored object containing a DBRef to actually be "properties" present on the object. This is not true, as in fact the object has it's own BSON type, just like Date and ObjectId. The only time that serialized form is valid is when used with a "strict mode" JSON parser, which would yeild the actual DBRef object.
The manual itself is not particularly helpful with that fact and says "DBRefs have the following fields". This is misleading since those properties are not actually available as fields for query, and would not be exposed other than object inspection available to JavaScript processing of $where or mapReduce. And not a good approach to use either of those for "query" purposes.
So those properties are not available for query, but you can of course specifiy the BSON form directly from your API. Such as:
db.InstitutionUserConnection.find({
"intervalAbsenceDates": DBRef(
"ScheduleIntervalContainer",
ObjectId("56eb0a05560fd70473184469")
)
})
Which resolves and matches correctly since the correct BSON form was sent in a query and the element will actually match.
From this principle we can then move on to the the concept of "ranges" as asked in the question.
MongoDB does not "really" do Joins
Until recent releases ( MongoDB 3.2.x series ) the statement was actually "MongoDB does not do joins", and this has been a general distinction in design philosophy away from relational databases.
The general mantra "has" been that "joins are costly" and therefore do not scale well in distributed data systems such as what MongoDB is principally designed for.
So if you are asking for referencing a property in the document which the DBRef refers to, then you are basically out of luck. The only possible action for filtering results based on an external property in this case would be:
Look at all the data in the master collection, either loaded to query in whole or processed individually
For each retrieved document, lookup and expand the DBRef values to their target collection data.
Filter out documents whose expanded data from external references do not actually meet the conditions.
This means all the "expansion" and "filtering" must take place on the "client" to the database rather than on the server itself. There simply is no mechanism to do this, so you end up pulling a lot of data over the network connection.
With a DBRef in place this is purely not possible to perform on the server even with modern releases. Again it's the same BSON type problem, since the "source" contains DBRef types and the "target" contains ObjectId types.
If however you can simply live with the fact that your "range" is looking at the "creation date" data that would be inherently present in any ObjectId, then there is of course another approach that does not involve a "join".
Filtering on ObjectId "range"
Every ObjectId starts with 4-bytes that represents the current timestamp value ( excluding milliseconds ) at the time the ObjectId was created. This is generally a good indicator of the time of "insertion" for the document in question where it was used.
From this you can determine that the "created date" of the documents refernced with DBRef is approximately equal to that portion of the ObjectId value used by that document in the target collection. This allows you to basically construct a "range" value for ObjectId values that would fall between the given range. As a consequence, you can then construct DBRef BSON objects that would work with range operators:
// Define start and end dates to query
var dateStart = new Date("2016-03-17T19:48:21Z"), // equal to 56eb0a05
dateEnd = new Date("2016-03-17T19:48:25Z"); // equal to 56eb0a09
// Convert to hex and pad to ObjectId length
var startRange = new ObjectId(
( dateStart.valueOf() / 1000 ).toString(16) + "0000000000000000"
),
// Yields ObjectId("56eb0a050000000000000000")
endRange = new ObjectId(
( dateEnd.valueOf() / 1000 ).toString(16) + "ffffffffffffffff"
);
// Yields ObjectId("56eb0a09ffffffffffffffff")
// Now query with contructed DBRef values
db.InstitutionUserConnection.find({
"intervalAbsenceDates": {
"$elemMatch": {
"$gte": DBRef("ScheduleIntervalContainer",startRange),
"$lt": DBRef("ScheduleIntervalContainer",endRange),
}
}
})
So as long as "created" is what you are looking for, then that method should suffice for selecting the matching parents without first expanding the DBRef values in the array for further inpection.
The Reverse Case Lookup
Of course the other case here is to simply query the "joined" collection first and then look for documents in the "master" collection that contain the ObjectId values within the DBRef. This does of course mean multiple queries to be issued, but it does cure the case of expanding every DBRef just to match the related properties:
// Create array of matching DBRef values
var refs = db.ScheduleIntervalContainer.find({
"start" { "$lte": targetDate },
"end": { "$gte": targetDate }
}).map(function(doc) {
return DBRef("ScheduleIntervalContainer",doc._id)
});
// Find documents that match the DBRef's within the array
db.InstitutionUserConnection.find({
"intervalAbsenceDates": { "$in": refs }
})
The practicallity of this varies on the number of matches from the related collection resulting in the array that would be passed to $in, but it does actually yield the desired result.
Actually doing Joins
I mention earlier that "modern" MongoDB releases now have an approach to "joining" data from different collections. This is the $lookup aggregation pipeline operator.
But while this can be used to "join" data, the current usage of DBRef does not work here. I did also mention earlier that the basic problem is the data in the array is a DBRef, but the data in the referenced collection is instead an ObjectId.
So if you wanted to use a $lookup approach, then you would first need to use plain ObjectId values in place of the existing DBRef values:
{
"_id" : ObjectId("56eb0a06560fd7047318465a"),
"intervalAbsenceDates" : [
ObjectId("56eb0a06560fd7047318463b"),
ObjectId("56eb0a05560fd70473184467"),
ObjectId("56eb0a05560fd70473184468"),
ObjectId("56eb0a05560fd70473184469")
]
}
With data in that structure you could then use $lookup and other aggregation pipeline methods to just return the documents that actually match the value of a related property. I.e "end" within the related object:
db.InstitutionUserConnection.aggregate([
// Presently you need to unwind the array first
{ "$unwind": "$intervalAbsenceDates" },
// Then $lookup to get a resulting array of matches for each member
{ "$lookup": {
"from": "ScheduleIntervalContainer",
"localField": "intervalAbsenceDates",
"foreignField": "_id",
"as": "absenceDates"
}},
// unwind the array result field as well
{ "$unwind": "$absenceDates" },
// Now reform the documents
{ "$group": {
"_id": "$_id",
"intervalAbsenceDates": { "$push": "$absenceDates" }
}},
// Then query on the "end" property for the range
{ "$match": {
"intervalAbsenceDates": {
"$elemMatch": {
"end": {
"$gte": new Date("2016-03-23"),
"$lt": new Date("2016-03-24")
}
}
}
}}
])
The current behaviour of $lookup is that you cannot directly process on an array property in the document, so the procedure shown in "$lookup on ObjectId's in an array" is used to replace the current array with the expanded objects from the other collection.
Once the operations here actually produce a document that now has that related data embedded, it's a straightforward process of looking at the properties of the documents within the array to see if they match the query conditions.
Conclusion
This all should show that DBRef is not a good idea for storing references. Whist it is possible as demonstrated to work around the problem using the ObjectId values, you generally want to have a plain ObjectId or other key value as the reference and resolve them by other means. And even if the workaround is sufficient, it works just the same with plain ObjectId values or anything else that presents a natural range.
When it comes to using the "values" of refernced properties in such a "join", then of course regardless of using DBRef or other value, it is not a possibilty without $lookup to use that in query conditions on the server. All data would first need to be loaded to the client and then resolved with additional queries to the database before those properties can be inspected for filtering.
Since the mechanism of $lookup will in fact result in a form that will look exactly like if you "embedded" the data in the first place, then "embedding" is most often the correct approach, since the data is already present in the source collection and available for query.
There is a lot of "scare media" around regarding the BSON limit of 16MB and saying this is why you keep data in another collection. Sometimes this does indeed apply, but most of the time it does not. Afterall, 16MB is really quite a large amount of data, and more than most would actually use in general applications.
Citation from MongoDB: The Definitive Guide
To give you an idea of how much 16MB is, the entire text of War and Peace is just 3.14MB.
To examine for queries to need to get to an "embedded form" anyway, and it is arguable that if you can store an array of DBRef or ObjectId or whatever as embedded data, then storing all of the content they are actually pointing to is not really that much more of a stretch.
The general lesson is that you should be designing based on the actual usage patterns your applcation applies to the data. If you are querying "related data" all of the time, then it makes most sense to keep that data all in the one collection. Of course other factors apply, but always keep in mind the trade-off in performance considerations by what you are doing.

How to create an index in MongoDB which calls a JS function via system.js?

I have two collections viz. whitelist (id, count, expiry) and blacklist (id).
Now i would like to create an index such that when count>=200 then call a JS function which will remove the document from whitelist and add the id to blacklist.
So can i do this in Mongo using db.collection.createindex({"count":1}, ???);
or do i need to write a daemon to scan the entire collection? or is there any better method for the same?
You seem to be asking for what in a SQL relational database we would call a "trigger", which is something completely different from an "index" even in that world.
In the NoSQL world typically and especially with MongoDB, that sort of "server logic" is relegated to the "client" code operations rather than the server. Think of it as another part of the "scalability" philosphy of these products, where certain functions like "triggers" are taken away due to the stance that these "cost" a lot with distributed data.
So in order to do what you want you do it in "code" instead of defining a database "trigger". The process is simple enough, via .findAndModify() and other wrapping variants available to langauge API's:
// Increment below 200 and return the modified document
var doc = db.whitelist.findAndModify({
"query": { "_id": myId, "count": { "$lt": 200 } }
"update": { "count": { "$inc": 1 } },
"new": true
});
// Then remove the blacklist where the value meets conditions
if ( doc.hasOwnProperty("count") {
if ( doc.count >= 200 )
db.blacklist.remove({ "_id": myId });
}
Be careful with the actual language API method variant as the structure typically differs fromt the "query/update" keys as is provided in the shell method.
The basic principles remain the same. Modifiy and fetch, then remove from the other collection if your conditions are met. But it is "two" trips to the server, and there is no way to make the server "trigger" when such a condition is met by itself.
db.whitelist.insert(doc);
if(db.whitelist.find(criterion).count() >= 200) {
var bulkRemove = db.whitelist.initializeUnorderedBulkOp();
var bulkInsert = db.blacklist.initializeUnorderedBulkOp();
db.whitelist.find(criterion).forEach(
function(doc){
bulkInsert.insert({_id:doc._id});
bulkRemove.find({doc._id}).removeOne();
}
);
bulkInsert.execute();
bulkRemove.execute();
}
First, you insert the document as usual. Since criterion is going to use an index, the if clause should be determined fast and efficiently.
In case we have 200 or more documents matching that criterion, we use bulk operations to insert the ids into the blacklist and remove the documents from the whitelist, which will be executed in parallel.
The problem with only writing the _id to the backlist is that you need to check wether the criterion for being blacklisted is matched, so the _id needs to contain that criterion.
A better solution IMHO is to flag entries of a single collection using a field named blacklisted for individual entries or to use the aggregation framework to find blacklisted documents and write them to an a collection using the out pipeline stage. Sadly, you didn't give example data or a proper description of your use case, so you get an unspecified answer.

How to overwrite object Id's in Mongo db while creating an App in Sails

I am new to Sails and Mongo Db. Currently I am trying to implement a CRUD Function using Sails where I want to save user details in Mongo db.In the model I have the following attributes
"id":{
type:'Integer',
min:100,
autoincrement:true
},
attributes: {
name:{
type:'String',
required:true,
unique:true
},
email_id:{
type:'EMAIL',
required:false,
unique:false
},
age:{
type:'Integer',
required:false,
unique:false
}
}
I want to ensure that the _id is overridden with my values starting from 100 and is auto incremented with each new entry. I am using the waterline model and when I call the Api in DHC, I get the following output
"name": "abc"
"age": 30
"email_id": "abc#gmail.com"
"id": "5587bb76ce83508409db1e57"
Here the Id given is the object Id.Can somebody tell me how to override the object id with an Integer starting from 100 and is auto incremented with every new value.
Attention: Mongo id should be unique as possible in order to scale well. The default ObjectId is consist of a timestamp, machine ID, process ID and a random incrementing value. Leaving it with only the latter would make it collision prone.
However, sometimes you badly want to prettify the never-ending ObjectID value (i.e. to be shown in the URL after encoding). Then, you should consider using an appropriate atomic increment strategy.
Overriding the _id example:
db.testSOF.insert({_id:"myUniqueValue", a:1, b:1})
Making an Auto-Incrementing Sequence:
Use Counters Collection: Basically a separated collection which keeps track the last number of the sequence. Personally, I have found it more cohesive to store the findAndModify function in the system.js collection, although it lacks version control's capabilities.
Optimistic Loop
Edit:
I've found an issue in which the owner of sails-mongo said:
MongoDb doesn't have an auto incrementing attribute because it doesn't
support it without doing some kind of manual sequence increment on a
separate collection or document. We don't currently do this in the
adapter but it could be added in the future or if someone wants to
submit a PR. We do something similar for sails-disk and sails-redis to
get support for autoIncremeting fields.
He mentions the first technique I added in this answer:
Use Counters Collection. In the same issue, lewins shows a workaround.

Mongodb findAndModify atomicity

I want to know how to reference the returned document attributes
from find and use it within modify. E.x. :
var totalNoOfSubjects = 5;
db.people.findAndModify( {
query: { name: "Tom", state: "active", rating: { $gt: 10 } },
sort: { rating: 1 },
update: { $set: { average: <reference score value returned by find>/totalNoOfSubjects} }
} );
My understanding is that findAndModify locks the document, hence I want to perform the update in the modify using the attributes found in the find. This will make the operation
atomic.
I am wondering if this is supported by mongo.
No, you cannot refer to values in the found document during the update portion of a findAndModify. It's the same as update in this respect.
As such, you cannot do this atomically as you need to first fetch the document and then craft the update or findAndMondify to contain the value computed from your fetched doc.
See https://jira.mongodb.org/browse/SERVER-458 for one way this may be addressed in the future.
Atomicity is exactly the reason for findAndModify.
As stated in the docs, Mongo will find one or more documents (matching the query specified) modify one document (using the update specified). The whole process is atomic. Default implementation has Mongo returning the found document (in its unchanged state). This can be modified using the new option.