I'm currently using MongoDB and I have a collection called Product. I have a requirement in the system that asks to increment the collection version whenever any change happens to the collection (e.g. add a new product, remove, change price, etc...).
Question: Is there a recommended approach to set versions for collections in MongoDB?
I was expecting to find something like that:
db.collection.Product.setVersion("1.0.0");
and the corresponding get method:
db.collection.Product.getVersion();
I'm not sure if it makes sense. Personally, I would love to have collection metadata provided as a native implementation from MongoDB. Is there any document database that does so?
MongoDB itself is completely "schemaless" and as such does not have any of it's own concepts of document "metadata" or the general "version management" that you seem to be looking for. As such the general implementation is all up to you, and documents store whatever you supply them with.
You could implement such a scheme, generally by wrapping methods to include such things as version management in updates. So on document creation you would do this:
db.collection.myinsert({ "field": 1, "other": 2 })
Which wraps a normal insert to do this:
db.collection.insert({ "field": 1, "other": 2, "__v": 0 })
Having that data any "updates" would need to provide a similar wrapper. So this:
db.collection.myupdate({ "field": 1 },{ "$set": { "other": 4 } })
Actually does a check for the same version as held and "increments" the version at the same time via $inc:
db.collection.update(
{ "field": 1, "__v": 0 },
{
"$set": { "other": 4 },
"$inc": { "__v": 1 }
}
)
That means the document to be modified in the database needs to match the same "version" as what is in memory in order to update. Changing the version number means subsequent updates with stale data would not succeed.
Generally though, there are several Object Document Mapper or ODM implementations available for various languages that have the sort of functionality built in. You would probably be best off looking at the Drivers section of the documentation to find something suitable for your language implementation. Also a little extra reading up on MongoDB would help as well.
Related
I have a collection of documents in MongoDB and a subset of those documents are indexed in Elasticsearch for search purposes. I am using custom scoring function on ES, and the indexed json is mainly used for scoring. Once I have a sorted list of documents, what I am actually interested in is getting those full documents from MongoDB (so ES is used to return a list of IDs that I will then query on MongoDB using an { "$in" => "_id": [...]} filter).
The problem is, the documents indexed on Elasticsearch may not be synced correctly, and when I get a list of results from elasticsearch, there are some documents that are undesirable (for example unpublished data, etc.)
So what I would like to do, is to "filter" this list of IDs, according to conditions based on their attributes: at least 6 attributes must have a specific value, which is always the same, ie one attribute must be "non-null", the other one false, etc). I was thinking I could achieve this using a partial index filter, but I cannot create a duplicate index using a different partial filter expression (otherwise, I would have just added another { _id: 1 } index with a partialFilterExpression: { "my_field": true, ... } that suits me)
What would be the best way to go about it ?
Concrete scenario
Assume those docs are indexed on MongoDB, with "published" / "hidden" attributes relevant to my search action
(I do not want to show documents that are either unpublished or hidden)
{ _id: "1...", "created_at": "2019-01-20", "published": true, "hidden": false}
{ _id: "2...", "created_at": "2019-02-20", "published": false, "hidden": false}
{ _id: "3...", "created_at": "2029-03-20", "published": true, "hidden": false}
{ _id: "4...", "created_at": "2029-03-20", "published": false, "hidden": false}
{ _id: "5...", "created_at": "2029-03-20", "published": false, "hidden": true}
When a user searches the data with our ES implementation,
Elasticsearch runs a scoring function and returns the scores of each document
(here the example assumes only document 1-3 are retrieved)
Because of out-of-sync issues, an unpublished document (2) could be returned
{ _id: "1...", "score": 1}
{ _id: "2...", "score": 2}
{ _id: "3...", "score": 3}
Now I want to filter again this data and retrieve from mongoDB the documents which are published and not soft deleted, ie.
{ _id: "1...", "created_at": "2019-01-20", "published": true, "hidden": false}
{ _id: "3...", "created_at": "2029-03-20", "published": true, "hidden": false}
So I need a way to run queries to retrieve the documents that would exclude document 2 which should not be visible (and is out of sync with my elasticsearch results)
Is there a trick to do this maybe using a partialIndex ? Here the scenario is simple because I have just "published": false, "hidden": true but my conditions are actually a bit much more complex as mentionned above, and it would be a waste to retrieve all those documents and then filter them instead of only retrieving the documents filtered by just reading a "filtered index" of IDs to see if those IDs are there or not there.
A similar kind of approach was taken by me, in my task.
I wanted to carry out some text search and display results in front of users according to the relevance of the result.
Earlier we were using simple mongo regular expression to do this, but then we decided to do the same with ElasticSearch.
However, that result is also based on other factors and attributes in my case(which are just some fields and user data).
Here is my approach, (I will explain it with a dummy example).
Consider an Article System, where a user has his own articles written, he can write more articles, if he wants he can publish those to, if not published they are in DRAFT state, if activated then article will be in ACTIVATED state, and he can read some articles provided by System owners also, Also a user can share his articles with others.
So a typical DB design will look like..
{
articleType: Custom or System,
status: DRAFT, DEACTIVATED, ACTIVATED,
ownerId: unique id of the owner,
sharedWith: [unique ids of shared with users],
title: title of the article,
articleText: text description of the article,
tags: some tags related to article
}
and some other fields that are useful.
Basically, when the article is searched, we have to show result, if he is the owner of the article, or it is shared with him by some other user, or its a system article and that system article is in the ACTIVATED state, based on the text from tags, description, title fields in document.
So according to the mongo query, I exported all the necessary fields to the ElasticSeach index first.
Then I prepared the exact same query in ElasticSeach as we were doing a search with the mongo query.
i.e. my mongo query and Elasticsearch query were exactly the same.
I removed stop-words from the search text.
I performed a search on ElasticSearch only if there is a search text, otherwise, I used mongo results only.
we also have pagination implemented in here.
I created a text index on the mongo collection.
The strategy is, whenever there is an error from ElasticSeach or Cluster is down, timeout issue, I will perform text query on mongo.
When use search articles with some text, I performed a search on ElasticSeach, also with limit and skip, and fetched the _id and index of each document.
When I received the result from ElastcSeach, I converted _ids to ObjectId, I already have a mongo query, I added this _ids array(from ElasticSeach) into query,
like:
{$and: [{_id: {$in : _idsFromES}} , ...remaining query]}
I get exactly the same result from mongo also,
The results are too satisfactory and desired for me.
Also, with this I also handled, delete, insert, update of the article, the same things will change on ElasticSeach Index too.
That way they were always up to date.
I hope this helps you find the best way to carry out your task. :-)
Currently I am working on a mobile app. Basically people can post their photos and the followers can like the photos like Instagram. I use mongodb as the database. Like instagram, there might be a lot of likes for a single photos. So using a document for a single "like" with index seems not reasonable because it will waste a lot of memory. However, I'd like a user add a like quickly. So my question is how to model the "like"? Basically the data model is much similar to instagram but using Mongodb.
No matter how you structure your overall document there are basically two things you need. That is basically a property for a "count" and a "list" of those who have already posted their "like" in order to ensure there are no duplicates submitted. Here's a basic structure:
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3")
"photo": "imagename.png",
"likeCount": 0
"likes": []
}
Whatever the case, there is a unique "_id" for your "photo post" and whatever information you want, but then the other fields as mentioned. The "likes" property here is an array, and that is going to hold the unique "_id" values from the "user" objects in your system. So every "user" has their own unique identifier somewhere, either in local storage or OpenId or something, but a unique identifier. I'll stick with ObjectId for the example.
When someone submits a "like" to a post, you want to issue the following update statement:
db.photos.update(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
"likes": { "$ne": ObjectId("54bb2244a3a0f26f885be2a4") }
},
{
"$inc": { "likeCount": 1 },
"$push": { "likes": ObjectId("54bb2244a3a0f26f885be2a4") }
}
)
Now the $inc operation there will increase the value of "likeCount" by the number specified, so increase by 1. The $push operation adds the unique identifier for the user to the array in the document for future reference.
The main important thing here is to keep a record of those users who voted and what is happening in the "query" part of the statement. Apart from selecting the document to update by it's own unique "_id", the other important thing is to check that "likes" array to make sure the current voting user is not in there already.
The same is true for the reverse case or "removing" the "like":
db.photos.update(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
"likes": ObjectId("54bb2244a3a0f26f885be2a4")
},
{
"$inc": { "likeCount": -1 },
"$pull": { "likes": ObjectId("54bb2244a3a0f26f885be2a4") }
}
)
The main important thing here is the query conditions being used to make sure that no document is touched if all conditions are not met. So the count does not increase if the user had already voted or decrease if their vote was not actually present anymore at the time of the update.
Of course it is not practical to read an array with a couple of hundred entries in a document back in any other part of your application. But MongoDB has a very standard way to handle that as well:
db.photos.find(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
},
{
"photo": 1
"likeCount": 1,
"likes": {
"$elemMatch": { "$eq": ObjectId("54bb2244a3a0f26f885be2a4") }
}
}
)
This usage of $elemMatch in projection will only return the current user if they are present or just a blank array where they are not. This allows the rest of your application logic to be aware if the current user has already placed a vote or not.
That is the basic technique and may work for you as is, but you should be aware that embedded arrays should not be infinitely extended, and there is also a hard 16MB limit on BSON documents. So the concept is sound, but just cannot be used on it's own if you are expecting 1000's of "like votes" on your content. There is a concept known as "bucketing" which is discussed in some detail in this example for Hybrid Schema design that allows one solution to storing a high volume of "likes". You can look at that to use along with the basic concepts here as a way to do this at volume.
I am designing a generic notification subscription system where user can specify a compound rule at the time of subscription in terms of MongoDB query, or more generally, json query. The subscription data is stored in MongoDB collection. For example,
{ "userId": 1, "rule": {"p1": "a"} }
{ "userId": 2, "rule": {"p1": "a", "p2": "b"} }
{ "userId": 3, "rule": {"p3": {$gt: 3} } }
Later when an event in the form of a json object, such as the following, arrives, I want to find all user rules the event maches:
{"p1": "a", "p3": 4}
The above event should match rules specified by userId 1 and 3 in the example. The event object doesn't have to be stored in MongoDB.
Although I can probably meet the requirement by writing a loop at application layer. For efficiency I really want to implement it at db layer, preferably allow distributed (sharded) execution due to volume and latency requirement.
Is it achievable? Any help is appreciated. In fact, I am open to other NOSQL dbs as long as supporting dynamic event schema and there is a way to specify compound rule.
What you are trying to achieve is not possible, at least in MongoDB.
If you reason about how a query engine works, you will realize that this has not a straightforward solution.
On high-level terms, the engine will generate a condition object from your query that then will get evaluated against each document in the set that will result in a boolean value which determines if the document belongs to the result set or not.
In your case you want to do the other way round, you want to generate a condition object based on the document and then apply it to something (e.g an object) that you give it.
Even if it were possible, the cost of doing this on the DB would be too high as it would require to compile an expression function for each object and execute it and there would be no way to optimize the execution of the query.
It is more reasonable to actually do that outside the database, where you could have the expression functions already created.
You cant store "Comparison Query Operators" in a mongo database, but you can do this:
{ "userId": 1, "rule": {"p1": "a"} }
{ "userId": 2, "rule": {"p1": "a", "p2": "b"} }
{ "userId": 3, "rule": {"p3": {"value": 3, "operator":"gt"} } }
You store value AND OPERATOR, in string form, and you can make a query like this:
db.test.find({"rule.p3.comparator":"gt", "rule.p3.value":{$lt:4}})
Notice, if your "operator" is "gt", you must use $lt (the opposite comparison operator) in the query
Your complete example is something like this:
db.test.find({$or:[{"rule.p3.comparator":"gt", "rule.p3.value":{$lt:4}}, {"rule.p1":"a"}]})
This query match userId 1 and 3 like you want
Update: Following solution doesn't work. Problem with mongodb is that it doesn't use NodeJs to run map-reduce javascript, nor support any package manager. So it's hard to use any 3rd party libraries.
My own proposed solution, which hasn't been confirmed :
Compose query conforming to json-query syntax
upon arrival of evt, call MongoDB mapReduce function on user rules collection to invoke jsonQuery in mapper
var jsonQuery = require('json-query')
var mapper = function(evt) {
function map() {
if(jsonQuery(this.rule, {data: evt}){
emit(this.userId);
}
}
return map;
};
db.userRules.mapReduce(mapper(evt), ...);
The reason to compose query into json-query syntax instead of MongoDB query syntax is only json-query offers the jsonQuery method that tries to match one rule to one object. For above code to meet the requirements in question, following assumptions have to be met:
MongoDB can execute mapReduce on distributed nodes
In mapReduce I can use external library such as json-query, which implies the library code has to be distributed to all MongoDB nodes, perhaps as a result of closure.
Currently I am working on a mobile app. Basically people can post their photos and the followers can like the photos like Instagram. I use mongodb as the database. Like instagram, there might be a lot of likes for a single photos. So using a document for a single "like" with index seems not reasonable because it will waste a lot of memory. However, I'd like a user add a like quickly. So my question is how to model the "like"? Basically the data model is much similar to instagram but using Mongodb.
No matter how you structure your overall document there are basically two things you need. That is basically a property for a "count" and a "list" of those who have already posted their "like" in order to ensure there are no duplicates submitted. Here's a basic structure:
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3")
"photo": "imagename.png",
"likeCount": 0
"likes": []
}
Whatever the case, there is a unique "_id" for your "photo post" and whatever information you want, but then the other fields as mentioned. The "likes" property here is an array, and that is going to hold the unique "_id" values from the "user" objects in your system. So every "user" has their own unique identifier somewhere, either in local storage or OpenId or something, but a unique identifier. I'll stick with ObjectId for the example.
When someone submits a "like" to a post, you want to issue the following update statement:
db.photos.update(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
"likes": { "$ne": ObjectId("54bb2244a3a0f26f885be2a4") }
},
{
"$inc": { "likeCount": 1 },
"$push": { "likes": ObjectId("54bb2244a3a0f26f885be2a4") }
}
)
Now the $inc operation there will increase the value of "likeCount" by the number specified, so increase by 1. The $push operation adds the unique identifier for the user to the array in the document for future reference.
The main important thing here is to keep a record of those users who voted and what is happening in the "query" part of the statement. Apart from selecting the document to update by it's own unique "_id", the other important thing is to check that "likes" array to make sure the current voting user is not in there already.
The same is true for the reverse case or "removing" the "like":
db.photos.update(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
"likes": ObjectId("54bb2244a3a0f26f885be2a4")
},
{
"$inc": { "likeCount": -1 },
"$pull": { "likes": ObjectId("54bb2244a3a0f26f885be2a4") }
}
)
The main important thing here is the query conditions being used to make sure that no document is touched if all conditions are not met. So the count does not increase if the user had already voted or decrease if their vote was not actually present anymore at the time of the update.
Of course it is not practical to read an array with a couple of hundred entries in a document back in any other part of your application. But MongoDB has a very standard way to handle that as well:
db.photos.find(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
},
{
"photo": 1
"likeCount": 1,
"likes": {
"$elemMatch": { "$eq": ObjectId("54bb2244a3a0f26f885be2a4") }
}
}
)
This usage of $elemMatch in projection will only return the current user if they are present or just a blank array where they are not. This allows the rest of your application logic to be aware if the current user has already placed a vote or not.
That is the basic technique and may work for you as is, but you should be aware that embedded arrays should not be infinitely extended, and there is also a hard 16MB limit on BSON documents. So the concept is sound, but just cannot be used on it's own if you are expecting 1000's of "like votes" on your content. There is a concept known as "bucketing" which is discussed in some detail in this example for Hybrid Schema design that allows one solution to storing a high volume of "likes". You can look at that to use along with the basic concepts here as a way to do this at volume.
I have a collection of Parents that contain EmbeddedThings, and each EmbeddedThing contains a reference to the User that created it.
UserCollection: [
{
_id: ObjectId(…),
name: '…'
},
…
]
ParentCollection: [
{
_id: ObjectId(…),
EmbeddedThings: [
{
_id: 1,
userId: ObjectId(…)
},
{
_id: 2,
userId: ObjectId(…)
}
]
},
…
]
I soon realized that I need to get all EmbeddedThings for a given user, which I managed to accomplish using map/reduce:
"results": [
{
"_id": 1,
"value": [ `EmbeddedThing`, `EmbeddedThing`, … ]
},
{
"_id": 2,
"value": [ `EmbeddedThing`, `EmbeddedThing`, … ]
},
…
]
Is this where I should really just normalize EmbeddedThing into its own collection, or should I still keep map/reduce to accomplish this? Some other design perhaps?
If it helps, this is for users to see their list of EmbeddedThings across all Parents, as opposed to for some reporting/aggregation task (which made me realize I might me doing this wrong).
Thanks!
"To embed or not to embed: that is the question" :)
My rules are:
embed if an embedded object has sense only in the context of parent objects. For example, OrderItem without an Order doesn't make sense.
embed if dictated by performance requirements. It's very cheap to read full document tree (as opposed to having to make several queries and joining them programmatically).
You should look at your access patterns. If you load ParentThing several thousand times per second, and load User once a week, then map-reduce is probably a good choice. User query will be slow, but it might be ok for your application.
Yet another approach is to denormalize even more. That is, when you add an embedded thing, add it to both parent thing and user.
Pros: queries are fast.
Cons: Complicated code. Double amount of writes. Potential loss of sync (you update/delete in one place, but forget to do in another).