How to Model a "likes" voting system with MongoDB - mongodb

Currently I am working on a mobile app. Basically people can post their photos and the followers can like the photos like Instagram. I use mongodb as the database. Like instagram, there might be a lot of likes for a single photos. So using a document for a single "like" with index seems not reasonable because it will waste a lot of memory. However, I'd like a user add a like quickly. So my question is how to model the "like"? Basically the data model is much similar to instagram but using Mongodb.

No matter how you structure your overall document there are basically two things you need. That is basically a property for a "count" and a "list" of those who have already posted their "like" in order to ensure there are no duplicates submitted. Here's a basic structure:
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3")
"photo": "imagename.png",
"likeCount": 0
"likes": []
}
Whatever the case, there is a unique "_id" for your "photo post" and whatever information you want, but then the other fields as mentioned. The "likes" property here is an array, and that is going to hold the unique "_id" values from the "user" objects in your system. So every "user" has their own unique identifier somewhere, either in local storage or OpenId or something, but a unique identifier. I'll stick with ObjectId for the example.
When someone submits a "like" to a post, you want to issue the following update statement:
db.photos.update(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
"likes": { "$ne": ObjectId("54bb2244a3a0f26f885be2a4") }
},
{
"$inc": { "likeCount": 1 },
"$push": { "likes": ObjectId("54bb2244a3a0f26f885be2a4") }
}
)
Now the $inc operation there will increase the value of "likeCount" by the number specified, so increase by 1. The $push operation adds the unique identifier for the user to the array in the document for future reference.
The main important thing here is to keep a record of those users who voted and what is happening in the "query" part of the statement. Apart from selecting the document to update by it's own unique "_id", the other important thing is to check that "likes" array to make sure the current voting user is not in there already.
The same is true for the reverse case or "removing" the "like":
db.photos.update(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
"likes": ObjectId("54bb2244a3a0f26f885be2a4")
},
{
"$inc": { "likeCount": -1 },
"$pull": { "likes": ObjectId("54bb2244a3a0f26f885be2a4") }
}
)
The main important thing here is the query conditions being used to make sure that no document is touched if all conditions are not met. So the count does not increase if the user had already voted or decrease if their vote was not actually present anymore at the time of the update.
Of course it is not practical to read an array with a couple of hundred entries in a document back in any other part of your application. But MongoDB has a very standard way to handle that as well:
db.photos.find(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
},
{
"photo": 1
"likeCount": 1,
"likes": {
"$elemMatch": { "$eq": ObjectId("54bb2244a3a0f26f885be2a4") }
}
}
)
This usage of $elemMatch in projection will only return the current user if they are present or just a blank array where they are not. This allows the rest of your application logic to be aware if the current user has already placed a vote or not.
That is the basic technique and may work for you as is, but you should be aware that embedded arrays should not be infinitely extended, and there is also a hard 16MB limit on BSON documents. So the concept is sound, but just cannot be used on it's own if you are expecting 1000's of "like votes" on your content. There is a concept known as "bucketing" which is discussed in some detail in this example for Hybrid Schema design that allows one solution to storing a high volume of "likes". You can look at that to use along with the basic concepts here as a way to do this at volume.

Related

How to filter through possibly infinitely nested data NoSQL?

I'm new to NoSQL so I might be wrong in my thinking process, but I am trying to figure out how to filter through possibly infinitely nested object (comment, replies to comment, replies to replies to comment). I am using MongoDB, but it probably applies to other NoSQL databases too.
This is the structure I wanted to use:
Post
{
"name": "name",
"comments": [
{
"id": "someid"
"author": "author",
"replies": [
{
"id": "someid",
"author": "author",
"replies": [
{
...
}
]
},
{
"id": "someid",
"author": "author",
"replies": null
}
]
}
]
}
As you can see, replies can be infinitely nested. (well, unless i set the limit which doesn't sound that stupid)
But now if user wants to edit / delete comment, I have to filter through them and find the one and I can't find any better way than to loop through all of them, but that would be very slow with a lot of comments.
I was thinking to create ID for each comment that would somewhat help finding it (something inspired from hashmap, but not exactly). It could maybe include depth (how deep nested is comment) and then only filter through comments with at least that depth, but that little help would only increase performance slightly and only in specific cases, in worst case I would have to loop through all of them anyway. ID could also include indexes of comments and replies, but that would be limited since ID can't be infinite and replies can.
I couldn't find any MongoDB query for that.
Is there any solution / algorithm to do it more efficiently?

Filtering MongoDB results after Elasticsearch returns document IDs

I have a collection of documents in MongoDB and a subset of those documents are indexed in Elasticsearch for search purposes. I am using custom scoring function on ES, and the indexed json is mainly used for scoring. Once I have a sorted list of documents, what I am actually interested in is getting those full documents from MongoDB (so ES is used to return a list of IDs that I will then query on MongoDB using an { "$in" => "_id": [...]} filter).
The problem is, the documents indexed on Elasticsearch may not be synced correctly, and when I get a list of results from elasticsearch, there are some documents that are undesirable (for example unpublished data, etc.)
So what I would like to do, is to "filter" this list of IDs, according to conditions based on their attributes: at least 6 attributes must have a specific value, which is always the same, ie one attribute must be "non-null", the other one false, etc). I was thinking I could achieve this using a partial index filter, but I cannot create a duplicate index using a different partial filter expression (otherwise, I would have just added another { _id: 1 } index with a partialFilterExpression: { "my_field": true, ... } that suits me)
What would be the best way to go about it ?
Concrete scenario
Assume those docs are indexed on MongoDB, with "published" / "hidden" attributes relevant to my search action
(I do not want to show documents that are either unpublished or hidden)
{ _id: "1...", "created_at": "2019-01-20", "published": true, "hidden": false}
{ _id: "2...", "created_at": "2019-02-20", "published": false, "hidden": false}
{ _id: "3...", "created_at": "2029-03-20", "published": true, "hidden": false}
{ _id: "4...", "created_at": "2029-03-20", "published": false, "hidden": false}
{ _id: "5...", "created_at": "2029-03-20", "published": false, "hidden": true}
When a user searches the data with our ES implementation,
Elasticsearch runs a scoring function and returns the scores of each document
(here the example assumes only document 1-3 are retrieved)
Because of out-of-sync issues, an unpublished document (2) could be returned
{ _id: "1...", "score": 1}
{ _id: "2...", "score": 2}
{ _id: "3...", "score": 3}
Now I want to filter again this data and retrieve from mongoDB the documents which are published and not soft deleted, ie.
{ _id: "1...", "created_at": "2019-01-20", "published": true, "hidden": false}
{ _id: "3...", "created_at": "2029-03-20", "published": true, "hidden": false}
So I need a way to run queries to retrieve the documents that would exclude document 2 which should not be visible (and is out of sync with my elasticsearch results)
Is there a trick to do this maybe using a partialIndex ? Here the scenario is simple because I have just "published": false, "hidden": true but my conditions are actually a bit much more complex as mentionned above, and it would be a waste to retrieve all those documents and then filter them instead of only retrieving the documents filtered by just reading a "filtered index" of IDs to see if those IDs are there or not there.
A similar kind of approach was taken by me, in my task.
I wanted to carry out some text search and display results in front of users according to the relevance of the result.
Earlier we were using simple mongo regular expression to do this, but then we decided to do the same with ElasticSearch.
However, that result is also based on other factors and attributes in my case(which are just some fields and user data).
Here is my approach, (I will explain it with a dummy example).
Consider an Article System, where a user has his own articles written, he can write more articles, if he wants he can publish those to, if not published they are in DRAFT state, if activated then article will be in ACTIVATED state, and he can read some articles provided by System owners also, Also a user can share his articles with others.
So a typical DB design will look like..
{
articleType: Custom or System,
status: DRAFT, DEACTIVATED, ACTIVATED,
ownerId: unique id of the owner,
sharedWith: [unique ids of shared with users],
title: title of the article,
articleText: text description of the article,
tags: some tags related to article
}
and some other fields that are useful.
Basically, when the article is searched, we have to show result, if he is the owner of the article, or it is shared with him by some other user, or its a system article and that system article is in the ACTIVATED state, based on the text from tags, description, title fields in document.
So according to the mongo query, I exported all the necessary fields to the ElasticSeach index first.
Then I prepared the exact same query in ElasticSeach as we were doing a search with the mongo query.
i.e. my mongo query and Elasticsearch query were exactly the same.
I removed stop-words from the search text.
I performed a search on ElasticSearch only if there is a search text, otherwise, I used mongo results only.
we also have pagination implemented in here.
I created a text index on the mongo collection.
The strategy is, whenever there is an error from ElasticSeach or Cluster is down, timeout issue, I will perform text query on mongo.
When use search articles with some text, I performed a search on ElasticSeach, also with limit and skip, and fetched the _id and index of each document.
When I received the result from ElastcSeach, I converted _ids to ObjectId, I already have a mongo query, I added this _ids array(from ElasticSeach) into query,
like:
{$and: [{_id: {$in : _idsFromES}} , ...remaining query]}
I get exactly the same result from mongo also,
The results are too satisfactory and desired for me.
Also, with this I also handled, delete, insert, update of the article, the same things will change on ElasticSeach Index too.
That way they were always up to date.
I hope this helps you find the best way to carry out your task. :-)

What is the proper way to check if a user has liked a thread when he GET the thread? [duplicate]

Currently I am working on a mobile app. Basically people can post their photos and the followers can like the photos like Instagram. I use mongodb as the database. Like instagram, there might be a lot of likes for a single photos. So using a document for a single "like" with index seems not reasonable because it will waste a lot of memory. However, I'd like a user add a like quickly. So my question is how to model the "like"? Basically the data model is much similar to instagram but using Mongodb.
No matter how you structure your overall document there are basically two things you need. That is basically a property for a "count" and a "list" of those who have already posted their "like" in order to ensure there are no duplicates submitted. Here's a basic structure:
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3")
"photo": "imagename.png",
"likeCount": 0
"likes": []
}
Whatever the case, there is a unique "_id" for your "photo post" and whatever information you want, but then the other fields as mentioned. The "likes" property here is an array, and that is going to hold the unique "_id" values from the "user" objects in your system. So every "user" has their own unique identifier somewhere, either in local storage or OpenId or something, but a unique identifier. I'll stick with ObjectId for the example.
When someone submits a "like" to a post, you want to issue the following update statement:
db.photos.update(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
"likes": { "$ne": ObjectId("54bb2244a3a0f26f885be2a4") }
},
{
"$inc": { "likeCount": 1 },
"$push": { "likes": ObjectId("54bb2244a3a0f26f885be2a4") }
}
)
Now the $inc operation there will increase the value of "likeCount" by the number specified, so increase by 1. The $push operation adds the unique identifier for the user to the array in the document for future reference.
The main important thing here is to keep a record of those users who voted and what is happening in the "query" part of the statement. Apart from selecting the document to update by it's own unique "_id", the other important thing is to check that "likes" array to make sure the current voting user is not in there already.
The same is true for the reverse case or "removing" the "like":
db.photos.update(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
"likes": ObjectId("54bb2244a3a0f26f885be2a4")
},
{
"$inc": { "likeCount": -1 },
"$pull": { "likes": ObjectId("54bb2244a3a0f26f885be2a4") }
}
)
The main important thing here is the query conditions being used to make sure that no document is touched if all conditions are not met. So the count does not increase if the user had already voted or decrease if their vote was not actually present anymore at the time of the update.
Of course it is not practical to read an array with a couple of hundred entries in a document back in any other part of your application. But MongoDB has a very standard way to handle that as well:
db.photos.find(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
},
{
"photo": 1
"likeCount": 1,
"likes": {
"$elemMatch": { "$eq": ObjectId("54bb2244a3a0f26f885be2a4") }
}
}
)
This usage of $elemMatch in projection will only return the current user if they are present or just a blank array where they are not. This allows the rest of your application logic to be aware if the current user has already placed a vote or not.
That is the basic technique and may work for you as is, but you should be aware that embedded arrays should not be infinitely extended, and there is also a hard 16MB limit on BSON documents. So the concept is sound, but just cannot be used on it's own if you are expecting 1000's of "like votes" on your content. There is a concept known as "bucketing" which is discussed in some detail in this example for Hybrid Schema design that allows one solution to storing a high volume of "likes". You can look at that to use along with the basic concepts here as a way to do this at volume.

MongoDB: how to set collection version?

I'm currently using MongoDB and I have a collection called Product. I have a requirement in the system that asks to increment the collection version whenever any change happens to the collection (e.g. add a new product, remove, change price, etc...).
Question: Is there a recommended approach to set versions for collections in MongoDB?
I was expecting to find something like that:
db.collection.Product.setVersion("1.0.0");
and the corresponding get method:
db.collection.Product.getVersion();
I'm not sure if it makes sense. Personally, I would love to have collection metadata provided as a native implementation from MongoDB. Is there any document database that does so?
MongoDB itself is completely "schemaless" and as such does not have any of it's own concepts of document "metadata" or the general "version management" that you seem to be looking for. As such the general implementation is all up to you, and documents store whatever you supply them with.
You could implement such a scheme, generally by wrapping methods to include such things as version management in updates. So on document creation you would do this:
db.collection.myinsert({ "field": 1, "other": 2 })
Which wraps a normal insert to do this:
db.collection.insert({ "field": 1, "other": 2, "__v": 0 })
Having that data any "updates" would need to provide a similar wrapper. So this:
db.collection.myupdate({ "field": 1 },{ "$set": { "other": 4 } })
Actually does a check for the same version as held and "increments" the version at the same time via $inc:
db.collection.update(
{ "field": 1, "__v": 0 },
{
"$set": { "other": 4 },
"$inc": { "__v": 1 }
}
)
That means the document to be modified in the database needs to match the same "version" as what is in memory in order to update. Changing the version number means subsequent updates with stale data would not succeed.
Generally though, there are several Object Document Mapper or ODM implementations available for various languages that have the sort of functionality built in. You would probably be best off looking at the Drivers section of the documentation to find something suitable for your language implementation. Also a little extra reading up on MongoDB would help as well.

mongodb: Is this where I should just normalize my embedded objects?

I have a collection of Parents that contain EmbeddedThings, and each EmbeddedThing contains a reference to the User that created it.
UserCollection: [
{
_id: ObjectId(…),
name: '…'
},
…
]
ParentCollection: [
{
_id: ObjectId(…),
EmbeddedThings: [
{
_id: 1,
userId: ObjectId(…)
},
{
_id: 2,
userId: ObjectId(…)
}
]
},
…
]
I soon realized that I need to get all EmbeddedThings for a given user, which I managed to accomplish using map/reduce:
"results": [
{
"_id": 1,
"value": [ `EmbeddedThing`, `EmbeddedThing`, … ]
},
{
"_id": 2,
"value": [ `EmbeddedThing`, `EmbeddedThing`, … ]
},
…
]
Is this where I should really just normalize EmbeddedThing into its own collection, or should I still keep map/reduce to accomplish this? Some other design perhaps?
If it helps, this is for users to see their list of EmbeddedThings across all Parents, as opposed to for some reporting/aggregation task (which made me realize I might me doing this wrong).
Thanks!
"To embed or not to embed: that is the question" :)
My rules are:
embed if an embedded object has sense only in the context of parent objects. For example, OrderItem without an Order doesn't make sense.
embed if dictated by performance requirements. It's very cheap to read full document tree (as opposed to having to make several queries and joining them programmatically).
You should look at your access patterns. If you load ParentThing several thousand times per second, and load User once a week, then map-reduce is probably a good choice. User query will be slow, but it might be ok for your application.
Yet another approach is to denormalize even more. That is, when you add an embedded thing, add it to both parent thing and user.
Pros: queries are fast.
Cons: Complicated code. Double amount of writes. Potential loss of sync (you update/delete in one place, but forget to do in another).