Sharding with mongodb. Optimal way to write my query - mongodb

Let me try to explain my problem first, and then the solution I'm implementing. I have a collection of "events", which can be shared with specific users. I also have a collection of "users". Any user could share an event with any number of other users. When an event is shared with a user, it is seen in the home page of my website by that user (let's say that it is sorted by creation date to make it simple).
I want to use sharding to balance both my writes and my reads, and to be able to scale horizontally if needed. Before I thought of sharding, I had an events collection, which had an array of userIds within. Those userIds are the ones that can see the event. My query then was every event where the logged in user was contained within that array, sorted by creation date, limiting to my page size.
To implement sharding in this scenario, the obvious choice would be to somehow have the userId as shard key, as every event returned by my query has the userId within that embedded array. However, my userId is contained within an array, so that wouldn't work. I then though to have a new collection, with the following fields:
userId: ObjectId (hashed shard key, to avoid monotony)
eventId: ObjectId
creationDate: Date
This way, I can run my query by userId, and have it go only to the corresponding shard. My problem of course with this solution, is that I now have eventIds instead of events, which is a somehow big document so I wouldn't want to have it redundantly as an embedded document within that collection (remember many users can be shared the same event).
To solve this, I think the correct solution would be to have the eventId be the shard key of the events collection (again, hashed to avoid monotony). I can then query the events collection by just those ids.
This raises two questions:
Is this the correct way to think about this particular problem. Is it a good solution?
As I now have several eventIds, let's just say five, and each one of them can be located in a different shard, which good be more performant: to have a single query looking for the five ids, or have five different queries looking for a single id each?

Yes, this is correct way and solution is fine. Users sharded with userId and events sharded with eventId.
Latter one. five different queries searching single id, because then query goes to one shard. If you have single query what looks five ids same time ( $in:[]), it probably scatter to multiple shards.

Related

never show same document to same user twice

I have a server storing content 5,000 documents. Lets say I have 1 million users who all query for 50 new documents at their own pace, until all content has been seen.
I want to make sure that each user only sees and interacts with the content once and never again, like Tinder.
My first thought was to tag each document with a list of user-ids of the users who have seen the document. However, this list would get really long... like a list of 1 million user-ids per document - but this sounds like it would really kill query performance.
Does anyone have any better ideas of how I can return content to users just once and never again.
p.s i am planning on doing this build out with mongoDB
p.p.s i thought about making a list of 'document-ids-seen' and attaching that to the user's document, and then with every query made by that user 'filter' out results that match 'document-ids-seen', but same challenge here, the query length would grow linearly as the user keeps interacting and bringing in new content.
The solution depends on the exact meaning of "at their own pace".
Your second post suggests that the time schedule is up to the user, but she will be presented with the documents in an order determined by your application, like e.g. getting news items in the order of the timestamp of news creation. In that case, your timestamp or auto increment solution will work, and it has only a small impact on data volume and query complexity.
If, however, the user may also choose which documents to view, this won't work any more, as the documents already viewed may be scattered across the entire document set. A solution to handle this efficiently consists of two design ideas:
(a) Imagine whether most users, at a given point of time, will have viewed a small or a large part of the entire document set. If only a small selection of documents is expected to be of interest to a particular user, then the count of documents the user has viewed will be rather small. (E.g. assume the documents are about IT and one user only wants to look at MongoDB docs, another mainly at Linux docs.) If all users will be interested in most or all of documents, then the count of documents a particular user has not viewed will be small. (E.g. a set of news that everyone tries to follow.) Depending on which is the case, store only a small list of viewed/not viewed document ids with each user, which will also simplify the query for the documents still to be viewed.
(b) With each user, don't store a list of single document ids (viewed or not viewed), but a list of intervals of such ids. E.g., if you store ids of documents not yet viewed, and some documents get added to the database, then, when a user is opened, her highest interval will be updated from (someLowerId, formerHighestId) to (someLowerId, currentHighestId). When a user views a document, the interval containing its id gets split from (lowId, highId) to (lowId, viewedId - 1), (viewedId + 1, highId), where one or both of these intervals may get empty. Including or excluding intervals like these will also simplify the queries as opposed to listing single ids.
I just had the idea that I could avoid the many-to-many relationship of content-to-users' interaction altogether, if I put a time-stamp on each document, and therefore only queried for more documents after a particular time-stamp 'X'.
Where 'X' could be stored in my 'users' table.
So when opening the app, I would sync my 'users' table, then issue queries after time-stamp 'X', then when results are returned, I'd update my 'users' table again with my new time-stamp X.
Or 'x' could not be a time-stamp, 'x' could just be an auto-incrementing id

In MongoDB, how likely is it two documents in different collections in the same database will have the same Id?

According to the MongoDB documentation, the _id field (if not specified) is automatically assigned a 12 byte ObjectId.
It says a unique index is created on this field on the creation of a collection, but what I want to know is how likely is it that two documents in different collections but still in the same database instance will have the same ID, if that can even happen?
I want my application to be able to retrieve a document using just the _id field without knowing which collection it is in, but if I cannot guarantee uniqueness based on the way MongoDB generates one, I may need to look for a different way of generating Id's.
Short Answer for your question is : Yes that's possible.
below post on similar topic helps you in understanding better:
Possibility of duplicate Mongo ObjectId's being generated in two different collections?
You are not required to use a BSON ObjectId for the id field. You could use a hash of a timestamp and some random number or a field with extremely high cardinality (an US SSN for example) in order to make it close to impossible that two objects in the world will share the same id
The _id_index requires the idto be unique per collection. Much like in an RDBMS, where two objects in two tables may very likely have the same primary key when it's an auto incremented integer.
You can not retrieve a document solely by it's _id. Any driver I am aware of requires you to explicitly name the collection.
My 2 cents: The only thing you could do is to manually iterate over the existing collections and query for the _id you are looking for. Which is... ...inefficient, to put it polite. I'd rather semantically distinguish the documents in question by an additional field than by the collection they belong to. And remember, mongoDB uses dynamic schemas, so there is no reason to separate documents which semantically belong together but have a different set of fields. I'd guess there is something seriously, dramatically wrong with you schema. Please elaborate so that we can help you with that.

MongoDB Shard considering DBRefs

I have a case where in first collection I use DBRef to another collection.
First collection is Books, the second is Users (who read those books). The user can have avatars and various other informations, which is reasonable to keep in separate collection.
But now I need to shard the books collection. If I shard it amongst 2 nodes, how the Users collection will be sharded? I would like to keep users that are related to particular books in same node. Is that possible? Thanks!
At the moment this is not possible out side of tag aware sharding ( http://docs.mongodb.org/manual/core/tag-aware-sharding/ ). Kristina (when she was still with 10gen) wrote a good article on how to distribute your data, can easily be used to group multiple collections: http://www.kchodorow.com/blog/2012/07/25/controlling-collection-distribution/
However, you might find that very difficult to maintain as such I wouldn't advise it unless you are solely a DBA since you will literally be spending most of your time keeping it together with an ever expanding network like that.
What I would do instead is shard the books on user_id and then shard the user collection on hashed _id that way you only need to query two shards at most

MongoDB - Using email id as identifier across collections

I have user collection which holds email_id and _id as unique. I want to store user data across various collections. I would like to use email_id as identifier in those collections. Because it is easy to query in the shell against those collections with email_id instead of complex ObjectId.
Is this right way? will it give any performance problem while creating indexes with big emailIds?
Also, don't consider this option, If you have plan to enable email_id change
option in future.
While relational databases encourage you to normalize your data and spread it over many tables, this approach is usually not the best for MongoDB. MongoDB doesn't support JOINs over multiple collections or even multiple documents from the same collection. So you should try to design your database documents in a way that each query can be statisfied by searching for a single document. That means it is usually a good idea to store all information about a user in one document.
An exception for this is when certain points of data of the user grows indefinitely (like the posts made by a user in a forum). First, MongoDB documents have a size limit and second, when the size of a document increases, the database needs to reallocate its hard drive space frequently. This slows down writes and leads to fragmentation in the database. In that case it's better to put each entity in a different collection.
The size of the fields covered by an index don't matter when you search for equality. When you have an unique index on email_id, it should be just as fast as searching by _id.

MongoDB - simulate join or subquery

I'm trying to figure out the best way to structure my data in Mongo to simulate what would be a simple join or subquery in SQL.
Say I have the classic Users and Posts example, with Users in one collection and Posts in another. I want to find all posts by users who's city is "london".
I've simplified things in this question, in my real world scenario storing Posts as an array in the User document won't work as I have 1,000's of "posts" per user constantly inserting.
Can Mongos $in operator help here? Can $in handle an array of 10,000,000 entries?
Honestly, if you can't fit "Posts" into "Users", then you have two options.
Denormalize some User data inside of posts. Then you can search through just the one collection.
Do two queries. (one to find users the other find posts)
Based on your question, you're trying to do #2.
Theoretically, you could build a list of User IDs (or refs) and then find all Posts belonging to a User $in that array. But obviously that approach is limited.
Can $in handle an array of 10,000,000 entries?
Look, if you're planning to "query" your posts for all users in a set of 10,000,000 Users you are well past the stage of "query". You say yourself that each User has 1,000s of posts so you're talking about a query for "Users with Posts who live in London" returning 100Ms of records.
100M records isn't a query, that's a dataset!
If you're worried about breaking the $in command, then I highly suggest that you use map/reduce. The Mongo Map/Reduce will create a new collection for you. You can then trim down or summarize this dataset as you see fit.
$in can handle 100,000 entries. I've never tried 10,000,000 entries but the query (a query is also a document) has to be smaller than 4mb (like every document) so 10,0000,0000 entries isn't possible.
Why don't you include the user and its town in the Posts collection? You can index this town because you can index properties of embedded entities. You no longer have to simulate a join because you can query the Posts on the towns of its embedded users.
This means that you have to update the Posts when the town of a user changes but that doesn't happen very often. This update will be fast if you index the UserId in the Posts collection.
I have something similar, but my setup is geared towards "users" and "messages." What I did was add a reference to the user, sort of like a foreign key. I used the generated "_id" from the users collection and stored it as a key inside of "messages." For every message a user sends, I save it to the "messages" collection. You should read up on dbrefs, I think it's what you're looking for.
You'll have to run multiple queries, but you should definitely do that on the app side.