MongoDB - simulate join or subquery - mongodb

I'm trying to figure out the best way to structure my data in Mongo to simulate what would be a simple join or subquery in SQL.
Say I have the classic Users and Posts example, with Users in one collection and Posts in another. I want to find all posts by users who's city is "london".
I've simplified things in this question, in my real world scenario storing Posts as an array in the User document won't work as I have 1,000's of "posts" per user constantly inserting.
Can Mongos $in operator help here? Can $in handle an array of 10,000,000 entries?

Honestly, if you can't fit "Posts" into "Users", then you have two options.
Denormalize some User data inside of posts. Then you can search through just the one collection.
Do two queries. (one to find users the other find posts)
Based on your question, you're trying to do #2.
Theoretically, you could build a list of User IDs (or refs) and then find all Posts belonging to a User $in that array. But obviously that approach is limited.
Can $in handle an array of 10,000,000 entries?
Look, if you're planning to "query" your posts for all users in a set of 10,000,000 Users you are well past the stage of "query". You say yourself that each User has 1,000s of posts so you're talking about a query for "Users with Posts who live in London" returning 100Ms of records.
100M records isn't a query, that's a dataset!
If you're worried about breaking the $in command, then I highly suggest that you use map/reduce. The Mongo Map/Reduce will create a new collection for you. You can then trim down or summarize this dataset as you see fit.

$in can handle 100,000 entries. I've never tried 10,000,000 entries but the query (a query is also a document) has to be smaller than 4mb (like every document) so 10,0000,0000 entries isn't possible.
Why don't you include the user and its town in the Posts collection? You can index this town because you can index properties of embedded entities. You no longer have to simulate a join because you can query the Posts on the towns of its embedded users.
This means that you have to update the Posts when the town of a user changes but that doesn't happen very often. This update will be fast if you index the UserId in the Posts collection.

I have something similar, but my setup is geared towards "users" and "messages." What I did was add a reference to the user, sort of like a foreign key. I used the generated "_id" from the users collection and stored it as a key inside of "messages." For every message a user sends, I save it to the "messages" collection. You should read up on dbrefs, I think it's what you're looking for.
You'll have to run multiple queries, but you should definitely do that on the app side.

Related

Finding all documents that don't have a relations with others in MongoDB

I have a collection of Users and one of Posts. I want to find all the posts that one user has not viewed yet. I expect the number of posts one user views to grow over time, possibly reaching tens or hundreds of thousands for some users, although the majority of users will only have a few hundreds.
How should I organize my data in a MongoDB database?
Should I keep the array of viewed posts in the User collection, in the Post collection, in a collection on its own (a document per view) or what else?
How should I then query the database?
The query should be constructable using the aggregation pipeline. First $lookup to join posts to post views, then $match with $exists: false to remove existing posts.
This won't be a cheap query with a large volume of data. One strategy for making it faster is by limiting, for example, the time allowed for posts prior to the join, or scoping the posts to a forum/tag, etc.

Sharding with mongodb. Optimal way to write my query

Let me try to explain my problem first, and then the solution I'm implementing. I have a collection of "events", which can be shared with specific users. I also have a collection of "users". Any user could share an event with any number of other users. When an event is shared with a user, it is seen in the home page of my website by that user (let's say that it is sorted by creation date to make it simple).
I want to use sharding to balance both my writes and my reads, and to be able to scale horizontally if needed. Before I thought of sharding, I had an events collection, which had an array of userIds within. Those userIds are the ones that can see the event. My query then was every event where the logged in user was contained within that array, sorted by creation date, limiting to my page size.
To implement sharding in this scenario, the obvious choice would be to somehow have the userId as shard key, as every event returned by my query has the userId within that embedded array. However, my userId is contained within an array, so that wouldn't work. I then though to have a new collection, with the following fields:
userId: ObjectId (hashed shard key, to avoid monotony)
eventId: ObjectId
creationDate: Date
This way, I can run my query by userId, and have it go only to the corresponding shard. My problem of course with this solution, is that I now have eventIds instead of events, which is a somehow big document so I wouldn't want to have it redundantly as an embedded document within that collection (remember many users can be shared the same event).
To solve this, I think the correct solution would be to have the eventId be the shard key of the events collection (again, hashed to avoid monotony). I can then query the events collection by just those ids.
This raises two questions:
Is this the correct way to think about this particular problem. Is it a good solution?
As I now have several eventIds, let's just say five, and each one of them can be located in a different shard, which good be more performant: to have a single query looking for the five ids, or have five different queries looking for a single id each?
Yes, this is correct way and solution is fine. Users sharded with userId and events sharded with eventId.
Latter one. five different queries searching single id, because then query goes to one shard. If you have single query what looks five ids same time ( $in:[]), it probably scatter to multiple shards.

Get all usernames from a items user_id? (Mongodb query)

I am having some difficulty with my leaderboard for my application.
I have a database with two collections.
Users
Fish
In my Fish collection I also have user_id. When I fetch all fish, I get everything including user_id.
However, the user_id doesn't really help me, I want to display the username belonging to that user_id.
This is how my query looks.
Fish.find().sort({weight: -1}).limit(10).exec(function(err, leaderboard) {
if(err) return res.json(500, {errorMsg: 'Could not get leaderboard'});
res.json(leaderboard);
})
I feel like I need to make another query, to get all the usernames belonging to the user_ids I get from the first query. Perhaps use a loop somehow?
MongoDb is pretty new to me and don't really know what to look for.
Any advice, tips, link are much appriecated.
You may find useful information on MongoDB's Database References documentation.
The first thing to consider on using fields from different collections in MongoDB is:
MongoDB does not support joins. In MongoDB some data is denormalized, or stored with related data in documents to remove the need for joins. However, in some cases it makes sense to store related information in separate documents, typically in different collections or databases.
In your case, you might want to consider storing the information from the Fish collection as embedded documents within the users from the User collection.
If this is not an option, then you might want to use Manual References or loop over the user_ids provided in the result from your query over the Fish collection.
With the second option you may use a query to obtain the corresponding usernames from the User collection such as:
Users.find({user_id:<USER_ID>},{username:1})

Many-to-many in document DBs

I am just starting out with MongoDB (Late to the party, I know...)
I am still trying to get 10+ years of relational DBing out of my head when thinking of a document design.
Lets say I have many users using many apps. Any user can use several apps, and any app can be used by any number of users.
In the login procedure I would like to access all the apps a user uses. In another procedure I would like to get all the users of a specific app.
Should I just have duplicate data? Maybe have an array of users in the App document and an array of apps in the user document? Does this make sense? Is this a conventional approach in document DBs?
Good question!
You have many to many scenario.
In Mongo you can solve this problem in many ways:
Using a lookup table like in SQL or having an array.
What you should consider are indexes, same as in SQL, but this time you have more options.
Since its a many to many scenario I would probably go with the lookup table.
This is the most effective way to get users of an app and apps of a user.
Array is not good for dynamic values especially if you need two array fields (app / user) while the app.users array field is going to change often.
The downside is that you can "join" and will have to "select" data from two tables and do the "join" yourself but this shouldn't be an issue, especially since you can always cache the result (local caching in your application) and Mongo will return the result super fast if you will add index for the user field
{
_id: "<appID>_<userID>" ,
user: "<userID>"
}
_id indexes by default. Another index should be created for the "user" field then Mongo will load the btree into memory and you are all good.
As per your scenario, you need not have duplicate data. Since it's a many to many relationship and the data is going to keep changing, you need to use document reference instead of document embedding.
So you will have two collections:
app collection :
{
_id : appId,
app_name : "appname",
// other property of app
users : [userid1, userid2]
}
users collection:
{
_id : userId,
// other details of user
apps: [appid1, appid2, ..]
}
As you mentioned you need to have array of users in app collection & array of apps in user collection.
When you are fetching data in the client, at first when the user logs in, you will get the array of app IDs from the user document.
Then again, with the app IDs you need to query for Apps details in the app collection.
This roundtrip will be there for sure as we are using references. But you can improve performance by caching the details & by having proper indexes.
This is conventional in mongodb for a many to many relationship

many-to-many relationships for social app: Mongodb or graph databases like Neo4j

I have tried to understand embedding in Mongodb but could not find good enough documentation. Linking is not advised as writes are not atomic across documents and also there are two lookups. Does someone know how to solve this or would you suggest me to go to graph dbs like neo4j.
I am trying to build an application which would need many-to-many relationships. To explain, I will take the example of a library. It can suggest books to user based on books his friends are reading and neighbors (like minded) users are reading.
There are Users and Books. Users borrow books and have friends who are other users
Given a user, I need all books he is reading and number of mutual
friends for the book
Given a book, I need all the people who are reading it. May be given
a user A, this would return the intersection of people reading book
and friends of user A. This is mutual friendship
Users = [
{ name: 'xyz', 'id':'000000', friend_ids:['949583','958694']}
{ name: 'abc', 'id':'000001', friend_ids:['949582','111111']}
]
Books = [
{'book':'da vinci code', 'author': 'dan brown', 'readers'=['949583', '000000']}
{'book':'iCon', 'author': 'Young', 'readers'=['000000', '000001']}
]
As seen above, generally I need two documents if I take mongo DB as I might two way lookup. Duplicating (embedding) on document into another could lead to lot of duplicity (these schemas could store much more information than shown).
Am I modeling my data correctly? Can this be effectively done in mongodb or should I look at graph dbs.
A disclaimer: I work for Neo4j
From your outline, requirements and type of data it seems that your app is rather in a sweetspot for graph databases.
I'd suggest you just do a quick spike with a graph database and see how it is going.
There will be no duplication
you have transactions for atomic operations
following links is the natural operation
local queries (e.g. from a user or a book) are cheap and fast
you can use graph algorithms like shortest path to find interesting information about your data
recommendations and similar operations are natural to graph databases
Some Questions:
Why did you choose MongoDB in the first place?
What implementation language do you use?
Your basic schema proposal above would work fine for MongoDB, with a few suggestions:
Use integers for identifiers, rather than strings. Integers will often be stored more compactly by MongoDB (they will always be 8 bytes, whereas strings' stored size will depend on the length of the string). You can use findAndModify to emulate unique sequence generators (like auto_increment in some relational databases) -- see Mongoengine's SequenceField for an example of how this is done. You could also use ObjectIds which are always 12 bytes, but are virtually guaranteed to be unique without having to store any coordination information in the database.
You should use the _id field instead of id, as this field is always present in MongoDB and has a default unique index created on it. This means your _ids are always unique, and lookups by _id is very fast.
You are right that using this sort of schema will require multiple find()s, and will incur network round-trip overhead each time. However, for each of the queries you have suggested above, you need no more than 2 lookups, combined with some straightforward application code:
"Given a user, I need all books he is reading and number of mutual friends for the book"
a. Look up the user in question, thenb. query the books collection using db.books.find({_id: {$in: [list, of, books, for, the, user]}}), thenc. For each book, compute a set union for that book's readers plus the user's friends
"Given a book, I need all the people who are reading it."a. Look up the book in question, thenb. Look up all the users who are reading that book, again using $in like db.users.find({_id: {$in: [list, of, users, reading, book]}})
"May be given a user A, this would return the intersection of people reading book and friends of user A."a. Look up the user in question, thenb. Look up the book in question, thenc. Compute the set union of the user's friends and the book's readers
I should note that $in can be slow if you have very long lists, as it is effectively equivalent to doing N number of lookups for a list of N items. The server does this for you, however, so it only requires one network round-trip rather than N.
As an alternative to using $in for some of these queries, you can create an index on the array fields, and query the collection for documents with a specific value in the array. For instance, for query #1 above, you could do:
// create an index on the array field "readers"
db.books.ensureIndex({readers: 1})
// now find all books for user whose id is 1234
db.books.find({readers: 1234})
This is called a multi-key index and can perform better than $in in some cases. Your exact experience will vary depending on the number of documents and the size of the lists.