MongoDB $nin otherDocument.array without downloading otherDocument - mongodb

I've been fighting with a query for awhile now in which I have to download a document from the database to construct my query and it's honestly slowing things down quite a bit as my service starts to get more and more requests. I was wondering if someone could help me optimize the query, or make it so I don't have to download the initial document.
I'm going to use Tinder as an example here, as it simulates the amount of items that could potentially be in this array, and why having a "swipedBy: []" array on each client to remove the initial download of the query seems like it would be inefficient, as the array could end up being hundreds of thousands of elements long, and will only grow over time.
So let's say that I have a field in my user's documents called "swipes" which is an array of firebase ids (strings) of the user's that they have interacted with, an example of this can be found below:
{
_id: 'firebaseUid,
swipes: [ 'firebaseUid_1', 'firebaseUid_2', 'firebaseUid_3' ]
}
I have a query that is supposed to select a user from the database that is not already in MY swipes array, currently this is how I have it done. (Javascript)
database.collection('users').findOne({ _id: myUserId }).then((document => {
const query = {
...,
$and: [
{ _id: { $ne: myUserId } },
{ _id: { $nin: document.swipes }
]
}
});
This requires me to download the user document from the database, then pass the whole array back in as a query, which seems highly inneficient, when talking about tens, if not hundereds of thousands of array elements.
While the above query DOES work, I feel like there's a way this can be sped up, and my lack of knowledge in MongoDB is really attacking me here. I know for a fact I can do this in MySQL, but I don't know of any good (and affordable) MySQL services like mLab.
I've been fighting with a query for awhile now in which I have to download a document from the database to construct my query and it's honestly slowing things down quite a bit as my service starts to get more and more requests. I was wondering if someone could help me optimize the query, or make it so I don't have to download the initial document.
I'm going to use Tinder as an example here, as it simulates the amount of items that could potentially be in this array, and why having a "swipedBy: []" array on each client to remove the initial download of the query seems like it would be inefficient, as the array could end up being hundreds of thousands of elements long, and will only grow over time.
So let's say that I have a field in my user's documents called "swipes" which is an array of firebase ids (strings) of the user's that they have interacted with, an example of this can be found below:
{
_id: 'firebaseUid,
swipes: [ 'firebaseUid_1', 'firebaseUid_2', 'firebaseUid_3' ]
}
I have a query that is supposed to select a user from the database that is not already in MY swipes array, currently this is how I have it done. (Javascript)
database.collection('users').findOne({ _id: myUserId }).then((document => {
const query = {
...,
$and: [
{ _id: { $ne: myUserId } },
{ _id: { $nin: document.swipes }
]
}
});
This requires me to download the user document from the database, then pass the whole array back in as a query, which seems highly inneficient, when talking about tens, if not hundereds of thousands of array elements.
While the above query DOES work, I feel like there's a way this can be sped up, and my lack of knowledge in MongoDB is really attacking me here. I know for a fact I can do this in MySQL, but I don't know of any good (and affordable) MySQL services like mLab.
I should add: My MongoDB database is remote, so this document and massive array is being downloaded (per-request) to my Google Cloud Functions call, then being sent back to the server over the network. Meaning the data has to be downloaded and then uploaded over the network again, and considering I'm charged by the millisecond, I'd like to minimize that.

You should refactor swipes from user document to separate collection that point to the user who swiped and who was swiped. Also, this would enable to store additional data like was swiped left or right, timestamp and so on.

Related

query too large issue with mongodb

let's say we have a collection of users and each user is followed by another user. if I want to find the users that are NOT following me, I need to do something like:
db.users.find({_id: { $nin : followers_ids } } ) ;
if the amount of followers_ids is huge, let's say 100k users, mongodb will start saying the query is too large, plus sending a big amount of data over the network to make the query is not good neither. what are the best practices to accomplish this query without sending all this ids over the network ?.
I recommend that you limit the number of query Results to Reduce Network Demand. According to the Docs,
MongoDB cursors return results in groups of multiple documents. If you know the number of results you want, you can reduce the demand on network resources by issuing the limit() method.
This is typically used in conjunction with sort operations. For
example, if you need only 50 results from your query to the users
collection, you would issue the following command:
db.users.find({$nin : followers_ids}).sort( { timestamp : -1 } ).limit(50)
You can then use the cursor to get retrieve more user documents as needed.
Recommendation to Restructure Followers Schema
I would recommend that you restructure your user documents if the followers will grow to a large amount. Currently user schema may be as such:
{
_id: ObjectId("123"),
username: "jobs",
email: "stevej#apple.com",
followers: [
ObjectId("12345"),
ObjectId("12375"),
ObjectId("12395"),
]
}
The good thing about the schema is whenever this user does anything all of the users you need to notify is right here inside of the document. The downside is that if you needed to find everyone a user is following you will have to query the entire users collection. Also your user document will become larger and more volatile as the followers grow.
You may want to further normalize your followers. You can keep a collection that matches followee to followers with documents that look like this:
{
_id: ObjectId("123"),//Followee's "_id"
followers: [
ObjectId("12345"),
ObjectId("12375"),
ObjectId("12395"),
]
}
This will keep your user documents slender, but will take an extra query to get the followers. As the "followers" array changes in size, you can enable the userPowerOf2Sizes allocation strategy to reduce fragmentation and moves.

How get distinct list of values and below that all the documents that are of that distinct value?

Getting my head around MongoDB document design and trying to figure out how to do something or if I'm barking up the wrong tree.
I'm creating a mini CMS. The site will contains either documents or url's that are grouped by a category, i.e. There's a group called 'shop' that has a list of links to items on another site and there's a category called 'art' that has a list of works of art, each of which has a title, summary and images for a slideshow.
So one possible way to do this would be to have a collection that would look something like:
[{category: 'Products',
title: 'Thong',
href: 'http://www.thongs.com'
},{
category: 'Products',
title: 'Incredible Sulk',
href:'http://www.sulk.com'
},{
category: 'Art',
title: 'Cool art',
summary: 'This is a summary to display',
images: [...]
}]
But, and here's the question.... when I'm building the webpage this structure isn't much use to me. the homepage contains lists of 'things' grouped by their category, lists... menus.. stuff like that. To be able to easily do that I need to have something that looks more like:
[
{'Products':[
{title:'thong', href:'http://www.thongs.com'},
{title:'Incredible Sulk'}
]
},
{'Art':[
{title:'Cool art',summary:'This is a summary to display',images:[...]}
]
}
]
So the question is, can I somehow do this transformation in MondoDB? If I can't then is it bad to do this in my app server layer(I'd get a grouped list of unique categories and then loop through them querying Mongo for documents of that category)? I'm guessing app server layer is bad, after all mongodb has it all in memory if I'm lucky. If neither of these are good then am I doing it all wrong and should I actually store the structure like this in the first place?
I need to make it easy for the user to create categories on the fly and consider what happens if they start to add lots of documents and I either need to restrict how many documents I pull back for each category or somehow limit the fields returned so that when I query mongodb it doesn't return back a relatively big chunk of data which is slow and wasteful, but instead returns back the minimum I need to create the desired page.
I figured a group query that will give me almost the structure I want, but good enough to use for templates.
db.things.group({
key:{category:true},
initial:{articles:[]},
reduce: function(doc, aggregator) {
aggregator.articles.push(doc);
}
})

MongoDB data structure with large number internal documents

I am relatively new to MongoDB, and so far am really impressed. I am struggling with the best way to setup my document stores though. I am trying to do some summary analytics using twitter data and I am not sure whether to put the tweets into the user document, or to keep those as a separate collection. It seems like putting the tweets inside the user model would quickly hit the limit with regards to size. If that is the case then what is a good way to be able to run MapReduce across a group of user's tweets?
I hope I am not being too vague but I don't want to get too specific and too far down the wrong path as far as setting up my domain model.
As I am sure you are all bored of hearing, I am used to RDB land where I would lay out my schema like
| USER |
--------
|ID
|Name
|Etc.
|TWEET__|
---------
|ID
|UserID
|Etc
It seems like the logical schema in Mongo would be
User
|-Tweet (0..3000)
|-Entities
|-Hashtags (0..10+)
|-urls (0..5)
|-user_mentions (0..12)
|-GeoData (0..20)
|-somegroupID
but wouldn't that quickly bloat the User document beyond capacity. But I would like to run analysis on tweets belonging to users with similar somegroupID. It conceptually makes sense to to the model layout as above, but at what point is that too unweildy? And what are viable alternatives?
You're right that you'll probably run into the 16MB MongoDB document limit here. You are not saying what sort of analysis you'd like to run, so it is difficult to recommend a schema. MongoDB schemas are designed with the data-query (and insertion) patterns in mind.
Instead of putting your tweets in a user, you can of course quite easily do the opposite, add a user-id and group-id into the tweet documents itself. Then, if you need additional fields from the user, you can always pull that in a second query upon display.
I mean a design for a tweet document as:
{
'hashtags': [ '#foo', '#bar' ],
'urls': [ "http://url1.example.com", 'http://url2.example.com' ],
'user_mentions' : [ 'queen_uk' ],
'geodata': { ... },
'userid': 'derickr',
'somegroupid' : 40
}
And then for a user collection, the documents could look like:
{
'userid' : 'derickr',
'realname' : Derick Rethans',
...
}
All credit to the fine folks at MongoHQ.com. My question was answered over on https://groups.google.com/d/msg/mongodb-user/OtEOD5Kt4sI/qQg68aJH4VIJ
Chris Winslett # MongoHQ
You will find this video interesting:
http://www.10gen.com/presentations/mongosv-2011/schema-design-at-scale
Essentially, in one document, store one days of tweets for one
person. The reasoning:
Querying typically consists of days and users
Therefore, you can have the following index:
{user_id: 1, date: 1} # Date needs to be last because you will range
and sort on the date
Have fun!
Chris MongoHQ
I think it makes the most sense to implement the following:
user
{ user_id: 123123,
screen_name: 'cledwyn',
misc_bits: {...},
groups: [123123_group_tall_people, 123123_group_techies, ],
groups_in: [123123_group_tall_people]
}
tweet
{ tweet_id: 98798798798987987987987,
user_id: 123123,
tweet_date: 20120220,
text: 'MongoDB is pretty sweet',
misc_bits: {...},
groups_in: [123123_group_tall_people]
}

MongoDB - Query embbeded documents

I've a collection named Events. Each Eventdocument have a collection of Participants as embbeded documents.
Now is my question.. is there a way to query an Event and get all Participants thats ex. Age > 18?
When you query a collection in MongoDB, by default it returns the entire document which matches the query. You could slice it and retrieve a single subdocument if you want.
If all you want is the Participants who are older than 18, it would probably be best to do one of two things:
Store them in a subdocument inside of the event document called "Over18" or something. Insert them into that document (and possibly the other if you want) and then when you query the collection, you can instruct the database to only return the "Over18" subdocument. The downside to this is that you store your participants in two different subdocuments and you will have to figure out their age before inserting. This may or may not be feasible depending on your application. If you need to be able to check on arbitrary ages (i.e. sometimes its 18 but sometimes its 21 or 25, etc) then this will not work.
Query the collection and retreive the Participants subdocument and then filter it in your application code. Despite what some people may believe, this isnt terrible because you dont want your database to be doing too much work all the time. Offloading the computations to your application could actually benefit your database because it now can spend more time querying and less time filtering. It leads to better scalability in the long run.
Short answer: no. I tried to do the same a couple of months back, but mongoDB does not support it (at least in version <= 1.8). The same question has been asked in their Google Group for sure. You can either store the participants as a separate collection or get the whole documents and then filter them on the client. Far from ideal, I know. I'm still trying to figure out the best way around this limitation.
For future reference: This will be possible in MongoDB 2.2 using the new aggregation framework, by aggregating like this:
db.events.aggregate(
{ $unwind: '$participants' },
{ $match: {'age': {$gte: 18}}},
{ $project: {participants: 1}
)
This will return a list of n documents where n is the number of participants > 18 where each entry looks like this (note that the "participants" array field now holds a single entry instead):
{
_id: objectIdOfTheEvent,
participants: { firstName: 'only one', lastName: 'participant'}
}
It could probably even be flattened on the server to return a list of participants. See the officcial documentation for more information.

MongoDB / NOSQL: Best approach to handling read/unread status on messages

Suppose you have a large number of users (M) and a large number of documents (N) and you want each user to be able to mark each document as read or unread (just like any email system). What's the best way to represent this in MongoDB? Or any other document database?
There are several questions on StackOverflow asking this question for relational databases but I didn't see any with recommendations for document databases:
What's the most efficient way to remember read/unread status across multiple items?
Implementing an efficient system of "unread comments" counters
Typically the answers involve a table listing everything a user has read: (i.e. tuples of user id, document id) with some possible optimizations for a cut off date allowing mark-all-as-read to wipe the database and start again knowing that anything prior to that date is 'read'.
So, MongoDB / NOSQL experts, what approaches have you seen in practice to this problem and how did they perform?
{
_id: messagePrefs_uniqueId,
type: 'prefs',
timestamp: unix_timestamp
ownerId: receipientId,
messageId: messageId,
read: true / false,
}
{
_id: message_uniqueId,
timestamp: unix_timestamp
type: 'message',
contents: 'this is the message',
senderId: senderId,
recipients: [receipientId1,receipientId2]
}
Say you have 3 messages you want to retrieve preferences for, you can get them via something like:
db.messages.find({
messageId : { $in : [messageId1,messageId2,messageId3]},
ownerId: receipientId,
type:'prefs'
})
If all you need is read/unread you could use this with MongoDB's upsert capabilities, so you are not creating prefs for each message unless the user actually reads it, then basically you create the prefs object with your own unique id and upsert it into MongoDB. If you want more flexibility(like say tags or folders) you'll probably want to make the pref for each recipient of the message. For example you could add:
tags: ['inbox','tech stuff']
to the prefs object and then to get all the prefs of all the messages tagged with 'tech stuff' you'd go something like:
db.messages.find({type: 'prefs', ownerId: recipientId, tags: 'tech stuff'})
You could then use the messageIds you find within the prefs to query and find all the messages that correspond:
db.messages.find((type:'message', _id: { $in : [array of messageIds from prefs]}})
It might be a little tricky if you want to do something like counting how many messages each 'tag' contains efficiently. If it's only a handful of tags you can just add .count() to the end of your query for each query. If it's hundreds or thousands then you might do better with a map/reduce server side script or maybe an object that keeps track of message counts per tag per user.
If you're only storing a simple boolean value, like read/unread, another method is to embedded an array in each Document that contains a list of the Users who have read it.
{
_id: 'document#42',
...
read_by: ['user#83', 'user#2702']
}
You should then be able to index that field, making for fast queries for Documents-read-by-User and Users-who-read-Document.
db.documents.find({read_by: 'user#83'})
db.documents.find({_id: 'document#42}, {read_by: 1})
However, I find that I'm usually querying for all Documents that have not been read by a particular User, and I can't think of any solution that can make use of the index in this case. I suspect it's not possible to make this fast without having both read_by and unread_by arrays, so that every User is included in every Document (or join table), but that would have a large storage cost.