Paginate sub document MongoDB - mongodb

I'm trying to make a paginate mechanism for our product documents stored in MongoDB. What makes this tricky, is that each document can have several colors, and I need to paginate by these instead of the document itself. E.g. the example below has two colors, and should then count as 2 in my paginate results.
How would anyone go around doing this the easiest / most affective way?
Thanks in advance!
{
"_id": ObjectId("4fdbaf608b446b0477000142"),
"created_at": newDate("14-10-2011 12:02:55"),
"modified_at": newDate("15-6-2012 23:55:43"),
"sku": "A1051g",
"name": {
"en": "Earrings - Celebrity"
},
"variants": [
{
color: {
en: "Blue"
}
},
{
color: {
en: "Yellow"
}
}
]
}

I like Sammaye's solution but another approach could just be pulling back more results than you need.
So for example, if you need 100 variants per page and each product has at least 1 variant, query with a limit of 100 to try and get 100 products, and therefore, at least 100 variants.
Chances are, you will have more than 100 variants (each product having more than 1) so build a list of products as you iterate over the cursor keeping track of the number variants.
When you have 100 variants, take note of how many products you have in the list, out of the 100 you retrieved, and use that as the skip for your next query.
This will eventually get expensive for large skips as you will have to seek over the number of documents you skip but could be a good solution for now.

Related

Query nested array performance, Mongo vs ElasticSearch

I have a music app that has a job to find music recommendations based on a tag id.
There are two entities involved:
Song - a song record contains its name and a list of music tag ids (genres) this song belongs to
MusicTag - the music tag itself, includes id, name etc.
Data is currently stored in MongoDB.
The Songs collections in mongo have millions of songs, and each song has an average of 7 tag ids.
The MusicTags has about 30K records.
The Songs collection looks like that:
[
{
name: "Metallica - one",
tags: [
"6018703624d8a5e8efa1b76e", // Rock
"601861cc8cef62ba86765017", // Heavy metal
"5fda07ac8db0615c1c503a46" // Hard Rock
]
},
{
name: "Metallica - unforgiven",
tags: [
"6018703624d8a5e8efa1b76e", // Rock
"5fda07ac8db0615c1c503a46", // Metal
]
},
{
name: "Lady Gaga - Bad Romance",
tags: [
"5fc7b9f95e38e17282896b64", // Pop
"5fc729be5e38e17282844eff", // Dance
]
}
]
Given the tag "6018703624d8a5e8efa1b76e" (Rock), I want to query the Songs collection and find all songs that have Rock tag in their tags array.
In Mongo this is the query i'm doing:
db.songs.find({ tags: { $in: [ObjectId("6018703624d8a5e8efa1b76e")] }});
The performance of it is very bad (between 10 to 40 seconds and getting worst as long as the collection grows), I tried to index Mongo in various ways (the table contains more data that involve in the search, such as score and duration, but it's not relevant for now) but my queries are still take too long, I can't explain it (and I read a lot of official and unofficial stuff) but I have a feeling that holding the data in this nested form makes the index worthless and somehow still make a full scan on the table each time - but I can't prove it (the Mongo "explain" not really explained me something :) )
I'm thinking of using ElasticSearch for it, sync all songs data, and query it instead of the Mongo that will stay as the data SSOT and other lightweight ops.
But then the question remains open and I want to make sure: is in Elastic I can hold the data in that form (nested array inside song) or I need to represent it differently (e.g. flat it so every record will be song_tag index etc?
Thanks.
Elasticsearch doesn't offer a dedicated array type so what you'd typically do is define the mapping based on the type of the individual array items -- in your case a keyword:
PUT songs
{
"mappings": {
"properties": {
"tags": {
"type": "keyword"
}
}
}
}
Then you'd index the docs:
POST songs/_doc
{
"name": "Metallica - one",
"tags": [
"6018703624d8a5e8efa1b76e",
"601861cc8cef62ba86765017",
"5fda07ac8db0615c1c503a46"
]
}
and query the tags:
POST songs/_search
{
"query": {
"bool": {
"must": [
{ ... other queries },
{
"terms": {
"tags": [
"6018703624d8a5e8efa1b76e" // one or more
]
}
}
]
}
}
}
The tags are unique keywords but are not human-readable so you'd need to keep the map of them vs. the actual genres somewhere. Since the genres are probably set once and rarely, if ever, updated, you could use nested fields too. But your tags would then become an array of key-value pairs:
POST songs/_doc
{
"name": "Metallica - one",
"tags": [
{
"tag": "6018703624d8a5e8efa1b76e",
"genre": "Rock"
}
...
]
}
The mapping would be slightly different and so would be the queries but now you wouldn't need the translation map, plus you could query or aggregate by human-readable values -- tags.genre.

Mongo error 16996 during aggregation - too large document produced

I am parsing Wikipedia dumps in order to play with the link-oriented metadata. One of the collections is named articles and it is in the following form:
{
_id : "Tree",
id: "18955875",
linksFrom: " [
{
name: "Forest",
count: 6
},
[...]
],
categories: [
"Trees",
"Forest_ecology"
[...]
]
}
The linksFrom field stores all articles this article points to, and how many times that happens. Next, I want to create another field linksTo with all the articles that point to this article. In the beginning, I went through the whole collection and updated every article, but since there's lots of them it takes too much time. I switched to aggregation for performance purposes and tried it on a smaller set - works like a charm and is super fast in comparison with the older method. The aggregation pipeline is as follows:
db.runCommand(
{
aggregate: "articles",
pipeline : [
{
$unwind: "$linksFrom"
},
{
$sort: { "linksFrom.count": -1 }
},
{
$project:
{
name: "$_id",
linksFrom: "$linksFrom"
}
},
{
$group:
{
_id: "$linksFrom.name",
linksTo: { $push: { name: "$name", count: { $sum : "$linksFrom.count" } } },
}
},
{
$out: "TEMPORARY"
}
] ,
allowDiskUse: true
}
)
However, on a large dataset being the english Wikipedia I get the following error after a few minutes:
{
"ok" : 0,
"errmsg" : "insert for $out failed: { connectionId: 24, err: \"BSONObj size: 24535193 (0x1766099) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: \"United_States\"\", code: 10334, n: 0, ok: 1.0 }",
"code" : 16996
}
I understand that there are too many articles, which link to United_States article and the corresponding document's size grows above 16MB, currently almost 24MB. Unfortunately, I cannot even check if that's the case (error messages sometimes tend to lie)... Because of that, I'm trying to change the model so that the relationship between articles is stored with IDs rather than long names but I'm afraid that might not be enough - especially because my plan is to merge the two collections for every article later...
The question is: does anyone have a better idea? I don't want to try to increase the limit, I'm rather thinking about a different approach of storing this data in the database.
UPDATE after comment by Markus
Markus is correct, I am using a SAX parser and, as a matter of fact, I'm already storing all the links in a similar way. Apart from articles I have three more collections - one with links and two others, labels and stemmed-labels. The first one stores all links that occur in the dump in the following way:
{
_id : "tree",
stemmedName: "tree",
targetArticle: "Christmas_tree"
}
_id stores the text that is used to represent a given link, stemmedName represents stemmed _id and targetArticle marks what article this text pointed to. I'm in the middle of adding sourceArticle to this one, because it's obviously a good idea.
The second collection labels contains documents as follows:
{
_id : "tree",
targetArticles: [
{
name: "Christmas_tree",
count: 1
},
{
name: "Tree",
count: 166
}
[...]
]
}
The third stemmed-labels is analogous to the labels with its _id being a stemmed version of the root label.
So far, the first collection links serves as a baseline for the two other collections. I group the labels together by their name so that I only do one lookup for every phrase and then I can immiedately get all target articles with one query. Then I use the articles and labels collections in order to:
Look for label with a given name.
Get all articles it might
point to.
Compare the incoming and outcoming links for these
articles.
This is where the main question comes. I thought that it's better if I store all possible articles for a given phrase in one document rather than leave them scattered in the links collection. Only now did it occur to me, that - as long as the lookups are indexed - the overall performance might be the same for one big document or many smaller ones! Is this a correct assumption?
I think your data model is wrong. It may well be (albeit a bit theoretical) that individual articles (let's stick with the wikipedia example) are linked more often than you could store in a document. Embedding only works with One-To(-Very)-Few™ relationships.
So basically, I think you should change your model. I will show you how I would do it.
I will use the mongoshell and JavaScript in this example, since it is the lingua franca. You might need to translate accordingly.
The questions
Lets begin with the questions you want to have answered:
For a given article, which other articles link to that article?
For a given article, to which other articles does that article link to?
For a given article, how many articles link to it?
Optional: For a given article, to how many articles does it link to?
The crawling
What I would do basically is to implement a SAX parser on the articles, creating a new document for each article link you encounter. The document itself should be rather simple:
{
"_id": new ObjectId(),
// optional, for recrawling or pointing out a given state
"date": new ISODate(),
"article": wikiUrl,
"linksTo": otherWikiUrl
}
Note that you should not do an insert, but an upsert. The reason for this is that we do not want to document the number of links, but the articles linked to. If we did an insert, the same combination of article and linksTocould occur multiple times.
So our statement when encountering a link would look like this for example:
db.links.update(
{ "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ upsert:true }
)
Answering the questions
As you might already guess, answering the questions becomes pretty straightforward now. I have use the following statements for creating a few documents:
db.links.update(
{ "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ upsert:true }
)
db.links.update(
{ "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
{ "date":new ISODate(), "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
{ upsert:true }
)
db.links.update(
{ "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy"},
{ "date":new ISODate(), "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy" },
{ upsert:true }
)
For a given article, which other articles link to that article?
We found out that we should not use an aggregation, since that might exceed the size limit. But we don't have to. We simply use a cursor and gather the results:
var toLinks =[]
var cursor = db.links.find({"linksTo":"Royal_Navy"},{"_id":0,"article":1})
cursor.forEach(
function(doc){
toLinks.push(doc.article);
}
)
printjson(toLinks)
// Output: [ "HMS_Warrior_(1860)", "Mutiny_on_the_Bounty" ]
For a given article, to which other articles does that article link to?
This works pretty much like the first question – we basically only change the query:
var fromLinks = []
var cursor = db.links.find({"article":"Royal_Navy"},{"_id":0,"linksTo":1})
cursor.forEach(
function(doc){
fromLinks.push(doc.linksTo)
}
)
printjson(fromLinks)
// Output: [ "Mutiny_on_the_Bounty" ]
For a given article, how many articles link to it?
It should be obvious that in case you already have answered question 1, you could simply check toLinks.length. But let's assume you haven't. There are two other ways of doing this
Using .count()
You can use this method on replica sets. On sharded clusters, this doesn't work well. But it is easy:
db.links.find({ "linksTo":"Royal_Navy" }).count()
// Output: 2
Using an aggregation
This works on any environment and isn't much more complicated:
db.links.aggregate([
{ "$match":{ "linksTo":"Royal_Navy" }},
{ "$group":{ "_id":"$linksTo", "isLinkedFrom":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "isLinkedFrom" : 2 }
Optional: For a given article, to how many articles does it link to?
Again, you can answer this question by reading the length of the array from question 2 of use the .count()method. The aggregation again is simple
db.links.aggregate([
{ "$match":{ "article":"Royal_Navy" }},
{ "$group":{ "_id":"$article", "linksTo":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "linksTo" : 1 }
Indices
As for the indices, I haven't really checked them, but individual indices on the fields is probably what you want:
db.links.createIndex({"article":1})
db.links.createIndex({"linksTo":1})
A compound index will not help much, since order matters and we do no always ask for the first field. So this is probably as optimized as it can get.
Conclusion
We are using an extremely simple, scalable model and rather simple queries and aggregations to get the questions answered you have to the data.

mongo operation speed : array $addToSet/$pull vs object $set/$unset

I have a index collection containing lots of terms, and a field items containing identifier from an other collection. Currently that field store an array of document, and docs are added by $addToSet, but I have some performance issues. It seems an $unset operation is executed faster, so I plan to change the array of document to a document of embed documents.
Am I right to think the $set/$unset fields are fatest than push/pull embed document into arrays ?
EDIT:
After small tests, we see the set/unset 4 times faster. On the other
hand, if I use object instead of array, it's a little harder to count
the number of properties (vs the length of the array), and we were
counting that a lot. But we can consider using $set everytime and
adding a field with the number of items.
This is a document of the current index :
{
"_id": ObjectId("5594dea2b693fffd8e8b48d3"),
"term": "clock",
"nbItems": NumberLong("1"),
"items": [
{
"_id": ObjectId("55857b10b693ff18948ca216"),
"id": NumberLong("123")
}
{
"_id": ObjectId("55857b10b693ff18948ca217"),
"id": NumberLong("456")
}
]
}
Frequent update operations are :
* remove item : {$pull:{"items":{"id":123}}}
* add item : {$addToSet:{"items":{"_id":ObjectId("55857b10b693ff18948ca216"),"id":123,}}}
* I can change $addToSet to $push and check duplicates before if performances are better
And this is what I plan to do:
{
"_id": ObjectId("5594dea2b693fffd8e8b48d3"),
"term": "clock",
"nbItems": NumberLong("1"),
"items": {
"123":{
"_id": ObjectId("55857b10b693ff18948ca216")
}
"456":{
"_id": ObjectId("55857b10b693ff18948ca217")
}
}
}
* remove item : {$unset:{"items.123":true}
* add item : {$set:{"items.123":{"_id":ObjectId("55857b10b693ff18948ca216"),"id":123,}}}
For information, theses operations are made with pymongo (or can be done with php if there is a good reason to), but I don't think this is relevant
As with any performance question, there are a number of factors which can come into play with an issue like this, such as indexes, need to hit disk, etc.
That being said, I suspect you are likely correct that adding a new field or removing an old field from a MongoDB document will be slightly faster than appending/removing from an array as the array types will be less easy to traverse when searching for duplicates.

Maintaining total count of a set in another collection

I got simple scenario of two entities: post; bumps (ie upvote).
Example of a post:
{_id: 'happy_days', 'title': 'Happy days', text: '...', bumps: 2}
Example of a bump:
{_id: {user: 'jimmy', post: 'happy_days'}}
{_id: {user: 'hans', post: 'happy_days'}}
Question: how do I maintain correct bumps count in post under all circumstances (and failures)?
The method I have come up with so far is:
To bump, upsert and check for existence. Only if inserted, increase bumps count.
To unbump, delete and check for existence. Only if deleted, decrease bumps count.
Above fails if the app crashes between the two ops and the only way to correct the bumps stats is to query all documents in bump collection and recalculate everything offline (ie there is no way to know which post have incorrect bumps count).
I suggest that you stick with what you already have. The worst that can happen if there is a failover/connection issue between your two operations is that you bump count is wrong. So what? This is not the end of the world, and nobody is going to care too much if a bump count is either 812 or 813. You can always recreate the count anyway by checking how many bumps you have for each post by running an aggregation query if something went wrong. Embrace eventual consistency!
As an alternative to updating the data in multiple places (which, for read performance, will probably be the best but as you noticed will complicate updates) it may be worth considering storing uid's of the bumps in an array (here called bump_uids) directly on the post, and just count the bumps when needed using aggregate framework;
> db.test.aggregate( [ { $match: { _id:'happy_days' } },
{ $project: { bump_uids: 1 } },
{ $unwind: '$bump_uids' },
{ $group: {_id:'$_id', bumps: { $sum:1 } } } ] )
>>> { "result" : [ { "_id" : "happy_days", "bumps" : 3 } ], "ok" : 1 }
Since MongoDB does not yet support triggers ( https://jira.mongodb.org/browse/SERVER-124 ) you have to do this the gritty way with application logic.
As a brief example:
db.follower.insert({fromId:u,toId:c});
db.user.update({_id:u},{$inc:{totalFollowing:1}});
db.user.update({_id:c},{$inc:{totalFollowers:1}});
Yes, it is not atomic etc etc however it is the way to do it. In reality many update counters like this, whether in MongoDB or not.

how do I do 'not-in' operation in mongodb?

I have two collections - shoppers (everyone in shop on a given day) and beach-goers (everyone on beach on a given day). There are entries for each day, and person can be on a beach, or shopping or doing both, or doing neither on any day. I want to now do query - all shoppers in last 7 days who did not go to beach.
I am new to Mongo, so it might be that my schema design is not appropriate for nosql DBs. I saw similar questions around join and in most cases it was suggested to denormalize. So one solution, I could think of is to create collection - activity, index on date, embed actions of user. So something like
{
user_id
date
actions {
[action_type, ..]
}
}
Insertion now becomes costly, as now I will have to query before insert.
A few of suggestions.
Figure out all the queries you'll be running, and all the types of data you will need to store. For example, do you expect to add activities in the future or will beach and shop be all?
Consider how many writes vs. reads you will have and which has to be faster.
Determine how your documents will grow over time to make sure your schema is scalable in the long term.
Here is one possible approach, if you will only have these two activities ever. One record per user per day.
{ user: "user1",
date: "2012-12-01",
shopped: 0,
beached: 1
}
Now your query becomes even simpler, whether you have two or ten activities.
When new activity comes in you always have to update the correct record based on it.
If you were thinking you could just append a record to your collection indicating user, date, activity then your inserts are much faster but your queries now have to do a LOT of work querying for both users, dates and activities.
With proposed schema, here is the insert/update statement:
db.coll.update({"user":"username", "date": "somedate"}, {"shopped":{$inc:1}}, true)
What that's saying is: "for username on somedate increment their shopped attribute by 1 and create it if it doesn't exist aka "upsert" (that's the last 'true' argument).
Here is the query for all users on a particular day who did activity1 more than once but didn't do any of activity2.
db.coll.find({"date":"somedate","shopped":0,"danced":{$gt:1}})
Be wary of picking a schema where a single document can have continuous and unbounded growth.
For example, storing everything in a users collection where the array of dates and activities keeps growing will run into this problem. See the highlighted section here for explanation of this - and keep in mind that large documents will keep getting into your working data set and if they are huge and have a lot of useless (old) data in them, that will hurt the performance of your application, as will fragmentation of data on disk.
Remember, you don't have to put all the data into a single collection. It may be best to have a users collection with a fixed set of attributes of that user where you track how many friends they have or other semi-stable information about them and also have a user_activity collection where you add records for each day per user what activities they did. The amount or normalizing or denormalizing of your data is very tightly coupled to the types of queries you will be running on it, which is why figure out what those are is the first suggestion I made.
Insertion now becomes costly, as now I will have to query before insert.
Keep in mind that even with RDBMS, insertion can be (relatively) costly when there are indices in place on the table (ie, usually). I don't think using embedded documents in Mongo is much different in this respect.
For the query, as Asya Kamsky suggest you can use the $nin operator to find everyone who didn't go to the beach. Eg:
db.people.find({
actions: { $nin: ["beach"] }
});
Using embedded documents probably isn't the best approach in this case though. I think the best would be to have a "flat" activities collection with documents like this:
{
user_id
date
action
}
Then you could run a query like this:
var start = new Date(2012, 6, 3);
var end = new Date(2012, 5, 27);
db.activities.find({
date: {$gte: start, $lt: end },
action: { $in: ["beach", "shopping" ] }
});
The last step would be on your client driver, to find user ids where records exist for "shopping", but not for "beach" activities.
One possible structure is to use an embedded array of documents (a users collection):
{
user_id: 1234,
actions: [
{ action_type: "beach", date: "6/1/2012" },
{ action_type: "shopping", date: "6/2/2012" }
]
},
{ another user }
Then you can do a query like this, using $elemMatch to find users matching certain criteria (in this case, people who went shopping in the last three days:
var start = new Date(2012, 6, 1);
db.people.find( {
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
}
});
Expanding on this, you can use the $and operator to find all people went shopping, but did not go to the beach in the past three days:
var start = new Date(2012, 6, 1);
db.people.find( {
$and: [
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
},
actions : {
$not: {
$elemMatch : {
action_type : { $in: ["beach"] },
date : { $gt : start }
}
}
}
]
});