Sorting hybrid bucketed schema in MongoDB - mongodb

Our application allows users to create posts and comments. Data is growing fast and we already reviewed Mongodb scaling strategies. We like the approach presented in http://www.10gen.com/presentations/mongosf2011/schemascale , which uses a hybrid schema between embedded and non-embedded documents, bucketing comments so that they are saved in groups of 100 or 200 comments per document.
{
"_id" : '/post/2323423/1--1',
"comments" : [{
"author" : "peter",
"text" : "comment!",
"when" : "June 24 2012,
"votes": 43
},
{
"author" : "joe",
"text" : "hi!",
"when" : "June 25 2012,
"votes": 102
},
...
],
}
By bucketing comments, fewer disk reads are necessary to display thousands of comments, while at the same time, documents are kept small so writes are fast. It's perfect to paginate comments sorted by date.
We are very interesented in this approach but our application requires comments to be sorted by votes and subcomments.
Currently we use a non-embedded approach which uses a separate collection for comments. Allows us to retrieve data sorted by any field and subcommenting is easy (by reference), but performance is becoming an issue. We would like to use bucketing but the sorting by votes thing does not seem to fit in a bucket.
Sorting by date is trivial, just go for the next bucket as the user clicks 'next page', quering one document. But, how do we manage to do this if we want to sort by votes? we'd have to retrieve all buckets and then sort the comments, which is obviously inneficient...
Any ideas about a proper schema design to accomplish this?

You should be able to sort by descending:
db.collection.find({},{_id:0}).sort({'comments.votes':1})
Just note that there is a bug where you can only sort by ascending.
See this bug ticket

have you tried an aggregation query?
db.commentbuckets.aggregate([
$match: {discussion_id: <discussion_id>},
$unwind: "$comments",
$sort: {votes: -1}
]);

Related

MongoDB collections design

I've got such four tables:
Point is that users that joined in particular group have access to a survey for time interval from date to date. How should i organize collection structure of such db in mongodb?
For survey and questions this will be a simple colection of surveys with an array of questions. But for this behavior with start/end of survey it is not clear for me how to store this data.
What about something like.
Groups
{
_id : "group1",
"members" : [{"name":"A"...},{"name":"B"...}],
"surveys" : [{"surveyId":"survey1", "startDate": ISODate(),"endDate":ISODate()},{"surveyId":"survey2", "startDate": ISODate(),"endDate":ISODate()}]
}
Surveys
{
_id : "survey1",
questions : [{"text":"Atheist??"...},{....}]
}
Honestly, it depends on what pattern you want to use, I mean you can embed groups inside survey also with registration details.

mongodb: modelling user defined sort order

I am looking for a good way to implement a sort key, that is completely user definable. E.g. The user is presented with a list and may sort the elements by dragging them around. This order should be persisted.
One commonly used way is to just create an ascending integer type sort field within each element:
{
"_id": "xxx1",
"sort": 2
},
{
"_id": "xxx2",
"sort": 3
},
{
"_id": "xxx3",
"sort": 1
}
While this will surely work, it might not be ideal: In case the user moves an element from the very bottom to the very top, all the indexes in-between need to be updated. We are not talking about embedded documents here, so this will cause a lot of individual documents to be updated. This might be optimised by creating initial sort values with gaps in-between (e.g. 100, 200, 300, 400). However, this will create the need for additional logic an re-sorting in case the space between two elements is exhausted.
Another approach comes to mind: Have the parent document contain a sorted array, which defines the order of the children.
{
"_id": "parent01",
"children": ["xxx3","xxx1","xxx2"]
}
This approach would certainly make it easier to change the order, but will have it's own caveats: The parent documents must always keep track of a valid list of its children. As adding children will update multiple documents, this still might not be ideal. And there needs to be complex validation of the input received from the client, as the length of this list and the elements contained, may never be changed by the client.
Is there a better way to implement such a use case?
Hard to say which option is better without knowing:
How often the sort order is usually updated
Which queries you gonna run against the documents and how often
How many documents can be sorted at a time
I'm sure you gonna do much more queries than updates so personally I would go with the first option. It's easy to implement and it's simple which means it's gonna be rebust. I understand your concerns about updating multiple documents but the updates will be done in place, I mean no documents shifting will occur as you don't actually change the documents size. Just create a simple test. Generate 1k of documents, then just update each of them in a loop like that
db.test.update({ '_id': arrIds[i] }, { $set: { 'sort' : i } })
You will see it will be a pretty instant operation.
I like the second option as well, from programming perspective it looks more elegant but when it comes to practice you don't usually care much if your update takes 10 milleseconds instead of 5 if you don't do it often and I'm sure you don't, most applications are query oriented.
EDIT:
When you update multiple documents, even if it's an instant operation, one may come up with an inconsistency issue when some documents are updated and some not. In my case it wasn't really an issue in fact. Let's consider an example, assume there's a list:
{ "_id" : 1, "sort" : 1 },{ "_id" : 2, "sort" : 4 },{ "_id" : 3, "sort" : 2 },{ "_id" : 4, "sort" : 3 }
so the ordered ids should look like that 1,3,4,2 according to sort fields. Let's say we have a failure when we want to move id=2 to the top. The failure occurs when we only updated two documents, so we will come up with the following state as we only managed to update ids 2 and 1:
{ "_id" : 1, "sort" : 2 },{ "_id" : 2, "sort" : 1 },{ "_id" : 3, "sort" : 2 },{ "_id" : 4, "sort" : 3 }
the data is in inconsistent state but still we can display the list to fix the problem, the ids order will be 2,1,3,4 if we just order it by sort field. why is it not a problem in my case? because when a failure occurs a user is redirected to an error page or provided with an error message, it is obvious for him that something got wrong and he should try again, so he just goes to the page and fix the order which is only partially valid for him.
Just to sum it up. Taking into account that it's a really rare case and other benefits of the approach I would go with it. Otherwise you will have to place everything in one document both the elements and the array with their indexes. This might be a much bigger issue, especially when it come to querying.
Hope it helps!

MongoDB design for scalability

We want to design a scalable database. If we have N users with 1 Billion user responses, from the 2 options below which will be a good design? We would want to query based on userID as well as Reponse ID.
Having 2 Collections one for the user information and another to store the responses along with user ID. Each response is stored as a document so we will have 1 billion documents.
User Collection
{
"userid" : "userid1",
"password" : "xyz",
,
"City" : "New York",
},
{
"userid" : "userid2",
"password" : "abc",
,
"City" : "New York",
}
responses Collection
{
"userid": "userid1",
"responseID": "responseID1",
"response" : "xyz"
},
{
"userid": "userid1",
"responseID": "responseID2",
"response" : "abc"
},
{
"userid": "userid2",
"responseID": "responseID3",
"response" : "mno"
}
Having 1 Collection to store both the information as below. Each response is represented by a new key (responseIDX).
{
"userid" : "userid1",
"responseID1" : "xyz",
"responseID2" : "abc",
,
"responseN"; "mno",
"city" : "New York"
}
If you use your first options, I'd use a relational database (like MySQL) opposed to MongoDB. If you're heartfelt on MongoDB, use it to your advantage.
{
"userId": n,
"city": "foo"
"responses": {
"responseId1": "response message 1",
"responseId2": "response message 2"
}
}
As for which would render a better performance, run a few benchmark tests.
Between the two options you've listed - I would think using a separate collection would scale better - or possibly a combination of a separate collection and still using embedded documents.
Embedded documents can be a boon to your schema design - but do not work as well when you have an endlessly growing set of embedded documents (responses, in your case). This is because of document growth - as the document grows - and outgrows the allocated amount of space for it on disk, MongoDB must move that document to a new location to accommodate the new document size. That can be expensive and have severe performance penalties when it happens often or in high concurrency environments.
Also, querying on those embedded documents can become troublesome when you are looking to selectively return only a subset of responses, especially across users. As in - you can not return only the matching embedded documents. Using the positional operator, it is possible to get the first matching embedded document however.
So, I would recommend using a separate collection for the responses.
Though, as mentioned above, I would also suggest experimenting with other ways to group those responses in that collection. A document per day, per user, per ...whatever other dimensions you might have, etc.
Group them in ways that allow multiple embedded documents and compliments how you would query for them. If you can find the sweet spot between still using embedded documents in that collection and minimizing document growth, you'll have fewer overall documents and smaller index sizes. Obviously this requires benchmarking and testing, as the same caveats listed above can apply.
Lastly (and optionally), with that type of data set, consider using increment counters where you can on the front end to supply any type of aggregated reporting you might need down the road. Though the Aggregation Framework in MongoDB is great - having, say, the total response count for a user pre-aggregated is far more convenient then trying to get a count by running a aggregate query on the full dataset.

Multiple documents update mongodb casbah scala

I have two MongoDB collections
promo collection:
{
"_id" : ObjectId("5115bedc195dcf55d8740f1e"),
"curr" : "USD",
"desc" : "durable bags.",
"endDt" : "2012-08-29T16:04:34-04:00",
origPrice" : 1050.99,
"qtTotal" : 50,
"qtClaimd" : 30,
}
claimed collection:
{
"_id" : ObjectId("5117c749195d62a666171968"),
"proId" : ObjectId("5115bedc195dcf55d8740f1e"),
"claimT" : ISODate("2013-02-10T16:14:01.921Z")
}
Whenever someone claimed a promo, a new document will be created inside "claimedPro" collection where proId is a (virtual) foreign key to first (promo) collection. Every claim should increment a counter "qtClaimd" in "promo" collection. What's the best way to increment a value in another collection in a transactional fashion? I understand MongoDB doesn't have isolation for multiple docs.
Also, reason why I went with "non-embedded" approach is as follow
promo gets created and published to users then claims will happen in 100s of thousands amounts. I didn't think it was logical to embed claims inside promo collection given the number of writes will happen in a single document ('coz mongo resizes promo collection when size grows due to thousands of claims). Having non embedded approach keeps promo collection unaffected but insert new document in "claims" collection. Later while generating report I'll have to display "promo" details along with "claims" details for that promo. With non-embedded approach I'll have to first query "promo" collection and then "claims" collection with "proId". *Also worth mentioning that there could be times where 100s of "claims" can happen simultaneously for the same "promo" *.
What's the best way to achieve trnsactional effect with these two collections? I am using Scala, Casbah and Salat all with Scala 2.10 version.
db.bar.update({ id: 1 }, { $inc: { 'some.counter': 1 } });
Just look at how to run this with SalatDAO, I'm not a play user so I wouldn't want to give you wrong advise about that. $inc is the Mongo way to increment.

MongoDB Table Design and Query Performance

I'm new to MongoDB. When creating a new table a question came to my mind related to how to design it and performance. My table structure looks this way:
{
"name" : string,
"data" : { "data1" : "xxx", "data2" : "yyy", "data3" : "zzz", .... }
}
The "data" field could grow until it reaches an amount of 100.000 elements ( "data100.000" : "aaaXXX"). However the number of rows in this table would be under control (between 500 and 1000).
This table will be accessed many times in my application and I'd like to maximize the performance of any queries. I would do queries like this one (I'll put an example in java):
new Query().addCriteria(Criteria.where("name").is(name).and("data.data3").is("zzz"));
I don't know if this would get slower when the amount of "dataX"... elements grows.
So the question is: Is this design correct? Should I change something?
I'll be pleased to read your advice, many thanks in advance
A document could be viewed like a table with columns, but you have to be carefull. It has other usage characteristics. The document size can be max. 16 MB. And you have to keep in mind that the documents are hold in memory by mongo.
With your query the whole document will be returned. Ask yourself do you need all entries or
will you have to use a single entry on his own?
Using MongoDB for eCommerce
MongoDB Schema Design
MongoDB and eCommerce
MongoDB Transactions
This should be a good start.
What is data? I wouldn't store a single nested document with up to 100,000 fields as it you wouldn't be able to index it easily so you would get performance issues.
You'd be better off storing as an array of strings, then you can index the array field which would index all the values.
{
"name" : string,
"data" : [ "xxx", "yyy", "zzz" ]
}
If like in your query you then wanted the value at a particular position in the array, instead of data.data3 you could do:
db.Collection.find( { "data.2" : "zzz" } )
Or, if you don't care about the position and just want all documents where the data array contains 'zzz' you can do:
db.Collection.find( { "data" : "zzz" } )
100,000 strings is not going to get anywhere near 16MB so you don't need to worry about that, but having 100,000 fields in a nested document or array indicates something is wrong with the design, but without knowing what data is I couldn't say for sure.