MongoDB design for scalability - mongodb

We want to design a scalable database. If we have N users with 1 Billion user responses, from the 2 options below which will be a good design? We would want to query based on userID as well as Reponse ID.
Having 2 Collections one for the user information and another to store the responses along with user ID. Each response is stored as a document so we will have 1 billion documents.
User Collection
{
"userid" : "userid1",
"password" : "xyz",
,
"City" : "New York",
},
{
"userid" : "userid2",
"password" : "abc",
,
"City" : "New York",
}
responses Collection
{
"userid": "userid1",
"responseID": "responseID1",
"response" : "xyz"
},
{
"userid": "userid1",
"responseID": "responseID2",
"response" : "abc"
},
{
"userid": "userid2",
"responseID": "responseID3",
"response" : "mno"
}
Having 1 Collection to store both the information as below. Each response is represented by a new key (responseIDX).
{
"userid" : "userid1",
"responseID1" : "xyz",
"responseID2" : "abc",
,
"responseN"; "mno",
"city" : "New York"
}

If you use your first options, I'd use a relational database (like MySQL) opposed to MongoDB. If you're heartfelt on MongoDB, use it to your advantage.
{
"userId": n,
"city": "foo"
"responses": {
"responseId1": "response message 1",
"responseId2": "response message 2"
}
}
As for which would render a better performance, run a few benchmark tests.

Between the two options you've listed - I would think using a separate collection would scale better - or possibly a combination of a separate collection and still using embedded documents.
Embedded documents can be a boon to your schema design - but do not work as well when you have an endlessly growing set of embedded documents (responses, in your case). This is because of document growth - as the document grows - and outgrows the allocated amount of space for it on disk, MongoDB must move that document to a new location to accommodate the new document size. That can be expensive and have severe performance penalties when it happens often or in high concurrency environments.
Also, querying on those embedded documents can become troublesome when you are looking to selectively return only a subset of responses, especially across users. As in - you can not return only the matching embedded documents. Using the positional operator, it is possible to get the first matching embedded document however.
So, I would recommend using a separate collection for the responses.
Though, as mentioned above, I would also suggest experimenting with other ways to group those responses in that collection. A document per day, per user, per ...whatever other dimensions you might have, etc.
Group them in ways that allow multiple embedded documents and compliments how you would query for them. If you can find the sweet spot between still using embedded documents in that collection and minimizing document growth, you'll have fewer overall documents and smaller index sizes. Obviously this requires benchmarking and testing, as the same caveats listed above can apply.
Lastly (and optionally), with that type of data set, consider using increment counters where you can on the front end to supply any type of aggregated reporting you might need down the road. Though the Aggregation Framework in MongoDB is great - having, say, the total response count for a user pre-aggregated is far more convenient then trying to get a count by running a aggregate query on the full dataset.

Related

MongoDB find and iterate vs count

I have a peculiar problem with Mongo.
We have a collection of 800k documents with the following structure.
{
"_id" : ObjectId("5bd844199114bab3b2c19fab"),
"u" : 0,
"c" : 0,
"iden" : "343754856",
"name" : "alan",
"email" : "mkasd#abc.com",
"mobile" : "987654321093456",
"expires" : ISODate("2018-11-29T11:44:25.453Z"),
"created" : ISODate("2018-10-30T11:44:25.453Z")
}
We have indexed iden and name on which we generally query.
We tried two types of queries.
db.Collection.find({"iden": "343754856", "name": "alan", "created":
{"$gt": ....}).count()
where "created" is an unindexed field.
db.Collection.find({"iden": "343754856", "name": "alan"})
and iterate over all records to filter based on created.
However, MongoDB seems to be taking enormous amount of time in executing the second query while it was supposed to be an optimization over 1.
Any leads on what is going wrong here?
We are using Go library.
How could the second version be an optimization over the first?
Your first query retrieves a single number from the MongoDB server: the overall count of the query result. While your second version fetches all matching documents, and you do the counting at the "client" side.
Believe me that the MongoDB can count internally the result documents just as fast as you could in your Go client. Making the MongoDB server send the results, fetching them and unmarshaling them at the client takes orders of magnitude more time (depending on lot of factors).
Please note that if you have a composite index containing "iden" and "name", even if you add more filters (like "created" in your example), the index may still be used, but the MongoDB have to iterate over the partial results to apply the rest of the query. To find out if the index is used, please execute the following command:
db.Collection.find(
{"iden": "343754856", "name": "alan", "created": {"$gt": ....}
).explain()

How to query two collections at the same time?

I am using MongoDB and I ended up with two Collections (unintentionally).
The first Collection (sample) has 100 million records (Tweets) with the following structure:
{
"_id" : ObjectId("515af34297c2f607b822a54b"),
"text" : "bla bla ",
"id" : NumberLong("314965680476803072"),
"user" :
{
"screen_name" : "TheFroooggie",
"time_zone" : "Amsterdam",
},
}
The second Collection (users) with 30 Million records of unique users from the tweet collection and it looks like this
{ "_id" : "000000_n", "target" : 1, "value" : { "count" : 5 } }
where the _id in the users collection is the user.screen_name from the tweets collection, the target is their status (spammer or not) and finally the value.count is the number a user appeared in our first collection (sample) collection (e.g. number of captured tweets)
Now I'd like to make the following query:
I'd like to return all the documents from the sample collection (tweets) where the user has the target value = 1
In other words, I want to return all the tweets of all the spammers for example.
As you receive the tweets you could upsert them into a collection. Using the author information as the key in the "query" document portion of the update. The update document could utilize the $addToSet operator to put the tweet into a tweets array. You'll end up with a collection that has the author and an array of tweets. You can then do your spammer classification for each author and have their associated tweets.
So, you would end up doing something like this:
db.samples.update({"author":"joe"},{$addToSet:{"tweets":{"tweet_id":2}}},{upsert:true})
This approach does have the likely drawback of growing the document past its initially allocated size on disk which means it would be moved and expanded on disk. You would likely incur some penalty for index updating as well.
You could also take an approach of storing a spam rating with each tweet document and later pulling those based on user id.
As others have pointed out, there is nothing wrong with setting up the appropriate indexes and using a cursor to loop through your users pulling their tweets.
The approach you choose should be based on your intended access pattern. It sounds like you are in a good place where you can experiment with several different possible solutions.

Multiple documents update mongodb casbah scala

I have two MongoDB collections
promo collection:
{
"_id" : ObjectId("5115bedc195dcf55d8740f1e"),
"curr" : "USD",
"desc" : "durable bags.",
"endDt" : "2012-08-29T16:04:34-04:00",
origPrice" : 1050.99,
"qtTotal" : 50,
"qtClaimd" : 30,
}
claimed collection:
{
"_id" : ObjectId("5117c749195d62a666171968"),
"proId" : ObjectId("5115bedc195dcf55d8740f1e"),
"claimT" : ISODate("2013-02-10T16:14:01.921Z")
}
Whenever someone claimed a promo, a new document will be created inside "claimedPro" collection where proId is a (virtual) foreign key to first (promo) collection. Every claim should increment a counter "qtClaimd" in "promo" collection. What's the best way to increment a value in another collection in a transactional fashion? I understand MongoDB doesn't have isolation for multiple docs.
Also, reason why I went with "non-embedded" approach is as follow
promo gets created and published to users then claims will happen in 100s of thousands amounts. I didn't think it was logical to embed claims inside promo collection given the number of writes will happen in a single document ('coz mongo resizes promo collection when size grows due to thousands of claims). Having non embedded approach keeps promo collection unaffected but insert new document in "claims" collection. Later while generating report I'll have to display "promo" details along with "claims" details for that promo. With non-embedded approach I'll have to first query "promo" collection and then "claims" collection with "proId". *Also worth mentioning that there could be times where 100s of "claims" can happen simultaneously for the same "promo" *.
What's the best way to achieve trnsactional effect with these two collections? I am using Scala, Casbah and Salat all with Scala 2.10 version.
db.bar.update({ id: 1 }, { $inc: { 'some.counter': 1 } });
Just look at how to run this with SalatDAO, I'm not a play user so I wouldn't want to give you wrong advise about that. $inc is the Mongo way to increment.

Sorting hybrid bucketed schema in MongoDB

Our application allows users to create posts and comments. Data is growing fast and we already reviewed Mongodb scaling strategies. We like the approach presented in http://www.10gen.com/presentations/mongosf2011/schemascale , which uses a hybrid schema between embedded and non-embedded documents, bucketing comments so that they are saved in groups of 100 or 200 comments per document.
{
"_id" : '/post/2323423/1--1',
"comments" : [{
"author" : "peter",
"text" : "comment!",
"when" : "June 24 2012,
"votes": 43
},
{
"author" : "joe",
"text" : "hi!",
"when" : "June 25 2012,
"votes": 102
},
...
],
}
By bucketing comments, fewer disk reads are necessary to display thousands of comments, while at the same time, documents are kept small so writes are fast. It's perfect to paginate comments sorted by date.
We are very interesented in this approach but our application requires comments to be sorted by votes and subcomments.
Currently we use a non-embedded approach which uses a separate collection for comments. Allows us to retrieve data sorted by any field and subcommenting is easy (by reference), but performance is becoming an issue. We would like to use bucketing but the sorting by votes thing does not seem to fit in a bucket.
Sorting by date is trivial, just go for the next bucket as the user clicks 'next page', quering one document. But, how do we manage to do this if we want to sort by votes? we'd have to retrieve all buckets and then sort the comments, which is obviously inneficient...
Any ideas about a proper schema design to accomplish this?
You should be able to sort by descending:
db.collection.find({},{_id:0}).sort({'comments.votes':1})
Just note that there is a bug where you can only sort by ascending.
See this bug ticket
have you tried an aggregation query?
db.commentbuckets.aggregate([
$match: {discussion_id: <discussion_id>},
$unwind: "$comments",
$sort: {votes: -1}
]);

MongoDB Table Design and Query Performance

I'm new to MongoDB. When creating a new table a question came to my mind related to how to design it and performance. My table structure looks this way:
{
"name" : string,
"data" : { "data1" : "xxx", "data2" : "yyy", "data3" : "zzz", .... }
}
The "data" field could grow until it reaches an amount of 100.000 elements ( "data100.000" : "aaaXXX"). However the number of rows in this table would be under control (between 500 and 1000).
This table will be accessed many times in my application and I'd like to maximize the performance of any queries. I would do queries like this one (I'll put an example in java):
new Query().addCriteria(Criteria.where("name").is(name).and("data.data3").is("zzz"));
I don't know if this would get slower when the amount of "dataX"... elements grows.
So the question is: Is this design correct? Should I change something?
I'll be pleased to read your advice, many thanks in advance
A document could be viewed like a table with columns, but you have to be carefull. It has other usage characteristics. The document size can be max. 16 MB. And you have to keep in mind that the documents are hold in memory by mongo.
With your query the whole document will be returned. Ask yourself do you need all entries or
will you have to use a single entry on his own?
Using MongoDB for eCommerce
MongoDB Schema Design
MongoDB and eCommerce
MongoDB Transactions
This should be a good start.
What is data? I wouldn't store a single nested document with up to 100,000 fields as it you wouldn't be able to index it easily so you would get performance issues.
You'd be better off storing as an array of strings, then you can index the array field which would index all the values.
{
"name" : string,
"data" : [ "xxx", "yyy", "zzz" ]
}
If like in your query you then wanted the value at a particular position in the array, instead of data.data3 you could do:
db.Collection.find( { "data.2" : "zzz" } )
Or, if you don't care about the position and just want all documents where the data array contains 'zzz' you can do:
db.Collection.find( { "data" : "zzz" } )
100,000 strings is not going to get anywhere near 16MB so you don't need to worry about that, but having 100,000 fields in a nested document or array indicates something is wrong with the design, but without knowing what data is I couldn't say for sure.