How to query two collections at the same time? - mongodb

I am using MongoDB and I ended up with two Collections (unintentionally).
The first Collection (sample) has 100 million records (Tweets) with the following structure:
{
"_id" : ObjectId("515af34297c2f607b822a54b"),
"text" : "bla bla ",
"id" : NumberLong("314965680476803072"),
"user" :
{
"screen_name" : "TheFroooggie",
"time_zone" : "Amsterdam",
},
}
The second Collection (users) with 30 Million records of unique users from the tweet collection and it looks like this
{ "_id" : "000000_n", "target" : 1, "value" : { "count" : 5 } }
where the _id in the users collection is the user.screen_name from the tweets collection, the target is their status (spammer or not) and finally the value.count is the number a user appeared in our first collection (sample) collection (e.g. number of captured tweets)
Now I'd like to make the following query:
I'd like to return all the documents from the sample collection (tweets) where the user has the target value = 1
In other words, I want to return all the tweets of all the spammers for example.

As you receive the tweets you could upsert them into a collection. Using the author information as the key in the "query" document portion of the update. The update document could utilize the $addToSet operator to put the tweet into a tweets array. You'll end up with a collection that has the author and an array of tweets. You can then do your spammer classification for each author and have their associated tweets.
So, you would end up doing something like this:
db.samples.update({"author":"joe"},{$addToSet:{"tweets":{"tweet_id":2}}},{upsert:true})
This approach does have the likely drawback of growing the document past its initially allocated size on disk which means it would be moved and expanded on disk. You would likely incur some penalty for index updating as well.
You could also take an approach of storing a spam rating with each tweet document and later pulling those based on user id.
As others have pointed out, there is nothing wrong with setting up the appropriate indexes and using a cursor to loop through your users pulling their tweets.
The approach you choose should be based on your intended access pattern. It sounds like you are in a good place where you can experiment with several different possible solutions.

Related

Sort collection permanently in Mongodb

Whenever we do db.Collection.find().sort(), only our output is sorted, not the collection itself,
i.e. If i do db.collection.find() then i see the original collection, not the sorted one.
Is there any way to sort the collection itself insted of just sorting the output?
Exporting the sorted result into entire new collection would also work.
if i have numbered _id field.(like _id:1 , _id_2 , _id:3 and so on)
Also I do not see any reason for doing this (index on the field on which you are going to sort it will help you to get this sort fast), here is a solution for your problem:
You have your test collection this way
{ "_id" : ObjectId("5273f6987c6c502364ddfe94"), "n" : 5 }
{ "_id" : ObjectId("5273f6e57c6c502364ddfe95"), "n" : 14}
{ "_id" : ObjectId("5273f6ee7c6c502364ddfe96"), "n" : -5}
Then the following command will create a sorted collection for you
db.test.find().sort({n : 1}).forEach(function(e){
db.testSorted.insert(e);
})
Completely the same way you can achieve with this (which I assume might perform a faster, but I have not done any testing):
db.testSorted.insert(db.test.find().sort({n : 1}).toArray());
And just to make this answer complete, also I understand that this is an overkill, you can do this with aggregation framework option $out.
Just to highlight: with all this you can solve bigger problem: save into another collection some sort of modification/subset of previous collection.
Documents in a collection are stored in natural order which is affected by document moves (when the document grows larger than the current record space allocated) and deletions (free space can be reused for inserted/moved documents). There is currently (as at MongoDB 2.4) no option to control the order of documents on disk aside from using a capped collection, which is a fixed-size collection that maintains insertion order but is subject to a number of restrictions.
An index is the appropriate way to efficiently return documents in an expected sort order. For more information see: Using Indexes to Sort Query Results in the MongoDB manual.
A related feature is a clustered index, which would store documents on disk to match an index ordering. This is not a current feature of MongoDB, although it has been requested (see SERVER-3294).

MongoDB design for scalability

We want to design a scalable database. If we have N users with 1 Billion user responses, from the 2 options below which will be a good design? We would want to query based on userID as well as Reponse ID.
Having 2 Collections one for the user information and another to store the responses along with user ID. Each response is stored as a document so we will have 1 billion documents.
User Collection
{
"userid" : "userid1",
"password" : "xyz",
,
"City" : "New York",
},
{
"userid" : "userid2",
"password" : "abc",
,
"City" : "New York",
}
responses Collection
{
"userid": "userid1",
"responseID": "responseID1",
"response" : "xyz"
},
{
"userid": "userid1",
"responseID": "responseID2",
"response" : "abc"
},
{
"userid": "userid2",
"responseID": "responseID3",
"response" : "mno"
}
Having 1 Collection to store both the information as below. Each response is represented by a new key (responseIDX).
{
"userid" : "userid1",
"responseID1" : "xyz",
"responseID2" : "abc",
,
"responseN"; "mno",
"city" : "New York"
}
If you use your first options, I'd use a relational database (like MySQL) opposed to MongoDB. If you're heartfelt on MongoDB, use it to your advantage.
{
"userId": n,
"city": "foo"
"responses": {
"responseId1": "response message 1",
"responseId2": "response message 2"
}
}
As for which would render a better performance, run a few benchmark tests.
Between the two options you've listed - I would think using a separate collection would scale better - or possibly a combination of a separate collection and still using embedded documents.
Embedded documents can be a boon to your schema design - but do not work as well when you have an endlessly growing set of embedded documents (responses, in your case). This is because of document growth - as the document grows - and outgrows the allocated amount of space for it on disk, MongoDB must move that document to a new location to accommodate the new document size. That can be expensive and have severe performance penalties when it happens often or in high concurrency environments.
Also, querying on those embedded documents can become troublesome when you are looking to selectively return only a subset of responses, especially across users. As in - you can not return only the matching embedded documents. Using the positional operator, it is possible to get the first matching embedded document however.
So, I would recommend using a separate collection for the responses.
Though, as mentioned above, I would also suggest experimenting with other ways to group those responses in that collection. A document per day, per user, per ...whatever other dimensions you might have, etc.
Group them in ways that allow multiple embedded documents and compliments how you would query for them. If you can find the sweet spot between still using embedded documents in that collection and minimizing document growth, you'll have fewer overall documents and smaller index sizes. Obviously this requires benchmarking and testing, as the same caveats listed above can apply.
Lastly (and optionally), with that type of data set, consider using increment counters where you can on the front end to supply any type of aggregated reporting you might need down the road. Though the Aggregation Framework in MongoDB is great - having, say, the total response count for a user pre-aggregated is far more convenient then trying to get a count by running a aggregate query on the full dataset.

mongodo index within a document or only across a document collection

I am collecting data from a streaming API and I want to create a real-time analytics dashboard. Every time a new record appears at the end of the stream I update a counter in the below document.
From a design perspective. Am I correct to use only one document, like in the below example?
{
"_id" : ObjectId("5238beb4d4bed9e444c99978"),
"counts" : {
"hours" : {
"1" : 835,
"2" : 1007,
.
.
.
"3" : 174,
}
}
The benefit with this approach is that only one document needs to be sent to the real-time analytics dashboard. Also after a year this document would have only 365 * 24 fields, 1 for each hour in that year?
What about indexing? Can I create an index on counts.hours if I only have one document? Or do indexes only work across collections in mongodb? Do indexes help with finding documents faster or also fields inside documents?
If I could create an index on counts.hours, then the counter increment process could find the correct hour to increment (per new document at the end of the stream) much more efficiently.
You can create indexes in fields embedded in a document. In the case above:
yourCollection.ensureIndex({ 'counts.hours':1 });
The index will help you optimize queries to return documents based on 'counts.hours' field.
youCollection.find({ 'count.hours':1 });
Your data structure design should depend on the kind of queries and updates you are planning to do. In the case you described I imagine you will be adding members to the 'hours' object, updates like that might be expensive since MongoDB pads each collection record optimizing for the case where the record size is stable across updates.

Multiple documents update mongodb casbah scala

I have two MongoDB collections
promo collection:
{
"_id" : ObjectId("5115bedc195dcf55d8740f1e"),
"curr" : "USD",
"desc" : "durable bags.",
"endDt" : "2012-08-29T16:04:34-04:00",
origPrice" : 1050.99,
"qtTotal" : 50,
"qtClaimd" : 30,
}
claimed collection:
{
"_id" : ObjectId("5117c749195d62a666171968"),
"proId" : ObjectId("5115bedc195dcf55d8740f1e"),
"claimT" : ISODate("2013-02-10T16:14:01.921Z")
}
Whenever someone claimed a promo, a new document will be created inside "claimedPro" collection where proId is a (virtual) foreign key to first (promo) collection. Every claim should increment a counter "qtClaimd" in "promo" collection. What's the best way to increment a value in another collection in a transactional fashion? I understand MongoDB doesn't have isolation for multiple docs.
Also, reason why I went with "non-embedded" approach is as follow
promo gets created and published to users then claims will happen in 100s of thousands amounts. I didn't think it was logical to embed claims inside promo collection given the number of writes will happen in a single document ('coz mongo resizes promo collection when size grows due to thousands of claims). Having non embedded approach keeps promo collection unaffected but insert new document in "claims" collection. Later while generating report I'll have to display "promo" details along with "claims" details for that promo. With non-embedded approach I'll have to first query "promo" collection and then "claims" collection with "proId". *Also worth mentioning that there could be times where 100s of "claims" can happen simultaneously for the same "promo" *.
What's the best way to achieve trnsactional effect with these two collections? I am using Scala, Casbah and Salat all with Scala 2.10 version.
db.bar.update({ id: 1 }, { $inc: { 'some.counter': 1 } });
Just look at how to run this with SalatDAO, I'm not a play user so I wouldn't want to give you wrong advise about that. $inc is the Mongo way to increment.

MongoDB Table Design and Query Performance

I'm new to MongoDB. When creating a new table a question came to my mind related to how to design it and performance. My table structure looks this way:
{
"name" : string,
"data" : { "data1" : "xxx", "data2" : "yyy", "data3" : "zzz", .... }
}
The "data" field could grow until it reaches an amount of 100.000 elements ( "data100.000" : "aaaXXX"). However the number of rows in this table would be under control (between 500 and 1000).
This table will be accessed many times in my application and I'd like to maximize the performance of any queries. I would do queries like this one (I'll put an example in java):
new Query().addCriteria(Criteria.where("name").is(name).and("data.data3").is("zzz"));
I don't know if this would get slower when the amount of "dataX"... elements grows.
So the question is: Is this design correct? Should I change something?
I'll be pleased to read your advice, many thanks in advance
A document could be viewed like a table with columns, but you have to be carefull. It has other usage characteristics. The document size can be max. 16 MB. And you have to keep in mind that the documents are hold in memory by mongo.
With your query the whole document will be returned. Ask yourself do you need all entries or
will you have to use a single entry on his own?
Using MongoDB for eCommerce
MongoDB Schema Design
MongoDB and eCommerce
MongoDB Transactions
This should be a good start.
What is data? I wouldn't store a single nested document with up to 100,000 fields as it you wouldn't be able to index it easily so you would get performance issues.
You'd be better off storing as an array of strings, then you can index the array field which would index all the values.
{
"name" : string,
"data" : [ "xxx", "yyy", "zzz" ]
}
If like in your query you then wanted the value at a particular position in the array, instead of data.data3 you could do:
db.Collection.find( { "data.2" : "zzz" } )
Or, if you don't care about the position and just want all documents where the data array contains 'zzz' you can do:
db.Collection.find( { "data" : "zzz" } )
100,000 strings is not going to get anywhere near 16MB so you don't need to worry about that, but having 100,000 fields in a nested document or array indicates something is wrong with the design, but without knowing what data is I couldn't say for sure.