I have a web server (40gig hd, 1 gig ram) that runs Mongo and a Rails application.
The Mongo DB is a document store of Twitter tweets and users, which has several million records. I perform map-reduce queries on the data to extract things like most common hashtags, words, mentions etc (very standard stuff). The meta-data of each tweet is already stored, so the map-reduce is really as efficient as a single collect.
However, since it is run on a (fairly) large dataset, it can't be done in real-time anymore - for example I have a report generator that works out a whole bunch of these map-reduces in a row and takes about 2 minutes for 20 thousand tweets.
What is the quickest, cheapest way to scale mongo, especially in map-reduce performance? I can set up an additional server and split the load, but wonder if I should use sharding, replication or both? Sharding may be overkill for this situation.
Would love some input on my mysql-mongo connection. mysql contains twitter profiles that store twitter ids for each profile. each time a map reduce is done, it collects all IDs to be fed as options into the mapreduce ie:
#profile_tweet_ids = current_profile_tweet_ids # array of ids
#daily_trend = TwitterTweet.daily_trend :query => {:twitter_id => {"$in" => #profile_tweet_ids}}
The mapreduce function in TwitterTweet looks like:
def daily_trend(options={})
options[:out] = "daily_trend"
map = %Q( function(){
if (this.created_at != null)
{
emit(this.created_at.toDateString(), 1);
}
})
result = collection.map_reduce(map, standard_reduce, options)
normalize_results(result)
end
Any advice is appreciated!
If you are doing simple counts, sums, uniques etc, you may be able to avoid using map-reduce completely. You can use the $inc operator to get most of the stuff that you need in real-time.
I have explained this in detail in my blog post on real-time analytics with MongoDB.
Sounds like your use case is more in the lines of online stream / event processing.
You can use mongo or other databases / caching product to store reference data, and an event processing framework for receiving and processing the events. There are a few tools that can help you with that - out the back of my head here are a few: Twitter Storm, Apache S4, GigaSpaces XAP (disclaimer - I work for GigaSpaces) and GridGain.
Use one of the cloud services like MongoLab.. Depends on your definition of cheap though..
The answer regarding using operators rather than MapReduce has merits, and may be far more beneficial to your efforts to get real time responses. Map Reduce on mongodb does not lend itself to yielding real time responses.
Further to that, you may also benefit from the new aggregation framework (http://www.mongodb.org/display/DOCS/Aggregation+Framework), once that is available in the next release.
To answer the more general question about how to scale out MapReduce, adding a new server may not help if you are simply going to add it as a secondary, as a secondary it will not have the capability to store your M/R results in a collection, so inline is your only option. If you do not need to store results in a collection then this is your easiest way forward. For more information, see an in-depth discussion here: http://groups.google.com/group/mongodb-user/browse_thread/thread/bd8f5734dc64117a
Sharding can help with scaling out, but bear in mind that you will need to run everything through a mongos process, have config servers and that the mongos will need to finalize the result sets returned from each shard, so you add a new potential bottleneck depending on your data and you will need more than just one extra machine to have it working in a reliable manner.
It is the connections between different data items that is most valuable to them (they let the public do the work of categorising the data to make it valuable) and hence also the most dangerous to you http://indresult.com
Related
I am using MongoDB Atlas Free Tier hosted on GCP. I have documents which have arrays containing 300kb data. A simple Get By ID query takes around 8-15 seconds. There are less than 50 records in the collection so probably indexing is not an issue. Also, the I have used my custom Ids, and not the built in ObjectIds in my collection. Is this much query time normal? If yes, what are some ways to address this issue as I need fast realtime analytics on Frontend. I already have Redis in mind, but is there any better way to address this?
Ensure your operations are not throttled. https://docs.atlas.mongodb.com/reference/free-shared-limitations/
Test performance with a different driver (another language), verify you are using most recent driver releases.
Test smaller documents to identify whether time is being expended on the server or over the network.
Test with mongo shell.
As for an answer, I highly recommend you not to deal with M0 Atlas tier. Or at least choose it wisely, don't choose US-based cluster if you thousand of miles away from States side. Don't understood me wrong. It's a good product. But it depends on your costs.
As for myself, I prefer to deal with MongoDB Community Edition version and deploy it on my VPS/VDS. Of course it doesn't provide you such good web-interface like you have seen in Atlas. And there is no support of Realms functional (stitch), but instead you could design it yourself. And also, every performance issue is depend on you.
As for me, I using MongoDB not for real-time data, but visual snapshots on front-end, and I have no problems with performance.
I mean if I have them, then I deal with them myself, via indexing,
increasing VPS CPU/RAM, optimizing queries and so on
Also, one more thing about your problem: «I have documents which have arrays containing 300kb data»
If you have an array field in your schema, and it stores lots of data, especially if it's embedded docs, are you sure that you are using right schema pattern?
You might wanna take a look at this articles at Mongo University about architecture patterns.
Probably it will be much better for you to have a different collection for embedded docs, and request them via aggregation.$lookup when they needed.
I started building and app and chose meteor as a platform, but I stumbled upon a problem, I need to serve large collections of data to user let's say 2000-5000 records, now I understand that having such large reactive collection is a problem for meteor, but the thing is I don't need it be reactive, I just need statistically display it to user whenever he requests it. I just started out with meteor and don't know of it's capabilities, so I wonder if something like this is possible? For example php queries ~3000 records from mysql and prints it to user in around 3 second.
But using meteor even for smaller collections let's say 500 records I have to wait for a lot more time: ~1min.
I have a clue that this slow loading might be caused by meteor default MongoDB implementation, and using external database should increase performance, though I did not tried it yet. Anyway, the question would be can I achieve fast loading of large data collections in meteor and if so how would I do that, and what are best practices of handling large collections in meteor?
PS. I chose meteor because I do need it's reactivity for some cases, with small collections. But I also need to serve larger static collections. But I wonder if I can combine both in meteor?
A couple of pointers, which may help with your static collections:
Use 'reactive: false' in your find queries that don't need to be reactive as that'll stop meteor watching for updates.
http://docs.meteor.com/#/full/find
Figure out what fields you need where and only return the bare minimum. You can use session variables to filter based on the context, which will make your publications a lot more effective.
http://docs.meteor.com/#/full/meteor_publish
Surely the user doesn't need to see all 2000-5000 records at once? Are you not able to implement some sort of paging mechanism?
Best pattern for pagination for Meteor
I am trying to pick MongoDB as my preferred database. I need help on the design of my table.
App background - analytics app where contacts push their own events and related custom data. Contact can have many events. Eg: contact did this, did that etc.
event_type, custom_data (json), epoch_time
eg:
event 1: event_type: page_visited, custom-data: {url: pricing, referrer: google}, current_time
event 2: event_type: video_watched, custom-data: {url: video_link}, current_time
event 3: event_type: paid, custom_data: {plan:lite, price:35}
These events are custom and are defined by the user. Scalability is a concern.
These are the common use cases:
give me a list of users who have come to pricing page in the last 7 days
give me a list of users who watched the video and paid more than 50
give me a list of users who have visited pricing, watched video but NOT paid at least 20
What's the best way to design my table?
Is it a good idea to use embedded events in this case?
In Mongo they are called collections and not tables, since the data is not rows/columns :)
(1) I'd make an Event collection and a Users collections
(2) I'd do 1 document per Event which has a userId in it.
(3) If you need realtime data you will want an index on what you want to query by (i.e. never do a scan over the whole collection).
(4) if there are things which are needed for reporting only, I'd recommend making a reporting node (i.e. a different mongo instance) and using replication to copy data to that mongo instance. You can put additional indexes for reporting on that node. That way the additional indexes and any expensive queries will not affect production performance.
Notes on sharding
If your events collection is going to become large - you may need to consider sharding. Perhaps sharding by user Id. However, I'd recommend that may be a longer term solution and not to dive into that until you need it.
One thing to note, is that mongo has currently (2.6) a database level write locking implementation. Which means you can only perform 1 write at a time. It allows many reads. Which means that if you want a high write system AND have a lot of users, you will need to look into sharding at some point. However, in my experience so far, administratively 1 primary node with a secondary (and reporting node) is easier to setup. We currently can handle around 10,000 operations per second with that setup.
However, we have had issues with spikes in users coming to the system. You'll want to make sure you have enough memory for your indexes. And SSD's would be recommended to. as a surge in users can result in cache misses (i.e. index not in memory) which causes it to be read off the hard disk.
One final note - there are a lot of NoSQL DB's and they all have their pros and cons. I personally found that high write, low read, and realtime anaysis of lots of data is not really mongo's strength. So it does depend on what you are doing. It sounds like you are still learning the fundamentals. It might be worth a read of all the available types to pick the right tool for the right job.
I am building a web based system for my organization, using Mongo DB, I have gone through the document provided by mongo db and came to the following conclusion:
find: Cannot pull data from sub array.
group: Cannot work in sharded environment.
aggregate:Best for sub arrays, but has performance issue when data set is large.
Map Reduce : Too risky to write map and reduce function.
So,if someone can help me out with the best approach to work with sub array document, in production environment having sharded cluster.
Example:
{"testdata":{"studdet":[{"id","name":"xxxx","marks",80}.....]}}
now my "studdet" is a huge collection of more than 1000, rows for each document,
So suppose my query is:
"Find all the "name" from "studdet" where marks is greater than 80"
its definitely going to be an aggregate query, so is it feasible to go with aggregate in this case because ,"find" cannot do this and "group" will not work in sharded environment, so if I go with aggregate what will be the performance impact, i need to call this query most of the time.
Please have a look at:
http://docs.mongodb.org/manual/core/data-modeling/
and
http://docs.mongodb.org/manual/tutorial/model-embedded-one-to-many-relationships-between-documents/#data-modeling-example-one-to-many
These documents describe the decisions in creating a good document schema in MongoDB. That is one of the hardest things to do in MongoDB, and one of the most important. It will affect your performance etc.
In your case running a database that has a student collection with an array of grades looks to be the best bet.
{_id:, …., grades:[{type:”test”, grade:80},….]}
In general, and, given your sample data set, the aggregation framework is the best choice. The aggregation framework is faster then map reduce in most cases (certainly in execution speed, it is C++ vs javascript for map reduce).
If your data's working set becomes so large you have to shard then aggregation, and everything else, will be slower. Not, however, slower then putting everything on a single machine that has a lot of page faults. Generally you need a working set larger then the RAM available on a modern computer for sharding to be the correct way to go such that you can keep everything in RAM. (At this point a commercial support contract for Mongo for assistance is going to be a less then the cost of hardware, and that include extensive help with schema design.)
If you need anything else please don’t hesitate to ask.
Best,
Charlie
The aggregation framework on MongoDB has certain limitations as per this link.
I want to remove the restrictions 2, 3.
I really do not care what the resulting set's size is. I have a lot of RAM and resources.
And I do not care if it takes more than 10% system resources.
I expect both 2, 3 to be violated in my application. Mostly 2.
But I really need the aggregation framework. Is there anything that I can do to remove these limitations?
The reason *
The application I have been working has these things
The user has the ability to upload a large dataset
We have a menu to let him sort, aggregate etc
The aggregate has no restrictions currently and the user can choose to do whatever he wants. Since the data is not known to the developer and since it is possible to group by any number of columns, the application can error out.
Choosing something other than mongodb is a no go. We have already sunk too much into development with MongoDB
Is it advisable to change the source code of Mongo?
1) Saving aggregated values directly to some collection(like with MapReduce) will released in future versions, so first solution is just wait for a while :)
2) If you hit 2-nd or 3-rd limitation may you should redesign your data scheme and/or aggregation pipeline. If you working with large time series, you can reduce number of aggregated docs and do aggregation in several steps (like MapReduce do). I can't say more concretely, because I don't know your data/use cases(give me a comment).
3) You can choose different framework. If you familiar with MapReduce concept, you can try Hadoop(it can use MongoDB as data source). I don't have experience with MongoDB-Hadoop integration, but I mast warn you NOT to use Mongo's MapReduce -- it sucks hard on large datasets.
4) You can do aggregation inside your code, but you should use some "lowlevel" language or library. For example, pymongo (http://api.mongodb.org/python/current/) is not suitable for such things, but you can tray something like monary(https://bitbucket.org/djcbeach/monary/wiki/Home) to efficiently extract date and NumPy or Pandas to aggregate it the way want.