High Volume MongoDB with Twitter Streaming API, Ruby on Rails, Heroku setup - mongodb

I'm looking to re-code an application to better handle spikes in tweets. I'm moving to Heroku and MongoDB (either MongoLab or MongoHQ) for the database solution.
During certain news events, tweet volume might spike to 15,000 / second. Typically with each tweet, I parse the tweet and store various pieces of data such as user data, etc. My idea is to store the raw tweets in a separate collection, and have a separate process grab raw tweets and parse them. The goal here is when there is a massive spike in tweets, my application isn't trying to parse all of these, but is essentially backlogging the raw tweets in another collection. As the volume slows, the process can take care of the backlog over time.
My question is three fold:
Can MongoDB handle this type of volume with regards to inserts into a collection at a rate of 15,000 tweets per second?
Any idea on the better setup: MongoHQ or MongoLab?
Any feedback on the overall setup?
Thanks!

The write volume that it will handle depends on lots of factors - hardware, indexes, size of each document, etc. Your best bet is to test it in the environment you're planning to use. If the demands of the write load exceed the capacity of a single mongo server, you can always use just multiple shards.
They are very similar, but there are some differences in pricing and the actual site design has a bunch of differences. There's a thread of discussion about it here: https://webmasters.stackexchange.com/questions/20782/mongodb-hosting-mongolab-vs-mongohq-vs-mongomachine
Overall it seems to make sense. Sounds like you will probably want to flesh out some details about how you will be processing the backlog. Will you be polling it by querying periodically, deleting tweets from the backlog as it processes them, etc.

Completely agree on the need to test this. In general, mongo can handle that many writes, but in practice it depends on the size of your set up, other operations, indexes, etc.
I had to do a similar approach for collecting tons of metrics data. I used a lightweight event-machine process to accept incoming requests in parallel, and store them in a simple format, then another process would take those requests and send them up to a central server. The main goal was to make sure no data was lost if the central server was down, but it also allowed me to put in some throttling logic so that the spikes in data wouldn't overwhelm the system.
I'd be interested to see how this works out for you price-wise, vs. a vps like linode. (I'm a huge Heroku fan, but with certain architectures it can get pricey quickly)

Related

design question: best way to aggregate data from several microservices and show in UI

we have a scenario where we need to aggregate data from several services and show in UI. The current scenario is when an agent logins in, we need to show cases assigned to that agent. Case information needs to be aggregated from several microservices. There would be around 1K cases assigned to agent at a time and all of the needs to be shown to agent so that he can perform sorting based on certain case data.
What be best approach to show data in this scenario? should we do API calls to several services for each case and aggregate and show ? Or there are better approaches to achieve this.
No. You'll certainly not call multiple APIs to aggregate data on runtime. Even if you call the apis parallely, it will be a huge latency.
You need to pre-aggregate the case details and cache them in a distributed caching system (e.g. Redis or memcached) using a streaming platform (e.g. Kafka). Also, store the pre-aggregated case details in a persistent database. Basically, it's a kind of materialized views.
Caching will enable you to serve the case details fast to the user without any noticeable latency. And streaming will help you to keep the cache and DB aggregations updated in a near-real time. Storing the materialized view in database will save you from storing everything in memory. You can use an LRU cache. Only the recently used data will be in cache. If you need to show any case data that is not in cache, you'd read it from database and store it in cache for future requests.
I recommend you read these two Martin Kleppmann articles here and here

Meteor - Why not just publish all the collection data?

This may be quite an easy question to answer as it may just be my lack of understanding, but if you are having to run the query twice - once of the server and once on the client - why not just publish all the collection data, and then just run one query on the client?
Obviously I don't mean doing this for the users collection, but if you have a blog Posts collection, wouldn't this be beneficial?
Publish all the post data, then subscribe to it and running whatever query is necessary on the client to get the data you need.
Publishing everything is good for 'development' environment as meteor adds autopublish by default but this has some fallacies in 'production' environment. I find this two points to be of importance
Security : The idea is, supply only as much data to the client as required. You can never trust the client and you don't know what the client may use the data for. For your use case, of simple blog posts, this may not be a serious risk but may be a critical risk for e commerce application. The last thing, you want is a hacker to use the data and leverage a bug in your code to do nasty stuff.
Data Overheads: For subscriptions, generally waitOn is used. Thus, till all the data has been made available to the client, the templates are not rendered. IF you have a very large amount of data it will take considerable time to render. So, it is advised to keep the data at 'only what to need' stage to optimize this time too.

How to ensure that parallel queries to ext. system are executed only once and then cached

Server frameworks: Scala, Play 2.2, ReactiveMongo, Heroku
I think I have quite interesting brain teaser for you:
In my trip-planning application I want to display weather forecast on a map(similar to this). I'm using a paid REST service to query weather data. To speed up user experience and reduce costs I plan to cache weather data for each location for one hour.
There are a few not-so obvious things to consider:
It might require to query up to 100 location for weather to display one weather map
Weather must be queried in parallel because it would take too long to query it in serial fashion considering network latency
However launching 100 threads for each user request is not an option as well (imagine just 5 users looking at a map at one time)
The solution is to have let's say 50 workers that query weather for user requests
Multiple users might be viewing the same portion of map
There is a possible racing condition where one location is queried multiple times.
However it should be queried only once and then cached.
The application is running in clustered environment meaning there will be several play instances.
Coming from a Java EE background I can come up with a pretty good solution using the Java EE stack.
However I wonder how to do this using something more natural to Scala/Play stack: Akka. There is an example (google "heroku scala akka") for similar problem but it doesn't solve one issue: Racing condition when multiple users query the same data at once.
How would you implement this?
EDIT: I have decided that the requirement to ensure that weather data is updated only once is not necessary. The situation would happen far too infrequently to be a real problem and all proposed solutions would bring too much overhead and complexity to the system to be viable.
Thanks everyone for your time and effort. I hope answers to this question will help someone in the future with similar problem.
In Akka you can choose from multiple routing strategies. ConsistentHashingRoutingLogic could serve you well in this situation. Since actors are single-threaded you can easily maintain a cache in each actor. This routing logic will assure that two equal messages will always hit the same actor.
Each actor can work in the following way:
1. check local cache (for example apache commons LRUMap)
- if found, return
2. check global cache (distributed memcache or any other key-value store)
- if found, store the result in the local cache and return
3. query the REST service
4. store the result in the global and local caches
You can have a look at this question, which I based my answer on.
I decided that I'll post my JMS solution as well.
Controller that processes the request for weather does following:
Query the DB for weather data. If there are NO locations with out-of-date data reply immediately. Otherwise continue:
Start listening on a topic (explained later).
For each location: Check whether the weather for the location isn't being updated.
If not send a weather update request message to queue.
Certain amount of workers (50?) listen to that queue.
Worker first marks the location weather as being updated
Worker retrieves updated weather and updates the DB.
Worker sends a message to a topic with weather data for that location.
When controller receives (via topic) weather updates for all out-of-date locations, combine it with up-to-date locations and reply.

Caching repeating query results in MongoDB

I am going to build a page that is designed to be "viewed" alot, but much fewer users will "write" into the database. For example, only 1 in 100 users may post his news on my site, and the rest will just read the news.
In the above case, 100 SAME QUERIES will be performed when they visit my homepage while the actual database change is little. Actually 99 of those queries are a waste of computer power. Are there any methods that can cache the results of the first query, and when they detect the same query in a short time, can deliver the cached result?
I use MongoDB and Tornado. However, some posts say that the MongoDB does not do caching.
Making a static, cached HTML with something like Nginx is not preferred, because I want to render a personalized page by Tornado each time.
I use MongoDB and Tornado. However, some posts say that the MongoDB does not do caching.
I dunno who said that but MongoDB does have a way to cache queries, in fact it uses the OS' LRU to cache since it does not do memory management itself.
So long as your working set fits into the LRU without the OS having to page it out or swap constantly you should be reading this query from memory at most times. So, yes, MongoDB can cache but technically it doesn't; the OS does.
Actually 99 of those queries are a waste of computer power.
Caching mechanisms to solve these kind of problems is the same across most techs whether they by MongoDB or SQL. Of course, this only matters if it is a problem, you are probably micro-optimising if you ask me; unless you get Facebook or Google or Youtube type traffic.
The caching subject goes onto a huge subject that ranges from caching queries in either pre-aggregated MongoDB/Memcache/Redis etc to caching HTML and other web resources to make as little work as possible on the server end.
Your scenario, personally as I said, sounds as though you are thinking wrong about the wasted computer power. Even if you were to cache this query in another collection/tech you would probably use the same amount of power and resources retrieving the result from that tech than if you just didn't bother. However that assumption comes down to you having the right indexes, schema, set-up etc.
I recommend you read some links on good schema design and index creation:
http://docs.mongodb.org/manual/core/indexes/
https://docs.mongodb.com/manual/core/data-model-operations/#large-number-of-collections
Making a static, cached HTML with something like Nginx is not preferred, because I want to render a personalized page by Tornado each time.
Yea I think by trying to worry about query caching you are pre-maturely optimising, especially if you don't want to take off, what would be 90% of the load on your server each time; loading the page itself.
I would focus on your schema and indexes and then worry about caching if you really need it.
The author of the Motor (MOngo + TORnado) package gives an example of caching his list of categories here: http://emptysquare.net/blog/refactoring-tornado-code-with-gen-engine/
Basically, he defines a global list of categories and queries the database to fill it in; then, whenever he need the categories in his pages, he checks the list: if it exists, he uses it, if not, he queries again and fills it in. He has it set up to invalidate the list whenever he inserts to the database, but depending on your usage you could create a global timeout variable to keep track of when you need to re-query next. If you're doing something complicated, this could get out of hand, but if it's just a list of the most recent posts or something, I think it would be fine.

Does it make sense to use both redis and mongodb?

We have a lot of data, decided to use mongodb and it works great.
We started using redis to track the active users in our real-time app. We also started doing some pub/sub channel stuff with redis.
Our next move might be to use mongodb for dormant data and redis for active data. An example of this would be, all of our users are stored in mongodb but when they are logged in we will move a copy of that data to redis for fast access. We also store things like their game activity in redis and use the data accordingly. When the user logs out we will save anything needed in mongo where it will live until its needed again and loaded into redis.
One thing we have been looking into is preservation of redis on crash. User activity on the system is meaningful data that we wouldn't want to lose on crash, and if we are only logging data after the fact, should we save a back up of important data in mongo after every event? Then on crash redis can restore from mongo?
Is there are better way to go about the things we are trying to achieve?
Thanks!
OK, so there are several angles from which to attack this question. The first thing to point out is that redis does have user-configurable persistence.
User activity on the system is meaningful data that we wouldn't want to lose on crash, and if we are only logging data after the fact, should we save a back up of important data in mongo after every event?
To be fair, the default setup with MongoDB is to flush to disk every 60 seconds. So you still have a 60 second window of data loss.
You can use journaling and drop that window to 100ms, but that will tax the IO more heavily.
You can also configure your writers to wait on that journal to flush (WriteConcern: fsync), but that's going to slow down writes significantly.
Is there are better way to go about the things we are trying to achieve?
Really depends on what you're trying to achieve.
What type of data loss can you handle?
Redis has replication, are you using that? Does that solve most of your data loss worries?
You say you're using PubSub features, how many nodes does this cover? Is your data adequately replicated just as a result of this?
Either way, it's a somewhat complex problem. MongoDB may kind of solve your problems, but replication may solve those problems just as well. Depends on your comfort level.