I am going to build a page that is designed to be "viewed" alot, but much fewer users will "write" into the database. For example, only 1 in 100 users may post his news on my site, and the rest will just read the news.
In the above case, 100 SAME QUERIES will be performed when they visit my homepage while the actual database change is little. Actually 99 of those queries are a waste of computer power. Are there any methods that can cache the results of the first query, and when they detect the same query in a short time, can deliver the cached result?
I use MongoDB and Tornado. However, some posts say that the MongoDB does not do caching.
Making a static, cached HTML with something like Nginx is not preferred, because I want to render a personalized page by Tornado each time.
I use MongoDB and Tornado. However, some posts say that the MongoDB does not do caching.
I dunno who said that but MongoDB does have a way to cache queries, in fact it uses the OS' LRU to cache since it does not do memory management itself.
So long as your working set fits into the LRU without the OS having to page it out or swap constantly you should be reading this query from memory at most times. So, yes, MongoDB can cache but technically it doesn't; the OS does.
Actually 99 of those queries are a waste of computer power.
Caching mechanisms to solve these kind of problems is the same across most techs whether they by MongoDB or SQL. Of course, this only matters if it is a problem, you are probably micro-optimising if you ask me; unless you get Facebook or Google or Youtube type traffic.
The caching subject goes onto a huge subject that ranges from caching queries in either pre-aggregated MongoDB/Memcache/Redis etc to caching HTML and other web resources to make as little work as possible on the server end.
Your scenario, personally as I said, sounds as though you are thinking wrong about the wasted computer power. Even if you were to cache this query in another collection/tech you would probably use the same amount of power and resources retrieving the result from that tech than if you just didn't bother. However that assumption comes down to you having the right indexes, schema, set-up etc.
I recommend you read some links on good schema design and index creation:
http://docs.mongodb.org/manual/core/indexes/
https://docs.mongodb.com/manual/core/data-model-operations/#large-number-of-collections
Making a static, cached HTML with something like Nginx is not preferred, because I want to render a personalized page by Tornado each time.
Yea I think by trying to worry about query caching you are pre-maturely optimising, especially if you don't want to take off, what would be 90% of the load on your server each time; loading the page itself.
I would focus on your schema and indexes and then worry about caching if you really need it.
The author of the Motor (MOngo + TORnado) package gives an example of caching his list of categories here: http://emptysquare.net/blog/refactoring-tornado-code-with-gen-engine/
Basically, he defines a global list of categories and queries the database to fill it in; then, whenever he need the categories in his pages, he checks the list: if it exists, he uses it, if not, he queries again and fills it in. He has it set up to invalidate the list whenever he inserts to the database, but depending on your usage you could create a global timeout variable to keep track of when you need to re-query next. If you're doing something complicated, this could get out of hand, but if it's just a list of the most recent posts or something, I think it would be fine.
Related
I wanna ask you about lists and pagination in APIs
I want to build a long list in home screen that's mean this request will have a lot of traffic because it's the main screen and I want to build it in a good way to handle the traffic
After I searched about the way of how I gonna implement it
Can I depend on postgresql in pagination ? Or I need to use search engine like solr
If I depend on the database and users started to visit the app, then this request gonna submit a lot of queries on the database is this gonna kill the database ?
Also I'm using Redis to Cache Some data and this gonna handle some traffic but the problem with home screen the response it too large and I can't cache all of this response in one key in Redis
Can anyone explain to me what is the best way to implement this request for pagination .. the only thing I want is pagination I'm not looking to implement a full text search but to handle the traffic I read that search engine will handle it to not affect the database or kill it
Thanks a lot :D
You can do this seamlessly with the pagination technology we know in PostgreSQL. PostgreSQL has enough functions and capabilities to do this. (limit, offset, fetch)
But let me give you a recommendation.
There are several types of pagination.
The first type is that the count of pages must be known in advance. This technology is outdated and is not recommended. Because at this time you need to know the count of records in the table. But calculating count of records is a very slowing process, mainly in large tables.
The second type is that the number of pages is not known in advance. Information from the next page is brought in parts only if necessary. Just like Google, LinkedIn and other big companies use it. In this case, it is not necessary to calculate the count of any table.
This may be quite an easy question to answer as it may just be my lack of understanding, but if you are having to run the query twice - once of the server and once on the client - why not just publish all the collection data, and then just run one query on the client?
Obviously I don't mean doing this for the users collection, but if you have a blog Posts collection, wouldn't this be beneficial?
Publish all the post data, then subscribe to it and running whatever query is necessary on the client to get the data you need.
Publishing everything is good for 'development' environment as meteor adds autopublish by default but this has some fallacies in 'production' environment. I find this two points to be of importance
Security : The idea is, supply only as much data to the client as required. You can never trust the client and you don't know what the client may use the data for. For your use case, of simple blog posts, this may not be a serious risk but may be a critical risk for e commerce application. The last thing, you want is a hacker to use the data and leverage a bug in your code to do nasty stuff.
Data Overheads: For subscriptions, generally waitOn is used. Thus, till all the data has been made available to the client, the templates are not rendered. IF you have a very large amount of data it will take considerable time to render. So, it is advised to keep the data at 'only what to need' stage to optimize this time too.
This question has been asked multiple times, here and here, and the answer to get this working is fairly straight forward: add an environmental variable to your bash_profile and all Meteor instances on your localhost will share that MONGO_URL.
What I've noticed however is that while this may be the case, there's quite a bit of latency in the "reactivity" of Meteor. I've tested this with two very lean Meteor apps, with empty collections. Inserting a document to a collection from one Meteor app, where my second app is querying that same collection and printing out a field from the documents does work, but there's a noticeable lag before it updates. I've ruled out the possibility of the collection insertion being the source of the lag (simple console.log callback on the client of the first app, logging the id of the newly inserted document).
My purpose for having multiple apps (two to be precise) sharing the same MongoDB is to separate an admin panel from a mobile app without going crazy regarding name-spacing and bloat. This configuration works, but I'm not sure it's the "proper" way of accomplishing the task, and it certainly seems to be causing a performance hit.
Any insight into this matter would be appreciated. Thank you!
EDIT: To clarify, the db URL I'm using is on my localhost, and isn't something hosted online.
When you use an external database, by default meteor will use periodic polling (every few seconds) in order to observe any changes. The delay you are experiencing is a result of this polling process. You can remove the delay and reduce your app's CPU usage by taking advantage of meteor's oplog tailing feature. In order to use it you will:
Get access to a mongodb instance with the oplog turned on.
Set the environment variable MONGO_OPLOG_URL so your app(s) can read the oplog.
Personally, I'd recommend compose.io for this. They provide exactly this as part of their basic elastic deployment. See this post for detailed instructions.
For users who wish to connect to the oplog created locally for you, you can obtain the URL via:
MongoInternals.defaultRemoteCollectionDriver().mongo._oplogHandle._oplogUrl
It should end up looking something like mongodb://127.0.0.1:3001/local
I'm looking to re-code an application to better handle spikes in tweets. I'm moving to Heroku and MongoDB (either MongoLab or MongoHQ) for the database solution.
During certain news events, tweet volume might spike to 15,000 / second. Typically with each tweet, I parse the tweet and store various pieces of data such as user data, etc. My idea is to store the raw tweets in a separate collection, and have a separate process grab raw tweets and parse them. The goal here is when there is a massive spike in tweets, my application isn't trying to parse all of these, but is essentially backlogging the raw tweets in another collection. As the volume slows, the process can take care of the backlog over time.
My question is three fold:
Can MongoDB handle this type of volume with regards to inserts into a collection at a rate of 15,000 tweets per second?
Any idea on the better setup: MongoHQ or MongoLab?
Any feedback on the overall setup?
Thanks!
The write volume that it will handle depends on lots of factors - hardware, indexes, size of each document, etc. Your best bet is to test it in the environment you're planning to use. If the demands of the write load exceed the capacity of a single mongo server, you can always use just multiple shards.
They are very similar, but there are some differences in pricing and the actual site design has a bunch of differences. There's a thread of discussion about it here: https://webmasters.stackexchange.com/questions/20782/mongodb-hosting-mongolab-vs-mongohq-vs-mongomachine
Overall it seems to make sense. Sounds like you will probably want to flesh out some details about how you will be processing the backlog. Will you be polling it by querying periodically, deleting tweets from the backlog as it processes them, etc.
Completely agree on the need to test this. In general, mongo can handle that many writes, but in practice it depends on the size of your set up, other operations, indexes, etc.
I had to do a similar approach for collecting tons of metrics data. I used a lightweight event-machine process to accept incoming requests in parallel, and store them in a simple format, then another process would take those requests and send them up to a central server. The main goal was to make sure no data was lost if the central server was down, but it also allowed me to put in some throttling logic so that the spikes in data wouldn't overwhelm the system.
I'd be interested to see how this works out for you price-wise, vs. a vps like linode. (I'm a huge Heroku fan, but with certain architectures it can get pricey quickly)
We have a website at swalif.com which is like a news website based on forums. We are currently using a mysql database and things are getting slow. We decided to go with the Sphinx Search Server to speedup things and its been going quiet well.
Recently we heard of something called 'memcached' and just going through it we think we should look into it before moving to a search server completely.
My question is what are the pros and cons of using 'memcached' as it is a fairly new topic to us.
Thanking you
I just got my site set up with memcached a couple months back and it's amazing. The pros are rather obvious. It can be used to cache information that is, perhaps difficult to gather. The best example is an expensive mysql query. Check your slow query log, that would be a good starting point for things to parts to target. I had this one main page that took 2.5 seconds to echo from the server (horrible, I know). I had thought about changing the way it was written and it would have been very complicated. I put in memcached on the "difficult" parts of that page and now it's down to 0.001 seconds to parse. It's just amazing.
There is one main con that I've run into. If you update your content, you have to delete all associated keys related to that new content so that your front end will refetch the data and cache the new data. If not, you get stale content. I have tens of thousands of entries in my memcached and it's difficult to delete all the appropriate ones. If you don't, you'll get old content. One solution is to just set your key expiration duration to something short (24 hours). If you do that, you know that your site will reflect the newest content, at worst, 24 hours after a change. So if you can live with that, this problem is rather moot.
Bottom line, it's one of the best tools I've ever seen. It took me less than a day to install it on the lions share of my rather big site and the impact was tremendous.