Memcached - Pros and Cons - memcached

We have a website at swalif.com which is like a news website based on forums. We are currently using a mysql database and things are getting slow. We decided to go with the Sphinx Search Server to speedup things and its been going quiet well.
Recently we heard of something called 'memcached' and just going through it we think we should look into it before moving to a search server completely.
My question is what are the pros and cons of using 'memcached' as it is a fairly new topic to us.
Thanking you

I just got my site set up with memcached a couple months back and it's amazing. The pros are rather obvious. It can be used to cache information that is, perhaps difficult to gather. The best example is an expensive mysql query. Check your slow query log, that would be a good starting point for things to parts to target. I had this one main page that took 2.5 seconds to echo from the server (horrible, I know). I had thought about changing the way it was written and it would have been very complicated. I put in memcached on the "difficult" parts of that page and now it's down to 0.001 seconds to parse. It's just amazing.
There is one main con that I've run into. If you update your content, you have to delete all associated keys related to that new content so that your front end will refetch the data and cache the new data. If not, you get stale content. I have tens of thousands of entries in my memcached and it's difficult to delete all the appropriate ones. If you don't, you'll get old content. One solution is to just set your key expiration duration to something short (24 hours). If you do that, you know that your site will reflect the newest content, at worst, 24 hours after a change. So if you can live with that, this problem is rather moot.
Bottom line, it's one of the best tools I've ever seen. It took me less than a day to install it on the lions share of my rather big site and the impact was tremendous.

Related

Possible big mistake. What exactly does "db.repairDatabase()" do? MONGODB

I have a mongodb database with several million users.
I wanted to free space and I created a bot to remove inactive users of more than 6 months.
I have been looking at the disk for several minutes
and I have seen that it varied but it will not release large space, not even 1 mb. That's weird.
I've read that "remove" does not actually delete the disc if it does not simply mark that it can be deleted or overwritten. It is true?
That seemed to make a lot of sense to me. So, I've looked for something that forces space to really free up...
I've applied repairDatabase() and I think I've done wrong.
Everything has been blocked!
I have tried the luck and I have restarted the server.
There is a MongoDB service working but its status is maintained in "Starting" (not Running).
I'm reading from other sites that repairDatabase() requires twice as much space as the original size of the database, it does not have it.
I do not know, what is doing, and this could in several hours, days ...
Is the database lost? I think I will stop all services and delete the database.
repairDatabase is similar to fsck. That is, it attempts to clean up the database of any corrupt documents which may be preventing MongoDB to start up. How it works in detail is different depending on your storage engine, but repairDatabase could potentially remove documents from the database.
The details of what the command does is outlined quite clearly (with all the warnings) in the MongoDB documentation page: https://docs.mongodb.com/manual/reference/command/repairDatabase/
I would suggest that next time it's better to read the official documentation first rather than reading what people said in forums. Second-hand information like these could be outdated, or just plain wrong.
Having said that, you should leave the process running until completion, and perform any troubleshooting if the database cannot be started. It may require 2x the disk space of your data, but it's also possible that the command just needs time to finish.

Caching repeating query results in MongoDB

I am going to build a page that is designed to be "viewed" alot, but much fewer users will "write" into the database. For example, only 1 in 100 users may post his news on my site, and the rest will just read the news.
In the above case, 100 SAME QUERIES will be performed when they visit my homepage while the actual database change is little. Actually 99 of those queries are a waste of computer power. Are there any methods that can cache the results of the first query, and when they detect the same query in a short time, can deliver the cached result?
I use MongoDB and Tornado. However, some posts say that the MongoDB does not do caching.
Making a static, cached HTML with something like Nginx is not preferred, because I want to render a personalized page by Tornado each time.
I use MongoDB and Tornado. However, some posts say that the MongoDB does not do caching.
I dunno who said that but MongoDB does have a way to cache queries, in fact it uses the OS' LRU to cache since it does not do memory management itself.
So long as your working set fits into the LRU without the OS having to page it out or swap constantly you should be reading this query from memory at most times. So, yes, MongoDB can cache but technically it doesn't; the OS does.
Actually 99 of those queries are a waste of computer power.
Caching mechanisms to solve these kind of problems is the same across most techs whether they by MongoDB or SQL. Of course, this only matters if it is a problem, you are probably micro-optimising if you ask me; unless you get Facebook or Google or Youtube type traffic.
The caching subject goes onto a huge subject that ranges from caching queries in either pre-aggregated MongoDB/Memcache/Redis etc to caching HTML and other web resources to make as little work as possible on the server end.
Your scenario, personally as I said, sounds as though you are thinking wrong about the wasted computer power. Even if you were to cache this query in another collection/tech you would probably use the same amount of power and resources retrieving the result from that tech than if you just didn't bother. However that assumption comes down to you having the right indexes, schema, set-up etc.
I recommend you read some links on good schema design and index creation:
http://docs.mongodb.org/manual/core/indexes/
https://docs.mongodb.com/manual/core/data-model-operations/#large-number-of-collections
Making a static, cached HTML with something like Nginx is not preferred, because I want to render a personalized page by Tornado each time.
Yea I think by trying to worry about query caching you are pre-maturely optimising, especially if you don't want to take off, what would be 90% of the load on your server each time; loading the page itself.
I would focus on your schema and indexes and then worry about caching if you really need it.
The author of the Motor (MOngo + TORnado) package gives an example of caching his list of categories here: http://emptysquare.net/blog/refactoring-tornado-code-with-gen-engine/
Basically, he defines a global list of categories and queries the database to fill it in; then, whenever he need the categories in his pages, he checks the list: if it exists, he uses it, if not, he queries again and fills it in. He has it set up to invalidate the list whenever he inserts to the database, but depending on your usage you could create a global timeout variable to keep track of when you need to re-query next. If you're doing something complicated, this could get out of hand, but if it's just a list of the most recent posts or something, I think it would be fine.

Cloudant / CouchDB "pull" replication for 600+ documents to iPhone

I'm using Cloudant and I'm struggling to pull/replicate 600 documents from server to my iPhone. First, it's pretty slow because it has to go one-document-at-a-time, and Second Cloudant was giving me "timeouts" after the 100th-or-so REST request. (I have a ticket with Cloudant for this one, as it's unacceptable!)
I was wondering if anyone has found a way / hack to "bulk" replicate when pulling. I was thinking, perhaps it's possible to "zip up" all of the changes, send them in one file, and fast-forward the iPhone database to the last-change seq.
Any helps is great -- thanks!
Can you not hit _all_docs?include_docs=true to get everything in one shot? http://wiki.apache.org/couchdb/HTTP_Document_API#all_docs
I don't know couchcoccoa but it looks like the API supports this: http://couchbaselabs.github.com/CouchCocoa/docs/interfaceCouchDatabase.html#a49d0904f438587b988860891e8049885
Actually, why not make a view. Make a view that gives you your list and make sure your id is there. With your id, you can then go to the document and get all the rest of the required information that you need in order to update it if you need to.
There really is no reason you would ever need to hit every document individually. They have views and search2.0 for that. Keep in mind you are using a cloud based technology. This stuff is not sitting in your basement, you can't just hit it a million times per device in a few seconds and expect anyone to not notice and/or get upset (an exaggeration, yes I know).
What I do not understand is that you are trying to replicate it to an iPhone? Are you running apache and couchdb in your app? Why not just read the JSON data and throw it into a database. or just throw it into a file if it updates that much and keep overwriting it. There is so many options that are a whole lot less messy.

How to limit the effect of client modifications to production systems

Our shop has developed a few WEB/SMS/DB solution for a dozen client installations. The applications have some real-time performance requirements, and are just good enough to function properly. The problem is that the clients (owners of the production servers) are using the same server/database for customizations that are causing problems with the performance of the applications that we created and deployed.
A few examples of clients' customizations:
Adding large tables with many text datatypes for the columns that get cast to other data types in the queries
No primary keys, indexes, or FK constraints
Use of external scripts that use count(*) from table where id = x, in a loop from the script, to determine how to construct more queries later in the same script. (no bulk actions that the planner can optimize or just do everything in a single pass)
All new code files on the server are created/owned by root, with 0777 permissions
The clients don't take suggestions/criticism well. If we just go ahead and try to port/change the scripts ourselves, the old code can come back, clobbering any changes that we make! Or with out limited knowledge of their use cases, we break functionality while trying to optimize their changes.
My question is this: how can we limit the resources to queries/applications other that what we create and deploy? Are there any pragmatic options in scenarios like this? We prided ourselves in having an OSS solution, but it seems that it's become a liability.
We use PG 8.3 running on a range on Linux Distos. The clients prefer php, but shell scripts, perl, python, and plpgsql are all used on the system in one form or another.
This problem started about two minutes after the first client was given full access to the first computer, and it hasn't gone away since. Anytime someone whose priorities are getting business oriented work done quickly they will be sloppy about it and screw up things for everyone. That's just how things work, because proper design and implementation are harder than cheap hacks. You're not going to solve this problem, all you can do is figure out how to make it easier for the client to work with you than against you. If you do it right, it will look like excellent service rather than nagging.
First off, the database side. There's now way to control query resources in PostgreSQL. The main difficulty is that tools like "nice" control CPU usage, but if the database doesn't fit in RAM it may very well be I/O usage that is killing you. See this developer message summarizing the issues here.
Now, if in fact it's CPU the clients are burning through, you can use two techniques to improve that situation:
Install a C function that changes the process priority (example 1, example 2) and make sure whenever they run something it gets called first (maybe put it into their psql config file, there are other ways).
Write a script that looks for postmaster processes spawned by their userid and renice them, make it run often in cron or as a daemon.
It sounds like your problem isn't the particular query processes they're running, but rather other modifications they're making to the larger structure. There's only one way to cope with that: you have to treat the client like they're an intruder and use the approaches of that portion of the computer security field to detect when they screw things up. Seriously! Install an intrusion detection system like Tripwire on the server (there are better tools, that's just the classic example), and have it alert you when they touch anything. New file that's 0777? Should jump right out of a proper IDS report.
On the database side, you can't directly detect the database being modified usefully. You should do a pg_dump of the schema every day into a file (pg_dumpall -g and pg_dump -s, then diff that against the last one you delivered and again alert you when it's changed. If you manage that this well, the contact with the client turns into "we noticed you changed on the server...what is it you're trying to accomplish with that?" which makes you look like you're really paying attention to them. That can turn into a sales opportunity, and they may stop fiddling with things as much just knowing you're going to catch it immediately.
The other thing you should start doing immediately is install as much version control software as you can on each client box. You should be able to login to each system, run the appropriate status/diff tool for the install, and see what's changed. Get that mailed to you regularly too. Again, this works best if combined with something that dumps the schema as a component to what it manages. Not enough people use serious version control approaches on the code that lives in the database.
That's the main set of technical approaches useful here. The rest of what you've got is a classic consulting client management problem that's far more of a people problem than a computer one. Cheer up, it could be worse--FSM help you if you give them ODBC access and they discover they can write their own queries in Access or something simple like that.

What time should I build to production?

My users use the site pretty equally 24/7. Is there a meme for build timing?
International audience, single cluster of servers on eastern time, but gets hit well into the morning, by international clients.
1 db, several web servers, so if no db, simple, whenever.
But when the site has to come down, when would you, as a programmer be least mad to see SO be down for say 15 minutes.
If there's truly no good time from the users' perspective, then I'd suggest doing it when your team has the most time to recover from any build-related disaster.
Here's what I have done and its worked well for me:
Get a site traffic analysis tool
which will graph hourly user load
Select low-point in graph for doing
updates
If you're small, then yeah, find when your lowest usage period is, and do it then (for us personally, usually around 1AM-3AM PST is the lowest dip...but it never drops to 0 of course). Once you start growing to having a larger userbase, if you want people to take you seriously you'll need to design your application such that you can upgrade without downtime. This is not simple, and it often involves having multiple servers.
I've spent ages trying to get our application to this point, the best I've come up with so far is for a couple hours run both the old version and new version at the same time. Users logged in at the time of the switchover stay on the old version, until they log out. Next time they come in they go to the new version. Any users coming on after the switchover get sent straight to the new version. It's still not foolproof, but it's pretty good.
What kind of an application is it? Most sites that I use tend to update around 2AM or 3AM.
Use a second site, and hotswap as needed.
The issue with hot-swapping, is database would still be shared, and breaking changes would bring stand in down as well.
I guess you have to ask your clients.
In any case, there's the wee hours of the morning. If you're talking about a locally available website, I do not think users will mind if they get an "under maintenance" notice at 2 am in their time zone.
Depends on your location: 4AM East Coast/1AM West Coast is typlically the lightest time.
Pick a few times that you'd like to do it and offer them as choices to the decider-types. Whatever you do, put up a "down for routine maintenance" page while you deploy.
Check the time of least usage
Clone/copy/update latest production code to another directory
If there exists any database migrations to be done, perform any that are required, and non conflicting with the old code base
At time of least usage, move symlink to point to latest code
First use an analysis tool to try and determine your typically "light" traffic times. Depending on the site and your location in the world in comparison to most of your users, it could be 4am, it could be 1pm, who knows. Then, once you have a good timeframe nailed down, make sure to have your deployment process as automated as possible, so that it happens quickly to minimize the downtime of your site.