I'm building a site that I hope will achieve the same sort of traffic as SO.
I know early optimisation is the root of all evil and all but I keep telling myself that I have to have a caching strategy baked into the design.
I'm using the LAMP stack and to begin with I'll be doing everything on one dedicated server.
Do you think it's not worth the effort to incorporate Memecahced into the project from day one?
Thanks
Definetely worth the effort!
Once your website starts scaling really fast, it will be very hard for you to only then start taking care of all the big scalability problems.
...Besides, Memcached is very easy to implement!
:)
Related
I am about to start a project that I hope will some day require the capacity for scaling. The key question for me is: should I invest the effort to design for this eventuality now, or should I cross that bridge when I get there?
I know how these things go: everyone thinks their project might scale, but most projects never do.
Is there a set of best practices that will allow you to scale more easily at a later stage, while not investing too much in something that may never be needed?
Obviously I have no experience with scalable web sites. Currently I am leaning towards Scala / Play! / Akka. From what I can glean from the Akka website, it is very suitable for this kind of project (in that it provides a toolset that allows development on a single machine and scaling out to an arbitrary number of machines).
The project is a consumer-facing web site that involves some user interaction (comments, messages, voting etc.). The main focus is editorial, though. It's no Facebook :)
Not being experienced in any of these technologies (my experience mainly coming from PHP, AS3, Objective-C), I probably have a little too much on my plate right now. But then I am not going to go at it right away. I am aware that I need to get some experience using Scala and Play! first.
Any advice is greatly appreciated.
Start out by simply designing you model, build unit tests for that and then set your presentation layer on top of it. As long as your model is sensible, it will be easy to scale Play out to any number of machines. If you go for the build in JPA support, you can always deal with the question of which DB to go for.
You have larger things on your plate for the time being. So just make sure your design is consistent and sensible, then scaling will not be a problem.
You've got a head start by using a stateless web framework like Play! that won't get in the way when you need to scale. As ExxKA says, keep your model clean and sensible. This will help you keep the advantage of complexity.
Don't consider it a failure when you need to refactor your code - or even rewrite important parts of it. This is a natural part of a growing project, like a snake shedding its skin.
There are inevitably things that you'll learn in the process of writing the project, so don't try to anticipate them all right now.
I've watch Harry Heymann, from Foursquare, gave a presentation on Lift to BASE usergroup. He mention something about how Lift being statebase isn't going to scale well in that video.
Is that true? If so why is that? Note: I know very little about state base.
I can't seem to find the google, I'll look for it later. Thank you in advance.
When this questions comes up on the Lift Mailing list, what the author of the framework usually replies with is a blog post he wrote some time ago, which explains why Lift is Stateful, but at the same time how you can use Lift as a stateless framework.
This is the link
David Pollack has a good answer to this at this this Quora thread, in a comment to Jackson Davis's answer:
In practice, scaling a Lift site is much much easier than scaling a LAMP site. Why? Well, state exists someplace. If it exists in the JVM, you get a lot of performance benefits and stability as well as, in Lift's case, lots of security. Contrast that with sessions in memcached. "Whoop, memcached went down, there go a pile of sessions." "Whoops, we've got a new memcached hashing algorithm, there go all the session." "Whoops, Google just crawled us creating 200,000 new sessions pushing all the but the active sessions out of cache." "Whoops, the Ruby runtime just went wild, ate all the VM on one of our boxes, memcached went down..." So, you try storing sessions in some wacky shared version of MySQL. This solution requires tons of hardware and a team of make sure that the sharing code is correct, etc. Contrast that to using Nginx, Jetty and session affinity. It's about 4 hours of setup time and it just works. See http://blog.harryh.org/post/7550... So, talk to a Facebook engineer about the challenges they go through to manage state between the front end, memcached, MySQL, etc. Compare that to Twitter with the famous fail whale. Compare that to Apple's store and the iTunes store which are written on WebObject (which is highly stateful.) Lift apps running at scale typically require 7% of the front end resources of LAMP app. The Lift apps that are running at scale (Foursquare and Novell pulse are two) do not have the kind of scaling issues associated with LAMP sites that have similar traffic patterns. Scaling with Lift is neither tricky, nor risky. It's simple. It's known. It's proven. Scaling with LAMP is playing whack-a-mole with state and that only becomes a problem at scale. -David Pollak • Jul 20, 2010
I think it's pretty clear that Lift apps scale very well. Foursquare and the UK Guardian are both using Lift. Both sites are very highly trafficked and neither has had a material Lift-related outage. Please also see the link that Diego posted. It provides an in-depth discussion of scaling Lift-powered sites.
Good day,
Our school, a small high school in semi-rural New Zealand, is currently looking into online homework solutions. Being one of the IT guys, I have been asked to look into some of the options. We have checked around and there are no robust solutions that cover what we are looking for. So, we are considering development of our own system, either on our own or in collaboration with some other schools.
Before I put significant time into any one option, I would thought I should ask for some expert advice.
Please keep in mind that one of our major obstacles is that around 20% of our students are on dial-up because broadband is not available in their area.
We are also not limited to the technologies listed, they just are the ones that we have been looking into up to this point.
With that in mind, here goes.
1. Is there a way to pre-determine the bandwidth needed for these technologies?
2. If bandwidth continued to be too limiting, could the final solution stand alone so we could distribute it to students on CD or USB stick?
3. What are some pros/cons of each for use with databases, specifically mysql or postgresql? (After all we do need to keep track of lots of data)
4. What are some pros/cons of each for of these RIA development?
I appreciate everyone for sharing their time and expertise on the matter.
Cheers,
Ben
1) If you write full-AJAX application, such as in GWT, the bandwitch will be:
a) the size of application java script, images, etc., you may consider that everything is loaded when user logs in (cache for images may seems to be big, but it's easily overloaded)
b) the size of communication - in GWT it depends only from you! no magic full-frame reloading, sending is only what YOU are wanting to send
2) I do not catch your point, stand alone applications can be distributed such way, applications that use databases generally can't
3) postgresql has high compatibility with Oracle - same transaction+select for update behaviour, pgPLSQL is highly inspired by PL/SQL (easy to rewrite stored procedures).
I personally suggest MySQL for a school project for its simplicity. PostgreSQL is powerful but a bit complicate to configure and the visual tool for optimizing queries not good.
Without considering the bandwidth, I definitely suggest ZK since, again, it is much easier to learn, to develop and to maintain (also much more powerful). The bandwidth consumption and latency of GWT really depends how much effort you want to invest, and how skillful your people are familiar with distributed computing, while the network bandwidth is basically the states of UI (not data), which is reasonably small. In short, you could have the best network bandwidth and latency if you optimize it at the best with GWT, while ZK is less to worry but, if you want to improve, you have to use jQuery (i.e, in JavaScript).
Thanks lechlukasz, I appreciate your comments and insight.
I will clarify my point about stand alone applications. We have a number of students, as high as 20%, who do not have access to broadband due to their geographic location. We are considering, as part of the design, how we may be able to distribute a stand alone version.
For instance, if we were to abstract all the database calls using a separate class in GWT, we could recompile a stand alone version that didn't make the database calls. The database would likely only be for tracking results and reporting.
In reality, we would likely implement the front end product first with references to empty methods for storing the results in a database and implement those methods at a later time.
For the record, we have started to code up some test cases using GWT/SmartGWT and are pleased with the results. Although we cannot comment on the other technologies considered because we didn't try them to the same extent, we are pleased with the results to this point of the project.
Cheers,
Ben
There are many ways to deploy Pylons apps.
- Proxying through apache or nginx to paste
- Embedding the app with mod_wsgi
- using some edgy nginx+uwsgi combo
- and probably more...
I've read a lot about the various approaches but failed to really decide which one to choose.
Proxying to paste through nginx seems to be the easiest method to setup, but is it efficient? Wouldn't paste be slower than mod_wsgi or uswgi? If so, is the performance increase worth the hassle?
Need some experts to help me choose the best compromise...
I want simplicity, but I need decent (if not cutting edge) performance, and you, Obiwan Kenobi, are my only hope ;)
If performance is most important, look at some tests:
http://wiki.pylonshq.com/display/pylonscookbook/Some+performance+test+results
What I meant to say is that if the application is more framework dependent than static content dependent, the limiting factor would be the webserver -> framework and I've found negligible differences in the performance of nginx -> uwsgi -> pylons and apache2/mpm-worker -> mod_wsgi -> pylons as the limiting factor is Pylons. This isn't to mean that Pylons is slow.
No matter which deployment method I used with repoze.who/what, I found it difficult to scale past 280 requests per second per CPU core.
#mkucharz, As for those performance results, those results are three years old and don't even come close to configurations that exist today. Pylons 1.0 is about 10% faster than 0.9, flup is much more mature, and that doesn't test uwsgi or mod_wsgi. It also uses Mighty rather than Mako, also pointing to the test's age.
The other hidden variables include the version of Python. In some distributions, I've found Python 2.5 to be a little faster than Python 2.6 depending on what the application does.
Disclaimers:
Pylons is not slow.
mod_wsgi and uwsgi performance differences are negligible in production settings.
Nginx's static file performance is better than apache.
Apache/mpm-worker is much faster than mod-prefork if mod_php isn't needed.
Almost any deployment that you understand is probably enough for 99% of the webapps out there.
99% of the published benchmarks don't properly test an environment. Hitting a page 10000 times is not indicative of real world performance.
Trying to be helpful when posting late at night never works. I knew when I saw this come up on tweetdeck I should have just said nothing.
The best answer is, it depends.
From a pure simplicity standpoint, apache2/mod_wsgi is probably the easiest to manage since you have a much larger pool of people that understand apache.
From a performance standpoint, it depends.
If your application is very framework heavy and not very static content (css, images) intensive, the gateway between the webserver and pylons is more likely your bottleneck and almost any deployment can handle that.
Paste is fairly quick. I found nginx/uwsgi's interface to be slightly quicker than apache2/mod_wsgi. nginx's static file performance and memory requirements favor nginx as well.
There are a few sites I've come across that talk about both:
tonylandis.com/python/deployment-howt-pylons-nginx-and-uwsgi/
cd34.com/blog/programming/python/pylons-and-facebook-application-layout/
code.google.com/p/modwsgi/wiki/IntegrationWithPylons
The comparisons I've done are with apache2/mpm-worker rather than mpm-prefork as I didn't need mod_php5 in my setup.
I am looking into caching solutions, for a multi webserver configuration. Thought of memcached as being cheap (free) and proven over the years. Microsoft is also developing a caching solution for webfarms, called Velocity, but this is still in CTP2.
There is a distributed caching model used in the configuration service that is part of the .NET Stocktrader sample application. This is a framework that allows you to run multiple nodes with centralised configuration management, load balancing and distributed caching. You can implement the configuration service as is or look through the code and grab what suits you. Worth a look.
When I listened to Scott Hanselman's podcast interview with the StackOverflow team, I was left with the impressions that a. they did use some kind of caching and b. they knew almost nothing about what they were doing in this respect and had fiddled with a few options and then written a blog post or two.
They currently seem to use client-side caching rather half-heartedly (short expiry times on images, for example), and I think they use a lot of ASP.NET user-mode caching, and I can't tell if they use IIS kernel-mode caching. (They didn't seem to be able to tell Scott that, either.)
However, the podcast was a while back, and I was driving at the time, so my memory might be wrong and/or out of date.
You should think HARD before bringing in something like memcached.
Caching can hide performance issues from you ("got a slow running query? just cache it and dont worry about fixing it!")
Invalidating stale data out is a nightmare.
You may spend days chasing bugs that get cleared up when you clear the cache, and it pollutes your code base.
I'm not saying don't do it, but think HARD before you do.
If you can get enough performance by adding a couple* of extra machines (which I think stackoverflow can) then do that and don't worry about caching. It'll be much cheaper in the long run.
*note I don't say 100 machines.