When reading about frameworks (.net. ruby on rails, django, spring, etc.), I keep seeing so and so does or doesn't scale well.
What does it mean when someone says that a framework "scales well" and what does it mean to say a framework "doesn't scale well"?
Thank you.
When you plot some resource use (memory, time, disk space, network bandwidth) against concurrent users, you get a function that describes how the application works at different scale factors.
Small-scale -- a few users -- uses a few resources.
Large-scale -- a large number of users -- uses a large number of resources.
The critical question is "how close to linear is the scaling?" If it scales linearly, then serving 2,000 concurrent users costs 2 times as much as serving 1,000 users and 4 times as much as serving 500 users. This is a tool/framework/language/platform/os that scales well. It's predictable, and the prediction is linear.
If it does not scale linearly, then serving 4,000 users costs 1,000 times as much as serving 2,000 users which cost 100 times serving 500 users. This did not scale well. Something went wrong as usage went up; it does not appear predictable and it is not linear.
It means that a particular framework does (or does not) meet the increased demand that more users put on it. If you have an application written in VBScript, it might not do a good job of handling the 40,000,000 users of Facebook, for example.
This blog post explains some of the scalability pains Twitter experienced a year or so ago. It could provide some more insight into the answer to your question.
Sometimes lack of scalability is used to denigrate a language or framework, so watch out for that. Stick to studies that show real metrics. This applies to my VBScript example in the previous paragraph as well.
If a framework or an application scales well, it means that it can handle larger loads. As your site becomes more popular with more visitors and more hits per day, a framework that scales well will handle the larger load the same as it handles a smaller load. A framework that scales well will act the same when it receives 200,000 hits an hour as it does when it gets 1 hit an hour. Not only hits, but being deployed across multiple servers, possibly behind load balancing, possibly with several different database servers. A framework that scales well can handle these increasing demands well.
For instance, twitter exploded almost overnight last year. It was developed using Ruby On Rails, and it was hotly featured in the ongoing debate on whether Rails scales well or not.
substitute the phrase "handle expansion" for "scale"
There are a few elements to it in my mind. The first is the obvious one -- performance scaling. Can your framework be used to build hight capacity, high throughput system or can it just be used to build smaller applications. Will it scale vertically on hardware (parallel libraries for example) and will it scale horizontally (web farms, for example).
The second is can it scale to larger teams or the enterprise. That is, does it work well with large code bases? Large development teams? Does it have good tool support? How easy is it to deploy? Can you roll out to tens or hundreds or even thousands of users? All the way down to is it easy to hire people that have this skill. Think of trying to put together a development team of 20 or 50 people that all work on this framework. Would it be easy or next to impossible?
IMHO, saying that a framework "scales well" usually means that someone in the hearsay chain was able to use it to handle lots of volume.
In parallel programming scalability is usually used to describe how an algorithm performs as it is parallelized. An algorithm that has a 1:1 speedup is a rare beast but will double in performance on twice the hardware/cpu, treble on three times the hardware/cpu, etc...
In my experience practically any framework can be made to scale given sufficient expertise.
The easier the framework is to use the greater the chance that a developer with insufficient expertise will run into scalability problems.
It means that some respected company is doing something serious with it and isn't having any problems with it.
Scaling means how easy it is to satisfy more demand by using more hardware.
Example: You have a website written in some language that gets 1000 visits a day. You get featured in some prominent magazin and your number of users grows. Suddenly you have 1000000 visits a day, thats 1000 times as much. If you can just use 1000 more servers to satisfy the grown need of resources, your website scales well. If on the other hand you add 2000 servers but still users can't connect, because your database can only handle 1000 requests per day, than your website does not scale well.
Related
We're launching an iPhone app soon, and if everything goes well, we might reach up to tens of millions of user each day.
What server solution would you use for this? I guess a small VPS isn't enough. Is dedicated server a better choice? Is there any good hosting provider that can provide such servers?
I'm a newbie when It comes to servers, and would like some basic info about how to handle this.
Thanks in advance
Unfortunately, you are not really going to know the apps requirements until the app is launched. It all depends on how much the app needs to communicate with the server, and how often users are using the app. Depending on those variables and even more, a VPS might be enough, or you may need a dedicated box, or several. It also depends a lot on the performance of the VPS and dedicated boxes, furthermore it depends on how much access to the system you need.
Ultimately, it seems you may not even know how well the app is going to do, so I suggest you take the cheap/efficient route of using cloud computing. That way you will limit your expenses initially when you app has a small user base. Then your performance can amp up as quickly as your app requires (of course so will the price). That is the benefit of cloud computing, you will not be losing money in the beginning until you have the user base to use your server to its limit. Furthermore, you do not have downtime, etc when/if your server is no longer enough.
Check out Google's Cloud Computing to get a hint of what is possible. I personally like Google's cloud experience, but you have many more options with varying degrees of freedom that you will have to check out. Amazon of course is another possibility.
I want to build a web app similar to Reddit.com, where you have multy level of comments, lots of reads and writes. I was wondering if nosql and mongoDB in particular is the right tool for this?
Comments -- it's really thing for nosql database, no doubt. You avoiding multiple joins to itself. And it's means that your system can scale out!
With mongodb you can store all hierarchy within one document. Some peoples can say that here will be problems with atomic updates, but i guess that it's not a problem because of you can load and save back entire comments tree. In any way you can easy redesign your system later to support atomic updates and avoid issues with concurrency.
Reddit itself uses Cassandra. If you want something "similar to reddit.com," maybe you should look at their source -- https://github.com/reddit/reddit/wiki.
Here's what David King (ketralnis) said earlier this year about the Cassandra 0.7 release: "Running any large website is a constant race between scaling your user base and scaling your infrastructure to support it. Our traffic more than tripled this year, and the transparent scalability afforded to us by Apache Cassandra is in large part what allowed us to do it on our limited resources. Cassandra v0.7 represents the real-life operations lessons learned from installations like ours and provides further features like column expiration that allow us to scale even more of our infrastructure."
However, Rick Branson notes that Reddit doesn't take full advantage of Cassandra's features, so if you were to start from scratch, you'd want to do some things differently.
I am about to use Apache Hadoop, the headlines read:
The Apache Hadoop project develops
open-source software for reliable,
scalable, distributed computing.
I can relate "scalability" to programming, but I just don't know how this "distributing" can help me in my development. According to wikipedia:
A distributed system consists of
multiple autonomous computers that
communicate through a computer
network. The computers interact with
each other in order to achieve a
common goal
So does this mean I can deploy my web apps across multiple computers and do some sort of "intense computing"? The terms that come into my mind are Content Delivery Networks and Cloud Computing.
Web development has always been about distributed computing, since clients have been on different machines to the servers they talk to, web pages can pull in resources from many servers to build a page's content, and servers may talk to other machines to achieve their goals. CDNs make this more obvious than before, but really they're just an evolution, an introduction of a virtualization/indirection layer between what you ask for and the hardware used to provide it.
Clouds are about taking the concepts of virtualization and applying them to remote hosting, both of low-level OSes and higher-level software platforms. The really interesting thing about them is that this enables different business models on the part of customers (and with different risks too, but that's mostly not related to the fact that it's distributed computing but rather that it is not wholly under your control in your own jurisdiction).
I've found that the most effective use of distributed computing is when you think in terms of connecting together distinct services, each of which with different capabilities (which might be for technical reasons, or might not; sometimes, it's for business or legal reasons that things have to be divided up) and where each of those services may be provided by many components in multiple locations. There are, and continue to remain, issues with balancing the need for performance (which is a force that brings components together) and the need for robustness (which tends to lead to distribution and replication) within the overall context of the general capabilities map.
My goodness! That paragraph sounds like terrible piffle! What I'm trying to say is that it's all trade-offs, and you should be prepared for not getting it right first time.
(Hadoop is a mechanism for doing a distributed file store, and for efficiently applying certain classes of operation – those that fit well with MapReduce or other similar scatter-gather algorithms – across that whole dataset. If that shoe fits, use it. But it doesn't solve all problems, and thank goodness for that! Things that can do everything tend to look very much like things that can't actually do anything at all, and usefulness and comprehensibility come in the restrictions.)
Hadoop is typically used to process massive data sets by distributing the processing of that data set across multiple machines.
What this means is you probably don't want to use it to "deploy an application". You might use it to process stats on your application, however. For instance, you might have very large logs of user data. This would happen if your user data grows to become too large to fit on a single hard drive, and/or would take too long for one machine to process stats on (using standard methods like an SQL query).
Ygam. While the traditional role of "clients" and "servers" have been pretty stable from 1960 till about 2005.
I believe with every fiber of my being, that distributed computing is that we all carry processors around in our pockets.
Phones do computing work. Phones do NOT need centralized servers, but they DO benefit from them.
Phones , Smartphones, tablets are an example of where distributed computation is going.
You can make a wifi base-station out of an Android device now. So now a phone becomes a server of sorts for just that instant in the coffee shop that you turn it on for that cute person next to you without internet ....and now I digress.......
Related question: What is the most efficient way to break up a centralised database?
I'm going to try and make this question fairly general so it will benefit others.
About 3 years ago, I implemented an integrated CRM and website. Because I wanted to impress the customer, I implemented the cheapest architecture I could think of, which was to host the central database and website on the web server. I created a desktop application which communicates with the web server via a web service (this application runs from their main office).
In hindsight this was rather foolish, as now that the company has grown, their internet connection becomes slower and slower each month. Now, because of the speed issues, the desktop software times out on a regular basis, the customer is left with 3 options:
Purchase a faster internet connection.
Move the database (and website) to an in-house server.
Re-design the architecture so that the CRM and web databases are separate.
The first option is the "easiest", but certainly not the cheapest long term. Second option; if we move the website to in-house hosting, the client has to combat issues like overloaded/poor/offline internet connection, loss of power, etc. And the final option; the client is loathed to pay a whole whack of cash for me to re-design and re-code the architecture, and I can't afford to do this for free (I need to eat).
Is there any way to recover from when you've screwed up the design of a distributed system so bad, that none of the options work? Or is it a case of cutting your losses and just learning from the mistake? I feel terrible that there's no quick fix for this problem.
You didn't screw up. The customer wanted the cheapest option, you gave it to them, this is the cost that they put off. I hope you haven't assumed blame with your customer. If they're blaming you, it's a classic case of them paying for a Chevy while wanting a Mercedes.
Pursuant to that:
Your customer needs to make a business decision about what to do. Your job is to explain to them the consequences of each of the choices in as honest and professional a way as possible and leave the choice up to them.
Just remember, you didn't screw up! You provided for them a solution that served their needs for years, and they were happy with it until they exceeded the system's design basis. If they don't want to have to maintain the system's scalability again three years from now, they're going to have to be willing to pay for it now. Software isn't magic.
I wouldn't call it a screw up unless:
It was known how much traffic or performance requirements would grow. And
You deliberately designed the system to under-perform. And
You deliberately designed the system to be rigid and non adaptable to change.
A screw up would have been to over-engineer a highly complex system costing more than what the scale at the time demanded.
In fact it is good practice to only invest as much as can currently be leveraged by the business, using growth to fund further investment in scalability, should it be required. It is simple risk management.
Surely as the business has grown over time, presumably with the help of your software, they have also set aside something for the next level up. They should be thanking you for helping grow their business beyond expectations, and throwing money at you so you can help them carry through to the next level of growth.
All of those three options could be good. Which one is the best depends on cost benefits analysis, ROI etc. It is partially a technical decision but mostly a business one.
Congratulations on helping build a growing business up til now, and on to the future.
Are you sure that the cause of the timeouts is the internet connection, and not some performance issues in the web service / CRM system? By timeout I'm going to assume you mean something like ~30 seconds, in which case:
Either the internet connection is to blame and so you would see these sorts of timeouts to other websites (e.g. google), which is clearly unacceptable and so sorting the internet is your only real option.
Or the timeout is caused either by the desktop application, the web serice, or due to exessively large amounts of information being passed backwards and forwards, in which case you should either address the performance issue how you might any other bug, or look into ways of optimising the Desktop application so that less information is passed backwards and forwards.
In sort: the architecture that you currently have seems (fundamentally) fine to me, on the basis that (performance problems aside) access for the company to the CRM system should be comparable to accesss for the public to the system - as long as your customers have reasonable response times, so should the company.
Install a copy of the database on the local network. Then let the client software communicate with the local copy and let the database software do the synchronization between the local database server and the database on the webserver. It depends on which database you use, but some of them have tools to make that work. In MSSQL it is called replication.
First things first how much of the code do you really have to throw away? What language did you use for the Desktop client? Something .NET and you may be able to salvage a good chuck of the logic of the system and only need to redo the UI and some of the connections.
My thoughts are that 1 and 2 are out of the question, while 1 might be a good idea it doesn't solve the real problem. And we as engineers should try and build solutions not dependent on the client when ever possible. And 2 makes them get into something they aren't experts at and it is better to keep the hosting else where.
Also since you mention a web service is all you are really losing the UI? You can alway reuse the webservices for the web server interface.
Lastly you could look at using a framework to help provide a simple web based CRUD to start and then expand from there.
Are you sure the connection is saturated? You could be hitting all sorts of network, I/O and database problems... Unless you've already done so, use wireshark to analyze the traffic; measure the throughput and share the results with us.
I do a sort of integration/stress test on a very large product (think operating-system size), and recently my team and I have been discussing ways to better organize our test workloads. Up until now, we've been content to have all of our (custom) workload applications in a series of batch-type jobs each of which represent a single stress-test run. Now that we're at a point where the average test run involves upwards of 100 workloads running across 13 systems, we think it's time to build something a little more advanced.
I've seen a lot out there about unit testing frameworks, but very little for higher-level stress type tests. Does anyone know of a common (or uncommon) way that the problem of managing large numbers of workloads is solved?
Right now we would like to keep a database of each individual workload and provide a front-end to mix and match them into test packages depending on what kind of stress we need on a given day, but we don't have any examples of the best way to do more advanced things like ranking the stress that each individual workload places on a system.
What are my fellow stress testers on large products doing? For us, a few handrolled scripts just won't cut it anymore.
My own experience has been initially on IIS platform with WCAT and latterly with JMeter and Selenium.
WCAT and JMeter both allow you to walk through a web site as a route, completing forms etc. and record the process as a script. The script can then be played back singly or simulating multiple clients and multiple threads. Randomising the play back to simulate lumpy and unpredictable use etc.
The scripts can be edited and or can be written by hand once you know whee you are going. WCAT will let you play back from the log files as well allowing you simulate real world usage.
Both of the above are installed on a PC or server.
Selenium is a FireFox add in but works in a similar way recording and playing back scripts and allowing for scaling up.
Somewhat harder is working out the scenarios you are testing for and then designing the tests to fit them. Also Interaction with the database and other external resouces need to be factored in. Expcet to spend a lot of time looking at log files, so good graphical output is essential.
The most difficult thing for me was to collect and organized performance metrics from servers under load. Performance Monitor was my main tool. When Visual Studio Tester edition came up I was astonished how easy it was to work with performance counters. They pre-packaged a list of counters for a web server, SQL server, ASP.NET application, etc. I learned about a bunch of performance counters I did not know even exist. In addition you can collect your own counters too. You can store metrics after each run. You can also connect to production servers and see how they are feeling today. Then I see all those graphics in real time I feel empowered! :) If you need more load you can get VS Load Agents and create load generating rig (or should I call it botnet). Comparing to over products on the market it is relatively inexpensive. MS license goes per processor but not per concurrent request. That means you can produce as much load as your hardware can handle. On average, I was able to get about 3,000 concurrent web requests on dual-core, 2GB memory computer. In addition you can incorporate performance tests into your builds.
Of course it goes for Windows. Besides the price tag of about $6K for the tool could be a bit high, plus the same amount of money per additional load agent.