How does "distributed computing" apply to web development or programming in general? - distributed-computing

I am about to use Apache Hadoop, the headlines read:
The Apache Hadoop project develops
open-source software for reliable,
scalable, distributed computing.
I can relate "scalability" to programming, but I just don't know how this "distributing" can help me in my development. According to wikipedia:
A distributed system consists of
multiple autonomous computers that
communicate through a computer
network. The computers interact with
each other in order to achieve a
common goal
So does this mean I can deploy my web apps across multiple computers and do some sort of "intense computing"? The terms that come into my mind are Content Delivery Networks and Cloud Computing.

Web development has always been about distributed computing, since clients have been on different machines to the servers they talk to, web pages can pull in resources from many servers to build a page's content, and servers may talk to other machines to achieve their goals. CDNs make this more obvious than before, but really they're just an evolution, an introduction of a virtualization/indirection layer between what you ask for and the hardware used to provide it.
Clouds are about taking the concepts of virtualization and applying them to remote hosting, both of low-level OSes and higher-level software platforms. The really interesting thing about them is that this enables different business models on the part of customers (and with different risks too, but that's mostly not related to the fact that it's distributed computing but rather that it is not wholly under your control in your own jurisdiction).
I've found that the most effective use of distributed computing is when you think in terms of connecting together distinct services, each of which with different capabilities (which might be for technical reasons, or might not; sometimes, it's for business or legal reasons that things have to be divided up) and where each of those services may be provided by many components in multiple locations. There are, and continue to remain, issues with balancing the need for performance (which is a force that brings components together) and the need for robustness (which tends to lead to distribution and replication) within the overall context of the general capabilities map.
My goodness! That paragraph sounds like terrible piffle! What I'm trying to say is that it's all trade-offs, and you should be prepared for not getting it right first time.
(Hadoop is a mechanism for doing a distributed file store, and for efficiently applying certain classes of operation – those that fit well with MapReduce or other similar scatter-gather algorithms – across that whole dataset. If that shoe fits, use it. But it doesn't solve all problems, and thank goodness for that! Things that can do everything tend to look very much like things that can't actually do anything at all, and usefulness and comprehensibility come in the restrictions.)

Hadoop is typically used to process massive data sets by distributing the processing of that data set across multiple machines.
What this means is you probably don't want to use it to "deploy an application". You might use it to process stats on your application, however. For instance, you might have very large logs of user data. This would happen if your user data grows to become too large to fit on a single hard drive, and/or would take too long for one machine to process stats on (using standard methods like an SQL query).

Ygam. While the traditional role of "clients" and "servers" have been pretty stable from 1960 till about 2005.
I believe with every fiber of my being, that distributed computing is that we all carry processors around in our pockets.
Phones do computing work. Phones do NOT need centralized servers, but they DO benefit from them.
Phones , Smartphones, tablets are an example of where distributed computation is going.
You can make a wifi base-station out of an Android device now. So now a phone becomes a server of sorts for just that instant in the coffee shop that you turn it on for that cute person next to you without internet ....and now I digress.......

Related

What to do when you've really screwed up the design of a distributed system?

Related question: What is the most efficient way to break up a centralised database?
I'm going to try and make this question fairly general so it will benefit others.
About 3 years ago, I implemented an integrated CRM and website. Because I wanted to impress the customer, I implemented the cheapest architecture I could think of, which was to host the central database and website on the web server. I created a desktop application which communicates with the web server via a web service (this application runs from their main office).
In hindsight this was rather foolish, as now that the company has grown, their internet connection becomes slower and slower each month. Now, because of the speed issues, the desktop software times out on a regular basis, the customer is left with 3 options:
Purchase a faster internet connection.
Move the database (and website) to an in-house server.
Re-design the architecture so that the CRM and web databases are separate.
The first option is the "easiest", but certainly not the cheapest long term. Second option; if we move the website to in-house hosting, the client has to combat issues like overloaded/poor/offline internet connection, loss of power, etc. And the final option; the client is loathed to pay a whole whack of cash for me to re-design and re-code the architecture, and I can't afford to do this for free (I need to eat).
Is there any way to recover from when you've screwed up the design of a distributed system so bad, that none of the options work? Or is it a case of cutting your losses and just learning from the mistake? I feel terrible that there's no quick fix for this problem.
You didn't screw up. The customer wanted the cheapest option, you gave it to them, this is the cost that they put off. I hope you haven't assumed blame with your customer. If they're blaming you, it's a classic case of them paying for a Chevy while wanting a Mercedes.
Pursuant to that:
Your customer needs to make a business decision about what to do. Your job is to explain to them the consequences of each of the choices in as honest and professional a way as possible and leave the choice up to them.
Just remember, you didn't screw up! You provided for them a solution that served their needs for years, and they were happy with it until they exceeded the system's design basis. If they don't want to have to maintain the system's scalability again three years from now, they're going to have to be willing to pay for it now. Software isn't magic.
I wouldn't call it a screw up unless:
It was known how much traffic or performance requirements would grow. And
You deliberately designed the system to under-perform. And
You deliberately designed the system to be rigid and non adaptable to change.
A screw up would have been to over-engineer a highly complex system costing more than what the scale at the time demanded.
In fact it is good practice to only invest as much as can currently be leveraged by the business, using growth to fund further investment in scalability, should it be required. It is simple risk management.
Surely as the business has grown over time, presumably with the help of your software, they have also set aside something for the next level up. They should be thanking you for helping grow their business beyond expectations, and throwing money at you so you can help them carry through to the next level of growth.
All of those three options could be good. Which one is the best depends on cost benefits analysis, ROI etc. It is partially a technical decision but mostly a business one.
Congratulations on helping build a growing business up til now, and on to the future.
Are you sure that the cause of the timeouts is the internet connection, and not some performance issues in the web service / CRM system? By timeout I'm going to assume you mean something like ~30 seconds, in which case:
Either the internet connection is to blame and so you would see these sorts of timeouts to other websites (e.g. google), which is clearly unacceptable and so sorting the internet is your only real option.
Or the timeout is caused either by the desktop application, the web serice, or due to exessively large amounts of information being passed backwards and forwards, in which case you should either address the performance issue how you might any other bug, or look into ways of optimising the Desktop application so that less information is passed backwards and forwards.
In sort: the architecture that you currently have seems (fundamentally) fine to me, on the basis that (performance problems aside) access for the company to the CRM system should be comparable to accesss for the public to the system - as long as your customers have reasonable response times, so should the company.
Install a copy of the database on the local network. Then let the client software communicate with the local copy and let the database software do the synchronization between the local database server and the database on the webserver. It depends on which database you use, but some of them have tools to make that work. In MSSQL it is called replication.
First things first how much of the code do you really have to throw away? What language did you use for the Desktop client? Something .NET and you may be able to salvage a good chuck of the logic of the system and only need to redo the UI and some of the connections.
My thoughts are that 1 and 2 are out of the question, while 1 might be a good idea it doesn't solve the real problem. And we as engineers should try and build solutions not dependent on the client when ever possible. And 2 makes them get into something they aren't experts at and it is better to keep the hosting else where.
Also since you mention a web service is all you are really losing the UI? You can alway reuse the webservices for the web server interface.
Lastly you could look at using a framework to help provide a simple web based CRUD to start and then expand from there.
Are you sure the connection is saturated? You could be hitting all sorts of network, I/O and database problems... Unless you've already done so, use wireshark to analyze the traffic; measure the throughput and share the results with us.

Non-relational databases (NoSQL) for small to medium sized applications

The benefits of a non-relational database (such as a key-value pair storage) are evident when used in large scale datasets (google, facebook, linkedin). How do you think small to medium sized applications can benefit from using non-relational databases?
IBM Mainframes have had "non-relational" databases since the 60s (hierarchial databases such as IMS + variants). These databases are still in use because they are extremely fast and handle huge scale well.
The point of relational databases was to provide a regular, relatively abstract method for storing and retrieving data in which the tuning can be done relatively independently of the data model (not true for IMS). They were designed rather in reaction to the inability to reorganize hiearchical databases easily. The upside is nice organization; the downside is medium, not high performance.
Google provides scalable storage and MapReduce to handle scale. It isn't relational.
There was a huge push early in the last decade to store data in XML, in essentially hiearchical form because XML is implicitly hierarchical. That was a huge mistake IMHO, because it repeated the inconvenience of heirarchical databases, but had none of the performance. I'm not very surprised this movement seems to have pretty much died.
Most of the practical push to non-relational seems to me to be towards performance and scale. I don't see how this helps "small" applications much.
People have proposed, but not done a lot of practical data management using knowledge-based schemes. Doug Lenat's CYC comes to mind here. The ability of the database
to help an application draw non-obvious conclusions strikes me a very interesting for "small" applications that are trying be "smart". But there aren't a lot of these yet.
The sweet spot of using a NoSQL database at that scale is when the database model (key-value, document, etc.) is a good match to the application's needs and the advanced relational functionality is not needed.
At the small end of the spectrum, performance is a non issue because just about everything is fast. Storage engines are a non issue, if you don't need a sophisticated query engine, the lack of SQL support is a non issue.
You are left with how well it fits and how easy it is to use. Honestly though, tooling does become an issue. Relational database tooling is mature, NoSQL tooling is less feature rich and less battle hardened. Too often it is roll-your-own tooling. Definitely consider what tools you'd be giving up and how much you need them.
There is an additional slate of advantages for smaller projects when considering a NoSQL service (like Amazon SimpleDB and Microsoft Azure) as compared to a product. If you only have to pay for what you use and you don't use much, it can be cheaper than running a dedicated server, going all the way down to free for something like the SimpleDB free usage tier.
You also avoid some of the server and database maintenance costs. This can be a big win if you don't have a DBA, or when your DBAs are already over worked. Of course you'll still have admin work to do, but it is significantly reduced, and typically simpler.
When it comes to graph databases (like Neo4j - a project I'm involved in) they excel at scaling to complexity. This means, they provide "better substrates for modeling business domains" (see The State of NoSQL, also by Ben Scofield, too). As I see it, this is very important in small to medium sized apps.
This may be better explained through examples, so here's some links to example apps/domain modeling:
Access control lists the graph
database way
Social networks in the database: using a graph database
Domain modeling gallery
The question perhaps requires a bit more context... assuming a Python environment, consider the tutorial at the y_serial project: http://yserial.sourceforge.net/
NoSQL is not merely adopted for reasons of scalability. Serialization (of any arbitrary Python object) and persistence are very convenient at any scale -- so consider the key-value system as one approach.
Well one of the problems with a RDBMS is that you need to spend effort mapping your programming languages domain models to the relational schema of your RDBMS. This effort is usually spent configuring your ORM layer.
With NoSQL databases you are not forced to map your objects to a relational model and in most cases your objects are serialized as-is. Because of the lack of an intermediary schema, data migrations and versioning become easier.
Another benefit is scalability and performance. Since most of the time your data is received by 'keys' effectively everything uses and index. Trivial sharding is possible by doing a % (MOD) on the key against the number of your available NoSQL instances providing natural data partitioning which is crucial for sharding.
If you're interested in seeing how developing with a NoSQL differs from a RDBMS, I have a tutorial where I show how to go about designing a simple blog application using Redis.
If you match up a few common PaaS cloud services like a Key-Value store, a BLOB store, and a Message Queue store you have some handy tools that can free small application developers from the tyranny of the DBA and the infrastructure folks.
Today small developers often resort to Jet MDBs. Why? Easy, shared access is as easy as storing the MDB file on a file share visible to the entire application community. When they can get away with it (i.e. get the necessary support from the gatekeepers) they might use SQL Server Express, MySQL, etc.
Sadly those gatekeepers can be pretty hostile to deal with in a large organization. Mention a "database" and suddenly you face the DBA gang and associated delays, application reviews, prioritization, etc. Mention needing a server and you face that other firing squad.
Using a NoSQL solution and related cloud services can eliminate a ton of this if you don't need an RDBMS.
For one thing, all that's really required is an account with a public cloud provider. This is something that becomes fairly easy once the concept has been approved. And easier for you as a developer once you've been approved and assigned an account, though of course there are the usual bookkeeping issues.
But let's even set that aside. What if your organization implemented a private cloud for such uses? Lots of the issues of outside billing go away, data insecurity worries go away, etc.
Such a thing could be implemented and provisioned in a semi-anonymous fashion, almost as easily as administering file shares. The anonymity comes in because once you've been approved to develop on the in-house cloud nobody needs to nitpick the details of your activities using it any more than they need to examine a request before you can create a file on an existing file share.
Obviously there would be storage and CPU quotas to manage. Nobody can afford to just keep scaling up indefinately. Rogue applications might consume vast quantities of resources. So what you need is some sort of quota system to cap usage. Whether this is monitored by infrastructure folks is an implementation decision, or it might be treated just like file share use: run out and somebody yells at the programmer who in turn looks into it and requests more if appropriate (or fixes his bugs).
But you end up with "utility computing" and by "using no SQL" you don't incur the cost (and issues) of dealing with DBAs. They can still sit quietly surfing the Web in their big offices while you get some work done.
Amazon SimpleDB can be useful for those who need a non-relational database for storage of smaller, non-structural data. Amazon SimpleDB has restricted storage size to 10GB per domain. Amazon SimpleDB offers simplicity and flexibility. SimpleDB automatically indexes all data. Amazon SimpleDB pricing is based on your actual box usage. You can store any UTF-8 string data in Amazon SimpleDB.

Old concepts with new names (namely REST and Cloud computing)

It seems that SaaS and Cloud computing are old concepts with new names, and I am curious if I am wrong.
For cloud computing you can look at: Difference between cloud computing and distributed computing?
Basically, it seems that when we have been hosting that that is cloud computing, it is just that now some companies have put in much great resources to ensure better uptime than my local ISP. But, it seems that there is nothing really new here.
For REST, it seems that it is what we have been doing with cgis for 15 years.
Here is a question on REST: What am I not understanding about REST?
It appears that REST is an old concept, and I am curious how it is different than has been done since the early days of the web, and, to a large extent, the early days of using telnet (which http is on top of).
Am I mistaken in my simplification of these? I try to see how what is new is like what I know so I can see what more has to be learned in that topic, but for cloud computing and REST it seems that very little needs to be learned.
You are both right and wrong. You are right in the sense that new ideas are normally similar to old ideas, and indeed cloud computing is based significantly on distributed computing.
What is new in cloud computing is
virtualization
self-service
With virtualization, you can run multiple operating systems on a single hardware. While that, in itself, isn't new, either, it was never considered in distributed systems as a relevant piece of the architecture. Using virtualization allows self-service: users can create their own clusters of nodes without the administrator of the hardware taking any action. This allows a significant acceleration of deployment, and a significant reduction of cost.
For ReST, what you are missing is the client API. It is true that on the server side, a ReST service can be implemented with CGI. What is new here is that it is not an end user which retrieves the URL, but a program.
Saying that HTTP is on top of telnet ignores realities; this is like saying that we made no progress since the introduction of copper wires for communication. Strictly speaking, HTTP is not in top of telnet, but on top of TCP (which telnet is also on top of, these days).
Considering Roy's dissertation coined the term REST back in 2000, you can definitely argue that there is nothing new about REST. Additionally, the REST architectural style was synthesized from successful existing practices, so REST implementations pre-date the definition. Having said that, there is nothing simple about designing REST interfaces. Ever since Netscape first abused cookies to allow servers to maintain session state people have been swimming upstream against the web.
REST's recent resurrection has come mainly from people becoming disillusioned with SOAP based Web Services. SOAP tried to hide HTTP instead of embracing it and I think people are starting to realize how effective HTTP can be as an distributed application protocol that can do more than just deliver HTML to web browsers.
RESTful web applications don't use session state, so one could argue that by that virtue alone it is different than most web applications in existence at the moment.
As for Cloud Computing, I find myself agreeing with Larry Ellison for once in my life.
I'm in agreement on what you've posted. You might consider making this community wiki since it's likely to garner many answers based on opinion. Cloud computing seems to have taken off as a buzzword, and this is largely due to a decrease in cost for mass quantities of hardware. And then there is REST which is really just a formal name and definition for something that has been in place for a long time. Some people like to encapsulate ideas with buzzwords and acronyms. Sometimes it's useful to put a name to an idea though.
Not only this, the concept of things being old concepts with new names is old. It's hard to be original these days :P
You are right about REST -- its mostly old concepts with a lot of added pedantry and not much added substance.
Cloud computing has a small but fundamental difference from distributed computing. In distributed computing you had servers dedicated to particular functions, and usually some sort of directory service to locate the correct server. In cloud computing any server is capable of any task and usually the servers queue up for work which is distributed from a central point.

Designing a Stress Testing Framework

I do a sort of integration/stress test on a very large product (think operating-system size), and recently my team and I have been discussing ways to better organize our test workloads. Up until now, we've been content to have all of our (custom) workload applications in a series of batch-type jobs each of which represent a single stress-test run. Now that we're at a point where the average test run involves upwards of 100 workloads running across 13 systems, we think it's time to build something a little more advanced.
I've seen a lot out there about unit testing frameworks, but very little for higher-level stress type tests. Does anyone know of a common (or uncommon) way that the problem of managing large numbers of workloads is solved?
Right now we would like to keep a database of each individual workload and provide a front-end to mix and match them into test packages depending on what kind of stress we need on a given day, but we don't have any examples of the best way to do more advanced things like ranking the stress that each individual workload places on a system.
What are my fellow stress testers on large products doing? For us, a few handrolled scripts just won't cut it anymore.
My own experience has been initially on IIS platform with WCAT and latterly with JMeter and Selenium.
WCAT and JMeter both allow you to walk through a web site as a route, completing forms etc. and record the process as a script. The script can then be played back singly or simulating multiple clients and multiple threads. Randomising the play back to simulate lumpy and unpredictable use etc.
The scripts can be edited and or can be written by hand once you know whee you are going. WCAT will let you play back from the log files as well allowing you simulate real world usage.
Both of the above are installed on a PC or server.
Selenium is a FireFox add in but works in a similar way recording and playing back scripts and allowing for scaling up.
Somewhat harder is working out the scenarios you are testing for and then designing the tests to fit them. Also Interaction with the database and other external resouces need to be factored in. Expcet to spend a lot of time looking at log files, so good graphical output is essential.
The most difficult thing for me was to collect and organized performance metrics from servers under load. Performance Monitor was my main tool. When Visual Studio Tester edition came up I was astonished how easy it was to work with performance counters. They pre-packaged a list of counters for a web server, SQL server, ASP.NET application, etc. I learned about a bunch of performance counters I did not know even exist. In addition you can collect your own counters too. You can store metrics after each run. You can also connect to production servers and see how they are feeling today. Then I see all those graphics in real time I feel empowered! :) If you need more load you can get VS Load Agents and create load generating rig (or should I call it botnet). Comparing to over products on the market it is relatively inexpensive. MS license goes per processor but not per concurrent request. That means you can produce as much load as your hardware can handle. On average, I was able to get about 3,000 concurrent web requests on dual-core, 2GB memory computer. In addition you can incorporate performance tests into your builds.
Of course it goes for Windows. Besides the price tag of about $6K for the tool could be a bit high, plus the same amount of money per additional load agent.

Cloud Computing need some regulations?

I was involved in couple of cloud computing platform recently.
First of all please note that I am not trying to criticize any platform.
Cloud computing is large area but to make my point simple and understandable. Let me come up with very simple scenario and that is data storage services hosted on the cloud.
If you take any storage service like Amazon EC2, SQL Data Service(SDS), Salesforce.com services.
If you want to consume any of such data storage service platform goal of all such service are same and that is to serve requested data on demand. Without warring about how it store and where it stored and who is maintaining it etc... (all cloud goodies)
Now my area of concern is the way ANSI-SQL regulated platform venders to make sure they follow similar language across all the product can’t they regulate similar concept across
service providers?
Why no such initiatives??
Any thoughts appreciated
It seems to me like you're worried about vendor lock-in with cloud computing. I may be naive but I would normally choose technologies and then go look for cloud vendors that'd be able to deliver these technologies. And if I was aiming for a "write once run anywhere approach" I'd have to select technology that'd make this as realistic as possible.
With the fairly rapid speed of development I really think standardization committees would struggle to keep up. ANSI-SQL has had 20 + years of history. It seems to me like you're requesting for standardization long before we even know what the cloud is up to....
I think that this emerging cloud computing initiative is just too young in order to have standards.
Service providers right now just worry about rushing into the market, rather than interoperability and standards.
Later on, when the situation is more established, some common guidelines may emerge. But there is still a long way to go.
You seem to be asking specifically about cloud storage services, rather than cloud computing in general. So your Amazon example would be S3, not EC2.
I think the field is a little young to be standardising on an API just yet. The services differentiate themselves in ways which rule this out. For example, S3 trades sophistication for scalability/reliability/performance: you can't do a complex SQL LIKE query. You can store and retrieve blobs of data based on a key, and that's about it.
I think as such services become more and more the mainstream way to do things, standards will emerge. Users will want the freedom to switch providers on a whim, move their data around, test against free local storage, etc.
The APIs used are all based on Web Standards already. Making an abstraction layer to make them look the same is fairly trivial.