Parallelism behaviour of stream processing engines

Parallelism behaviour of stream processing engines - distributed-computing

I have been learning Storm and Samza in order to understand how stream processing engines work and realized that both of them are standalone applications and in order to process an event I need to add it to a queue that is also connected to stream processing engine. That means I need to add the event to a queue (which is also a standalone application, let's say Kafka), and Storm will pick the event from the queue and process it in a worker process. And If I have multiple bolts, each bolt will be processed by different worker processes. (Which is one of the things I don't really understand, I see that a company that uses more than 20 bolts in production and each event is transferred between bolts in a certain path)
However I don't really understand why I would need such complex systems. The processes involves too much IO operations (my program -> queue -> storm ->> bolts) and it makes much more harder to control and debug the them.
Instead, if I'm collecting the data from web servers, why not just use the same node for event processing? The operations will be already distributed over the nodes by load-balancers which I use for web servers. I can create executors on same JVM instances and send the events from web server to the executor asynchronously without involving any extra IO requests. I can also watch the executors in web servers and make sure that the executor processed the events (at-least-once or exactly-one processing guarantee). In this way, it will be a lot easier to manage my application and since not much IO operation is required, it will be faster compared to the other way which involves sending the data to another node over the network (which is also not reliable) and process it in that node.
Most probably I'm missing something here because I know that many companies actively uses Storm and many people I know recommend Storm or other stream processing engines for real-time event processing but I just don't understand it.

My understanding is that the goal of using a framework like Storm is to offload the heavy processing (whether cpu-bound, I/O-bound or both) from the application/web servers and keep them responsive.
Consider that each application server may have have to serve a large number of concurrent requests, not all of them having to do with stream processing. If the app server is already processing a significant load of events, then it could constitute a bottleneck for lighter requests, as the server resources (think cpu usage, memory, disk contention etc.) will already be tied to heavier processing requests.
If the actual load you need to face isn't that heavy, or if it can simply be handled by adding app server instances, then of course it doesn't make sense to complexify your architecture/topology, which could in fact slow the entire thing down. It really depends on your performance and load requirements, as well as on how much (virtual) hardware you can throw at the problem. As usual, benchmarking based on your load requirements will help make a decision of which way to go.

you are right to consider that sending data across the network will consume more time of the total processing time.
However, these frameworks (Storm, Spark, Samza, Flink) were created to process a lot of data that potentially does not fit in memory of one computer. So, if we use more than one computer to process the data we can achieve parallelism.
And, following your question about the network latency. Yes! this is a trade off to consider. The developer has to know that they are implementing programs to deploy in a parallel framework. The way that they build the application will influence how much data is transferred through the network as well.

Related

When to use polling and streaming in launch darkly

I have started using launch darkly(LD) recently. And I was exploring how LD updates its feature flags.
As mentioned Here, there are two ways.
Streaming
Polling
I was just thinking which implementation will be better in what cases. After a little research about streaming vs polling, It was found Streaming has the following advantages over polling.
Faster than polling
Receives only latest data instead of all the data which is same as before
Avoids periodic requests
I am pretty sure all of the above advantages comes at a cost. So,
Are there any downsides of using streaming over polling?
In what scenarios polling should be preferred? or the other way around?
On what factors should I decide whether to stream or poll?

Streaming
Streaming requires your application to be always alive. This might not be the case in a serverless environment. Furthermore, a streaming solution usually relies on a connection that is always open in the background. This might be costly, so feature flag providers tend to limit the number of concurrent connections you can keep open to their infrastructure. This might be not a problem if you use feature flags only in a few application instances. But you will easily reach the limit if you want to stream feature flag updates to mobile apps or a ton of microservices.
Polling
Polling sounds less fancy, but it's a reliable & robust old-school pattern that will work in almost all environments.
Webhooks
There is a third option too: webhooks. The basic idea is that you create an HTTP endpoint on your end and he feature flag service will call that endpoint whenever a feature flag value update happens. This way you get a "notification" about feature flag value changes. For example ConfigCat supports this model. ConfigCat can notify your infrastructure by calling your webhooks and (optionally) pushing new values to your end. Webhooks have the advantage over streaming that they are cheap to maintain, so feature flag service providers don't limit them as much (for example ConfigCat can give you unlimited webhooks).
How to decide
How I would use the above 3 option really depends on your use-case. A general rule of thumb is: use polling by default and add quasi real-time notifications (by streaming or by webhooks) to the components where it's critical to know about feature flag value updates.

In addition to #Zoltan's answer, I Found the following from LaunchDarkly's Effective Feature management E book (Page 36)
In any networked system there are two methods to distribute information.
Polling is the method by which the endpoints (clients or servers) periodically ask for updates. Streaming, the second method,is when the central authority pushes the new values to all the end‐points as they change.Both options have pros and cons.
However, in a poll-based system, you are faced with an unattractive trade-off: either you poll infrequently and run the risk of different parts of your application having different flag states, or you poll very frequently and shoulder high costs in system load, network bandwidth, and the necessary infra‐structure to support the high demands.
A streaming architecture, on the other hand, offers speed advantages and consistency guarantees. Streaming is a better fit for large-scale and distributed systems. In this design, each client maintains along-running connection to the feature management system, which instantly sends down any changes as they occur to all clients.
Polling Pros:
Simple
Easily cached
Polling Cons:
Inefficient. All clients need to connect momentarily, regardless of whether there is a change.
Changes require roughly twice the polling interval to propagate to all clients.
Because of long polling intervals, the system could create a “split brain” situation, in which both new flag and old flag states exist at the same time.
Streaming Pros:
Efficient at scale. Each client receives messages only when necessary.
Fast Propagation. Changes can be pushed out to clients in real time.
Streaming Cons:
Requires the central service to maintain connections for every client
Assumes a reliable network
For my use case, I have decided to use polling in places where I don't need to update the flags often(long polling interval) and doesn't care about inconsistencies (split-brain) .
And Streaming for applications that need immediate flag updates and consistency is important.

Distributed systems with large number of different types of jobs

I want to create a distributed system that can support around 10,000 different types of jobs. One single machine can host only 500 such jobs, as each job needs some data to be pre-loaded into memory, which can't be kept in a cache. Each job must have redundancy for availability.
I had explored open-source libraries like zookeeper, hadoop, but none solves my problem.
The easiest solution that I can think of, is to maintain a map of job type, with its hosted machine. But how can I support dynamic allocation of job type on my fleet? How to handle machine failures, to make sure that each job type must be available on atleast 1 machine, at any point of time.

Based on the answers that you mentioned in the comments, I propose you to go for a MQ-based (Message Queue) architecture. What I propose in this answer is to:
Get the input from users and push them into a distributed message queue. It means that you should set up a message queue (Such as ActiveMQ or RabbitMQ) on several servers. This MQ technology, helps you to replicate the input requests for fault tolerance issues. It also provides a full end-to-end asynchronous system.
After preparing this MQ layer, you can setup you computing servers layers. This means that some computing servers (~20 servers in your case) will read the requests from the message queue and start a job based on the request. Because this MQ is distributed, you can make sure that a good level of load balancing can happen in your computing servers. In addition, each server is capable of running as much as jobs that you want (~500 in your case) based on the requests that it reads from the MQ.
Regarding the failures, the computing servers may only pop from the MQ, if and only if the job is completed. If one server is crashing, the job is still in the MQ and another server can work on it. If the job is saving some state somewhere or updates something, you should manage its duplicate run then.
The good point about this approach is that it is very salable. It means that if in future you have more jobs to handle, by adding a computing server and connecting it to the MQ, you can process more requests on the servers without any change to the system. In addition, some nice features in the MQ like priority-based queuing, helps you to prioritize the requests and process them based on the job type.
p.s. Your Q does not provide any details about the type and parameters of the system. This is a draft solution that I can propose. If you provide more details, maybe the community can help you more.

How to message/work queues work and what's the justification for a dedicated/hosted queue service

I'm getting into utilising work queues for my app and can easily understand their benefit: computationally or time-intensive tasks (like sending an email) can be put in a queue and run asynchronously to the request that initiated the task so that it can get on with responding to the user.
For example in my Laravel app I am using redis queueing. I push a job into onto the queue and a separate artisan process which is running on the server (artisan queue:listen) is listening on the queue for new jobs for it to execute.
My confusion is when it comes to the actually queuing system. Surely the queue worker is just a list of jobs with any pertinent data serialised and passed through. The job itself is still computed by the application.
So this leads me to wonder about the benefit and cost of large-scale queue workers like Iron.io or Amazon SQS. These services cost a fair amount for what seems like a fairly straightforward and computationally minimal task. Even with thousands of jobs a minute a simple redis or beanstalkd queue on the same server will surely be handled far easier than the actual jobs themselves. Having a hosted system seems like it'll slow down the process more due to the latency between servers.
Can anyone explain the workings of a work queue and how these hosted services improve the performance of such an application. I imagine the benefit is only felt once an app has scaled sufficiently but more insight would be helpful.

design high volume MSMQ

We have many communication servers sending data packets. We would like to store these data packets coming from these server programs in MSMQ until an updater will process them. Data loss has been a concern and we would like to not lose any data packet coming from these server programs and want an efficient and performant solution.
What will be the best design approach?

Well, there are two basic things you need to do to get started. First, you'll want to modify the default installation to move the storage location to a drive that is mirrored and/or is not the same as the one that the operating system boots from on that server. Also you'll want to ensure there is enough space there to hold messages as they are queued, depending on the volume you're contemplating. This article covers that.
Second, you'll want to use transactions and journaling to ensure reliability. This is both a programming and infrastructure issue, so you can start by looking at this article, and then following up with a general guide on how to program against MSMQ correctly. This for example is a good starting point if you've never used MSMQ, although it's fairly basic. If you're going to use MSMQ as a binding/transport for WCF then you have the plumbing part pretty much covered; it's just a matter of configuring your services to handle the volume and traffic you think you're going to see.

We have many communication servers sending data packets.
When storing 'data packets', I would recommend writing [Serializable] .NET objects to WCF, mainly because WCF can read/write them transparently to MSMQ. This will be easier to work with, but if your data packets are say TCP/IP or binary packets, you will need to turn on 'Ordering', to ensure they go into the queue in the exact order they were placed.
MSMQ also has sessions, so if you want to group items together this is possible. WCF does not make this guarantee. You will need to write custom code for this, but it is only a case of assigning a unique ID to each message in a particular session.
Data loss has been a concern and we would like to not lose any data packet coming from these server programs
MSMQ can persist the data to disk, so if a server goes down, its queue is preserved. MSMQ can hold the queue in memory, which is more efficient but crashes/restarts will not retain the queue information.
and want an efficient( good performance )
MSMQ is fairly performant. The persistence to disk has a small overhead, but only due to the disk write. If performance includes multi-threading, MSMQ does not offer this feature as the queue is sequential, so must be processed in order. But this is typical of queue technologies.
MSMQ also have 4MB max message size, so keep in mind what you want to send across the network.
The only other thing is that MSMQ is not massively scalable. Its primary goal is guaranteed delivery. If you post millions of packets, they will get to their destination, but MSMQ does have a finite ability to push the messages to other machines. It operates a ThreadPool-like system, so it will not scale if this is also a requirement.
I have also added info to the #msmq-wcf wiki with a basic example of writing data.

Some fundamental but important questions about web development?

I've developed some web-based applications till now using PHP, Python and Java. But some fundamental but very important questions are still beyond my knowledge, so I made this post to get help and clarification from you guys.
Say I use some programming language as my backend language(PHP/Python/.Net/Java, etc), and I deploy my application with a web server(apache/lighttpd/nginx/IIS, etc). And suppose at time T, one of my page got 100 simultaneous requests from different users. So my questions are:
How does my web server handle such 100 simultaneous requests? Will web server generate one process/thread for each request? (if yes, process or thread?)
How does the interpreter of the backend language do? How will it handle the request and generate the proper html? Will the interpreter generate a process/thread for each request?(if yes, process or thread?)
If the interpreter will generate a process/thread for each request, how about these processes(threads)? Will they share some code space? Will they communicate with each other? How to handle the global variables in the backend codes? Or they are independent processes(threads)? How long is the duration of the process/thread? Will they be destroyed when the request is handled and the response is returned?
Suppose the web server can only support 100 simultaneous requests, but now it got 1000 simultaneous requests. How does it handle such situation? Will it handle them like a queue and handle the request when the server is available? Or other approaches?
I read some articles about Comet these days. And I found long connection may be a good way to handle the real-time multi-users usecase. So how about long connection? Is it a feature of some specific web servers or it is available for every web server? Long connection will require a long-existing interpreter process?
EDIT:
Recently I read some articles about CGI and fastcgi, which makes me know the approach of fastcgi should be a typical approach to hanlde request.
the protocol multiplexes a single transport connection between several independent FastCGI requests. This supports applications that are able to process concurrent requests using event-driven or multi-threaded programming techniques.
Quoted from fastcgi spec, which mentioned connection which can handle several requests, and can be implemented in mutli-threaded tech. I'm wondering this connection can be treated as process and it can generate several threads for each request. If this is true, I become more confused about how to handle the shared resource in each thread?
P.S thank Thomas for the advice of splitting the post to several posts, but I think the questions are related and it's better to group them together.
Thank S.Lott for your great answer, but some answers to each question are too brief or not covered at all.
Thank everyone's answer, which makes me closer to the truth.

Update, Spring 2018:
I wrote this response in 2010 and since then, a whole lot of things have changed in the world of a web backend developer. Namely, the advent of the "cloud" turning services such as one-click load balancers and autoscaling into commodities have made the actual mechanics of scaling your application much easier to get started.
That said, what I wrote in this article in 2010 still mostly holds true today, and understanding the mechanics behind how your web server and language hosting environment actually works and how to tune it can save you considerable amounts of money in hosting costs. For that reason, I have left the article as originally written below for anyone who is starting to get elbows deep in tuning their stack.
1. Depends on the webserver (and sometimes configuration of such). A description of various models:
Apache with mpm_prefork (default on unix): Process per request. To minimize startup time, Apache keeps a pool of idle processes waiting to handle new requests (which you configure the size of). When a new request comes in, the master process delegates it to an available worker, otherwise spawns up a new one. If 100 requests came in, unless you had 100 idle workers, some forking would need to be done to handle the load. If the number of idle processes exceeds the MaxSpare value, some will be reaped after finishing requests until there are only so many idle processes.
Apache with mpm_event, mpm_worker, mpm_winnt: Thread per request. Similarly, apache keeps a pool of idle threads in most situations, also configurable. (A small detail, but functionally the same: mpm_worker runs several processes, each of which is multi-threaded).
Nginx/Lighttpd: These are lightweight event-based servers which use select()/epoll()/poll() to multiplex a number of sockets without needing multiple threads or processes. Through very careful coding and use of non-blocking APIs, they can scale to thousands of simultaneous requests on commodity hardware, provided available bandwidth and correctly configured file-descriptor limits. The caveat is that implementing traditional embedded scripting languages is almost impossible within the server context, this would negate most of the benefits. Both support FastCGI however for external scripting languages.
2. Depends on the language, or in some languages, on which deployment model you use. Some server configurations only allow certain deployment models.
Apache mod_php, mod_perl, mod_python: These modules run a separate interpreter for each apache worker. Most of these cannot work with mpm_worker very well (due to various issues with threadsafety in client code), thus they are mostly limited to forking models. That means that for each apache process, you have a php/perl/python interpreter running inside. This severely increases memory footprint: if a given apache worker would normally take about 4MB of memory on your system, one with PHP may take 15mb and one with Python may take 20-40MB for an average application. Some of this will be shared memory between processes, but in general, these models are very difficult to scale very large.
Apache (supported configurations), Lighttpd, CGI: This is mostly a dying-off method of hosting. The issue with CGI is that not only do you fork a new process for handling requests, you do so for -every- request, not just when you need to increase load. With the dynamic languages of today having a rather large startup time, this creates not only a lot of work for your webserver, but significantly increases page load time. A small perl script might be fine to run as CGI, but a large python, ruby, or java application is rather unwieldy. In the case of Java, you might be waiting a second or more just for app startup, only to have to do it all again on the next request.
All web servers, FastCGI/SCGI/AJP: This is the 'external' hosting model of running dynamic languages. There are a whole list of interesting variations, but the gist is that your application listens on some sort of socket, and the web server handles an HTTP request, then sends it via another protocol to the socket, only for dynamic pages (static pages are usually handled directly by the webserver).
This confers many advantages, because you will need less dynamic workers than you need the ability to handle connections. If for every 100 requests, half are for static files such as images, CSS, etc, and furthermore if most dynamic requests are short, you might get by with 20 dynamic workers handling 100 simultaneous clients. That is, since the normal use of a given webserver keep-alive connection is 80% idle, your dynamic interpreters can be handling requests from other clients. This is much better than the mod_php/python/perl approach, where when your user is loading a CSS file or not loading anything at all, your interpreter sits there using memory and not doing any work.
Apache mod_wsgi: This specifically applies to hosting python, but it takes some of the advantages of webserver-hosted apps (easy configuration) and external hosting (process multiplexing). When you run it in daemon mode, mod_wsgi only delegates requests to your daemon workers when needed, and thus 4 daemons might be able to handle 100 simultaneous users (depends on your site and its workload)
Phusion Passenger: Passenger is an apache hosting system that is mostly for hosting ruby apps, and like mod_wsgi provides advantages of both external and webserver-managed hosting.
3. Again, I will split the question based on hosting models for where this is applicable.
mod_php, mod_python, mod_perl: Only the C libraries of your application will generally be shared at all between apache workers. This is because apache forks first, then loads up your dynamic code (which due to subtleties, is mostly not able to use shared pages). Interpreters do not communicate with each other within this model. No global variables are generally shared. In the case of mod_python, you can have globals stay between requests within a process, but not across processes. This can lead to some very weird behaviours (browsers rarely keep the same connection forever, and most open several to a given website) so be very careful with how you use globals. Use something like memcached or a database or files for things like session storage and other cache bits that need to be shared.
FastCGI/SCGI/AJP/Proxied HTTP: Because your application is essentially a server in and of itself, this depends on the language the server is written in (usually the same language as your code, but not always) and various factors. For example, most Java deployment use a thread-per-request. Python and its "flup" FastCGI library can run in either prefork or threaded mode, but since Python and its GIL are limiting, you will likely get the best performance from prefork.
mod_wsgi/passenger: mod_wsgi in server mode can be configured how it handles things, but I would recommend you give it a fixed number of processes. You want to keep your python code in memory, spun up and ready to go. This is the best approach to keeping latency predictable and low.
In almost all models mentioned above, the lifetime of a process/thread is longer than a single request. Most setups follow some variation on the apache model: Keep some spare workers around, spawn up more when needed, reap when there are too many, based on a few configurable limits. Most of these setups -do not- destroy a process after a request, though some may clear out the application code (such as in the case of PHP fastcgi).
4. If you say "the web server can only handle 100 requests" it depends on whether you mean the actual webserver itself or the dynamic portion of the webserver. There is also a difference between actual and functional limits.
In the case of Apache for example, you will configure a maximum number of workers (connections). If this number of connections was 100 and was reached, no more connections will be accepted by apache until someone disconnects. With keep-alive enabled, those 100 connections may stay open for a long time, much longer than a single request, and those other 900 people waiting on requests will probably time out.
If you do have limits high enough, you can accept all those users. Even with the most lightweight apache however, the cost is about 2-3mb per worker, so with apache alone you might be talking 3gb+ of memory just to handle the connections, not to mention other possibly limited OS resources like process ids, file descriptors, and buffers, and this is before considering your application code.
For lighttpd/Nginx, they can handle a large number of connections (thousands) in a tiny memory footprint, often just a few megs per thousand connections (depends on factors like buffers and how async IO apis are set up). If we go on the assumption most your connections are keep-alive and 80% (or more) idle, this is very good, as you are not wasting dynamic process time or a whole lot of memory.
In any external hosted model (mod_wsgi/fastcgi/ajp/proxied http), say you only have 10 workers and 1000 users make a request, your webserver will queue up the requests to your dynamic workers. This is ideal: if your requests return quickly you can keep handling a much larger user load without needing more workers. Usually the premium is memory or DB connections, and by queueing you can serve a lot more users with the same resources, rather than denying some users.
Be careful: say you have one page which builds a report or does a search and takes several seconds, and a whole lot of users tie up workers with this: someone wanting to load your front page may be queued for a few seconds while all those long-running requests complete. Alternatives are using a separate pool of workers to handle URLs to your reporting app section, or doing reporting separately (like in a background job) and then polling its completion later. Lots of options there, but require you to put some thought into your application.
5. Most people using apache who need to handle a lot of simultaneous users, for reasons of high memory footprint, turn keep-alive off. Or Apache with keep-alive turned on, with a short keep-alive time limit, say 10 seconds (so you can get your front page and images/CSS in a single page load). If you truly need to scale to 1000 connections or more and want keep-alive, you will want to look at Nginx/lighttpd and other lightweight event-based servers.
It might be noted that if you do want apache (for configuration ease of use, or need to host certain setups) you can put Nginx in front of apache, using HTTP proxying. This will allow Nginx to handle keep-alive connections (and, preferably, static files) and apache to handle only the grunt work. Nginx also happens to be better than apache at writing logfiles, interestingly. For a production deployment, we have been very happy with nginx in front of apache(with mod_wsgi in this instance). The apache does not do any access logging, nor does it handle static files, allowing us to disable a large number of the modules inside apache to keep it small footprint.
I've mostly answered this already, but no, if you have a long connection it doesn't have to have any bearing on how long the interpreter runs (as long as you are using external hosted application, which by now should be clear is vastly superior). So if you want to use comet, and a long keep-alive (which is usually a good thing, if you can handle it) consider the nginx.
Bonus FastCGI Question You mention that fastcgi can multiplex within a single connection. This is supported by the protocol indeed (I believe the concept is known as "channels"), so that in theory a single socket can handle lots of connections. However, it is not a required feature of fastcgi implementors, and in actuality I do not believe there is a single server which uses this. Most fastcgi responders don't use this feature either, because implementing this is very difficult. Most webservers will make only one request across a given fastcgi socket at a time, then make the next across that socket. So you often just have one fastcgi socket per process/thread.
Whether your fastcgi application uses processing or threading (and whether you implement it via a "master" process accepting connections and delegating or just lots of processes each doing their own thing) is up to you; and varies based on capabilities of your programming language and OS too. In most cases, whatever is the default the library uses should be fine, but be prepared to do some benchmarking and tuning of parameters.
As to shared state, I recommend you pretend that any traditional uses of in-process shared state do not exist: even if they may work now, you may have to split your dynamic workers across multiple machines later. For state like shopping carts, etc; the db may be the best option, session-login info can be kept in securecookies, and for temporary state something akin to memcached is pretty neat. The less you have reliant on features that share data (the "shared-nothing" approach) the bigger you can scale in the future.
Postscript: I have written and deployed a whole lot of dynamic applications in the whole scope of setups above: all of the webservers listed above, and everything in the range of PHP/Python/Ruby/Java. I have extensively tested (using both benchmarking and real-world observation) the methods, and the results are sometimes surprising: less is often more. Once you've moved away from hosting your code in the webserver process, You often can get away with a very small number of FastCGI/Mongrel/mod_wsgi/etc workers. It depends on how much time your application stays in the DB, but it's very often the case that more processes than 2*number of CPU's will not actually gain you anything.

How does my web server handle such 100 simultaneous requests? Does web server generate one process/thread for each request? (if yes, process or thread?)
It varies. Apache has both threads and processes for handling requests. Apache starts several concurrent processes, each one of which can run any number of concurrent threads. You must configure Apache to control how this actually plays out for each request.
How does the interpreter of the backend language do? How will it handle the request and generate the proper html? Will the interpreter generate a process/thread for each request?(if yes, process or thread?)
This varies with your Apache configuration and your language. For Python one typical approach is to have daemon processes running in the background. Each Apache process owns a daemon process. This is done with the mod_wsgi module. It can be configured to work several different ways.
If the interpreter will generate a process/thread for each request, how about these processes(threads)? Will they share some code space? Will they communicate with each other? How to handle the global variables in the backend codes? Or they are independent processes(threads)? How long is the duration of the process/thread? Will they be destroyed when the request is handled and the response is returned?
Threads share the same code. By definition.
Processes will share the same code because that's the way Apache works.
They do not -- intentionally -- communicate with each other. Your code doesn't have a way to easily determine what else is going on. This is by design. You can't tell which process you're running in, and can't tell what other threads are running in this process space.
The processes are long-running. They do not (and should not) be created dynamically. You configure Apache to fork several concurrent copies of itself when it starts to avoid the overhead of process creation.
Thread creation has much less overhead. How Apaches handles threads internally doesn't much matter. You can, however, think of Apache as starting a thread per request.
Suppose the web server can only support 100 simultaneous requests, but now it got 1000 simultaneous requests. How does it handle such situation? Will it handle them like a queue and handle the request when the server is available? Or other approaches?
This is the "scalability" question. In short -- how will performance degrade as the load increases. The general answer is that the server gets slower. For some load level (let's say 100 concurrent requests) there are enough processes available that they all run respectably fast. At some load level (say 101 concurrent requests) it starts to get slower. At some other load level (who knows how many requests) it gets so slow you're unhappy with the speed.
There is an internal queue (as part of the way TCP/IP works, generally) but there's no governor that limits the workload to 100 concurrent requests. If you get more requests, more threads are created (not more processes) and things run more slowly.

To begin with, requiring detailed answers to all your points is a bit much, IMHO.
Anyway, a few short answers about your questions:
#1
It depends on the architecture of the server. Apache is a multi-process, and optionally also, multi-threaded server. There is a master process which listens on the network port, and manages a pool of worker processes (where in the case of the "worker" mpm each worker process has multiple threads). When a request comes in, it is forwarded to one of the idle workers. The master manages the size of the worker pool by launching and terminating workers depending on the load and the configuration settings.
Now, lighthttpd and nginx are different; they are so-called event-based architectures, where multiple network connections are multiplexed onto one or more worker processes/threads by using the OS support for event multiplexing such as the classic select()/poll() in POSIX, or more scalable but unfortunately OS-specific mechanisms such as epoll in Linux. The advantage of this is that each additional network connection needs only maybe a few hundred bytes of memory, allowing these servers to keep open tens of thousands of connections, which would generally be prohibitive for a request-per-process/thread architecture such as apache. However, these event-based servers can still use multiple processes or threads in order to utilize multiple CPU cores, and also in order to execute blocking system calls in parallel such as normal POSIX file I/O.
For more info, see the somewhat dated C10k page by Dan Kegel.
#2
Again, it depends. For classic CGI, a new process is launched for every request. For mod_php or mod_python with apache, the interpreter is embedded into the apache processes, and hence there is no need to launch a new process or thread. However, this also means that each apache process requires quite a lot of memory, and in combination with the issues I explained above for #1, limits scalability.
In order to avoid this, it's possible to have a separate pool of heavyweight processes running the interpreters, and the frontend web servers proxy to the backends when dynamic content needs to be generated. This is essentially the approach taken by FastCGI and mod_wsgi (although they use custom protocols and not HTTP so perhaps technically it's not proxying). This is also typically the approach chosen when using the event-based servers, as the code for generating the dynamic content seldom is re-entrant which it would need to be in order to work properly in an event-based environment. Same goes for multi-threaded approaches as well if the dynamic content code is not thread-safe; one can have, say, frontend apache server with the threaded worker mpm proxying to backend apache servers running PHP code with the single-threaded prefork mpm.
#3
Depending on at which level you're asking, they will share some memory via the OS caching mechanism, yes. But generally, from a programmer perspective, they are independent. Note that this independence is not per se a bad thing, as it enables straightforward horizontal scaling to multiple machines. But alas, some amount of communication is often necessary. One simple approach is to communicate via the database, assuming that one is needed for other reasons, as it usually is. Another approach is to use some dedicated distributed memory caching system such as memcached.
#4
Depends. They might be queued, or the server might reply with some suitable error code, such as HTTP 503, or the server might just refuse the connection in the first place. Typically, all of the above can occur depending on how loaded the server is.
#5
The viability of this approach depends on the server architecture (see my answer to #1). For an event-based server, keeping connections open is no big issue, but for apache it certainly is due to the large amount of memory required for every connection. And yes, this certainly requires a long-running interpreter process, but as described above, except for classic CGI this is pretty much granted.

Web servers are multi-threaded environment; besides using application scoped variables, a user request doesn't interact with other threads.
So:
Yes, a new thread will be created for every user
Yes, HTML will be processed for every request
You'll need to use application scoped variables
If you get more requests than you can deal, they will be put on queue. If they got served before configured timeout period, user will get his response, or a "server busy" like error.
Comet isn't specific for any server/language. You can achieve same result by quering your server every n seconds, without dealing with other nasty threads issues.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse