I used papa parse initially with workers enabled to avoid locking the browser thread. However, now that I'm using streaming the browser thread doesn't lock even with workers disabled.
I'm confused as to why? What reason would someone want to use both workers and streaming? Why doesn't the documentation mention that streaming can overcome locking the browser too?
Related
As the title states, have people used Kafka Streams' interactive queries for production-grade requests, or is the interactive querying more of a debugging feature?
I'm imagining that there may be issues around hot-keys if you tried to send too many QPS, which you can solve by caching, but I'm curious to know if there are more library-specific limitations.
It's used very frequently in production. In fact, tools like Confluent Control Center and KsqlDB heavily rely on the feature to work.
Yes, "hot keys" are a problem in any Kafka cluster, regardless of using Streams API. I'm not sure what you mean by caching, though unless you're referring to RocksDB or the in memory statestore as a cache
One main limitation is disk utilization against keyspaces with large cardinality, not "hot keys"; if the Streams app ever loses its state, it could take hours for it to rebuild. I've heard people get around this problem by writing their own state store implementations around maybe Redis or Aerospike
I am wondering why there is no non-blocking support via simple callbacks or Java's CompletableFuture or Scala Futures in the Kafka Stream API.
I do understand that ordering in a partition needs to be maintained, but across partitions I do not see the reason to achieve ordering by blocking an expensive resource: a thread.
i.e. when I let my Kafka Streams app with a call to an external service, e.g. in mapValues run on 1 server and I have more than thousands of partitions, I will probably lock up the machine because all threads are blocked. Having some API method like mapValuesAsync() would be nice here, wouldn't it?
Also just imagine on Kafka Stream App with doing several blocking operations in it's flow one would need way less partitions per each topic to run into the problem. Wasting threads doesn't look like a nice API design here.
Is there any support planned for this? Or do I oversee something here?
Async processing is generally hard in stream processing. It's not just about ordering, but also about fault-tolerance, tracking progress etc.
It's not impossible to support though and in fact there is already a design proposal for it: https://cwiki.apache.org/confluence/display/KAFKA/KIP-408%3A+Add+Asynchronous+Processing+To+Kafka+Streams
Feel free to help building this feature!
I have been learning Storm and Samza in order to understand how stream processing engines work and realized that both of them are standalone applications and in order to process an event I need to add it to a queue that is also connected to stream processing engine. That means I need to add the event to a queue (which is also a standalone application, let's say Kafka), and Storm will pick the event from the queue and process it in a worker process. And If I have multiple bolts, each bolt will be processed by different worker processes. (Which is one of the things I don't really understand, I see that a company that uses more than 20 bolts in production and each event is transferred between bolts in a certain path)
However I don't really understand why I would need such complex systems. The processes involves too much IO operations (my program -> queue -> storm ->> bolts) and it makes much more harder to control and debug the them.
Instead, if I'm collecting the data from web servers, why not just use the same node for event processing? The operations will be already distributed over the nodes by load-balancers which I use for web servers. I can create executors on same JVM instances and send the events from web server to the executor asynchronously without involving any extra IO requests. I can also watch the executors in web servers and make sure that the executor processed the events (at-least-once or exactly-one processing guarantee). In this way, it will be a lot easier to manage my application and since not much IO operation is required, it will be faster compared to the other way which involves sending the data to another node over the network (which is also not reliable) and process it in that node.
Most probably I'm missing something here because I know that many companies actively uses Storm and many people I know recommend Storm or other stream processing engines for real-time event processing but I just don't understand it.
My understanding is that the goal of using a framework like Storm is to offload the heavy processing (whether cpu-bound, I/O-bound or both) from the application/web servers and keep them responsive.
Consider that each application server may have have to serve a large number of concurrent requests, not all of them having to do with stream processing. If the app server is already processing a significant load of events, then it could constitute a bottleneck for lighter requests, as the server resources (think cpu usage, memory, disk contention etc.) will already be tied to heavier processing requests.
If the actual load you need to face isn't that heavy, or if it can simply be handled by adding app server instances, then of course it doesn't make sense to complexify your architecture/topology, which could in fact slow the entire thing down. It really depends on your performance and load requirements, as well as on how much (virtual) hardware you can throw at the problem. As usual, benchmarking based on your load requirements will help make a decision of which way to go.
you are right to consider that sending data across the network will consume more time of the total processing time.
However, these frameworks (Storm, Spark, Samza, Flink) were created to process a lot of data that potentially does not fit in memory of one computer. So, if we use more than one computer to process the data we can achieve parallelism.
And, following your question about the network latency. Yes! this is a trade off to consider. The developer has to know that they are implementing programs to deploy in a parallel framework. The way that they build the application will influence how much data is transferred through the network as well.
Reading the documentation about the Play Framework and ReactiveMongo leads me to believe that ReactiveMongo works in such a way that it uses few threads and never blocks.
However, it seems that the communication from the Play application to the Mongo server would have to happen on some thread somewhere. How is this implemented? Links to the source code for Play, ReactiveMongo, Akka, etc. would also be very appreciated.
The Play Framework includes some documentation about this on this page about thread pools. It starts off:
Play framework is, from the bottom up, an asynchronous web framework. Streams are handled asynchronously using iteratees. Thread pools in Play are tuned to use fewer threads than in traditional web frameworks, since IO in play-core never blocks.
It then talks a little bit about ReactiveMongo:
The most common place that a typical Play application will block is when it’s talking to a database. Unfortunately, none of the major databases provide asynchronous database drivers for the JVM, so for most databases, your only option is to using blocking IO. A notable exception to this is ReactiveMongo, a driver for MongoDB that uses Play’s Iteratee library to talk to MongoDB.
Following is a note about using Futures:
Note that you may be tempted to therefore wrap your blocking code in Futures. This does not make it non blocking, it just means the blocking will happen in a different thread. You still need to make sure that the thread pool that you are using there has enough threads to handle the blocking.
There is a similar note in the Play documentation on the page Handling Asynchronous Results:
You can’t magically turn synchronous IO into asynchronous by wrapping it in a Future. If you can’t change the application’s architecture to avoid blocking operations, at some point that operation will have to be executed, and that thread is going to block. So in addition to enclosing the operation in a Future, it’s necessary to configure it to run in a separate execution context that has been configured with enough threads to deal with the expected concurrency.
The documentation seems to be saying that ReactiveMongo is non-blocking, so you don't have to worry about it eating up a lot of the threads in your thread pool. But ReactiveMongo has to communicate with the Mongo server somewhere.
How is this communication implemented so that Mongo doesn't use up threads from Play's default thread pool?
Once again, links to the specific files in Play, ReactiveMongo, Akka, etc, would be very appreciated.
Yes, indeed, you still need to use threads to perform any kind of work, including communication with the database. What's important is how exactly this communication happens.
ReactiveMongo "does not use threads" in a sense that it does not use blocking I/O. Usual Java I/O facilities like java.io.InputStream are blocking; this means that reading from such an InputStream or writing to OutputStream blocks the thread until the "other side" provides the required data or is ready to accept it. For network communication this means that threads will be blocked.
However, Java provides NIO API which supports non-blocking and asynchronous I/O. I don't want to get into its details right now, but the basic idea, naturally, is that non-blocking I/O allow not to block threads which need to exchange some data with the outside world: for example, these threads can poll the data source to check if there is some data available, and if there is none, they return to the thread pool and can be used for other tasks. Of course, down there these facilities are provided by the underlying OS.
Exact implementation details of non-blocking I/O is usually hidden inside high-level libraries like Netty because it is not at all nice to use. Netty (which is exactly the library ReactiveMongo uses), for example, provides nice asynchronous callback-like API which is really easy to use but is also powerful and expressive enough to allow building complex I/O-heavy applications with high throughput.
So, ReactiveMongo uses Netty to talk with Mongo database server, and because Netty is an implementation of asynchronous network I/O, ReactiveMongo really does not need to block threads for a long time.
I'm building a small web server for learning purposes.
For each incoming POST request I'm planning to append the content to a file.
I'm using ZeroMQ sockets for communicating with the file-append process. Do I need to take special care with the file operations (fopen, fseek)?
Considering a typical Amazon EC2 instance and that each request has at most 1kb, how many file-append operations per second can my server handle?
Thanks!
Basic concerns should be followed, what happens if multiple processes are run and receive messages. What happens if you run out of disk space or a write fails?
Are you after synchronous writing to disk or is buffered, and potential of log corruption, acceptable? fopen and friends are buffered, consider open and friends for non-buffered writes.
Performance is tied to whether you can batch writes, use buffering, or want synchronous writes to disk. I think Amazon provide some IOPS details, certainly other developers have published results:
http://www.thebitsource.com/featured-posts/rackspace-cloud-servers-versus-amazon-ec2-performance-analysis/
http://blog.dt.org/index.php/2010/06/amazon-ec2-io-performance-local-emphemeral-disks-vs-raid0-striped-ebs-volumes/
https://forums.aws.amazon.com/thread.jspa?messageID=132387