Streaming data between akka cluster nodes - scala

I need to stream data between a couple hundred KB and many MBs between akka cluster nodes. Simplest approach would be to split it up as chunked messages, but that appears to be in advisable because it might interfere with housekeeping chatter of the cluster.
Alternatively, I could use messages to communicate one time urls and use http.
However, I'd prefer a persistent connection approach, so I was thinking using zeromq and chunked messages.
But rather than rolling my own approach, I'd like to use an existing way of accomplishing this but I have not found one.
One more requirement: most of the time the consumption of that stream is going straight out via Play, so an approach that created an iteratee that could be used to proxy the steam to http would be preferable.

Akka Streams 2.5.12 has StreamRefs for what I believe to be your use case.

Iteratees can't communicate across machine boundaries, so iteratees alone are probably not the tool you are looking for.
I would pursue one of the following approaches:
Using remote rpc akka Actors to send chunks of your data across the wire. Actors can be used to create iteratees and enumerators on either side (Enumerator.unicast and Iteratee.foreach) of the wire so that the fact that you are using Actors is just an implementation detail and not visible in your interface of these streams.
Use Akka Streams. This library has support for TCP connections, and while this is a different streaming library from iteratees, I have found that it is more robust in the stream operations it supports. It looks like Play is looking to move towards a tighter integration with Akka Streams as they are looking at replacing their netty HTTP backend with Akka Http Streams

Related

can vert.x event bus replace the need for Kafka?

I am evaluating the vert.x framework to see if I can reduce the Kafka based communications between my microservices developed using spring boot.
The question is:
Can I replace
Kafka with vert.x event bus and
spring boot microservices with vert.x based verticles
To answer quickly, I would say it depends on your needs.
Yes, the eventbus can be a good way to handle natively communication between microservices verticles using an asynchronous and non-blocking paradigm.
But in some cases you could need:
to handle some common enterprises patterns like replay mechanisms, persistence of messages, transactional reading
to be able to process some kind of messages in a chronological order
to handle communication between multiples kind of microservices that aren't all written with the same framework/toolkit or even programming language
to handle reliability, resilience and
failure recovery when all your consumers/microservices/verticles are died
to handle dynamic horizontal scalability and monitoring of your consumers/microservices/verticles
to be able to work with a single cluster deployed in multi-datacenters and multi-regions
In those cases I'd prefer to choose Apache Kafka over the native eventbus or an old fascioned JMS compliant system.
It's not forbidden to use both eventbus and kafka in the same microservices architecture according to your real needs. For example, you could have one kafka consumers group reading a kafka topic to handle scaling, monitoring, failure recovery and reply mechanism and then handle communication between your sub-verticles through the eventbus.
I'll clarify a little bit for the scalability and monitoring part and explain why I think it's more simple to handle that with Kafka over the native eventbus and cluster mode with vert.x : Kafka allow us to know in real time (through JMX metrics and the describe command):
the "lag" of a topic which corresponds to
the number of unread messages
the number of consumers of each group that are listening a topic
the number of partitions of a topic affected of each consumers
i/o metrics
So it's possible to use an ElasticStack or Prometheus+Grafana solution to monitor those metrics and use them to handle a dynamic scalability (when you know that there's a need to increase temporarily the number of consumers for example according to the lag metric and the number of partitions and the cpu/ram/swap metrics of your hosts).
To answer the second question vert.x or SpringBoot my answer will be not very objective but I'd vote for vert.x for its performances on the JVM and especially for its simplicity. I'm a little tired of the Spring factory and its big layers of abstraction that hides a lot of issues under a mountain of annotations triggering a mountain of AOP.
Moreover, In the Java world of microservices, there's other alternatives to SpringBoot like the different implementations of Microprofile (thorntail project for example).
The event-bus is not persistent. You should use it for fast verticle-to-verticle communications, and more generally to dispatch events where you know that you can loose them if you have some crash.
Kafka streams are persistent, and you should send events there because either you want other (possibly non-Vert.x) applications to consume them, and/or because you want to ensure that these events are not being lost in case of failure.
A reactive (read "scalable and fault-tolerant") Vert.x application typically uses a combination of both the event-bus and some replicable messaging systems like AMQP / Kafka / etc.
On the question:
Can I replace spring boot microservices with vert.x based verticles?
Yes, definitely, although the 2 have different programming models.
If you want a more progressive approach and use Spring for structuring your application while using Vert.x for resource efficiency over your I/O and event processing then you can mix them, see https://github.com/vert-x3/vertx-examples/tree/master/spring-examples for examples.
Take a look at the Quarkus framework: in the workshop section you'll find Vert.x and Apache Kafka combined!

Kafka messages over rest api

we currently have a library which we use to interact with kafka. but we planning to develop this library into a separate application. Other applications will send kafka messages using rest endpoint. Planning to use vert.x in this application to make it non-blocking and fast. Is it a good strategy. My concern 1) http will make it slower compared to TCP of kafka 2) streaming may not be possible 3) single point of failure
But being separate application - release management, control and support will be lot easier than currently.
Is it good strategy and has someone done like this before? Any suggestions?
Your consideration for going with HTTP/ TCP will depend on the number of applications that will be talking to your service. Let's say there is an IOT device that is sending lots of messages continuously, then using HTTP will be expensive and it will increase latency. Since HTTP connection establishment is an expensive operation.
Now, consider the case where you have a transactional system that is sending transaction events as they commit to your database then the rate of messages will be lower I assume, then it makes sense to use HTTP there.
It will depend on the rate of messages that your service will receive, that will decide the way you want to take.
Now, for your current approach of maintaining a library, it is a good way to maintain consistency across the organisation as long as the library is maintained and users of your library constantly update as and when you make changes to your library. It also has the advantage of not maintaining separate infrastructure/servers since your code will run in your users' application.

What happens to messages that come to a server implements stream processing after the source reached its bound?

Im learning akka streams but obviously its relevant to any streaming framework :)
quoting akka documentation:
Reactive Streams is just to define a common mechanism of how to move
data across an asynchronous boundary without losses, buffering or
resource exhaustion
Now, from what I understand is that if up until before streams, lets take an http server for example, the request would come and when the receiver wasent finished with a request, so the new requests that are coming will be collected in a buffer that will hold the waiting requests, and then there is a problem that this buffer have an unknown size and at some point if the server is overloaded we can loose requests that were waiting.
So then stream processing came to play and they bounded this buffer to be controllable...so we can predefine the number of messages (requests in my example) we want to have in line and we can take care of each at a time.
my question, if we implement that a source in our server can have a 3 messages at most, so if the 4th id coming what happens with it?
I mean when another server will call us and we are already taking care of 3 requests...what will happened to he's request?
What you're describing is not actually the main problem that Reactive Streams implementations solve.
Backpressure in terms of the number of requests is solved with regular networking tools. For example, in Java you can configure a thread pool of a networking library (for example Netty) to some parallelism level, and the library will take care of accepting as much requests as possible. Or, if you use synchronous sockets API, it is even simpler - you can postpone calling accept() on the server socket until all of the currently connected clients are served. In either case, there is no "buffer" on either side, it's just until the server accepts a connection, the client will be blocked (either inside a system call for blocking APIs, or in an event loop for async APIs).
What Reactive Streams implementations solve is how to handle backpressure inside a higher-level data pipeline. Reactive streams implementations (e.g. akka-streams) provide a way to construct a pipeline of data in which, when the consumer of the data is slow, the producer will slow down automatically as well, and this would work across any kind of underlying transport, be it HTTP, WebSockets, raw TCP connections or even in-process messaging.
For example, consider a simple WebSocket connection, where the client sends a continuous stream of information (e.g. data from some sensor), and the server writes this data to some database. Now suppose that the database on the server side becomes slow for some reason (networking problems, disk overload, whatever). The server now can't keep up with the data the client sends, that is, it cannot save it to the database in time before the new piece of data arrives. If you're using a reactive streams implementation throughout this pipeline, the server will signal to the client automatically that it cannot process more data, and the client will automatically tweak its rate of producing in order not to overload the server.
Naturally, this can be done without any Reactive Streams implementation, e.g. by manually controlling acknowledgements. However, like with many other libraries, Reactive Streams implementations solve this problem for you. They also provide an easy way to define such pipelines, and usually they have interfaces for various external systems like databases. In particular, such libraries may implement backpressure on the lowest level, down to to the TCP connection, which may be hard to do manually.
As for Reactive Streams itself, it is just a description of an API which can be implemented by a library, which defines common terms and behavior and allows such libraries to be interchangeable or to interact easily, e.g. you can connect an akka-streams pipeline to a Monix pipeline using the interfaces from the specification, and the combined pipeline will work seamlessly and supporting all of the backpressure features of Reacive Streams.

Importance of Akka Routers

I have this lingering doubt in my mind about the importance of Akka Routers. I have used Akka Routers in the current project I am working on. However, I am a little confused about the importance of it. Out of the two below methods, which is more beneficial.
having routers and routees.
Creating as many actors as needed.
I understood that router will assign the incoming messages among its routees based on the strategy. Also, we can have supervisor strategy based on the router.
I have also understood that actors are also lightweight and it is not an overhead to create as many actors as possible. So, we can create actors for each of the incoming messages and kill it if necessary after the processing si completed.
So I want to understand which one of the above design is better? Or in other words, in which case (1) has advantage over (2) OR vice versa.
Good question. I had similar doubts before I read Akka documentation. Here are the reasons:
Efficiency. From docs:
On the surface routers look like normal actors, but they are actually
implemented differently. Routers are designed to be extremely
efficient at receiving messages and passing them quickly on to
routees.
A normal actor can be used for routing messages, but an actor's
single-threaded processing can become a bottleneck. Routers can
achieve much higher throughput with an optimization to the usual
message-processing pipeline that allows concurrent routing. This is
achieved by embedding routers' routing logic directly in their
ActorRef rather than in the router actor. Messages sent to a router's
ActorRef can be immediately routed to the routee, bypassing the
single-threaded router actor entirely.
The cost to this is, of course, that the internals of routing code are
more complicated than if routers were implemented with normal actors.
Fortunately all of this complexity is invisible to consumers of the
routing API. However, it is something to be aware of when implementing
your own routers.
Default implementation of multiple routing strategies. You can always write your own, but it might get tricky. You have to take into account supervision, recovery, load balancing, remote deployment, etc.
Akka Router patterns will be familiar to Akka users. If you roll-out your custom routing then everyone will have to spend time understanding all corner cases and implications (+ testing? :)).
TL;DR If you don't care about efficiency too much and if it's easier for you to spawn new actors then go for it. Otherwise use Routers.

Storm results visualization

I've spent hours to find the best way to visualize the results of my Storm system. It seems that there is an infinite combination of technologies and I'm getting completely lost.
I want to avoid the use of a database so from what I have understood my system should have the following features:
a queuing message system (such as Redis, Kafka, ActiveMQ,...) that could be connected to my bolts.
a server that establishes a websocket connection with the browser and stream the messages to it.
a javascript library that updates the front end in real-time.
Could you please correct me if I'm saying something wrong regarding the architecture? And I also would appreciate to know which combination of technologies is the best.
As #Lan said, your question is too wide.
For the minimal use I personnally use Redis and Storm together as redis can be used as a basic queue (beware of persistence problems with redis and clustering if you have to), a shared memory space for storm bolts/spouts (storing configuration, intermediate results...) and a basic message broker (pub/sub support), it also has very good performance in latency and throughput.
You can then use a "classic" backend to plug redis topics to websockets using for instance nodejs with sockjs and redis client, but there are far more solutions for this problematic in many languages.
For the front part, it should be defined by your server choice (for instance sockjs-client or socket.io with nodejs), as fallback strategies are embedded when websockets are not supported in browsers.
To conclude, the best architecture is the one that fits your usage, so it depends.
There are so many ways. I recently built a demo using Apache Storm + Kafka. For visualization I used JQuery --> Node.js (for restful web service) --> Redis. This is just one example. There are so many other combinations that you can consider based on your use case.