Vertx - multiple vs single verticle for event processing

Vertx - multiple vs single verticle for event processing - vert.x

Scenario
I receive a message in a specific "address" in Vertx eventbus - the message can be of four types. The handler should process the message and send the result to another eventbus "address", its handler posts it to an external-service api.
Problem
How to design the Verticle for this? I have described two approaches below - which one is efficient, faster and is able to scale well, considering this will be deployed in Kubernetes. How about worker verticles? Any other effective approach I am missing?
The approaches
Write a verticle for each type, with an eventbus consumer consuming and processing this type. Send the processed data to the "external-service-call" address.
Write only one verticle - the eventbus handler can decide and invoke appropriate method based on the type of message, finally publish it to an the "external-service-call" address.
To my understanding, I can scale the second approach by deploying multiple instances of that verticle. By scaling I mean this can accept and process much volume concurrently? How about the first approach?
Other approach you think I should know?

First approach is slightly more preferable for two reasons:
Doing less checks => Less CPU time => more concurrency
Less code in each verticle => easier to maintain
Having said that, that's not something that should concern you. Your external-service-call will be by order of magnitude slower than any micro-optimisation on EventBus.

Approach 1:
a. More cleaner. eventbus is designed to for communicating between the verticles of same vertx instance or within cluster verticles.
b. Performance - Since verticles are on the same host, so no network IO and not much performance impact.
Approach 2:
a. Less cleaner but will gain very small amount of perf improvement which is negligible compared to the external service calls
b. Not good from maintainability point of view.
I would probably choose approach 1.

Related

What is the most efficient way to implement a complicated reactive program in vertx? Should I depend on event bus?

I want to implement a complicated reactive program in vertx, which contains multiple blocking operation steps. There seems several ways to implement it AFAIK, there may be other ways as well, what is the most efficient way in terms of throughout and response time, in a multi-core computer?
Separate each operation step in different verticles, and use event bus to communicate with these verticles.
Make all operations in one verticle, chain all operations with Future composition
Make all operations in one verticle, chain all operations with RxJava 2
According to Vertx core document, "There is a single event bus instance for every Vert.x instance and it is obtained using the method eventBus", the 1st way seems less efficient than others because the data transmission between verticles is in a single event bus thread, while for others, multiple instances of the verticle can be created so that more cores are used as event loop thread. Do I understand correctly?

Is it possible to combine REST and messaging for microservices?

We have the first version of an application based on a microservice architecture. We used REST for external and internal communication.
Now we want to switch to AP from CP (CAP theorem)* and use a message bus for communication between microservices.
There is a lot of information about how to create an event bus based on Kafka, RabbitMQ, etc.
But I can't find any best practices for a combination of REST and messaging.
For example, you create a car service and you need to add different car components. It would make more sense, for this purpose, to use REST with POST requests. On the other hand, a service for booking a car would be a good task for an event-based approach.
Do you have a similar approach when you have a different dictionary and business logic capabilities? How do you combine them? Just support both approaches separately? Or unify them in one approach?
* for the first version, we agreed to choose consistency and partition tolerance. But now availability becomes more important for us.

Bottom line up front: You're looking for Command Query Responsibility Segregation; which defines an architectural pattern for breaking up responsibilities from querying for data to asking for a process to be run. The short answer is you do not want to mix the two in either a query or a process in a blocking fashion. The rest of this answer will go into detail as to why, and the three different ways you can do what you're trying to do.
This answer is a short form of the experience I have with Microservices. My bona fides: I've created Microservices topologies from scratch (and nearly zero knowledge) and as they say hit every branch on the way down.
One of the benefits of starting from zero-knowledge is that the first topology I created used a mixture of intra-service synchronous and blocking (HTTP) communication (to retrieve data needed for an operation from the service that held it), and message queues + asynchronous events to run operations (for Commands).
I'll define both terms:
Commands: Telling a service to do something. For instance, "Run ETL Batch job". You expect there to be an output from this; but it is necessarily a process that you're not going to be able to reliably wait on. A command has side-effects. Something will change because of this action (If nothing happens and nothing changes, then you haven't done anything).
Query: Asking a service for data that it holds. This data may have been there because of a Command given, but asking for data should not have side effects. No Command operations should need to be run because of a Query received.
Anyway, back to the topology.
Level 1: Mixed HTTP and Events
For this first topology, we mixed Synchronous Queries with Asynchronous Events being emitted. This was... problematic.
Message Buses are by their nature observable. One setting in RabbitMQ, or an Event Source, and you can observe all events in the system. This has some good side-effects, in that when something happens in the process you can typically figure out what events led to that state (if you follow an event-driven paradigm + state machines).
HTTP Calls are not observable without inspecting network traffic or logging those requests (which itself has problems, so we're going to start with "not feasible" in normal operations). Therefore if you mix a message based process and HTTP calls, you're going to have holes where you can't tell what's going on. You'll have spots where due to a network error your HTTP call didn't return data, and your services didn't continue the process because of that. You'll also need to hook up Retry/Circuit Breaker patterns for your HTTP calls to ensure they at least try a few times, but then you have to differentiate between "Not up because it's down", and "Not up because it's momentarily busy".
In short, mixing the two methods for a Command Driven process is not very resilient.
Level 2: Events define RPC/Internal Request/Response for data; Queries are External
In step two of this maturity model, you separate out Commands and Queries. Commands should use an event driven system, and queries should happen through HTTP. If you need the results of a query for a Command, then you issue a message and use a Request/Response pattern over your message bus.
This has benefits and problems too.
Benefits-wise your entire Command is now observable, even as it hops through multiple services. You can also replay processes in the system by rerunning events, which can be useful in tracking down problems.
Problems-wise now some of your events look a lot like queries; and you're now recreating the beautiful HTTP and REST semantics available in HTTP for messages; and that's not terribly fun or useful. As an example, a 404 tells you there's no data in REST. For a message based event, you have to recreate those semantics (There's a good Youtube conference talk on the subject I can't find but a team tried to do just that with great pain).
However, your events are now asynchronous and non-blocking, and every service can be refactored to a state-machine that will respond to a given event. Some caveats are those events should contain all the data needed for the operation (which leads to messages growing over the course of a process).
Your queries can still use HTTP for external communication; but for internal command/processes, you'd use the message bus.
I don't recommend this approach either (though it's a step up from the first approach). I don't recommend it because of the impurity your events start to take on, and in a microservices system having contracts be the same throughout the system is important.
Level 3: Producers of Data emit data as events. Consumers Record data for their use.
The third step in the maturity model (and we were on our way to that paradigm when I departed from the project) is for services that produce data to issue events when that data is produced. That data is then jotted down by services listening for those events, and those services will use that (could be?) stale data to conduct their operations. External customers still use HTTP; but internally you emit events when new data is produced, and each service that cares about that data will store it to use when it needs to. This is the crux of Michael Bryzek's talk Designing Microservices Architecture the Right way. Michael Bryzek is the CTO of Flow.io, a white-label e-commerce company.
If you want a deeper answer along with other issues at play, I'll point you to my blog post on the subject.

Kafka Streams and RPC: is calling REST service in map() operator considered an anti-pattern?

The naive approach for implementing the use case of enriching an incoming stream of events stored in Kafka with reference data - is by calling in map() operator an external service REST API that provides this reference data, for each incoming event.
eventStream.map((key, event) -> /* query the external service here, then return the enriched event */)
Another approach is to have second events stream with reference data and store it in KTable that will be a lightweight embedded "database" then join main event stream with it.
KStream<String, Object> eventStream = builder.stream(..., "event-topic");
KTable<String, Object> referenceDataTable = builder.table(..., "reference-data-topic");
KTable<String, Object> enrichedEventStream = eventStream
.leftJoin(referenceDataTable , (event, referenceData) -> /* return the enriched event */)
.map((key, enrichedEvent) -> new KeyValue<>(/* new key */, enrichedEvent)
.to("enriched-event-topic", ...);
Can the "naive" approach be considered an anti-pattern? Can the "KTable" approach be recommended as the preferred one?
Kafka can easily manage millions of messages per minute. Service that is called from the map() operator should be capable of handling high load too and also highly-available. These are extra requirements for the service implementation. But if the service satisfies these criteria can the "naive" approach be used?

Yes, it is ok to do RPC inside Kafka Streams operations such as map() operation. You just need to be aware of the pros and cons of doing so, see below. Also, you should do any such RPC calls synchronously from within your operations (I won't go into details here why; if needed, I'd suggest to create a new question).
Pros of doing RPC calls from within Kafka Streams operations:
Your application will fit more easily into an existing architecture, e.g. one where the use of REST APIs and request/response paradigms is common place. This means that you can make more progress quickly for a first proof-of-concept or MVP.
The approach is, in my experience, easier to understand for many developers (particularly those who are just starting out with Kafka) because they are familiar with doing RPC calls in this manner from their past projects. Think: it helps to move gradually from request-response architectures to event-driven architectures (powered by Kafka).
Nothing prevents you from starting with RPC calls and request-response, and then later migrating to a more Kafka-idiomatic approach.
Cons:
You are coupling the availability, scalability, and latency/throughput of your Kafka Streams powered application to the availability, scalability, and latency/throughput of the RPC service(s) you are calling. This is relevant also for thinking about SLAs.
Related to the previous point, Kafka and Kafka Streams scale very well. If you are running at large scale, your Kafka Streams application might end up DDoS'ing your RPC service(s) because the latter probably can't scale as much as Kafka. You should be able to judge pretty easily whether or not this is a problem for you in practice.
An RPC call (like from within map()) is a side-effect and thus a black box for Kafka Streams. The processing guarantees of Kafka Streams do not extend to such side effects.
Example: Kafka Streams (by default) processes data based on event-time (= based on when an event happened in the real world), so you can easily re-process old data and still get back the same results as when the old data was still new. But the RPC service you are calling during such reprocessing might return a different response than "back then". Ensuring the latter is your responsibility.
Example: In the case of failures, Kafka Streams will retry operations, and it will guarantee exactly-once processing (if enabled) even in such situations. But it can't guarantee, by itself, that an RPC call you are doing from within map() will be idempotent. Ensuring the latter is your responsibility.
Alternatives
In case you are wondering what other alternatives you have: If, for example, you are doing RPC calls for looking up data (e.g. for enriching an incoming stream of events with side/context information), you can address the downsides above by making the lookup data available in Kafka directly. If the lookup data is in MySQL, you can setup a Kafka connector to continuously ingest the MySQL data into a Kafka topic (think: CDC). In Kafka Streams, you can then read the lookup data into a KTable and perform the enrichment of your input stream via a stream-table join.

I suspect most of the advice you hear from the internet is along the lines of, "OMG, if this REST call takes 200ms, how wil I ever process 100,000 Kafka messages per second to keep up with my demand?"
Which is technically true: even if you scale your servers up for your REST service, if responses from this app routinely take 200ms - because it talks to a server 70ms away (speed of light is kinda slow, if that server is across the continent from you...) and the calling microservice takes 130ms even if you measure right at the source....
With kstreams the problem may be worse than it appears. Maybe you get 100,000 messages a second coming into your stream pipeline, but some kstream operator flatMaps and that operation in your app creates 2 messages for every one object... so now you really have 200,000 messages a second crashing through your REST server.
BUT maybe you're using Kstreams in an app that has 100 messages a second, or you can partition your data so that you get a message per partition maybe even just once a second. In that case, you might be fine.
Maybe your Kafka data just needs to go somewhere else: ie the end of the stream is back into a Good Ol' RDMS. In which case yes, there's some careful balancing there on the best way to deal with potentially "slow" systems, while making sure you don't DDOS yourself, while making sure you can work your way out of a backlog.
So is it an anti-pattern? Eh, probably, if your Kafka cluster is LinkedIn size. Does it matter for you? Depends on how many messages/second you need to drive, how fast your REST service really is, how efficiently it can scale (ie your new kstreams pipeline suddenly delivers 5x the normal traffic to it...)

Concurrency, how to create an efficient actor setup?

Alright so I have never done intense concurrent operations like this before, theres three main parts to this algorithm.
This all starts with a Vector of around 1 Million items.
Each item gets processed in 3 main stages.
Task 1: Make an HTTP Request, Convert received data into a map of around 50 entries.
Task 2: Receive the map and do some computations to generate a class instance based off the info found in the map.
Task 3: Receive the class and generate/add to multiple output files.
I initially started out by concurrently running task 1 with 64K entries across 64 threads (1024 entries per thread.). Generating threads in a for loop.
This worked well and was relatively fast, but I keep hearing about actors and how they are heaps better than basic Java threads/Thread pools. I've created a few actors etc. But don't know where to go from here.
Basically:
1. Are actors the right way to achieve fast concurrency for this specific set of tasks. Or is there another way I should go about it.
2. How do you know how many threads/actors are too many, specifically in task one, how do you know what the limit is on number of simultaneous connections is (Im on mac). Is there a golden rue to follow? How many threads vs how large per thread pool? And the actor equivalents?
3. Is there any code I can look at that implements actors for a similar fashion? All the code Im seeing is either getting an actor to print hello world, or super complex stuff.

1) Actors are a good choice to design complex interactions between components since they resemble "real life" a lot. You can see them as different people sending each other requests, it is very natural to model interactions. However, they are most powerful when you want to manage changing state in your application, which does not seem to be the case for you. You can achieve fast concurrency without actors. Up to you.
2) If none of your operations is blocking the best rule is amount of threads = amount of CPUs. If you use a non blocking HTTP client, and NIO when writing your output files then you should be fully non-blocking on IOs and can just safely set the thread count for your app to the CPU count on your machine.
3) The documentation on http://akka.io is very very good and comprehensive. If you have no clue how to use the actor model I would recommend getting a book - not necessarily about Akka.

1) It sounds like most of your steps aren't stateful, in which case actors add complication for no real benefit. If you need to coordinate multiple tasks in a mutable way (e.g. for generating the output files) then actors are a good fit for that piece. But the HTTP fetches should probably just be calls to some nonblocking HTTP library (e.g. spray-client - which will in fact use actors "under the hood", but in a way that doesn't expose the statefulness to you).
2) With blocking threads you pretty much have to experiment and see how many you can run without consuming too many resources. Worry about how many simultaneous connections the remote system can handle rather than hitting any "connection limits" on your own machine (it's possible you'll hit the file descriptor limit but if so best practice is just to increase it). Once you figure that out, there's no value in having more threads than the number of simultaneous connections you want to make.
As others have said, with nonblocking everything you should probably just have a number of threads similar to the number of CPU cores (I've also heard "2x number of CPUs + 1", on the grounds that that ensures there will always be a thread available whenever a CPU is idle).
With actors I wouldn't worry about having too many. They're very lightweight.

If you have really no expierience in Akka try to start with something simple like doing a one-to-one actor-thread rewriting of your code. This will be easier to grasp how things work in akka.
Spin two actors at the begining one for receiving requests and one for writting to the output file. Then when request is received create an actor in request-receiver actor that will do the computation and send the result to the writting actor.

Akka - How many instances of an actor should you create?

I'm new to the Akka framework and I'm building an HTTP server application on top of Netty + Akka.
My idea so far is to create an actor for each type of request. E.g. I would have an actor for a POST to /my-resource and another actor for a GET to /my-resource.
Where I'm confused is how I should go about actor creation? Should I:
Create a new actor for every request (by this I mean for every request should I do a TypedActor.newInstance() of the appropriate actor)? How expensive is it to create a new actor?
Create one instance of each actor on server start up and use that actor instance for every request? I've read that an actor can only process one message at a time, so couldn't this be a bottle neck?
Do something else?
Thanks for any feedback.

Well, you create an Actor for each instance of mutable state that you want to manage.
In your case, that might be just one actor if my-resource is a single object and you want to treat each request serially - that easily ensures that you only return consistent states between modifications.
If (more likely) you manage multiple resources, one actor per resource instance is usually ideal unless you run into many thousands of resources. While you can also run per-request actors, you'll end up with a strange design if you don't think about the state those requests are accessing - e.g. if you just create one Actor per POST request, you'll find yourself worrying how to keep them from concurrently modifying the same resource, which is a clear indication that you've defined your actors wrongly.
I usually have fairly trivial request/reply actors whose main purpose it is to abstract the communication with external systems. Their communication with the "instance" actors is then normally limited to one request/response pair to perform the actual action.

If you are using Akka, you can create an actor per request. Akka is extremely slim on resources and you can create literarily millions of actors on an pretty ordinary JVM heap. Also, they will only consume cpu/stack/threads when they actually do something.
A year ago I made a comparison between the resource consumption of the thread-based and event-based standard actors. And Akka is even better than the event-base.
One of the big points of Akka in my opinion is that it allows you to design your system as "one actor per usage" where earlier actor systems often forced you to do "use only actors for shared services" due to resource overhead.
I would recommend that you go for option 1.

Options 1) or 2) have both their drawbacks. So then, let's use options 3) Routing (Akka 2.0+)
Router is an element which act as a load balancer, routing the requests to other Actors which will perform the task needed.
Akka provides different Router implementations with different logic to route a message (for example SmallestMailboxPool or RoundRobinPool).
Every Router may have several children and its task is to supervise their Mailbox to further decide where to route the received message.
//This will create 5 instances of the actor ExampleActor
//managed and supervised by a RoundRobinRouter
ActorRef roundRobinRouter = getContext().actorOf(
Props.create(ExampleActor.class).withRouter(new RoundRobinRouter(5)),"router");
This procedure is well explained in this blog.

It's quite a reasonable option, but whether it's suitable depends on specifics of your request handling.
Yes, of course it could.
For many cases the best thing to do would be to just have one actor responding to every request (or perhaps one actor per type of request), but the only thing this actor does is to forward the task to another actor (or spawn a Future) which will actually do the job.

For scaling up the serial requests handling, add a master actor (Supervisor) which in turn will delegate to the worker actors (Children) (round-robin fashion).