Kafka Streams Architecture - apache-kafka

I am hoping to clarify a few ideas on Kafka Streams from an architectural standpoint.
I understand the stream processing and data enrichment uses, and that the data can be reused by other applications if pushed back into Kafka, but what is the correct implementation of a Streams Application?
My initial thoughts would be to create an application that pulls in a table, joins it to a stream, and then fires off an event for each entry rather than pushing it back into Kafka. If multiple services use this data, then each would materialize their own table, right?
And I haven't implemented a test application yet, which may answer some of these questions, but I think is a good place for planning. Basically, where should the event be triggered, in the streaming app or in a separate consumer app?

My initial thoughts would be to create an application that pulls in a table, joins it to a stream, and then fires off an event for each entry rather than pushing it back into Kafka.
In an event-driven architecture, where would the application send the events to (and how), if you think that Kafka topics shouldn't be the destination for sharing the events with other apps? Do you have other preferences?
If multiple services use this data, then each would materialize their own table, right?
Yes, that is one option.
Another option is to use the interactive queries feature in KStreams (aka queryable state), which allows your first application to expose its tables and state stores to other applications directly (e.g., via a REST API). Other apps would then not need to materialize their own tables. However, an architectural downside is that you have now a direct coupling between your first app and any other downstream applications through request-response communication. While this pattern of direct inter-service communication is popular for a microservices architecture, a compelling alternative is to not use direct communication but instead let microservices/apps communicate indirectly with each other via Kafka (i.e., to use the previous option).
Basically, where should the event be triggered, in the streaming app or in a separate consumer app?
This is a matter of preference, see above. To inform your thinking you may want to read the 4-part mini series about event-driven architectures with Kafka: https://www.confluent.io/blog/journey-to-event-driven-part-1-why-event-first-thinking-changes-everything (disclaimer: this blog series was written by a colleague of mine).

Related

What is benefit of using Kafka streams?

I try to realize what is benefit of using Kafka stream in my business model. The customers publish an order and instantly gets offers from sellers who are online and intrested in this order.
In this case the streams are fit to join available sellers(online) to order stream and filter, sorting (by price) of offers. So as result the customer should give the best offers by price by request.
I discovered only one benefit: it is less of server calls(all calculations happends in stream).
My question is, why streams are matter in this case? Because I can make these business steps using the standard approach with the one monolithic application?
I know this question is opinion based, but after reading some books about stream processing it is still to hard change the mind on this approach.
only one benefit: it is less of server calls
Kafka Streams can still do "server calls", especially when using Interactive Queries with an RPC layer. Fetching data from a remote table, such as KSQLdb, is also a "server call".
This is not the only benefit. Have you tried to write a join between topics using regular consumer API? Or a filter/map in less than 2 lines of code (outside the config setup)?
can make these business steps using the standard approach with the one monolithic application?
A Streams topology can still be emdedded within a monolith, so I don't understand your point here. I assume you mean a fully synchronous application with a traditional database + API layer?
The books you say you've read should go over most benefits of stream processing, but you might want to checkout "Kafka Streams in Action" to specifically get the advantages of that

Is it me or are DynamoDb Streams just really lacking?

I have an application running in multiple regions in AWS, this application reads from global DynamoDb table(s). Updates occur in the background via another process and I wanted to be able to be able to monitor for these updates so the application can invalidate its cache (I'm not using DAX).
I was thinking I could use DynamoDb streams for this, however; after going through a number of road blocks with Spring Kinesis Streams Binder (e.g. the fact that it requires 2 tables [SpringIntegrationMetadataStore & SpringIntegrationLockRegistry] be created, my company doesn't allow dynamic creation of tables (so that was fun to hunt down as I couldn't find any mention in the docs - 🤷‍♀️ maybe I missed it). Now I think I have found out that only 1 application can listen to a Kinesis stream at a time?
Is that true?
Is there a way
Is there a way for multiple applications, that only read from DynamoDb, to get notified when an update occurs? I was thinking that I could use DynamoDb Streams such that each app would monitor the stream for updates and be able to invalidate their cache. If the above is true, then I need to do something more involved or complex (use a SNS/SQS for updates, elasticache, Redis, Kafka) which just seems like overkill for this scenario.
e.g. the fact that it requires 2 tables [SpringIntegrationMetadataStore & SpringIntegrationLockRegistry]
Well, that's how consumer group management is handled by Spring Cloud Stream Kinesis Binder. Even if you would use only a KCL, it still would require from you extra table in DynamoDB. Therefore your concern sounds more like a lack of confidence in cloud services you use.
Now I think I have found out that only 1 application can listen to a Kinesis stream at a time?
That's not true if all your consumer applications are configured for different consumer groups.
Please, make yourself familiar with Spring Cloud Stream and its model: https://docs.spring.io/spring-cloud-stream/docs/3.1.1/reference/html/spring-cloud-stream.html#_main_concepts
Another way probably could be done via AWS Lambda trigger for DynamoDB Streams: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.Lambda.html

Solution architecture Kafka

I am working with a third party vendor who I asked to provide me the events generated by a website
The vendor proposed to stream the events using Kafka ... why not...
On my side (the client) I am running a 100% MSSQL/Windows production environment and internal business want to have kpi and dashboard on website activities
Now the question - what would be the architecture to support a PoC so I can manage the inputs on one hand and create datamarts to deliver business needs?
Not clear what you mean by "events from website". Your Kafka producers are typically server side components, as you make API requests, you'd put Kafka producing events between those requests and your databases calls. I would be surprised if any third-party would just be able to do that immediately
Maybe you're looking for something like https://divolte.io/
You can also use CDC products to stream events out of your database
The architecture could be like this. The app streams event to Kafka. You can write a service to read the data from Kafka, do transformation and write to Database. You can then build Dashboard on top of DB.
Alternatively, you can populate indexes in Elastic Search and build Kibana dashboard as well.
My suggestion would be to use Lambda architecture to cater both Real-time and Batch processing needs:
Architecture:
Lambda architecture is designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods.
This architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data.
Another Solution:

Distributing events across different JVMs with Axon Server to Subscribing Event Processors (without Event Sourcing)

I'm using Axon Framework (4.1) with aggregates in one module (JVM, container) and projections/Sagas in another module. What I want to do is to have a distributed application taking advantage of CQRS but without Event Sourcing.
It is rather trivial to setup and everything works as expected in a single application. The problem arises when there are several independent modules (across separate JVMs) involved. Out of the box Axon starter uses tracking processors connected to AxonServerEventStore, which allows to have "location transparency" when it comes to listening to the events across different JVMs.
In my case, I don't want any infrastructure for persisting or tracking the events. I just want to distribute the events to any subscribing processors (SEPs) from my aggregates in a fire-and-forget style, just like AxonServerQueryBus is doing to distribute scatter-gather queries, for example.
If I just declare all processors as subscribing as follows:
#Autowired
public void configureEventSubscribers(EventProcessingConfigurer configurer) {
configurer.usingSubscribingEventProcessors();
}
events are reaching all #EventHandler methods in the same JVM, but events are not reaching any handlers in other JVMs anymore. If my understanding is correct, then, Axon Server will distribute the events across JVMs for tracking processors only (TEPs).
Obviously, what I can do, is to use an external message broker (RabbitMQ, Kafka) in combination with SpringAMQPMessageSource (as in the docs) to distribute events to all subscribers through something like fanout in RabbitMQ. This works, but this requires to maintain the broker myself.
What would be nice is to have Axon Server taking care of this just like it takes care of distributing commands and queries (this would give me one less infrastructure piece to care about).
As a side note, I've actually managed to distribute the events to projections using QueryBus and passing events as payloads to GenericQueryMessage sent as scatter-gather queries. Needless to say, this is not a robust solution. But it goes to demonstrate that there is nothing inherently impossible with Axon Server distributing events (just another type of a message, after all) to SEPs or TEPs indifferently.
Finally, the questions:
1) What is the community's recommendation for pure CQRS (without Event Sourcing) using Axon when it comes to location transparency and distributing the events?
2) Is it possible to make Axon Server to distribute events to SEPs across JVMs (eliminating the need for an external message broker)?
Note on Event Sourcing
From Axon Framework's perspective, Event Sourcing is a sole concern of your Command Model. This stance is taken, as Event Sourcing defines the recreation of a model through the events it has published. A Query Model however does not react to commands with publishing events changing its state, it simply listen to (distributed) events to update its state to be queried by others.
As such, the framework only thinks about Event Sourcing when it recreates your Aggregates, by providing the EventSourcingRepository.
The Event Processor's job is to be the "mechanical aspect of providing events to your Event Handlers". This relates to the Q part in CQRS, to recreating the Query Model.
Thus, the Framework does not regard Event Processors to be part of the notion of Event Sourcing.
Answer to your scenario
I do want to emphasize that if you are distributing your application by running several instances of a given app, you will very likely need to have a way to ensure a given event is only handled once.
This is one of the concerns a Tracking Event Processor (TEP) addresses, and it does so by using a Tracking Token.
The Tracking Token essential acts as a marker defining which events have been processed. Added, a given TEP's thread is inclined to have a claim on a token to be able to work, which thus ensure a given event is not handled twice.
Concluding, you will need to define infrastructure to store Tracking Tokens to be able to distributed the event load, essentially opting against the use of the SubscribingEventProcessor entirely.
However, whether the above is an issu does depend on your application landscape.
Maybe you aren't duplicating a given application at all, thus effectively not duplicating a given Tracking Event Processor.
In this case, you can fulfill your request to "not track events", whilst still using Tracking Event Processors.
All you have to do, is to ensure you are not storing them. The interface used to storing tokens, is the TokenStore, for which an in memory version exists.
Using the InMemoryTokenStore in a default Axon set up will however mean you'll technically be replaying your events every time. This occurs due to the default "initial Tracking Token" process. This is, of course, also configurable, for which I'd suggest you to use the following approach:
// Creating the configuration for a TEP
TrackingEventProcessorConfiguration tepConfig =
TrackingEventProcessorConfiguration
.forSingleThreadedProcessing() // Note: could also be multi-threaded
.andInitialTrackingToken(StreamableMessageSource::createHeadToken);
// Registering as default TEP config
EventProcessingConfigurer.
registerTrackingEventProcessorConfiguration(config -> tepConfig);
This should set you up to use TEP, without the necessity to set up infrastructure to store Tokens. Note however, this will require you not to duplicate the given application.
I'd like to end with the following question you've posted:
Is it possible to make Axon Server to distribute events to SEPs across JVMs (eliminating the need for an external message broker)?
As you have correctly noted, SEPs are (currently) only usable for subscribing to events which have been published within a given JVM. Axon Server does not (yet) have a mechanism to bridge events from one JVM to another for the purpose allowing distributed Subscribing Event Processing. I am (as part of AxonIQ) however relatively sure we will look in to this in the future. If such a feature is of importance to successful conclusion of your project, I suggest to contact AxonIQ directly.
If you are considering Apache Kafka for this use case, you might want to look into kalium.alkal.io.
It will make your code to be much simpler
MyObject myObject = ....
kalium.post(myObject); //this is used to send POJOs/protobuffs using Kafka Producer
//On the consumer side. This will use a deserializer with the Kafka Consumer API
kalium.on(MyObject.class, myObject -> {
// do something with the object
}, "consumer_group");

Kafka Microservice Proper Use Cases

In my new work's project, i discovered that instead of directly making post/put API calls from one microservice to another microservice, a microservice would produce a message to kafka, which is then consumed by a single microservice.
For example, Order microservice would publish a record to "pending-order" topic, which would then be consumed by Inventory microservice (no other consumer). In turn, after consuming the record and done some processing, Inventory microservice would produce a record to "processed-order" which would then be consumed only by Order microservice.
Is this a correct use case? Or is it better to just do API calls between microservices in this case?
There are two strong use cases of Kafka in a microservice based application:
You need to do a state change in multiple microservices as part of a single end user activity. If you do this by calling all the appropriate microservice APIs sequentially or parallely, there will be two issues:
Firstly, you lose atomicity i.e. you canNot guarantee "all or nothing" . It's very well possible that the call to microservice A succeeds but call to service B fails and that would lead to inconsistent data permanently. Secondly, in a cloud environment unpredictable latency and network timeouts are not uncommon and so when you make multiple calls as part of a single call, the probability of one of these calls getting delayed or failed is higher impacting user experience. Hence, the general recommendation here is, you write the user action atomically in a Kafka topic as an event and have multiple consumer groups - one for each interested microservice consume the event and make the state change in their own database. If the action is triggered by the user from a UI, you would also need to provide a "read your own write" guarantee where the user would like to see his data immediately after writing. Therefore, you'd need to write the event first in the local database of the first microservice and then do log based event sourcing (using an aporopriate Kafka Connector) to transfer the event data to Kafka. This will enable you to show the data to the user from the local DB. You may also need to update a cache, a search index, a distributed file system etc. and all of these can be done by consuming the Kafka events published by the individual microservices.
It is not very uncommon that you need to pull data from multiple microservice to do some activity or to aggregate data and display to the user. This, in general, is not recommended because of the latency and timeout issue mentioned above. It is usually recommended that we precompute those aggregates in the microservice local DB based on Kafka events published by the other microservices when they were changing their own state. This will allow you to serve the aggregate data to the user much faster. This is called materialized view pattern.
The only point to remember here is writing to Kafka log or broker and reading from it us asynchronous and there maybe a little time delay.
Microservice as consumer, seems fishy to me. You might mean Listeners to that topic would consume the message and maybe they will call your second microservice i.e. Inventory Microservice.
Yes, the model is fine, specially when you want to have asynchronous behavior and loads of traffic handled through it.
Imaging a scenario when you have more than 1 microservice to call from 1 endpoint. Here you need either aggregation layer which aggregates your services and you call it once, or you would like to publish several messages to Kafka which then will do the job.
Also think about Read services, if you need to call a microservice to read some data from somewhere else, then you can't use Kafka.
It all depends on your requirements and design.