Is it me or are DynamoDb Streams just really lacking? - spring-cloud

I have an application running in multiple regions in AWS, this application reads from global DynamoDb table(s). Updates occur in the background via another process and I wanted to be able to be able to monitor for these updates so the application can invalidate its cache (I'm not using DAX).
I was thinking I could use DynamoDb streams for this, however; after going through a number of road blocks with Spring Kinesis Streams Binder (e.g. the fact that it requires 2 tables [SpringIntegrationMetadataStore & SpringIntegrationLockRegistry] be created, my company doesn't allow dynamic creation of tables (so that was fun to hunt down as I couldn't find any mention in the docs - 🤷‍♀️ maybe I missed it). Now I think I have found out that only 1 application can listen to a Kinesis stream at a time?
Is that true?
Is there a way
Is there a way for multiple applications, that only read from DynamoDb, to get notified when an update occurs? I was thinking that I could use DynamoDb Streams such that each app would monitor the stream for updates and be able to invalidate their cache. If the above is true, then I need to do something more involved or complex (use a SNS/SQS for updates, elasticache, Redis, Kafka) which just seems like overkill for this scenario.

e.g. the fact that it requires 2 tables [SpringIntegrationMetadataStore & SpringIntegrationLockRegistry]
Well, that's how consumer group management is handled by Spring Cloud Stream Kinesis Binder. Even if you would use only a KCL, it still would require from you extra table in DynamoDB. Therefore your concern sounds more like a lack of confidence in cloud services you use.
Now I think I have found out that only 1 application can listen to a Kinesis stream at a time?
That's not true if all your consumer applications are configured for different consumer groups.
Please, make yourself familiar with Spring Cloud Stream and its model: https://docs.spring.io/spring-cloud-stream/docs/3.1.1/reference/html/spring-cloud-stream.html#_main_concepts
Another way probably could be done via AWS Lambda trigger for DynamoDB Streams: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.Lambda.html

Related

Recommendations to store streaming events

We're evaluating possible approaches to persist streaming events(user click events in a web browser from many different users) so that it allows us to build custom user dashboards to later analyse those click events. We're planning to use Kafka to serve as the intermediate layer to ingest the vast amounts of streaming data coming from various user browsers. However I am curious to know whether Kafka can also serve as a persistent database to store these events so that we can later build the dashboarding application and have it query the events via some backend web APIs that we design.
Essentially, this is what we're thinking as of now:
Dashboarding frontend --- API ---> backend service ----queries ----> Kafka(stores user click events)
This article mentions that Kafka can be used as a persistent DB that apps can query but it cannot "replace" the traditional databases. I can imagine the huge cost overhead if Kafka is used as a persistent DB but then Kafka tiered storage might be a possible solution to bring the storage costs down?
Overall, to be able to design a custom dashboard to query the ingested event streams, is it advisable to use Kafka as a DB replacement or should we consider integrating Kafka with a traditional SQL/noSQL database or some other type of database? Any recommendations on which persistent DBs go well with Kafka for these types of use-cases?
Yes and no.
RocksDB (or a custom state-store) will allow you to "query" Kafka data via KSQL or Kafka Streams; you wouldn't have a direct API replacement against Kafka directly. There is also a recent podcast from Confluent discussing GraphQL queries against Kafka and/or a database layer.
Regarding analysis, it would be far better to use tools like Elasticsearch (with Kibana), Apache Pinot, or Druid (along with Apache SuperSet) for such click-stream analytics and dashboarding, and using Kafka as a channel to get data into those locations.
In general, your approach of frontend -> backend -> kafka -> db is good. Assuming the throughput is at a point that warrants bringing in kafka.
is it advisable to use Kafka as a DB replacement
No
should we consider integrating Kafka with a traditional SQL/noSQL database or some other type of database?
Yes
Any recommendations on which persistent DBs go well with Kafka for these types of use-cases?
This depends more on the context, constraints, and requirements of your work place. Expected throughput? What DBs already exist? What programming language is preferred?
You can run olap style dashboard and analytics queries on oltp databases such as postgres. Many teams run their analytics on the read replicas.
The blue chip DBs for this would be elastic search, redash, or big query. The rocket ships are snowflake and clickhouse.
Another option is to allow the data science team [if there is a data science team] to ingest the kafka stream directly into spark or some other system and do their processing directly on the hose to provide the dashboards required

Solution architecture Kafka

I am working with a third party vendor who I asked to provide me the events generated by a website
The vendor proposed to stream the events using Kafka ... why not...
On my side (the client) I am running a 100% MSSQL/Windows production environment and internal business want to have kpi and dashboard on website activities
Now the question - what would be the architecture to support a PoC so I can manage the inputs on one hand and create datamarts to deliver business needs?
Not clear what you mean by "events from website". Your Kafka producers are typically server side components, as you make API requests, you'd put Kafka producing events between those requests and your databases calls. I would be surprised if any third-party would just be able to do that immediately
Maybe you're looking for something like https://divolte.io/
You can also use CDC products to stream events out of your database
The architecture could be like this. The app streams event to Kafka. You can write a service to read the data from Kafka, do transformation and write to Database. You can then build Dashboard on top of DB.
Alternatively, you can populate indexes in Elastic Search and build Kibana dashboard as well.
My suggestion would be to use Lambda architecture to cater both Real-time and Batch processing needs:
Architecture:
Lambda architecture is designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods.
This architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data.
Another Solution:

Kafka Microservice Proper Use Cases

In my new work's project, i discovered that instead of directly making post/put API calls from one microservice to another microservice, a microservice would produce a message to kafka, which is then consumed by a single microservice.
For example, Order microservice would publish a record to "pending-order" topic, which would then be consumed by Inventory microservice (no other consumer). In turn, after consuming the record and done some processing, Inventory microservice would produce a record to "processed-order" which would then be consumed only by Order microservice.
Is this a correct use case? Or is it better to just do API calls between microservices in this case?
There are two strong use cases of Kafka in a microservice based application:
You need to do a state change in multiple microservices as part of a single end user activity. If you do this by calling all the appropriate microservice APIs sequentially or parallely, there will be two issues:
Firstly, you lose atomicity i.e. you canNot guarantee "all or nothing" . It's very well possible that the call to microservice A succeeds but call to service B fails and that would lead to inconsistent data permanently. Secondly, in a cloud environment unpredictable latency and network timeouts are not uncommon and so when you make multiple calls as part of a single call, the probability of one of these calls getting delayed or failed is higher impacting user experience. Hence, the general recommendation here is, you write the user action atomically in a Kafka topic as an event and have multiple consumer groups - one for each interested microservice consume the event and make the state change in their own database. If the action is triggered by the user from a UI, you would also need to provide a "read your own write" guarantee where the user would like to see his data immediately after writing. Therefore, you'd need to write the event first in the local database of the first microservice and then do log based event sourcing (using an aporopriate Kafka Connector) to transfer the event data to Kafka. This will enable you to show the data to the user from the local DB. You may also need to update a cache, a search index, a distributed file system etc. and all of these can be done by consuming the Kafka events published by the individual microservices.
It is not very uncommon that you need to pull data from multiple microservice to do some activity or to aggregate data and display to the user. This, in general, is not recommended because of the latency and timeout issue mentioned above. It is usually recommended that we precompute those aggregates in the microservice local DB based on Kafka events published by the other microservices when they were changing their own state. This will allow you to serve the aggregate data to the user much faster. This is called materialized view pattern.
The only point to remember here is writing to Kafka log or broker and reading from it us asynchronous and there maybe a little time delay.
Microservice as consumer, seems fishy to me. You might mean Listeners to that topic would consume the message and maybe they will call your second microservice i.e. Inventory Microservice.
Yes, the model is fine, specially when you want to have asynchronous behavior and loads of traffic handled through it.
Imaging a scenario when you have more than 1 microservice to call from 1 endpoint. Here you need either aggregation layer which aggregates your services and you call it once, or you would like to publish several messages to Kafka which then will do the job.
Also think about Read services, if you need to call a microservice to read some data from somewhere else, then you can't use Kafka.
It all depends on your requirements and design.

Kafka Streams Architecture

I am hoping to clarify a few ideas on Kafka Streams from an architectural standpoint.
I understand the stream processing and data enrichment uses, and that the data can be reused by other applications if pushed back into Kafka, but what is the correct implementation of a Streams Application?
My initial thoughts would be to create an application that pulls in a table, joins it to a stream, and then fires off an event for each entry rather than pushing it back into Kafka. If multiple services use this data, then each would materialize their own table, right?
And I haven't implemented a test application yet, which may answer some of these questions, but I think is a good place for planning. Basically, where should the event be triggered, in the streaming app or in a separate consumer app?
My initial thoughts would be to create an application that pulls in a table, joins it to a stream, and then fires off an event for each entry rather than pushing it back into Kafka.
In an event-driven architecture, where would the application send the events to (and how), if you think that Kafka topics shouldn't be the destination for sharing the events with other apps? Do you have other preferences?
If multiple services use this data, then each would materialize their own table, right?
Yes, that is one option.
Another option is to use the interactive queries feature in KStreams (aka queryable state), which allows your first application to expose its tables and state stores to other applications directly (e.g., via a REST API). Other apps would then not need to materialize their own tables. However, an architectural downside is that you have now a direct coupling between your first app and any other downstream applications through request-response communication. While this pattern of direct inter-service communication is popular for a microservices architecture, a compelling alternative is to not use direct communication but instead let microservices/apps communicate indirectly with each other via Kafka (i.e., to use the previous option).
Basically, where should the event be triggered, in the streaming app or in a separate consumer app?
This is a matter of preference, see above. To inform your thinking you may want to read the 4-part mini series about event-driven architectures with Kafka: https://www.confluent.io/blog/journey-to-event-driven-part-1-why-event-first-thinking-changes-everything (disclaimer: this blog series was written by a colleague of mine).

How to model topics and partitions for Kafka when used to store all business events?

We're considering using Kafka as a way to store all our business events forever. The purpose is to be able to spin up new "microservices" that we haven't yet thought of that will be able to leverage on all previous events to build up their projections/state. Another use case might be an existing service where we'd like to "replay" all events that is of interest to this service to recreate its state.
Note that we're not planning to use Kafka as an "event store" in the sense that events will be projected/loaded into an aggregate on "every request".
Also (as far as I can tell) we don't know how consumers will consume the events. A new microservice might need all sorts of different events in order to create its internal projection/state.
Is Kafka suitable for this or is there a better alternative?
If so, what's a good way to model this (topics/partitions)?
We're currently using RabbitMQ for messaging (business events are sent to RabbitMQ). It would be great if we could migrate away from RabbitMQ in the future and move entirely to Kafka. I assume that this could change the way topics and partitions are modelled since now we have a better understanding of how consumers will consume the events. Would this be compatible with the other use case (infinite retention and replay)?
This is very good that you are switching to KAFKA and Yes it is possible to keep data in KAFKA BROKERs but i would suggest rather than keeping all the data in KAFKA-BROKERs for all time why can't you dump this data into HDFS or S3(AWS) it will be cheaper and you will have all the features of HDFS available with your data.
Storing all data in Brokers will increase overhead on Zookeeper as well.