Recommendations to store streaming events - apache-kafka

We're evaluating possible approaches to persist streaming events(user click events in a web browser from many different users) so that it allows us to build custom user dashboards to later analyse those click events. We're planning to use Kafka to serve as the intermediate layer to ingest the vast amounts of streaming data coming from various user browsers. However I am curious to know whether Kafka can also serve as a persistent database to store these events so that we can later build the dashboarding application and have it query the events via some backend web APIs that we design.
Essentially, this is what we're thinking as of now:
Dashboarding frontend --- API ---> backend service ----queries ----> Kafka(stores user click events)
This article mentions that Kafka can be used as a persistent DB that apps can query but it cannot "replace" the traditional databases. I can imagine the huge cost overhead if Kafka is used as a persistent DB but then Kafka tiered storage might be a possible solution to bring the storage costs down?
Overall, to be able to design a custom dashboard to query the ingested event streams, is it advisable to use Kafka as a DB replacement or should we consider integrating Kafka with a traditional SQL/noSQL database or some other type of database? Any recommendations on which persistent DBs go well with Kafka for these types of use-cases?

Yes and no.
RocksDB (or a custom state-store) will allow you to "query" Kafka data via KSQL or Kafka Streams; you wouldn't have a direct API replacement against Kafka directly. There is also a recent podcast from Confluent discussing GraphQL queries against Kafka and/or a database layer.
Regarding analysis, it would be far better to use tools like Elasticsearch (with Kibana), Apache Pinot, or Druid (along with Apache SuperSet) for such click-stream analytics and dashboarding, and using Kafka as a channel to get data into those locations.

In general, your approach of frontend -> backend -> kafka -> db is good. Assuming the throughput is at a point that warrants bringing in kafka.
is it advisable to use Kafka as a DB replacement
No
should we consider integrating Kafka with a traditional SQL/noSQL database or some other type of database?
Yes
Any recommendations on which persistent DBs go well with Kafka for these types of use-cases?
This depends more on the context, constraints, and requirements of your work place. Expected throughput? What DBs already exist? What programming language is preferred?
You can run olap style dashboard and analytics queries on oltp databases such as postgres. Many teams run their analytics on the read replicas.
The blue chip DBs for this would be elastic search, redash, or big query. The rocket ships are snowflake and clickhouse.
Another option is to allow the data science team [if there is a data science team] to ingest the kafka stream directly into spark or some other system and do their processing directly on the hose to provide the dashboards required

Related

How to operate a Kafka cluster and a streaming application 24/7 on budget?

I want to stream financial data (trades, orderbook) from an exchange websocket endpoint and store that data somewhere to build up my own data history for backtesting purposes. Furthermore I might want to analyze the data in real time.
I found the idea of an event driven system very interesting so that I ended up building my own dockerized confluent Kafka cluster (with avro schema-registry) and a python producer that sends the streaming data into a Kafka topic. Then I set up a Faust app to stream process the data and store it as a new topic in Kafka.
It's working fine on my laptop, but now I'm wondering how I could put this to production? Obviously I cannot do it on my laptop, because I need this application to run 24/7 without interruptions.
When I look at the fully managed Kafka cloud solutions like confluent then I find it quite expensive, especially as I'm not running a business, it's rather a private hobby project. And maybe I don't even need that kind of highly scalable and professional service.
What could be a cost efficient approach for me to get my streaming and storage application to work?
Is there another Kafka cloud solution more reduced to my needs?
Should I set up my own server? Maybe raspberry pi?
Or should I use a different approach?
I'm sorry if my problem description might not be very specific, it's a reflection of me being overwhelmed with all these system architecture questions and cloud services.
Any advice and recommendation are appreciated!

Is it me or are DynamoDb Streams just really lacking?

I have an application running in multiple regions in AWS, this application reads from global DynamoDb table(s). Updates occur in the background via another process and I wanted to be able to be able to monitor for these updates so the application can invalidate its cache (I'm not using DAX).
I was thinking I could use DynamoDb streams for this, however; after going through a number of road blocks with Spring Kinesis Streams Binder (e.g. the fact that it requires 2 tables [SpringIntegrationMetadataStore & SpringIntegrationLockRegistry] be created, my company doesn't allow dynamic creation of tables (so that was fun to hunt down as I couldn't find any mention in the docs - 🤷‍♀️ maybe I missed it). Now I think I have found out that only 1 application can listen to a Kinesis stream at a time?
Is that true?
Is there a way
Is there a way for multiple applications, that only read from DynamoDb, to get notified when an update occurs? I was thinking that I could use DynamoDb Streams such that each app would monitor the stream for updates and be able to invalidate their cache. If the above is true, then I need to do something more involved or complex (use a SNS/SQS for updates, elasticache, Redis, Kafka) which just seems like overkill for this scenario.
e.g. the fact that it requires 2 tables [SpringIntegrationMetadataStore & SpringIntegrationLockRegistry]
Well, that's how consumer group management is handled by Spring Cloud Stream Kinesis Binder. Even if you would use only a KCL, it still would require from you extra table in DynamoDB. Therefore your concern sounds more like a lack of confidence in cloud services you use.
Now I think I have found out that only 1 application can listen to a Kinesis stream at a time?
That's not true if all your consumer applications are configured for different consumer groups.
Please, make yourself familiar with Spring Cloud Stream and its model: https://docs.spring.io/spring-cloud-stream/docs/3.1.1/reference/html/spring-cloud-stream.html#_main_concepts
Another way probably could be done via AWS Lambda trigger for DynamoDB Streams: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.Lambda.html

Solution architecture Kafka

I am working with a third party vendor who I asked to provide me the events generated by a website
The vendor proposed to stream the events using Kafka ... why not...
On my side (the client) I am running a 100% MSSQL/Windows production environment and internal business want to have kpi and dashboard on website activities
Now the question - what would be the architecture to support a PoC so I can manage the inputs on one hand and create datamarts to deliver business needs?
Not clear what you mean by "events from website". Your Kafka producers are typically server side components, as you make API requests, you'd put Kafka producing events between those requests and your databases calls. I would be surprised if any third-party would just be able to do that immediately
Maybe you're looking for something like https://divolte.io/
You can also use CDC products to stream events out of your database
The architecture could be like this. The app streams event to Kafka. You can write a service to read the data from Kafka, do transformation and write to Database. You can then build Dashboard on top of DB.
Alternatively, you can populate indexes in Elastic Search and build Kibana dashboard as well.
My suggestion would be to use Lambda architecture to cater both Real-time and Batch processing needs:
Architecture:
Lambda architecture is designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods.
This architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data.
Another Solution:

What do you use Apache Kafka for? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I would like to ask if my understanding of Kafka is correct.
For really really big data stream, conventional database is not adequate so people use things such as Hadoop or Storm. Kafka sits on top of said databases and provide ...directions where the real time data should go?
I don't think so.
Kafka is messaging system and it does not sit on top of database.
You can compare Kafka with messaging systems like ActiveMQ, RabbitMQ etc.
From Apache documentation page
Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.
Key takeaways:
Kafka maintains feeds of messages in categories called topics.
We'll call processes that publish messages to a Kafka topic producers.
We'll call processes that subscribe to topics and process the feed of published messages consumers..
Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
Communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol.
Use Cases:
Messaging: Kafka works well as a replacement for a more traditional message broker. In this domain Kafka is comparable to traditional messaging systems such as ActiveMQ or RabbitMQ
Website Activity Tracking: The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds
Metrics: Kafka is often used for operational monitoring data, which involves aggregating statistics from distributed applications to produce centralized feeds of operational data
Log Aggregation
Stream Processing
Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records.
Commit Log: Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data
To fully understand Apache Kafka's role you should get a wider picture and know Kafka's use cases. Modern data processing systems try to break with the classic application architecture. You can start from the kappa architecture overview:
http://milinda.pathirage.org/kappa-architecture.com
In this architecture you don't store the current state of the world in any SQL or key-value database. All data is processed and stored as one or more series of events in an append-only immutable log. Immutable events are easier to replicate and store in a distributed environment. Apache Kafka is a system that is used storing these events and for brokering them between other system components.
Use cases on Apache Kafka's official site: http://kafka.apache.org/documentation.html#uses
More use cases :-
Kafka-Storm Pipeline -
Kafka can be used with Apache Storm to handle data pipeline for high speed filtering and pattern matching on the fly.
Apache Kafka is not just a message broker. It was initially designed and implemented by LinkedIn in order to serve as a message queue. Since 2011, Kafka has been open sourced and quickly evolved into a distributed streaming platform, which is used for the implementation of real-time data pipelines and streaming applications.
It is horizontally scalable, fault-tolerant, wicked fast, and runs in
production in thousands of companies.
Modern organisations have various data pipelines that facilitate the communication between systems or services. Things get a bit more complicated when a reasonable number of services needs to communicate with each other at real time.
The architecture becomes complex since various integrations are required in order to enable the inter-communication of these services. More precisely, for an architecture that encompasses m source and n target services, n x m distinct integrations need to be written. Also, every integration comes with a different specification, meaning that one might require a different protocol (HTTP, TCP, JDBC, etc.) or a different data representation (Binary, Apache Avro, JSON, etc.), making things even more challenging. Furthermore, source services might address increased load from connections that could potentially impact latency.
Apache Kafka leads to more simple and manageable architectures, by decoupling data pipelines. Kafka acts as a high-throughput distributed system where source services push streams of data, making them available for target services to pull them at real-time.
Also, a lot of open-source and enterprise-level User Interfaces for managing Kafka Clusters are available now. For more details refer to my answer to this question.
You can find more details about Apache Kafka and how it works in the blog post "Why Apache Kafka?"
Apache Kafka is an open-source software platform written in Scala and Java, mainly used for stream processing.
The use cases of Apache Kafka are:
Messaging
Website Activity Tracking
Metrics
Log Aggregation
Stream Processing
Event Sourcing
Commit Log
For more information use the official apache Kafka site.
https://kafka.apache.org/uses
Kafka is a pub-sub highly scalable messaging system. It acts as a transport layer guaranteeing exactly once semantics and Spark steaming does the processing. The next question that comes to my mind is even spark can poll directories to check for files and even read from a socket or port. How this Kafka and spark work in tandem ? I mean does an application written in some language instead of writing to a database for storage directly feds to the port (or places the files which would not really be tak time and would rather be some kind of batch processing) from which the data is then read by a Kafka producer and then via the Kafka consumer API is then read and processing by spark streaming?

How realtime data input to Druid?

I have analytic server (for example click counter). I want to send data to druid using some api. How should I do that?
Can I use it as replacement for google analytics?
As se7entyse7en said:
You can ingest your data to Kafka and then use druid's Kafka
firehose to ingest your data to druid through real-time ingestion.
After that you can interactively query druid using its api.
It must be said that firehoses can be setup only on Druid realtime nodes.
Here is a tutorial how to setup the Kafka firehose: Loading Streaming Data.
Beside Kafka firehose, you can setup other provided firehoses - Amazon S3 firehose, RabbitMQ firehose, etc... by including them and you can even write your own firehose as an extension, an example is here. Here are all druid extensions.
It must be said that Druid is shifting real-time ingestion from realtime nodes to the Indexing service, as explained here.
Right now the best practise is to run Realtime Index Task on Indexing Service and then you can use Druid's API to send data to this task. You can use the API directly but it's far more easier to use Tranquility. It's a library that will automatically create new Realtime Index Task for new segments and it'll allow you to send messages to the right task. You can also set replication and sharding level etc. Just run the indexing service, use Tranquility and you can start sending your messages to Druid.
You can ingest your data to Kafka and then use druid's Kafka firehose to ingest your data to druid through real-time ingestion. After that you can interactively query druid using its api.
The best way to use, considering your druid is a 0.9.x version is tranquility. The rest api is pretty solid and allows you to control your data schema. The druid.io quickstart page and hit the "Load streaming data" section.
I am loading in clickstream data for our website at real time and its been working very well. So, yes you can replace google analytics with druid (assuming, you have the required infrastructure).