I am trying to transfer (the incremental data added to) the CouchBase data to Kafka topic.
How can I do this?
Checkout "Quickstart" section of the documentation. Sidebar there also includes more details about Couchbase Kafka connector.
https://developer.couchbase.com/documentation/server/current/connectors/kafka-3.1/quickstart.html
Related
We're evaluating possible approaches to persist streaming events(user click events in a web browser from many different users) so that it allows us to build custom user dashboards to later analyse those click events. We're planning to use Kafka to serve as the intermediate layer to ingest the vast amounts of streaming data coming from various user browsers. However I am curious to know whether Kafka can also serve as a persistent database to store these events so that we can later build the dashboarding application and have it query the events via some backend web APIs that we design.
Essentially, this is what we're thinking as of now:
Dashboarding frontend --- API ---> backend service ----queries ----> Kafka(stores user click events)
This article mentions that Kafka can be used as a persistent DB that apps can query but it cannot "replace" the traditional databases. I can imagine the huge cost overhead if Kafka is used as a persistent DB but then Kafka tiered storage might be a possible solution to bring the storage costs down?
Overall, to be able to design a custom dashboard to query the ingested event streams, is it advisable to use Kafka as a DB replacement or should we consider integrating Kafka with a traditional SQL/noSQL database or some other type of database? Any recommendations on which persistent DBs go well with Kafka for these types of use-cases?
Yes and no.
RocksDB (or a custom state-store) will allow you to "query" Kafka data via KSQL or Kafka Streams; you wouldn't have a direct API replacement against Kafka directly. There is also a recent podcast from Confluent discussing GraphQL queries against Kafka and/or a database layer.
Regarding analysis, it would be far better to use tools like Elasticsearch (with Kibana), Apache Pinot, or Druid (along with Apache SuperSet) for such click-stream analytics and dashboarding, and using Kafka as a channel to get data into those locations.
In general, your approach of frontend -> backend -> kafka -> db is good. Assuming the throughput is at a point that warrants bringing in kafka.
is it advisable to use Kafka as a DB replacement
No
should we consider integrating Kafka with a traditional SQL/noSQL database or some other type of database?
Yes
Any recommendations on which persistent DBs go well with Kafka for these types of use-cases?
This depends more on the context, constraints, and requirements of your work place. Expected throughput? What DBs already exist? What programming language is preferred?
You can run olap style dashboard and analytics queries on oltp databases such as postgres. Many teams run their analytics on the read replicas.
The blue chip DBs for this would be elastic search, redash, or big query. The rocket ships are snowflake and clickhouse.
Another option is to allow the data science team [if there is a data science team] to ingest the kafka stream directly into spark or some other system and do their processing directly on the hose to provide the dashboards required
I'm working on a project where i have to process data coming from Kafka cluter, processing it and send it to MongoDB. The application should be deployable on the Pivotal Cloud foundary. After doing some research on the internet, i found the toolkit Spring-Cloud-Dataflow to be interesting since it can be deployed in PCF. I'm wondering how we can use it to create our real time streaming pipeline. For the moment, i'm thinking about using Kafka Streams and Spring Cloud Stream to process and transform the streams of topics but i don't know how to integrate it in SCDF and also how we can send those streams to MongoDB. I'm sorry if my question is not clear, i'm entierly new to those frameworks.
Thanks in advance
You could use the named-destination support in SCDF to directly consume events from Kafka or any other Spring Cloud Stream supported message broker implementations.
Now, for the write portion, you can use the out-of-the-box MongoDB-sink application that we build, maintain, and ship.
If you have to do some processing before you write to MongoDB, you can create a custom Spring Cloud Stream application with the desired binder implementation [see: dev-guide/docs].
To put this all together, if we assume you have events coming from a Kafka topic named Customers, and the custom processor doing some transformation on each of the received payloads (let's assume the name of the processor as CustomerTransformer), and finally the writing part to MongoDB.
Here's a take of this streaming data pipeline use-case designed from SCDF's Dashboard:
I am new to Kafka and data streaming and need some advice for the following requirement,
Our system is expecting close to 1 million incoming messages per day. The message carries a project identifier. The message should be pushed to users of only that project. For our case, lets say we have projects A, B and C. Users who opens project A's dashboard only sees / receives messages of project A.
This is my idea so far on implementing solution for the requirement,
The messages should be pushed to a Kafka Topic as they arrive, lets call this topic as Root Topic. The messages once pushed to the Root Topic, can be read by a Kafka Consumer/Listener and based on the project identifier in the message can push that message to a project specific Topic. So any message can end up at Topic A or B or C. Thinking of using websockets to update the message as they arrive on the project users' dashboards. There will be N Consumers/Listeners for the N project Topics. These consumers will push the project specific message to the project specifc websocket endpoints.
Please advise if I can make any improvements to the above design.
Chose Kafka as the messaging system here as it is highly scalable and fault tolerant.
There is no complex transformation or data enrichment before it gets sent to the client. Will it makes sense to use Apache Flink or Hazelcast Jet for the streaming or Kafka streaming is good enough for this simple requirement.
Also, when should I consider using Hazelcast Jet or Apache Flink in my project.
Should i use Flink say when I have to update few properties in the message based on a web service call or database lookup before sending it to the users?
Should I use Hazelcast Jet only when I need the entire dataset in memory to arrive at a property value? or will using Jet bring some benefits even for my simple use case specified above. Please advise.
Kafka Streams are a great tool to convert one Kafka topic to another Kafka topic.
What you need is a tool to move data from a Kafka topic to another system via web sockets.
Stream processor gives you a convenient tooling to build this data pipeline (among others connectors to Kafka and web sockets and scalable, fault-tolerant execution environment). So you might want use stream processor even if you don't transform the data.
The benefit of Hazelcast Jet is it's embedded scalable caching layer. You might want to cache your database/web service calls so that the enrichment is performed locally, reducing remote service calls.
See how to use Jet to read from Kafka and how to write data to a TCP socket (not websocket).
I would like to give you another option. I'm not Spark/Jet expert at all, but I've studying them for a few weeks.
I would use Pentaho Data Integration(kettle) to consume from the Kafka and I would write a kettle step (or User Defined Java Class step) to write the messages to a Hazelcast IMAP.
Then, would use this approach http://www.c2b2.co.uk/middleware-blog/hazelcast-websockets.php to provided the Websockets for the end-users.
I have a mobile app that generates events frequently and there are millions of users who will use this app.
What's the best way to capture these events and persist them into hdfs for later analysis?
As I assume from your tags, you are inclined to use Kafka and Flume with Kafka source and HDFS Sink. Your mobile app can publish data to Kafka topic and then by using Kafka source or Kafka channel (in case you do not need to use interceptors) you can consume these events and write to HDFS sink. Kafka is scalable so you don't have to worry about handling a high rate of events. However, I would suggest you use HBase as data storage. It will allow you later access each event with O(1) times. This can be done with HBase Sink. Check out this article from Cloudera blog.
I have analytic server (for example click counter). I want to send data to druid using some api. How should I do that?
Can I use it as replacement for google analytics?
As se7entyse7en said:
You can ingest your data to Kafka and then use druid's Kafka
firehose to ingest your data to druid through real-time ingestion.
After that you can interactively query druid using its api.
It must be said that firehoses can be setup only on Druid realtime nodes.
Here is a tutorial how to setup the Kafka firehose: Loading Streaming Data.
Beside Kafka firehose, you can setup other provided firehoses - Amazon S3 firehose, RabbitMQ firehose, etc... by including them and you can even write your own firehose as an extension, an example is here. Here are all druid extensions.
It must be said that Druid is shifting real-time ingestion from realtime nodes to the Indexing service, as explained here.
Right now the best practise is to run Realtime Index Task on Indexing Service and then you can use Druid's API to send data to this task. You can use the API directly but it's far more easier to use Tranquility. It's a library that will automatically create new Realtime Index Task for new segments and it'll allow you to send messages to the right task. You can also set replication and sharding level etc. Just run the indexing service, use Tranquility and you can start sending your messages to Druid.
You can ingest your data to Kafka and then use druid's Kafka firehose to ingest your data to druid through real-time ingestion. After that you can interactively query druid using its api.
The best way to use, considering your druid is a 0.9.x version is tranquility. The rest api is pretty solid and allows you to control your data schema. The druid.io quickstart page and hit the "Load streaming data" section.
I am loading in clickstream data for our website at real time and its been working very well. So, yes you can replace google analytics with druid (assuming, you have the required infrastructure).