How can I cache streaming data on kafka? - apache-kafka

I am using kafka as streaming data layer. A nodejs application will consume data from kafka while a C++ application is producing streaming data and write to kafka. It works fine but I'd like to know whether I can use kafka to cache the streaming data and let nodejs to query.
I have a requirement to support basic request - response request for other clients. I will have to save the streaming data on Redis in my nodejs app and build an endpoint to allow clients to query from.
If kafka supports cache and query, I don't need to bring Redis in to this architecture.

Kafka Streams KTable can act as a cache and supports key-value querying via its Interactive Streams feature. However, this API is only available via Java and the RPC layer must be manually setup (for example HTTP + JSON)

Related

Apache Kafka vs. HTTP API

I am new to Apache Kafka and its stream services related to API's, but was wondering if there was any formal documentation on where to obtain the initial raw data required for ingestion?
In essence, I want to try my hand at building a rudimentary crypto trading bot, but was under the impression that http APIs may have more latency than APIs that integrate with Kafka Streams. For example, I know RapidAPI has a library of http APIs that can be accessed that would help pull data, but was unsure if there was something similar if I wanted the data to be ingested through Kafka Streams. I guess I am under the impression that the two data sources will not be similar and will be different in some way, but am also unsure if this is not the case.
I tried digging around on Google, but it's not very clear on what APIs or source data is taken for Kafka Streams, or if they are the same/similar just handled differently.
If you have any insight or documentation that would be greatly appreciated. Also feel free to let me know if my understanding is completely false.
any formal documentation on where to obtain the initial raw data required for ingestion?
Kafka accepts binary data. You can feed in serialized data from anywhere (although, you are restricted by (configurable) message size limits).
APIs that integrate with Kafka Streams
Kafka Streams is an intra-cluster library, it doesn't integrate with anything but Kafka.
If you want to periodically, poll/fetch an HTTP/1 API, then you would use a regular HTTP client, and a Kafka Producer.
Probably a similar answer with streaming HTTP/2 or websocket, although, still not able to use Kafka Streams, and you'd have to deal with batching records into a Kafka Producer request
You instead should look for Kafka Connect projects on the web that operate with HTTP, or opt for something like Apache NiFi as a broader project with lots of different "processors" like GetHTTP and ProduceKafka.
Once the data is in Kafka, you are welcome to use Kafka Streams/KSQL to do some processing

Kafka to MongoDB using Spring Cloud Dataflow

I'm working on a project where i have to process data coming from Kafka cluter, processing it and send it to MongoDB. The application should be deployable on the Pivotal Cloud foundary. After doing some research on the internet, i found the toolkit Spring-Cloud-Dataflow to be interesting since it can be deployed in PCF. I'm wondering how we can use it to create our real time streaming pipeline. For the moment, i'm thinking about using Kafka Streams and Spring Cloud Stream to process and transform the streams of topics but i don't know how to integrate it in SCDF and also how we can send those streams to MongoDB. I'm sorry if my question is not clear, i'm entierly new to those frameworks.
Thanks in advance
You could use the named-destination support in SCDF to directly consume events from Kafka or any other Spring Cloud Stream supported message broker implementations.
Now, for the write portion, you can use the out-of-the-box MongoDB-sink application that we build, maintain, and ship.
If you have to do some processing before you write to MongoDB, you can create a custom Spring Cloud Stream application with the desired binder implementation [see: dev-guide/docs].
To put this all together, if we assume you have events coming from a Kafka topic named Customers, and the custom processor doing some transformation on each of the received payloads (let's assume the name of the processor as CustomerTransformer), and finally the writing part to MongoDB.
Here's a take of this streaming data pipeline use-case designed from SCDF's Dashboard:

Ingest mobile events data into hdfs

I have a mobile app that generates events frequently and there are millions of users who will use this app.
What's the best way to capture these events and persist them into hdfs for later analysis?
As I assume from your tags, you are inclined to use Kafka and Flume with Kafka source and HDFS Sink. Your mobile app can publish data to Kafka topic and then by using Kafka source or Kafka channel (in case you do not need to use interceptors) you can consume these events and write to HDFS sink. Kafka is scalable so you don't have to worry about handling a high rate of events. However, I would suggest you use HBase as data storage. It will allow you later access each event with O(1) times. This can be done with HBase Sink. Check out this article from Cloudera blog.

Kafka Stream API vs Consumer API

I need to read from a specific Kafka topic, do a VERY short processing on the message and pass it on to a different Kafka cluster.
Currently, I'm using a consumer that's also a producer on the other kafka server.
However, the streaming API supposedly offers a more light-weight high-throughput option.
So the questions are:
Assuming my processing code doesn't require much horse power, is the streaming API better?
Does the streaming APi support writing to a different Kafka cluster?
What are the Streaming API cons comparing to the Consumer API?
Unfortunately KafkaStreams doesn't currently support writing to a different Kafka cluster.

How realtime data input to Druid?

I have analytic server (for example click counter). I want to send data to druid using some api. How should I do that?
Can I use it as replacement for google analytics?
As se7entyse7en said:
You can ingest your data to Kafka and then use druid's Kafka
firehose to ingest your data to druid through real-time ingestion.
After that you can interactively query druid using its api.
It must be said that firehoses can be setup only on Druid realtime nodes.
Here is a tutorial how to setup the Kafka firehose: Loading Streaming Data.
Beside Kafka firehose, you can setup other provided firehoses - Amazon S3 firehose, RabbitMQ firehose, etc... by including them and you can even write your own firehose as an extension, an example is here. Here are all druid extensions.
It must be said that Druid is shifting real-time ingestion from realtime nodes to the Indexing service, as explained here.
Right now the best practise is to run Realtime Index Task on Indexing Service and then you can use Druid's API to send data to this task. You can use the API directly but it's far more easier to use Tranquility. It's a library that will automatically create new Realtime Index Task for new segments and it'll allow you to send messages to the right task. You can also set replication and sharding level etc. Just run the indexing service, use Tranquility and you can start sending your messages to Druid.
You can ingest your data to Kafka and then use druid's Kafka firehose to ingest your data to druid through real-time ingestion. After that you can interactively query druid using its api.
The best way to use, considering your druid is a 0.9.x version is tranquility. The rest api is pretty solid and allows you to control your data schema. The druid.io quickstart page and hit the "Load streaming data" section.
I am loading in clickstream data for our website at real time and its been working very well. So, yes you can replace google analytics with druid (assuming, you have the required infrastructure).