Visualise data from kafka consumer - apache-kafka

I'm new to kafka. I have written a simple producer script which writes JSON data in the form of latlong coordinates and another consumer app that can read the latlong data as it is being produced. If I wish to plot the latlong data using D3js or highcharts how do I it? Any suggestions or links would be greatly appreciated. I did my research but I couldn't find any relevant tutorials yet.

D3.js is a front-end technology. Kafka is backend.
You need some mechanism to forward data from a Kafka consumer to a given browser. Websockets is one option, and there's several resources out there about it
For example, https://github.com/sulthan309/Live-Dashboard-using-Kafka-and-Spring-Websocket
The alternative is to feed the data into a system like Druid, InfluxDB, Elasticsearch, or other storage system intended for time-series numerical data, from which you can use real BI tools to analyze it

Related

Apache Kafka vs. HTTP API

I am new to Apache Kafka and its stream services related to API's, but was wondering if there was any formal documentation on where to obtain the initial raw data required for ingestion?
In essence, I want to try my hand at building a rudimentary crypto trading bot, but was under the impression that http APIs may have more latency than APIs that integrate with Kafka Streams. For example, I know RapidAPI has a library of http APIs that can be accessed that would help pull data, but was unsure if there was something similar if I wanted the data to be ingested through Kafka Streams. I guess I am under the impression that the two data sources will not be similar and will be different in some way, but am also unsure if this is not the case.
I tried digging around on Google, but it's not very clear on what APIs or source data is taken for Kafka Streams, or if they are the same/similar just handled differently.
If you have any insight or documentation that would be greatly appreciated. Also feel free to let me know if my understanding is completely false.
any formal documentation on where to obtain the initial raw data required for ingestion?
Kafka accepts binary data. You can feed in serialized data from anywhere (although, you are restricted by (configurable) message size limits).
APIs that integrate with Kafka Streams
Kafka Streams is an intra-cluster library, it doesn't integrate with anything but Kafka.
If you want to periodically, poll/fetch an HTTP/1 API, then you would use a regular HTTP client, and a Kafka Producer.
Probably a similar answer with streaming HTTP/2 or websocket, although, still not able to use Kafka Streams, and you'd have to deal with batching records into a Kafka Producer request
You instead should look for Kafka Connect projects on the web that operate with HTTP, or opt for something like Apache NiFi as a broader project with lots of different "processors" like GetHTTP and ProduceKafka.
Once the data is in Kafka, you are welcome to use Kafka Streams/KSQL to do some processing

Solution architecture Kafka

I am working with a third party vendor who I asked to provide me the events generated by a website
The vendor proposed to stream the events using Kafka ... why not...
On my side (the client) I am running a 100% MSSQL/Windows production environment and internal business want to have kpi and dashboard on website activities
Now the question - what would be the architecture to support a PoC so I can manage the inputs on one hand and create datamarts to deliver business needs?
Not clear what you mean by "events from website". Your Kafka producers are typically server side components, as you make API requests, you'd put Kafka producing events between those requests and your databases calls. I would be surprised if any third-party would just be able to do that immediately
Maybe you're looking for something like https://divolte.io/
You can also use CDC products to stream events out of your database
The architecture could be like this. The app streams event to Kafka. You can write a service to read the data from Kafka, do transformation and write to Database. You can then build Dashboard on top of DB.
Alternatively, you can populate indexes in Elastic Search and build Kibana dashboard as well.
My suggestion would be to use Lambda architecture to cater both Real-time and Batch processing needs:
Architecture:
Lambda architecture is designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods.
This architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data.
Another Solution:

How can I process data from Kafka with PySpark?

I want to process logs data from Kafka streaming to PySpark and save to Parquet files, but I don't know how to input the data to Spark. Please help me thanks.
My answer is on high level. You need to use spark-streaming and need to have some basic understanding of messaging systems like Kafka.
The application that sends data into Kafka (or any messaging system) is called "producer" and the application that receives data from Kafka is called as "consumer". When producer sends data, it will send data to a specific "topic". Multiple producers can send data to Kafka layer under different topics.
You basically need to create a consumer application. To do that, first you need to identify the topic you are going to consume data from.
You can find many sample programs online. Following page can help you to build your first application
https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/

Kafka Messages Processing

I am using Kafka distributed system for message processing in spring boot application. Now my application are producing messages on even basic to three different different topics. There is one separate spring boot application which will be used by some data analysis team who will analysis the data. This application is a simple report type application with only one filter Topic.
Now I have to implement this but I am little bit confused how I will show the data to the UI. I have written listeners (Consumers) who are consuming the messages but how I will show the data to the UI on real time basic. Should I need to store it in some database like redis and then show this data to UI? Is this the correct way to deal with consumer in Kafka? Will it not be slow? As messages can grow drastically over the time.
In nutshell I want to know to how we can show messages on any UI in the efficient way and in real time.
Thanks
You can write a consumer to forward to a websocket.
Or you can use Kafka Connect to write to a database, then write a REST API
Or use Kafka Streams Interactive Queries feature + add a RPC layer on top for Javascript to call

How realtime data input to Druid?

I have analytic server (for example click counter). I want to send data to druid using some api. How should I do that?
Can I use it as replacement for google analytics?
As se7entyse7en said:
You can ingest your data to Kafka and then use druid's Kafka
firehose to ingest your data to druid through real-time ingestion.
After that you can interactively query druid using its api.
It must be said that firehoses can be setup only on Druid realtime nodes.
Here is a tutorial how to setup the Kafka firehose: Loading Streaming Data.
Beside Kafka firehose, you can setup other provided firehoses - Amazon S3 firehose, RabbitMQ firehose, etc... by including them and you can even write your own firehose as an extension, an example is here. Here are all druid extensions.
It must be said that Druid is shifting real-time ingestion from realtime nodes to the Indexing service, as explained here.
Right now the best practise is to run Realtime Index Task on Indexing Service and then you can use Druid's API to send data to this task. You can use the API directly but it's far more easier to use Tranquility. It's a library that will automatically create new Realtime Index Task for new segments and it'll allow you to send messages to the right task. You can also set replication and sharding level etc. Just run the indexing service, use Tranquility and you can start sending your messages to Druid.
You can ingest your data to Kafka and then use druid's Kafka firehose to ingest your data to druid through real-time ingestion. After that you can interactively query druid using its api.
The best way to use, considering your druid is a 0.9.x version is tranquility. The rest api is pretty solid and allows you to control your data schema. The druid.io quickstart page and hit the "Load streaming data" section.
I am loading in clickstream data for our website at real time and its been working very well. So, yes you can replace google analytics with druid (assuming, you have the required infrastructure).