Calculating delta values using kafka streams - apache-kafka

I have some metrics written to Kafka topic with a timestamp. I need to calculate the delta value between the current value and the previous value of the metric. I would like to do this via the Kafka streams API or KSQL to scale better than the current solution that I have.
What I have now is a simple Kafka producer/consumer in Python that reads a metric at time and calculates the the delta with the previous value stored in a Redis database.
Some example code to accomplish this via Kafka streams API would be highly appreciated.

Related

What happens internally when we run a kSQL query?

I am entirely new to Apache Kafka and kSQL. I was having a question in my mind and I tried to find out the answer but I failed to do so.
My current understanding is that the events that are getting generated from the producer are being stored in the Kafka internally in the topics in serialized form (0s and 1s). If I create a Kafka stream to consume the data and after that, If I run kSQL query let's say to use the COUNT() function so will the output of that query persist in the Kafka topics.
If that the case will it not be a storage cost?
Behind the scenes, it runs a Kafka Streams topology.
Any persisted streams or aggregated tables, in your case, indeed occupy storage.

How to interpret metric spring.cloud.stream.binder.kafka.offset from Actuator for Spring Cloud Stream Kafka consumer

I am exposing Kafka offset metric through Spring Boot Actuator. When I use the curl command:
http://server/actuator/metrics/spring.cloud.stream.binder.kafka.offset
I get the following:
{"name":"spring.cloud.stream.binder.kafka.offset",
"description":"Consumer lag for a particular group and topic",
"baseUnit":"seconds",
"measurements":[{"statistic":"VALUE","value":0.152}],
"availableTags":[{"tag":"topic","values":["kafka.topic.input"]},
{"tag":"group","values":["Consumer1"]}]
}
What does the measurement of 0.152 seconds mean?
Thanks
It was a bug, fixed in 2.1.
A TimeGauge (scaled to milliseconds) was used instead of a simple Gauge.
So, it appears the number means your total lag across all partitions was 152.

Can we create kafka time windowed stream from historical data?

I have some historical data, each record has their timestamp. I would like to read them and feed them into kafka topic, and use kafka stream to process them in a time windowed manner.
Now the question is, when I create kafka stream time windowed aggregation processor, how can I tell kafka to use timestamp field in the record to create time window, instead of real live time?
You need to create a custom TimestampExtractor that will extract the value from the record itself - there's an example of this in the documentation, and here too. I also found this gist which looks relevant.

Real Time event processing

I really want to get an architectural solution for my below scenario.
I have a source of events (Say sensors in oil wells , around 50000 ), that produces events to a server. At the server side I want to process all these events in such a way that , the information from the sensors about latest humidity, temperature,pressure ...etc will be stored/updated to a database.
I am confused with flume or kafka.
Can somebody please address my simple scenario in architectural terms.
I don't want to store the event somewhere, since I am already updating the database with latest values.
Should I really need spark , (flume/kafka) + spark , to meet the processing side?.
Can we do any kind of processing using flume without a sink?
Sounds like you need to use the Kafka producer API to publish the events to a topic then simply read those events either by using the Kafka consumer API to write to your database or use the Kafka JDBC sink connector.
Also if you need just the latest data inside Kafka take a look at log compaction.
One way would be to push all the messages to Kafka Topic. Using Spark Stream you can ingest and process from the kafka topic. Spark streaming can directly process from your Kafka Topic

What is the frequency with which partition offsets are queried by driver using the direct Kafka API in Spark Streaming?

Are the offsets queried for every batch interval or at a different frequency?
When you use the term offsets, I'm assuming you're meaning the offset and not the actual message. Looking through documentation I was able to find two references to the direct approach.
The first one, from Apache Spark Docs
Instead of using receivers to receive data, this approach periodically queries Kafka for the latest offsets in each topic+partition, and accordingly defines the offset ranges to process in each batch. When the jobs to process the data are launched, Kafka’s simple consumer API is used to read the defined ranges of offsets from Kafka (similar to read files from a file system).
This makes it seem like there are independent actions. Offsets are queried from Kafka, and then assigned to process in a specific batch. And querying offsets from Kafka can return offsets that cover multiple Spark batch jobs.
The second one, a blog post from databricks
Instead of receiving the data continuously using Receivers and storing it in a WAL, we simply decide at the beginning of every batch interval what is the range of offsets to consume. Later, when each batch’s jobs are executed, the data corresponding to the offset ranges is read from Kafka for processing (similar to how HDFS files are read).
This one makes it seem more like each batch interval itself fetches a range of offsets to consume. Then when running actually fetches those messages from Kafka.
I have never worked with Apache Spark, I mainly use Apache Storm + Kafka, but since the first doc suggests they can happen at different intervals I would assume they can happen at different intervals, and the blog post just doesn't mention it because it just doesn't get into the technical details.