Prometheus use case for high scale clickstream data - apache-kafka

We are getting website clickstream data ( at scale of 100 million events per minute)
Now our current pipeline takes all theses events on Kafka and form there spark for aggregation and finally
simple metrics gets stored to graphite
What I am thinking is to use Prometheus directly from Kafka and ingest 100 million events in a distributed fashion like from n instance
Spark aggregation are simple so I am thinking of this direct ingestion.
My question is has anyone done such a use case and scale for Prometheus.

Related

Publishing custom application metrics from Spark streaming job

I am running a Spark streaming job where few millions of Kafka events are being processed in a sliding window of 3 minutes, with window size of 24 hrs. There are multiple steps in the job, including aggregating and filtering the events based on certain fields and joining them with static RDDs loaded from files in S3 and finally running an MLLib transformation on each aggregated row of the window, publishing the results to Kafka topic.
I need a way to publish a bunch of application metrics, starting with how much time it takes to complete processing for each window, how many raw events being processed, what's the data size in bytes being processed, etc.
I've searched through all events that Spark publishes, and executor level events doesn't get me what I need. I'm trying out Kamon and Spark MetricsSource and Sink for now.
Any suggestions on the best way to accomplish this?
Also, I'm using Spark 2.4 now, as the original codebase is pretty old. But will be migrating to Spark 3.x soon.

ksqlDB for finding average last hour, and store results back to a kafka topic?

We have a readpanda (kafka compatible) source, with sensor data. Can we do the following:
Every hour, find the average sensor data last hour for each sensor
Store them back to a topic
You want to create a materialized view over the stream of events that can be queried by other applications. Your source publishes the individual events to Kafka/Redpanda, another process observers the events and makes them available as queryable "tables" for other applications. Elaborating a few options:
KSQLdb is likely a default choice as it comes as "native" in the Kafka/Confluent stack. Be careful with using it over your production Kafka cluster. It has heavy impact on the cluster performance. See the basic tutorial or the advanced tutorial.
Use an out-of-the box solution for materialized views such as Materialize. It's easiest to setup/use and doesn't stress the Kafka broker. However, it is single-node only as of now (06/2022). See the tutorial.
Another popular option is using a stream processor and store hourly aggregates to an attached database (for example Flink storing data to Redis). This is a do-it-yourself approach. Have a look on Hazelcast. It is one process running both stream processing services and a queryable store.

Calculating delta values using kafka streams

I have some metrics written to Kafka topic with a timestamp. I need to calculate the delta value between the current value and the previous value of the metric. I would like to do this via the Kafka streams API or KSQL to scale better than the current solution that I have.
What I have now is a simple Kafka producer/consumer in Python that reads a metric at time and calculates the the delta with the previous value stored in a Redis database.
Some example code to accomplish this via Kafka streams API would be highly appreciated.

Ingesting unique records in Kafka-Spark Streaming

I have a Kafka topic getting 10K events per min and a Spark Streaming 2.3 consumer in scala written to receive and ingest into Cassandra. Incoming events are JSON having an 'userid' field among others. However if an event with the same userid comes along again (even with a different message body) still I don't want that to be ingested into Cassandra. The Cassandra table to growing every minute and day so doing a lookup of all userids encountered till this point by retrieving the table into an in-memory spark dataframe is impossible as the table will be becoming huge. How can I best ingest only unique records?
Can updateStateByKey work? And how long can state be maintained? Because if the same userid comes after one year, i don't want to ingest it into Cassandra.
Use an external low latency external DB like Aerospike or if the rate of duplicates is low you can use an in-memory bloom/cuckoo filter (that is ~4GB for 1 year # 10K per min rate) with rechecking of matches through Cassandra to do not discard events in case of false positives.

Use Kafka topics to store data for many years

I am looking for a way of collecting metrics data from multiple devices. The data should be aggregated by multiple "group by" like functions. The aggregation functions list is not complete and new aggregations will be added later and it will be required to aggregate all data collected from first days.
Is it fine to create Kafka topic with 100 year expiration period and use it as a datastore for this purpose? So new aggregations will be able to read from topic's start while existing aggregations will continue from their's offsets?
In principle, yes you can use Kafka for long-term storage, exactly for the reason you outline - reprocessing of source data to derive additional aggregates/calculations.
A couple of references:
https://www.confluent.io/blog/okay-store-data-apache-kafka/
https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/
Yes if you want to keep the data you can just increase the retention time to a large value.
I'd still recommend having a retention policy on size to ensure you don't run out of disk space