Ingesting unique records in Kafka-Spark Streaming - scala

I have a Kafka topic getting 10K events per min and a Spark Streaming 2.3 consumer in scala written to receive and ingest into Cassandra. Incoming events are JSON having an 'userid' field among others. However if an event with the same userid comes along again (even with a different message body) still I don't want that to be ingested into Cassandra. The Cassandra table to growing every minute and day so doing a lookup of all userids encountered till this point by retrieving the table into an in-memory spark dataframe is impossible as the table will be becoming huge. How can I best ingest only unique records?
Can updateStateByKey work? And how long can state be maintained? Because if the same userid comes after one year, i don't want to ingest it into Cassandra.

Use an external low latency external DB like Aerospike or if the rate of duplicates is low you can use an in-memory bloom/cuckoo filter (that is ~4GB for 1 year # 10K per min rate) with rechecking of matches through Cassandra to do not discard events in case of false positives.

Related

Design stream pipeline using spark structured streaming and databricks delta to handle multiple tables

I am designing a streaming pipeline where my need is to consume events from Kafka topic.
Single Kafka topic can have data from around 1000 tables, data coming as a json record. Now I have below problems to solve.
Reroute messages based on its table to its seperate folder: this is done using spark structured streaming partition by on table name.
Second wanted to parse each json record attach appropriate table schema to it and create/append/update corresponding delta table. I am not able find best solution to do it. Infering json schema dyanamically and write to delta table dynamically. How this can be done? As records are coming as json string.
As I have to process so many tables do I need to write those many number of streaming queries? How it can be solved?
Thanks

Prometheus use case for high scale clickstream data

We are getting website clickstream data ( at scale of 100 million events per minute)
Now our current pipeline takes all theses events on Kafka and form there spark for aggregation and finally
simple metrics gets stored to graphite
What I am thinking is to use Prometheus directly from Kafka and ingest 100 million events in a distributed fashion like from n instance
Spark aggregation are simple so I am thinking of this direct ingestion.
My question is has anyone done such a use case and scale for Prometheus.

Use Session window in Kafka stream to order records and insert into MySQL database

As per the KSQLDB documentation, session window can be used to order the records as per timestamp and do aggregation.
I have an use case where I want to insert records into MySQL in sequence.
I have a timestamp field in my record that I used as ROWTIME and then tried session window over it and inserted into an output stream that will push into a topic and then to RDS. But in the output stream I was not able to reorder the messages as per the timestamp.
Example -
There are two records - Record 1 at 11:00AM and Record 2 at 11:01AM and both of them has same primary keys. These two records are getting ingested in Kafka in sequence - Record 2 , Record 1. But in MYSQL I need Record 1 and then Record 2 as the Record 1 has lower timestamp. I tried window session of 5 minutes in stream. But in output stream, it is always coming as Record 2, Record 1.
Is this scenario possible inside Kafka? Can I reorder the records inside Kafka and then push into a stream using INSERT INTO statement?
Currently I am trying to do using KSQL queries as I am using confluent Kafka.
Session windows do not change the order of records, they GROUP records together that have the same key and are within some time period of each other.
Hence session windows are not going to allow you to reorder messages.
Reordering messages is not a use-case ksqlDB is suited for at present. You may have better luck if you tried to write a Kafka Streams based application.
Kafka Streams would allow you to use a state-store to buffer input messages for some time to allow for out-of-order messages. You should be able to use punctuation to trigger outputting of cached messages after some time period. You will need to choose how long you are willing to buffer the input to allow for out-of-order messages.

Combining data coming from multiple kafka to single kafka

I have N Kafka topic, with data and a timestamp, I need to combine them in a single topic with sorted timestamp order, where the data is sorted inside the partition. I got one way to do that.
Combine all the Kafka topic data in Cassandra(because of its fast write) with clustering order as DESCENDING, it will combine them all but the limit would be if after a timed window of accumulation of data if a data came late, it won't be sorted
Is there any other appropriate way to do that? If not then is there any chance of improvement in my solution.
Thanks
Not clear why you need Kafka to sort on timestamps. Typically this is done only at consumption time for each batch of messages.
For example, create Kafka Streams process that reads from all topics. Create a Global KTable and enable Interactive Querying.
When you query, then you sort the data on the client side, regardless of how it is ordered in the topic.
This way, you are no limited to a single, ordered partition.
Alternatively, I would write to something other than Cassandra (due to my lack of deep knowledge of it). For example, Couchbase or CockroachDB.
Then when you query those later, run a SORT BY

Filter Repeated Messages In Kafka

PREFACE:
In our organization we're trying to use Kafka to solve a problem that involves capturing changes in an Oracle Database and send through Kafka. It's in fact a CDC, we are using Kafka Connector for that.
We catch the changes in Oracle, using Oracle Flashback queries, this allows us to get the timestamp of the change and operation (Insert, Delete, Update) involved.
Once some change is made in a table we observe, the Kafka Connector publishes this to a topic, we further read this topic using Kafka Streams.
The problem is that sometimes there are equal lines that appears in the Flashback query, because of some Update in the table that didn't change nothing (this triggers a flashback change too), or if the table has 100 columns, and we watch only 20, it end up that we see repeated lines in the query because none of those 20 fields has changed.
We use flashback to get changed rows (including excluded ones). In the connector we set timestamp+increment mode (timestamp is obtained by the field versions_starttime of flashback query)
Important: we can't touch the DB more than this, I mean, we can't create triggers instead of using this already Flashback scheme.
THE QUESTION
We're trying to filter records in Kafka, if some (key, value) is equally in content we want to discard. Note that, this is not exactly once semantics. The record will be repeated with large timestamps differences.
If I use a KTable to check the last value of some record, how efficient this will be after a long period?
I mean, internal state storage of consumers, is handled by RocksDB and a backed Kafka Topic, since if I use a non windowed KTable this internal space could end up being very large.
Which is considered a good approach in this scenario? To not overload Kafka Consumers internal state storage, and the same time being capable to know if the actual record was already processed time ago.