Join streaming based on key - Spark/Kafka - scala

Suppose 2 streaming given by spark and one of streaming is not 100 % in sync. There might be difference in getting streaming. We need to join streaming by key. Is there any way we can do it without any persistent ?

I don't think it is possible, Kafka Streams ships with built-in support to interpret the data in a Kafka topic as such a continuously updated table. In the Kafka Streams DSL this is achieved via the so-called Ktable
these KTables are backed by state stores in Kafka Streams. These state stores are local to your application (more precisely: they are local to the instances of your application, of which there can be one or many), which means that interacting with these state stores does not require talking over the network, so read and write operations are very fast. Incase you decided not to persist data, you might start losing information which you might not want

Related

KTable initialization and persistence

This is more of an architectural question. I'm learning about Event-Driven Architecture and Streaming Systems with Apache Kafka. I've learned about Event Sourcing and CQRS and have some basic questions regarding implementation.
For example, consider a streaming application where we are monitoring vehicular events of drivers registered in our system. These events will be coming in as a KStream. The drivers registered in the system will be in a KTable, and we need to join the events and drivers to derive some output.
Assume that we insert a new driver in the system by a microservice, which pushes the data in a Cassandra table and then to the KTable topic by Change Data Capture.
since Kafka topics have a TTL associated with them, how do we make sure that the driver records are not dropped?
I understand that Kafka has a persistent state store that can maintain the required state, but can I depend on it like a Cassandra table? Is there a size consideration?
If the whole application, and all kafka brokers and consumer nodes are terminated, can the application be restarted without loss of driver records in the KTable?
If the streaming application is Kubernetes based, how would I maintain the persistent disk volumes of each container and correctly attach them as containers come and go?
Would it be preferable to join the event stream with the driver table in Cassandra using Spark Streaming or Flink? Can Spark and Flink still maintain data locality as their streaming consumers will be distributed by Kafka partitions, and the Cassandra data by I don't know what?
EDIT: - I realized Spark and Flink would be pulling data from Cassandra on the respective nodes depending on what keys they have. Kafka Streaming has the advantage that the Stream and KTable to join will already be data local.
KTables don't have a TTL since they are built from compacted topics (infinite retention).
Yes, you need to maintain storage directories for persistent Kafka StateStores. Since those stores would be on-disk, no records should be dropped from them upon broker/app restarts until you actively clear the state directories from the app instance hosts.
Spark/Flink do not integrate with Kafka Streams stores, and have their own locality considerations. I do believe Flink offers RocksDB state, and both broadcast data for remote-joins, otherwise, joining Kafka record keys requires both topics have matching partition counts - this way partitions are assigned to the same instances/executors, similar to Kafka Streams joins.

Can compacted Kafka topic be used as key-value database?

In many articles, I've read that compacted Kafka topics can be used as a database. However, when looking at the Kafka API, I cannot find methods that allow me to query a topic for a value based on a key.
So, can a compacted Kafka topic be used as a (high performance, read-only) key-value database?
In my architecture I want to feed a component with a compacted topic. And I'm wondering whether that component needs to have a replica of that topic in its local database, or whether it can use that compacted topic as a key value database instead.
Compacted kafka topics themselves and basic Consumer/Producer kafka APIs are not suitable for a key-value database. They are, however, widely used as a backstore to persist KV Database/Cache data, i.e: in a write-through approach for instance. If you need to re-warmup your Cache for some reason, just replay the entire topic to repopulate.
In the Kafka world you have the Kafka Streams API which allows you to expose the state of your application, i.e: for your KV use case it could be the latest state of an order, by the means of queryable state stores. A state store is an abstraction of a KV Database and are actually implemented using a fast KV database called RocksDB which, in case of disaster, are fully recoverable because it's full data is persisted in a kafka topic, so it's quite resilient as to be a source of the data for your use case.
Imagine that this is your Kafka Streams Application architecture:
To be able to query these Kafka Streams state stores you need to bundle an HTTP Server and REST API in your Kafka Streams applications to query its local or remote state store (Kafka distributes/shards data across multiple partitions in a topic to enable parallel processing and high availability, and so does Kafka Streams). Because Kafka Streams API provides the metadata for you to know in which instance the key resides, you can surely query any instance and, if the key exists, a response can be returned regardless of the instance where the key lives.
With this approach, you can kill two birds in a shot:
Do stateful stream processing at scale with Kafka Streams
Expose its state to external clients in a KV Database query pattern style
All in a real-time, highly performant, distributed and resilient architecture.
The images were sourced from a wider article by Robert Schmid where you can find additional details and a prototype to implement queryable state stores with Kafka Streams.
Notable mention:
If you are not in the mood to implement all of this using the Kafka Streams API, take a look at ksqlDB from Confluent which provides an even higher level abstraction on top of Kafka Streams just using a cool and simple SQL dialect to achieve the same sort of use case using pull queries. If you want to prototype something really quickly, take a look at this answer by Robin Moffatt or even this blog post to get a grip on its simplicity.
While ksqlDB is not part of the Apache Kafka project, it's open-source, free and is built on top of the Kafka Streams API.

Kafka Streams - Best way to do lookups in remote store via interactive queries?

I have a bit of confusion and I would like some clarification. I have something I'm working on. I want to have one Kafka Streams topology that will have five separate KStreams reading from their own respective topic and dumping that data into a large monolithic topic. Next I'll have a GlobalKTable that will read from that monolithic topic and materialize a global store let's say called lookupStore. I want to have this materialized global store as basically a "lookup" table for other Kafka Streams applications. I've done some reading on exposing this with an RPC layer with the application.server configuration which will be in the form of some unique host:port.
Now I want to have however many separate microservices each that are Kafka Streams applications that will perform are processing events from a KStream and then doing a lookup on lookupStore via an interactive query. For instance a .filter() operation based on whether the lookup on that lookupStore returned a value or not. So here's my confusion... let's assume I hardcode that exposed RPC layer on host:port how do I query lookupStore specifically to query it. If this was in the same topology/local instance you could just do something like lookupStore.get("key")... but how do you do this within a remote Kafka Streams instance?
Or does connecting to that RPC layer expose that state store to the remote application so that it "knows" of it and you can query the lookupStore like as if it was a local instance? Is this feasible or am I going down the wrong path?
If your microservices (which are streams applications) share the same Kafka cluster as the main streaming app (that generates GlobalKTable), then they can access the Table topic corresponding to the same application and do KTable join or lookupStore.get("key"). Also it is not recommended to do remote API calls within a stream application to do lookups, because of latency. If the two Kafak clusters are different, then you could explore replicating the topics (GlobalKTable and State Store change log topics) using something like mirror maker.

Kafka streams state store for what?

As I got right from book, Kafka Streams state store it is a memory key/value storage to store data to Kafka or after filtering.
I am confused by some theoretical questions.
What is differenct Kafka streams state from another memory storage like Redis etc?
What is real case to use state storage in Kafka Streams?
Why topic is not alternative for state storage?
Why topic is not alternative for state storage?
A topic contains messages in a sequential order that typically represents a log.
Sometimes, we would want to aggregate these messages, group them and perform an operation, like sum, for example and store it in a place which we can retrieve later using a key. In this case, an ideal solution would be to use a key-value store rather than a topic that is a log-structure.
What is real case to use state storage in Kafka Streams?
A simple use-case would be word count where we have a word and a counter of how many times it has occurred. You can see more examples at kafka-streams-examples on github.
What is difference between Kafka streams state from another memory storage like Redis etc?
State can be considered as a savepoint from where you can resume your data processing or it might also contain some useful information needed for further processing (like the previous word count which we need to increment), so it can be stored using Redis, RocksDB, Postgres etc.
Redis can be a plugin for Kafka streams state storage, however the default persistent state storage for Kafka streams is RocksDB.
Therefore, Redis is not an alternative to Kafka streams state but an alternative to Kafka streams' default RocksDB.
-Why topic is not alternative for state storage?
Topic is the final statestore storage under the hood (everything is topic in kafka)
If you create a microservice with name "myStream" and a statestore named "MyState", you'll see appear a myStream-MyState-changelog with has an history of all changes in the statestore.
RocksDB is only the local cache to improve performances, with a first layer of local backup on the local disk, but at the end the real high availability and exactly-once processing guarantee is provided by the underlying changelog topic.
What is differenct Kafka streams state from another memory storage like Redis etc?
What is real case to use state storage in Kafka Streams?
It not a storage, it's a just local, efficient, guaranteed memory state to manage some business case is a fully streamed way.
As an example :
For each Incoming Order (Topic1), i want to find any previous order (Topic2) to the same location in the last 6 hours

Spark/Spark Streaming in production without HDFS

I have been developing applications using Spark/Spark-Streaming but so far always used HDFS for file storage. However, I have reached a stage where I am exploring if it can be done (in production, running 24/7) without HDFS. I tried sieving though Spark user group but have not found any concrete answer so far. Note that I do use checkpoints and stateful stream processing using updateStateByKey.
Depending on the streaming(I've been using Kafka), you do not need to use checkpoints etc.
Since spark 1.3 they have implemented a direct approach with so many benefits.
Simplified Parallelism: No need to create multiple input Kafka streams
and union-ing them. With directStream, Spark Streaming will create as
many RDD partitions as there is Kafka partitions to consume, which
will all read data from Kafka in parallel. So there is one-to-one
mapping between Kafka and RDD partitions, which is easier to
understand and tune.
Efficiency: Achieving zero-data loss in the first approach required
the data to be stored in a Write Ahead Log, which further replicated
the data. This is actually inefficient as the data effectively gets
replicated twice - once by Kafka, and a second time by the Write Ahead
Log. This second approach eliminate the problem as there is no
receiver, and hence no need for Write Ahead Logs.
Exactly-once semantics: The first approach uses Kafka’s high level API
to store consumed offsets in Zookeeper. This is traditionally the way
to consume data from Kafka. While this approach (in combination with
write ahead logs) can ensure zero data loss (i.e. at-least once
semantics), there is a small chance some records may get consumed
twice under some failures. This occurs because of inconsistencies
between data reliably received by Spark Streaming and offsets tracked
by Zookeeper. Hence, in this second approach, we use simple Kafka API
that does not use Zookeeper and offsets tracked only by Spark
Streaming within its checkpoints. This eliminates inconsistencies
between Spark Streaming and Zookeeper/Kafka, and so each record is
received by Spark Streaming effectively exactly once despite failures.
If you are using Kafka, you can found out more here:
https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
Approach 2.