Beam / Cloud Dataflow: How to Add Kafka (or PubSub) topics to Running Stream - streaming

(How) is it possible to dynamically add or remove topics to a running pipeline as a source or sink (Kafka or PubSub)? Or have as a sink a dynamic pattern like it is possible with BigQuery Table names.
Some background: We have different topics, one per customer, to better facilitate downstream aggregations and also clean/up add them on the fly. Kafka is used to be able to backfill calculations over periods that are longer than possible with PubSub.
The options I have in my mind right now are either extending KafkaIO to support this, or to update the pipeline each time there is a topic added removed (meaning there will be some lags in the stream while its updated). Or maybe I'm having a wrong design pattern in my head and there are other solutions for this.

You are correct that right now the easiest solution is updating the pipeline.
However, a new API called Splittable DoFn (SDF) is currently in active development; it is already available in the Cloud Dataflow runner in streaming mode and in the Direct runner, and implementation is in progress in Flink and Apex runners.
It makes it possible to do things like "create a PCollection of Kafka topic names and read each of those topics", so you can have one pipeline stage produce names of topics to be read (e.g. the names themselves could arrive over Kafka or Pubsub every time a customer is added, or you could write an SDF to watch the result of a database query returning a list of customers and emit new ones), and another stage reading those topics.
See http://s.apache.org/splittable-do-fn for the design doc of the API, and http://s.apache.org/textio-sdf for an example proposed refactoring of TextIO using this API - you may want to try to modify KafkaIO yourself in a similar fashion.

Related

Filtering in Kafka and other streaming technologies

I am currently doing some research about which stream processing technology to use. So far I have looked at message queueing technologies and streaming frameworks. I am now leaning towards Apache Kafka or Google Pub/Sub.
The requirements I have:
Deliver, read and process messages/events in real time.
Persistence in the messages/events.
Ability to filter messages/event in real time with out having to read entire topic. For example: if I have topic called ‘details’, I want to be able to filter out the messages/events out of that topic where an attribute of an event equals a certain value.
Ability to see if the producer to a certain topic or queue is finished.
Ability to delete messages/events in a topic based on an attribute within an event equaling a certain value.
Ordering in messages/events.
My question is: what is the best framework/technology for these use cases? From what I have read so far, Kafka doesn’t provide that out of the boxes filtering approach for messages/events in topics and Google Pub/Sub does have a filter approach.
Any suggestions and experience would be welcome.
As per the requirements you mentioned kafka seems a nice fit, using kafka streams or KSQL you can perform filtering in real-time, here is an example https://kafka-tutorials.confluent.io/filter-a-stream-of-events/confluent.html
What you need is more than just integration and data transfer, you need something similar to what is known as ETL tool, here you can find more about ETL and tools in GCP https://cloud.google.com/learn/what-is-etl

how to implement a short lived queues for ETL

I am looking for a suggestion on how we can implement a short lived queues(topic) to perform an ETL, after the ETL is completed that queue(topic) and data is not needed anymore.
Here is the scenario.. where a particular job runs, it has to run a query to extract data from database(assume teradata) and load it in a topic. Then a spark job will be kicked off and it will process all the records in that topic and stop the spark job. After that topic and data in that is not needed anymore.
For this I see Kafka and Redis stream as 2 options, looks to me Redis steam is the most appropriate tool because of ease of creating topics and destroying. with Kafka I see it requires additional custom handlers for creating the topics and drop the topic etc, also don't want to exploitate Kafka with too many topics.
I am open and happy to hear from you if we have another alternate and better solution out there.

Kafka Streams DSL over Kafka Consumer API

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.

Stream CDC change with Kafka and Spark still processes it in batches, whereas we wish to process each record

I'm still new in Spark and I want to learn more about it. I want to build and data pipeline architecture with Kafka and Spark.Here is my proposed architecture where PostgreSQL provide data for Kafka. The condition is the PostgreSQL are not empty and I want to catch any CDC change in the database. At the end,I want to grab the Kafka Message and process it in stream with Spark so i can get analysis about what happen at the same time when the CDC event happen.
However, when I try to run an simple stream, it seems Spark receive the data in stream, but process the data in batch, which not my goal. I have see some article that the source of data for this case came from API which we want to monitor, and there's limited case for Database to Database streaming processing. I have done the process before with Kafka to another database, but i need to transform and aggregate the data (I'm not use Confluent and rely on generic Kafka+Debezium+JDBC connectors)
According to my case, is Spark and Kafka can meet the requirement? Thank You
I have designed such pipelines and if you use Structured Streaming KAFKA in continuous or non-continuous mode, you will always get a microbatch. You can process the individual records, so not sure what the issue is.
If you want to process per record, then use the Spring Boot KAFKA setup for consumption of KAFKA messages, that can work in various ways, and fulfill your need. Spring Boor offers various modes of consumption.
Of course Spark Structured Streaming can be done using Scala and has a lot of support obviating extra work elsewhere.
https://medium.com/#contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b This article discusses the single message processing approach.

JDBC sink connector insert/upsert based on max timestamp?

I'm very new to Kafka connect
I am inserting records from multiple sources into one table.
In some cases, it may be possible for some records to reach before others.
Since I cannot control which source will pull which record first, I want to add a check on the timestamp key of the record.
I have a key called "LastModified_timestamp" in my schema where I store the timestamp of the latest state of my record.
I want to add a check to my JDBC sink connector where I can upsert a record based on comparing the value of LastModified_timestamp
I want to ignore the records which have a older timestamp and only want to upsert/insert the latest one. I couldn't find any configuration to achieve this
Is there any way by which I can achieve this?
Will writing a custom query help in this case?
The JDBC Sink connector does not support this kind of feature. You have two options to consider:
Single Message Transform (SMT) - these apply logic to records as they pass through Kafka Connect. SMT are great for things like dropping columns, changing datatypes, etc. BUT not appropriate for more complex processing and logic, including logic which needs to span multiple records as yours does here
Process the data in the source Kafka topic first, to apply the necessary logic. You can do this with Kafka Streams, KSQL, and several other stream processing frameworks (e.g. Spark, Flink, etc). You'd need some kind of stateful logic that could work out if a record was older than already processed.
Can you describe more about your upstream source for the data? It might be there's a better way to orchestrate the data coming through to enforce the ordering.
A final idea would be to land all records to your target DB and then use logic in your database query consuming it to select the most recent (based on LastModified_timestamp) record for a given key.
Disclaimer: I work for Confluent, the company behind the open-source KSQL project.