sSpark structured streaming PostgreSQL updatestatebykey - postgresql

How to update state of OUTPUT TABLE by Spark structured streaming computation triggered by changes in INPUT PostgreSQL table?
As a real life scenario USERS table has been updated by user_id = 0002, how to trigger Spark computation for that user only and write / update results to another table?

Although there is no solution out of the box you can implement it the following way.
You can use Linkedin's Databus or other similar tools which mines the databse logs and produce respective events to kafka. The tool tracks the changes in database bin logs. You can write a kafka connector to transform and filter data. You can then consume events from kafka and process them to any sink format you want.

Related

Design stream pipeline using spark structured streaming and databricks delta to handle multiple tables

I am designing a streaming pipeline where my need is to consume events from Kafka topic.
Single Kafka topic can have data from around 1000 tables, data coming as a json record. Now I have below problems to solve.
Reroute messages based on its table to its seperate folder: this is done using spark structured streaming partition by on table name.
Second wanted to parse each json record attach appropriate table schema to it and create/append/update corresponding delta table. I am not able find best solution to do it. Infering json schema dyanamically and write to delta table dynamically. How this can be done? As records are coming as json string.
As I have to process so many tables do I need to write those many number of streaming queries? How it can be solved?
Thanks

Stream CDC change with Kafka and Spark still processes it in batches, whereas we wish to process each record

I'm still new in Spark and I want to learn more about it. I want to build and data pipeline architecture with Kafka and Spark.Here is my proposed architecture where PostgreSQL provide data for Kafka. The condition is the PostgreSQL are not empty and I want to catch any CDC change in the database. At the end,I want to grab the Kafka Message and process it in stream with Spark so i can get analysis about what happen at the same time when the CDC event happen.
However, when I try to run an simple stream, it seems Spark receive the data in stream, but process the data in batch, which not my goal. I have see some article that the source of data for this case came from API which we want to monitor, and there's limited case for Database to Database streaming processing. I have done the process before with Kafka to another database, but i need to transform and aggregate the data (I'm not use Confluent and rely on generic Kafka+Debezium+JDBC connectors)
According to my case, is Spark and Kafka can meet the requirement? Thank You
I have designed such pipelines and if you use Structured Streaming KAFKA in continuous or non-continuous mode, you will always get a microbatch. You can process the individual records, so not sure what the issue is.
If you want to process per record, then use the Spring Boot KAFKA setup for consumption of KAFKA messages, that can work in various ways, and fulfill your need. Spring Boor offers various modes of consumption.
Of course Spark Structured Streaming can be done using Scala and has a lot of support obviating extra work elsewhere.
https://medium.com/#contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b This article discusses the single message processing approach.

JDBC sink connector insert/upsert based on max timestamp?

I'm very new to Kafka connect
I am inserting records from multiple sources into one table.
In some cases, it may be possible for some records to reach before others.
Since I cannot control which source will pull which record first, I want to add a check on the timestamp key of the record.
I have a key called "LastModified_timestamp" in my schema where I store the timestamp of the latest state of my record.
I want to add a check to my JDBC sink connector where I can upsert a record based on comparing the value of LastModified_timestamp
I want to ignore the records which have a older timestamp and only want to upsert/insert the latest one. I couldn't find any configuration to achieve this
Is there any way by which I can achieve this?
Will writing a custom query help in this case?
The JDBC Sink connector does not support this kind of feature. You have two options to consider:
Single Message Transform (SMT) - these apply logic to records as they pass through Kafka Connect. SMT are great for things like dropping columns, changing datatypes, etc. BUT not appropriate for more complex processing and logic, including logic which needs to span multiple records as yours does here
Process the data in the source Kafka topic first, to apply the necessary logic. You can do this with Kafka Streams, KSQL, and several other stream processing frameworks (e.g. Spark, Flink, etc). You'd need some kind of stateful logic that could work out if a record was older than already processed.
Can you describe more about your upstream source for the data? It might be there's a better way to orchestrate the data coming through to enforce the ordering.
A final idea would be to land all records to your target DB and then use logic in your database query consuming it to select the most recent (based on LastModified_timestamp) record for a given key.
Disclaimer: I work for Confluent, the company behind the open-source KSQL project.

Druid.io: update/override existing data via streams from Kafka (Druid Kafka indexing service)

I'm loading streams from Kafka using the Druid Kafka indexing service.
But the data I uploaded is always changed, so I need to reload it again and avoid duplicates and collisions if data was already loaded.
I research docs about Updating Existing Data in Druid.
But all info about Hadoop Batch Ingestion, Lookups .
Is it possible to update existing Druid data during Kafka streams?
In other words, I need to rewrite the old values with new ones using Kafka indexing service (streams from Kafka).
May be any kind of setting to rewrite duplicates?
Druid is in a way a time-series database where the data gets "finalised" and written to a log every time-interval. It does aggregations and optimises columns for storage and easy queries when it "finalises" the data.
By "finalising", what I mean is that Druid assumes that the data for the specified interval is already present and it can safely do its computations on top of them. So this in effect means that there is no support for you to update the data (like you do in a database). Any data that you write is treated as a new data and it keeps adding to its computations.
But Druid is different in the sense it provides a way to upload historical data for the same time period the real-time indexing has already taken place. This batch upload will overwrite any segments with the new ones and further queries will reflect the latest uploaded batch data.
So I am afraid the only option would be to do batch ingestion. Maybe you could still send the data to Kafka, but have a spark/gobbin job that does de-duplication and write to Hadoop. Then have a simple cron job to re-index these as a batch onto Druid.

Read data from KSQL tables

maybe this is a beginner question but what is the recommended way to read data produced in KSQL?
Let's assume I do some stream processing and write the data to a KSQL table. Now I want to access this data via a Spring application (e.g. fan-out some live data via a websocket). My first guess here was to use Spring Kafka and just subscribe to the underlying topic. Or should I use Kafka Streams?
Another use-case could be to do stream processing and write the results to a Redis store (e.g. for a webservice which always returns current values). What would be the approach here?
Thanks!
The results if KSQL queries are stored in Kafka topics. So you can access the results from third party applications by reading from the result topic.
If the query result is a Table the resulted Kafka topic is a changelog topic meaning that you can read it into a table in third party system such as Cassandra or Redis. This table will always have the latest result and you can query it from web services.
Check out our Clickstream demo where we push the results into Elastic for visualization. The visualized values are the latest values for in the corresponding tables.
https://github.com/confluentinc/ksql/tree/master/ksql-clickstream-demo#clickstream-analysis