Design stream pipeline using spark structured streaming and databricks delta to handle multiple tables - apache-kafka

I am designing a streaming pipeline where my need is to consume events from Kafka topic.
Single Kafka topic can have data from around 1000 tables, data coming as a json record. Now I have below problems to solve.
Reroute messages based on its table to its seperate folder: this is done using spark structured streaming partition by on table name.
Second wanted to parse each json record attach appropriate table schema to it and create/append/update corresponding delta table. I am not able find best solution to do it. Infering json schema dyanamically and write to delta table dynamically. How this can be done? As records are coming as json string.
As I have to process so many tables do I need to write those many number of streaming queries? How it can be solved?
Thanks

Related

Flink Table and Hive Catalog storage

I have a kafka topic and a Hive Metastore. I want to join the incomming events from the kafka topic with records of the metastore. I saw the possibility with Flink to use a catalog to query Hive Metastore.
So I see two ways to handle this:
using the DataStream api to consume the kafka topic and query the Hive Catalog one way or another in a processFunction or something similar
using the Table-Api, I would create a table from the kafka topic and join it with the Hive Catalog
My biggest concerns are storage related.
In both cases, what is stored in memory and what is not ? Does the Hive catalog stores anything on the Flink's cluster side ?
In the second case, how the table is handle ? Does flink create a copy ?
Which solution seems the best ? (maybe both or neither are good choices)
Different methods are suitable for different scenarios, sometimes depending on whether your hive table is a static table or a dynamic table.
If your hive is only a dimension table, you can try this chapter.
joins-in-continuous-queries
It will automatically associate the latest partition of hive, and it is suitable for scenarios where dimension data is slowly updated.
But you need to note that this feature is not supported by the Legacy planner.

JDBC sink connector insert/upsert based on max timestamp?

I'm very new to Kafka connect
I am inserting records from multiple sources into one table.
In some cases, it may be possible for some records to reach before others.
Since I cannot control which source will pull which record first, I want to add a check on the timestamp key of the record.
I have a key called "LastModified_timestamp" in my schema where I store the timestamp of the latest state of my record.
I want to add a check to my JDBC sink connector where I can upsert a record based on comparing the value of LastModified_timestamp
I want to ignore the records which have a older timestamp and only want to upsert/insert the latest one. I couldn't find any configuration to achieve this
Is there any way by which I can achieve this?
Will writing a custom query help in this case?
The JDBC Sink connector does not support this kind of feature. You have two options to consider:
Single Message Transform (SMT) - these apply logic to records as they pass through Kafka Connect. SMT are great for things like dropping columns, changing datatypes, etc. BUT not appropriate for more complex processing and logic, including logic which needs to span multiple records as yours does here
Process the data in the source Kafka topic first, to apply the necessary logic. You can do this with Kafka Streams, KSQL, and several other stream processing frameworks (e.g. Spark, Flink, etc). You'd need some kind of stateful logic that could work out if a record was older than already processed.
Can you describe more about your upstream source for the data? It might be there's a better way to orchestrate the data coming through to enforce the ordering.
A final idea would be to land all records to your target DB and then use logic in your database query consuming it to select the most recent (based on LastModified_timestamp) record for a given key.
Disclaimer: I work for Confluent, the company behind the open-source KSQL project.

Read data from KSQL tables

maybe this is a beginner question but what is the recommended way to read data produced in KSQL?
Let's assume I do some stream processing and write the data to a KSQL table. Now I want to access this data via a Spring application (e.g. fan-out some live data via a websocket). My first guess here was to use Spring Kafka and just subscribe to the underlying topic. Or should I use Kafka Streams?
Another use-case could be to do stream processing and write the results to a Redis store (e.g. for a webservice which always returns current values). What would be the approach here?
Thanks!
The results if KSQL queries are stored in Kafka topics. So you can access the results from third party applications by reading from the result topic.
If the query result is a Table the resulted Kafka topic is a changelog topic meaning that you can read it into a table in third party system such as Cassandra or Redis. This table will always have the latest result and you can query it from web services.
Check out our Clickstream demo where we push the results into Elastic for visualization. The visualized values are the latest values for in the corresponding tables.
https://github.com/confluentinc/ksql/tree/master/ksql-clickstream-demo#clickstream-analysis

Using kafka streams to create a table based on elasticsearch events

Is it possible to use Kafka streaming to create a pipeline that reads JSON from a Kafka topic and then do some logic with them and send the results to another Kafka topic or something else?
For example, I populate my topic using logs from elasticsearch. That is pretty easy using a simple logstash pipeline.
Once I have my logs in the kafka topic, I want to extract some pieces of information from the log and put them in a sort of "table" with N column(is Kafka capable of this?) and then put the table somewhere else (another topic or a db).
I didn't find any example that satisfies my criteria.
thanks
Yes, it's possible.
There is no concept of columns in kafka or kafka-streams. However, you typically just define a plain old java object of your choice, with the fields that your want (fields being the equivalent of columns in this case). You produce the output in that format to an output topic (using an appropriately chosen serializer). Finally, if you want to store the result in a relational database, you map the fields into columns, typically using a kafka connect jdbc sink:
http://docs.confluent.io/current/connect/connect-jdbc/docs/sink_connector.html

sSpark structured streaming PostgreSQL updatestatebykey

How to update state of OUTPUT TABLE by Spark structured streaming computation triggered by changes in INPUT PostgreSQL table?
As a real life scenario USERS table has been updated by user_id = 0002, how to trigger Spark computation for that user only and write / update results to another table?
Although there is no solution out of the box you can implement it the following way.
You can use Linkedin's Databus or other similar tools which mines the databse logs and produce respective events to kafka. The tool tracks the changes in database bin logs. You can write a kafka connector to transform and filter data. You can then consume events from kafka and process them to any sink format you want.