Flink Table and Hive Catalog storage - apache-kafka

I have a kafka topic and a Hive Metastore. I want to join the incomming events from the kafka topic with records of the metastore. I saw the possibility with Flink to use a catalog to query Hive Metastore.
So I see two ways to handle this:
using the DataStream api to consume the kafka topic and query the Hive Catalog one way or another in a processFunction or something similar
using the Table-Api, I would create a table from the kafka topic and join it with the Hive Catalog
My biggest concerns are storage related.
In both cases, what is stored in memory and what is not ? Does the Hive catalog stores anything on the Flink's cluster side ?
In the second case, how the table is handle ? Does flink create a copy ?
Which solution seems the best ? (maybe both or neither are good choices)

Different methods are suitable for different scenarios, sometimes depending on whether your hive table is a static table or a dynamic table.
If your hive is only a dimension table, you can try this chapter.
joins-in-continuous-queries
It will automatically associate the latest partition of hive, and it is suitable for scenarios where dimension data is slowly updated.
But you need to note that this feature is not supported by the Legacy planner.

Related

Can we use kafka JDBC source connector to pull data from multiple databases and put it into one input topic?

We have a use case wherein the business logic requires us to join tables from different databases and push the end result to an input topic.
table1 from schema1 in database1
table2 from schema2 in database2
Business logic
SELECT a,b FROM table1 INNER JOIN table2 ON table1.c = table2.d;
here a is from table1 and b is from table2, and value of the message in the input topic looks like { "payload":{ "a":xyz,"b":xyz} }
Is there any way to achieve this requirement with a single jdbc source connector?
PS:
I have referred to Can the JDBC Kafka Connector pull data from multiple databases?, but in the accepted answer messages are pushed to input topic without implementing any business logic. With this implementation we won't be able to push the message to input topic as per our requirement.
Alternative way would be using kafka streams, i.e., push the messages to input topic from each table and handle the joining logic at the kafka stream application level. But we are looking for a solution if we could implement the logic at the connector level itself?
Short answer: No, you cannot use the JDBC Source connector in this way.
Longer answer: The JDBC source connector can connect to one database per connector instance. You have a few options:
Stream the contents of both tables into Kafka, and use ksqlDB (or Kafka Streams if you prefer) to join them and push the resulting data to a new Kafka topic.
Write a new connector plugin yourself that connects to both databases and does the join (this sounds like an awful idea)
If the database supports it, use a remote join (e.g. Oracle's DB Link) and the JDBC source connector's query option.
Depending on data volumes and query complexity I'd personally go for option 1. ksqlDB is a perfect fit here.
If both the databases are on the same Database Server you can use the CustomQuery and write the SQL joining the tables across database.

KSQL query and tables storage

I was looking for a documentation about where KSQL storage querys and tables. For example, since KSQL was built to work with Kafka, when I create a table from a topic, or when I write a query, where are stored the tables or the query results? More specifically, does KSQL use some kind of pointers to events inside segments inside the topic partitions, or it duplicates the events when I create a table from a topic, for example?
The queries that have been ran or are active are persisteted back into a Kafka topic.
A Select statement has no persistent state - it acts as a consumer
A Create Stream/Table command will create potentially many topics, resulting in duplication, manpulation, and filtering of the input topic out to a given destination topic. For any stateful operations, results would be stored in a RocksDB instance on the KSQL server(s).
Since KSQL is built on Kafka Streams, you can refer to the wiki on Kafka Streams Internal Data Management

JDBC Confluent kafka Connector and Topic per schema

We recently started using Confluent Kafka-JDBC connector to import RDBMS data.
As part of default configuration settings, it seems that one Topic is created for every Table in the schema.
I would like to know if there is any way to
Create Topic per schema rather than every table. And if Topic per Schema is enabled then can Schema evolution (With Schema Registry) be supported on a table basis ?
If Topic per schema is not possible then are there any guidelines on how to manage hundred's or thousands of topics ? Considering that there will one to one mapping between number of tables to number of topics ?
Thanks in advance,
Create Topic per schema rather than every table.
No - it's either n tables -> n topics, or 1 query -> 1 topic.
any guidelines on how to manage hundred's or thousands of topics ?
Adopt a standard naming pattern for them. Use topic-specific configuration as required.

Read data from KSQL tables

maybe this is a beginner question but what is the recommended way to read data produced in KSQL?
Let's assume I do some stream processing and write the data to a KSQL table. Now I want to access this data via a Spring application (e.g. fan-out some live data via a websocket). My first guess here was to use Spring Kafka and just subscribe to the underlying topic. Or should I use Kafka Streams?
Another use-case could be to do stream processing and write the results to a Redis store (e.g. for a webservice which always returns current values). What would be the approach here?
Thanks!
The results if KSQL queries are stored in Kafka topics. So you can access the results from third party applications by reading from the result topic.
If the query result is a Table the resulted Kafka topic is a changelog topic meaning that you can read it into a table in third party system such as Cassandra or Redis. This table will always have the latest result and you can query it from web services.
Check out our Clickstream demo where we push the results into Elastic for visualization. The visualized values are the latest values for in the corresponding tables.
https://github.com/confluentinc/ksql/tree/master/ksql-clickstream-demo#clickstream-analysis

Querying MySQL tables using Apache Kafka

I am trying to use Kafka Streams for achieving a use-case.
I have two tables in MySQL - User and Account. And I am getting events from MySQL into Kafka using a Kafka MySQL connector.
I need to get all user-IDs within an account from within Kafka itself.
So I was planning to use KStream on MySQL output topic, process it to form an output and publish it to a topic with Key as the account-id and value as the userIds separated by comma (,).
Then I can use interactive query to get all userIds using account id, with the get() method of ReadOnlyKeyValueStore class.
Is this the right way to do this? Is there a better way?
Can KSQL be used here?
You can use Kafka Connect to stream data in from MySQL, e.g. using Debezium. From here you can use KStreams, or KSQL, to transform the data, including re-keying which I think is what you're looking to do here, as well as join it to other streams.
If you ingest the data from MySQL into a topic with log compaction set then you are guaranteed to always have the latest value for every key in the topic.
I would take a look at striim if you want built in CDC and interactive continuous SQL queries on the streaming data in one UI. More info here:
http://www.striim.com/blog/2017/08/making-apache-kafka-processing-preparation-kafka/