I have records with multiple nested JSON (UDTs) and based on those I have have multiple tables in Redshift: 1 for main json and others for nested json.
Now these input records need to be saved from Kinesis Firehose to Redshift. While configuring Redshift as destination there is an option to provide only one table through AWS console.
Is there any way of configuring more than one Redshift table as a destination for Kinesis Firehose?
It seems only one Redshift table can be configured as the Firehose destination.
A single delivery stream can only deliver data to one Amazon Redshift cluster and one table currently. If you want to have data delivered to multiple Redshift clusters or tables, you can create multiple delivery streams.
Reference:
https://aws.amazon.com/kinesis/data-firehose/faqs/
Related
I am looking for a way to read kinesis structured streaming in Databricks. I am able to use a spark cluster to continuously read streaming data. However I need a way now to get those timely records and do CRUD into Postgres or any DBMS tables. Is there a way I can do this only via Databricks?
I am designing a streaming pipeline where my need is to consume events from Kafka topic.
Single Kafka topic can have data from around 1000 tables, data coming as a json record. Now I have below problems to solve.
Reroute messages based on its table to its seperate folder: this is done using spark structured streaming partition by on table name.
Second wanted to parse each json record attach appropriate table schema to it and create/append/update corresponding delta table. I am not able find best solution to do it. Infering json schema dyanamically and write to delta table dynamically. How this can be done? As records are coming as json string.
As I have to process so many tables do I need to write those many number of streaming queries? How it can be solved?
Thanks
I have a kafka topic and a Hive Metastore. I want to join the incomming events from the kafka topic with records of the metastore. I saw the possibility with Flink to use a catalog to query Hive Metastore.
So I see two ways to handle this:
using the DataStream api to consume the kafka topic and query the Hive Catalog one way or another in a processFunction or something similar
using the Table-Api, I would create a table from the kafka topic and join it with the Hive Catalog
My biggest concerns are storage related.
In both cases, what is stored in memory and what is not ? Does the Hive catalog stores anything on the Flink's cluster side ?
In the second case, how the table is handle ? Does flink create a copy ?
Which solution seems the best ? (maybe both or neither are good choices)
Different methods are suitable for different scenarios, sometimes depending on whether your hive table is a static table or a dynamic table.
If your hive is only a dimension table, you can try this chapter.
joins-in-continuous-queries
It will automatically associate the latest partition of hive, and it is suitable for scenarios where dimension data is slowly updated.
But you need to note that this feature is not supported by the Legacy planner.
My organisation have MongoDB which stores application based time-series data. Now we are trying to create a data pipeline for analytics and visualisation. Due to time-series data we plan to use Druid as intermediate storage where we can do the required transformation and then use Apache Superset to visualise. Is there any way to migrate required data (not only updates) from MongoDB to Druid?
I was thinking about Apache Kafka but from what I have read, I understood that it will work better only to stream the changes happening in topics (topic associated with tables) which already exists in MongoDB and Druid. But what if there is a table of at least 100,000 records which exists only in MongoDB and first I wish to push whole table to Druid, will Kafka work in this scenario?
I am passing some json data to firehose delivery stream which in the end is getting saved into Redshift table. For my use case, I want the data to be stored in different tables.
Do I create different delivery stream for different tables?
If I create it that way, there will be data duplication in S3 as the data must go through S3 in order to push data to Redshift using Firehose delivery stream.
From the Kinesis Firehose FAQ:
Q: Can a single delivery stream deliver data to multiple Amazon Redshift clusters or tables?
A single delivery stream can only deliver data to one Amazon Redshift cluster and one table currently. If you want to have data delivered to multiple Redshift clusters or tables, you can create multiple delivery streams.
You will need multiple streams.