Reading a Delta Table with no Manifest File using Redshift - amazon-redshift

My goal is to read a Delta Table on AWS S3 using Redshift. I've read through the Redshift Spectrum to Delta Lake Integration and noticed that it mentions to generate a manifest using Apache Spark using:
GENERATE symlink_format_manifest FOR TABLE delta.`<path-to-delta-table>`
or
DeltaTable deltaTable = DeltaTable.forPath(<path-to-delta-table>);
deltaTable.generate("symlink_format_manifest");
However, there doesn't seem to be support to generate these manifest files for Apache Flink and the respective Delta Standalone Library that it uses. This is the underlying software that writes data to the Delta Table.
How can I either get around this limitation?

This functionality seems to now be supported on AWS:
With today’s launch, Glue crawler is adding support for creating AWS Glue Data Catalog tables for native Delta Lake tables and does not require generating manifest files. This improves customer experience because now you don’t have to regenerate manifest files whenever a new partition becomes available or a table’s metadata changes.
https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/

Related

How to copy Druid data source data from prod server to QA server (like hive distcp action)

I wanted to check if there is a way to copy Druid datasource data (segments) from one server to another. Our ask is to load new data to prod Druid (using SQL queries), and copy the same data to qa Druid server. We are using hive druid storage handler to load the data, and HDFS as deep storage.
I read Druid documentations, but did not find any useful information.
There is currently no way to do this cleanly in druid.
If you really need this feature, please request this by creating a github ticket on : https://github.com/apache/druid/issues .
The workaround way is documented here : https://docs.imply.io/latest/migrate/#the-new-cluster-has-no-data-and-can-access-the-old-clusters-deep-storage
Full disclosure: I work for imply.

Pyspark Structured Streaming error with delta file verison

I have a job that streams data from a delta table with parquet files to a an output table in json format. both tables live in an Azure Data Lake container.
I get the following error, which I can't make sense of:
java.lang.IllegalArgumentException: requirement failed: Did not get the first delta file version: 921 to compute Snapshot
What does this mean? I don't want to delete my checkpoint files or the transaction log etc.
Thanks in advance
Note: Restoring Azure Data Lake Storage Gen2 flat and hierarchical
namespaces is not supported.
For more details, refer MSFT Document “Point-in-time restore”.
Point-in-time restore allows you to recover data from actions that only affected block blobs. Any activities that acted on containers are irreversibly lost. For example, if you use the Delete Container action to delete a container from the storage account, that container cannot be restored using a point-in-time restore operation. If you wish to restore individual blobs later, remove individual blobs rather than a whole container.

Google Cloud SQL PostgreSQL replication?

I want to make sure that there's not a better (easier, more elegant) way of emulating what I think is typically referred to as "logical replication" ("logical decoding"?) within the PostgreSQL community.
I've got a Cloud SQL instance (PostgreSQL v9.6) that contains two databases, A and B. I want B to mirror A, as closely as possible, but don't need to do so in real time or anything near that. Cloud SQL does not offer the capability of logical replication where write-ahead logs are used to mirror a database (or subset thereof) to another database. So I've cobbled together the following:
A Cloud Scheduler job publishes a message to a topic in Google Cloud Platform (GCP) Pub/Sub.
A Cloud Function kicks off an export. The exported file is in the form of a pg_dump file.
The dump file is written to a named bucket in Google Cloud Storage (GCS).
Another Cloud Function (the import function) is triggered by the writing of this export file to GCS.
The import function makes an API call to delete database B (the pg_dump file created by the export API call does not contain initial DROP statements and there is no documented facility for adding them via the API).
It creates database B anew.
It makes an API call to import the pg_dump file.
It deletes the old pg_dump file.
That's five different objects across four GCP services, just to obtain already existing, native functionality in PostgreSQL.
Is there a better way to do this within Google Cloud SQL?

How to continuously populate a Redshift cluster from AWS Aurora (not a sync)

I have a number of MySql databases (OLTP) running on an AWS Aurora cluster. I also have a Redshift cluster that will be used for OLAP. The goal is to replicate inserts and changes from Aurora to Redshift, but not deletes. Redshift in this case will be an ever-growing data repository, while the Aurora databases will have records created, modified and destroyed — Redshift records should never be destroyed (at least, not as part of this replication mechanism).
I was looking at DMS, but it appears that DMS doesn't have the granularity to exclude deletes from the replication. What is the simplest and most effective way of setting up the environment I need? I'm open to third-party solutions, as well, as long as they work within AWS.
Currently have DMS continuous sync set up.
You could consider using DMS to replicate to S3 instead of Redshift, then use Redshift Spectrum (or Athena) against that S3 data.
S3 as a DMS target is append only, so you never lose anything.
see
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html
and
https://aws.amazon.com/blogs/database/replicate-data-from-amazon-aurora-to-amazon-s3-with-aws-database-migration-service/
That way, things get a bit more complex and you may need some ETL to process that data (depending on your needs)
You will still get the deletes coming through with a record type of "D", but you can ignore or process these depending on your needs.
A simple and effective way to capture Insert and Updates from Aurora to Redshift may be to use below approach:
Aurora Trigger -> Lambda -> Firehose -> S3 -> RedShift
Below AWS blog-post eases this implementation and look almost similar to your use-case.
It provides sample code also to get the changes from Aurora table to S3 through AWS Lambda and Firehose. In Firehose, you may setup the destination as Redshift, which will copy over data from S3 seemlessly into Redshift.
Capturing Data Changes in Amazon Aurora Using AWS Lambda
AWS Firehose Destinations

Triggering a Dataflow job when new files are added to Cloud Storage

I'd like to trigger a Dataflow job when new files are added to a Storage bucket in order to process and add new data into a BigQuery table. I see that Cloud Functions can be triggered by changes in the bucket, but I haven't found a way to start a Dataflow job using the gcloud node.js library.
Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result (inserting new data to BigQuery when files are added to a Storage bucket)?
This is supported in Apache Beam starting with 2.2. See Watching for new files matching a filepattern in Apache Beam.
Maybe this post would help on how to trigger Dataflow pipelines from App Engine or Cloud Functions?
https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions