Pyspark Structured Streaming error with delta file verison - pyspark

I have a job that streams data from a delta table with parquet files to a an output table in json format. both tables live in an Azure Data Lake container.
I get the following error, which I can't make sense of:
java.lang.IllegalArgumentException: requirement failed: Did not get the first delta file version: 921 to compute Snapshot
What does this mean? I don't want to delete my checkpoint files or the transaction log etc.
Thanks in advance

Note: Restoring Azure Data Lake Storage Gen2 flat and hierarchical
namespaces is not supported.
For more details, refer MSFT Document “Point-in-time restore”.
Point-in-time restore allows you to recover data from actions that only affected block blobs. Any activities that acted on containers are irreversibly lost. For example, if you use the Delete Container action to delete a container from the storage account, that container cannot be restored using a point-in-time restore operation. If you wish to restore individual blobs later, remove individual blobs rather than a whole container.

Related

CDF pipeline throws error in GCS bucket doesn't exist

While moving data to Sink, I'm getting this error in GCP Data fusion pipeline.
Can someone help?
GCS path cdap-job/dd5d2bba-9cce-11ed-8666-56bac137a1c0 was not cleaned up for bucket gs://df-5999901975431383890-d2icduix5qi63hnabcpaqiyllq due to The specified bucket does not exist..
I tried recreating the temp buckets as it appeared in the log but it keeps on changing.
I had deleted few temp buckets from the list and I suspect that caused this issue.
As of today, with every Data Fusion instance created there is always a bucket that is created with it that has a name similar to the one that you shared in your message, i.e - df-5999901975431383890-d2icduix5qi63hnabcpaqiyllq. This bucket is visible in GCS, on the project you created you instance on.
You are probably getting this error because you deleted this bucket from GCS. You can either recreate the bucket or your Data Fusion instance.

Reading a Delta Table with no Manifest File using Redshift

My goal is to read a Delta Table on AWS S3 using Redshift. I've read through the Redshift Spectrum to Delta Lake Integration and noticed that it mentions to generate a manifest using Apache Spark using:
GENERATE symlink_format_manifest FOR TABLE delta.`<path-to-delta-table>`
or
DeltaTable deltaTable = DeltaTable.forPath(<path-to-delta-table>);
deltaTable.generate("symlink_format_manifest");
However, there doesn't seem to be support to generate these manifest files for Apache Flink and the respective Delta Standalone Library that it uses. This is the underlying software that writes data to the Delta Table.
How can I either get around this limitation?
This functionality seems to now be supported on AWS:
With today’s launch, Glue crawler is adding support for creating AWS Glue Data Catalog tables for native Delta Lake tables and does not require generating manifest files. This improves customer experience because now you don’t have to regenerate manifest files whenever a new partition becomes available or a table’s metadata changes.
https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/

Access local storage for druid

How do I access or view the local storage for druid? I would like to view the segments or copy the segments to a file. I am running druid operator on kubernetes. I have tried exec commands for historicals pods and middle managers pod however I am unable to enter in to any of the druid pod
Have you tried looking where the deep storage says
Deep storage is where segments are stored. It is a storage mechanism
that Apache Druid does not provide. This deep storage infrastructure
defines the level of durability of your data, as long as Druid
processes can see this storage infrastructure and get at the segments
stored on it, you will not lose data no matter how many Druid nodes
you lose. If segments disappear from this storage layer, then you will
lose whatever data those segments represented.
Source: Deep Storage on Druid documentation
For example, you have to know what directory is pointed in: druid.storage.storageDirectory
Remember that the data is saved in segments as we can read here: Segments on Apache Druid documentation
Useful Documentation:
Ingestion troubleshooting FAQ
HDFS as Deep-Storage: Druid is not storing the historical data on hdfs
Druid Setup with HDFS
Change Local Storage to S3 as deepstorage

Can I use GCP Snapshot for data or database backup? can I trust the snapshot to restore my DB data?

I am using the Google cloud platform and compute engine for my project, Installed MongoDB into the instance server, what is the best way to take my database back-up and restore
my options :
create a custom script to dump -> zip -> upload to cloud storage.
Use GCP snapshot to back-up data and restore while a new instance creating.
Let me know the best way of handling this.
Due to the non-reliable snapshot size got confused a bit.
While schedule snapshot
Yes you can use snapshot. Also consider below suggestion
Use additional data disk attached to Compute Instance for Database.
In case any failure detach the disk and attach to new vm instance.
Snapshot will help you in regional failure or when you need to restore backup in another region.Also snapshot size seems confusing to you because snapshots are incremental in nature.
ex. snapshot-1 taken at 2:00 AM contains 10GB data.
now you added 1 GB data to the disk at 2:30 AM and taken snapshot again at 3:00 AM.
snapshot-2 at 3:00 AM will only contains 1 GB of data ( not 11 GB).
but real magic is when you restore snapshot-2 , it will restore your complete 11GB of data as it contains reference to snapshot-1 with 10GB data.
Yes Snapshots are reliable. You can use snapshots to create point in time image for your disk. In your case, you can take data disk snapshots. To recover data from snapshot create a disk from snapshot and attach to existing vm.
To automate mounting this you can run the gcloud commands in a bash/python script. https://cloud.google.com/compute/docs/disks/restore-snapshot#create-disk-from-snapshot
For taking regular snapshots you can setup snapshot schedule and how many times a day it should happen. To save costs you can keep retention too. https://cloud.google.com/compute/docs/disks/scheduled-snapshots

How to continuously populate a Redshift cluster from AWS Aurora (not a sync)

I have a number of MySql databases (OLTP) running on an AWS Aurora cluster. I also have a Redshift cluster that will be used for OLAP. The goal is to replicate inserts and changes from Aurora to Redshift, but not deletes. Redshift in this case will be an ever-growing data repository, while the Aurora databases will have records created, modified and destroyed — Redshift records should never be destroyed (at least, not as part of this replication mechanism).
I was looking at DMS, but it appears that DMS doesn't have the granularity to exclude deletes from the replication. What is the simplest and most effective way of setting up the environment I need? I'm open to third-party solutions, as well, as long as they work within AWS.
Currently have DMS continuous sync set up.
You could consider using DMS to replicate to S3 instead of Redshift, then use Redshift Spectrum (or Athena) against that S3 data.
S3 as a DMS target is append only, so you never lose anything.
see
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html
and
https://aws.amazon.com/blogs/database/replicate-data-from-amazon-aurora-to-amazon-s3-with-aws-database-migration-service/
That way, things get a bit more complex and you may need some ETL to process that data (depending on your needs)
You will still get the deletes coming through with a record type of "D", but you can ignore or process these depending on your needs.
A simple and effective way to capture Insert and Updates from Aurora to Redshift may be to use below approach:
Aurora Trigger -> Lambda -> Firehose -> S3 -> RedShift
Below AWS blog-post eases this implementation and look almost similar to your use-case.
It provides sample code also to get the changes from Aurora table to S3 through AWS Lambda and Firehose. In Firehose, you may setup the destination as Redshift, which will copy over data from S3 seemlessly into Redshift.
Capturing Data Changes in Amazon Aurora Using AWS Lambda
AWS Firehose Destinations