Maintaining Generation Data when Backing Up Versioned Google Cloud Storage Buckets

Maintaining Generation Data when Backing Up Versioned Google Cloud Storage Buckets - google-cloud-storage

My use cases:
I would like to store multiple versions of text files in Cloud Storage and save the createdAt timestamp at which each version was created.
I'd like to be able to retrieve a list of all versions and createdAt times without opening and reading the files.
I'd also like to create nightly backups of the bucket with all the versions intact, and for each file to keep a record of its original createdAt time.
My solutions:
I have a Cloud Storage bucket with versioning enabled. Every time a I save a file test, I get a new file test#generation_number.
I can list all versions and fetch an older version with test#generation_number
I can back up all versions of test in the bucket, using gsutil cp -A gs://my-original-bucket/test gs://my-backup-bucket.
The issue with point #3. The #generation_number of each version that is backed up changes to the time at which the version of each backup file was created, not the time of the original file creation. I understand this is working as intended, and the order of the versions is still intact.
But where to stash these original createdAt values? I tried to store them in metadata, and I found metadata seems not to be versioned, but rather global for the file object as a whole.
What is the best way to achieve my use case? Is there any way to do so directly with Google Cloud Storage? Or should I maintain a separate database with this information, instead?

Related

write apache iceberg table to azure ADLS / S3 without using external catalog

I'm trying to create an iceberg table format on cloud object storage.
In the below image we can see that iceberg table format needs a catalog. This catalog stores current metadata pointer, which points to the latest metadata. The Iceberg quick start doc lists JDBC, Hive MetaStore, AWS Glue, Nessie and HDFS as list of catalogs that can be used.
My goal is to store the current metadata pointer(version-hint.text) along with rest of the table data(metadata, manifest lists, manifest, parquet data files) in the object store itself.
With HDFS as the catalog, there’s a file called version-hint.text in
the table’s metadata folder whose contents is the version number of
the current metadata file.
Looking at HDFS as one of the possible catalogs, I should be able to use ADLS or S3 to store the current metadata pointer along with rest of the data. For example: spark connecting to ADLS using ABFSS interface and creating iceberg table along with catalog.
My question is
Is it safe to use version hint file as current metadata pointer in ADLS/S3? Will I lose any of the iceberg features if I do this? Looking at this comment from one of the contributors suggests that its not ideal for production.
The version hint file is used for Hadoop tables, which are named that
way because they are intended for HDFS. We also use them for local FS
tests, but they can't be safely used concurrently with S3. For S3,
you'll need a metastore to enforce atomicity when swapping table
metadata locations. You can use the one in iceberg-hive to use the
Hive metastore.
Looking at comments on this thread, Is version-hint.text file optional?
we iterate through on the possible metadata locations and stop only if
there is not new snapshot is available
Could someone please clarify?
I'm trying to do a POC with Iceberg. At this point the requirement is to be able to write new data from data bricks to the table at least every 10 mins. This frequency might increase in the future.
The data once written will be read by databricks and dremio.

I would definitely try to use a catalog other than the HadoopCatalog / hdfs type for production workloads.
As somebody who works on Iceberg regularly (I work at Tabular), I can say that we do think of the hadoop catalog as being more for testing.
The major reason for that, as mentioned in your threads, is that the catalog provides an atomic locking compare-and-swap operation for the current top level metadata.json file. This compare and swap operation allows for the query that's updating the table to grab a lock for the table after doing its work (optimistic locking), write out the new metadata file, update the state in the catalog to point to the new metadata file, and then release that lock.
The lock isn't something that really works out of the box with HDFS / hadoop type catalog. And then it becomes possible for two concurrent actions to write out a metadata file, and then one sets it and the other's work gets erased or undefined behavior occurs as ACID compliance is lost.
If you have an RDS instance or some sort of JDBC database, I would suggest that you consider using that temporarily. There's also the DynamoDB catalog, and if you're using Dremio then nessie can be used as your catalog as well
In the next version of Iceberg -- the next major version after 0.14, which will likely be 1.0.0, there is a procedure to register tables into a catalog, which makes it easy to move a table from one catalog to another in a very efficient metadata only operation, such as CALL catalog.system.register_table('$new_table_name', '$metadata_file_location');
So you're not locked into one catalog if you start with something simple like the JDBC catalog and then move onto something else. If you're just working out a POC, you could start with the Hadoop catalog and then move to something like the JDBC catalog once you're more familiar, but it's important to be aware of the potential pitfalls of the hadoop type catalog which does not have the atomic compare-and-swap locking operation for the metadata file that represents the current table state.
There's also an option to provide a locking mechanism to the hadoop catalog, such as zookeeper or etcd, but that's a somewhat advanced feature and would require that you write your own custom lock implementation.
So I still stand by the JDBC catalog as the easiest to get started with as most people can get an RDBMS from their cloud provider or spin one up pretty easily -- especially now that you will be able to efficiently move your tables to a new catalog with the code in the current master branch or in the next major Iceberg release, it's not something to worry about too much.
Looking at comments on this thread, Is version-hint.text file optional?
Yes, the version-hint.txt file is used by the hadoop type catalog to attempt to provide an authoritative location where the table's current top-level metadata file is located. So version-hint.txt is only found with hadoop catalog, as other catalogs store it in their own specific mechanism. A table in an RDBMS instance is used to store all of the catalogs "version hints" when using the JDBC catalog or even the Hive catalog, which is backed by Hive Metastore (and very typically an RDBMS). Other catalogs include the DynamoDB catalog.
If you have more questions, the Apache Iceberg slack is very active.
Feel free to check out the docker-spark-iceberg getting started tutorial (which I helped create), which includes Jupyter notebooks and a docker-compose setup.
It uses the JDBC catalog backed by Postgres. With that, you can get a feel for what the catalog is doing by ssh'ing into the containers and running psql commands, as well as looking at table data on your local machine. There's also some nice tutorials with sample data!
https://github.com/tabular-io/docker-spark-iceberg

Azure Data Factory: Data Lifecycle Management and cleaning up stale data

I'm working on a requirement to reduce the cost of data storage. It includes the following tasks:
Being able to remove files from File Share and blobs from Blob Storage, based on their last modified date.
Being able to change the tier of individual blobs, based on their last modified date.
Does Azure Data Factory has built-in activities to take care of these tasks? What's the best approach for automating the clean-up process?

1.Being able to remove files from File Share and blobs from Blob Storage, based on their last modified date.
This requirement could be implemented by ADF built-in method: Delete Activity.
Please create a blob storage dataset and just refer to this example and configure the range of last modify date :https://learn.microsoft.com/en-us/azure/data-factory/delete-activity#clean-up-the-expired-files-that-were-last-modified-before-201811
Please consider some back up strategy for some accidents because:
2.Being able to change the tier of individual blobs, based on their last modified date.
No built-in feature to complete this in ADF. However,while i notice that your profile shows you are .net maker, so follow this case:Azure Java SDK - set block blob to cool storage tier on upload so that you could know the Tier could be changed in sdk code. That's easy to create an Azure Function to do such simple task. Moreover,ADF supports Azure Function Activity.

How to creating an external table in redshift spectrum, where file location will change everyday?

We are planning to source data from another AWS account's S3 by using AWS redshift spectrum. But Source informed that bucket key will change every day and latest data will be available in the bucket key location with latest timestamp.
Can anyone suggest what is the best way to create this external table?

External table in Spectrum can be either configured to point to a prefix in S3 (kind of like folder in a normal filesystem) or you can use a manifest file to specify the exact list of files the table should comprise of ( they can even reside in different s3 buckets).
So you will have to create the table every day and point it to the correct location. If all the files end up in the same s3 prefix you will have to use manifest file to specify the current one.
a hint not directly related to the question:
What you could also do, is to create tables daily with a timestamp in the name, and every day create a view pointing to the latest table. This way it will be easy to have a look at the historical data, or of you use the data for eg. machine learning - pin the input to a immutable version of data so that you can reproducably fetch training data - but this of course depends on your requirements.

Data streaming to Google Cloud ML Engine

I found that Google ml engine expects data in cloud storage, big query etc. Is there any way to stream data to ml-engine. For example, imagine that I need to use data in WordPress or Drupal site to create a tensorflow model, say a spam detector. One way is to export the whole data as CSV and upload it to cloud storage using google-cloud--php library. The problem here is that, for every minor change, we have to upload the whole data. Is there any better way?

By minor change, do you mean "when you get new data, you have to upload everything--the old and new data--again to gcs"? One idea is to export just the new data to gcs on some schedule, making many csv files over time. You can write your trainer to take a file pattern and expand it using get_matching_files/Glob or multiple file paths.
You can also modify your training code to start from an old checkpoint and train over just the new data (which is in its own file) for a few steps.

Using Data compare to copy one database over another

Ive used the Data Comare tool to update schema between the same DB's on different servers, but what If so many things have changed (including data), I simply want to REPLACE the target database?
In the past Ive just used TSQL, taken a backup then restored onto the target with the replace command and/or move if the data & log files are on different drives. Id rather have an easier way to do this.

You can use Schema Compare (also by Red Gate) to compare the schema of your source database to a blank target database (and update), then use Data Compare to compare the data in them (and update). This should leave you with the target the same as the source. However, it may well be easier to use the backup/restore method in that instance.