Separate Spark AWS Glue Metastore entries by environment (test vs prod) - pyspark

I plan to run my Spark SQL jobs on AWS's EMR, and I plan to use AWS's Glue Metastore to persist tables' schema and file location metadata. The problem I'm facing is I'm not sure how to isolate our test vs prod environments. There are times when I might add a new column to a table, and I want to test that logic in the test environment before making the change to production. It seems that the Glue Metastore only supports one entry per database-table pair, which means that test and prod would point to the same Glue Metastore record, so whatever change I make to the test environment would also immediately impact prod. How have others tackled this issue?

Related

Reading a Delta Table with no Manifest File using Redshift

My goal is to read a Delta Table on AWS S3 using Redshift. I've read through the Redshift Spectrum to Delta Lake Integration and noticed that it mentions to generate a manifest using Apache Spark using:
GENERATE symlink_format_manifest FOR TABLE delta.`<path-to-delta-table>`
or
DeltaTable deltaTable = DeltaTable.forPath(<path-to-delta-table>);
deltaTable.generate("symlink_format_manifest");
However, there doesn't seem to be support to generate these manifest files for Apache Flink and the respective Delta Standalone Library that it uses. This is the underlying software that writes data to the Delta Table.
How can I either get around this limitation?
This functionality seems to now be supported on AWS:
With today’s launch, Glue crawler is adding support for creating AWS Glue Data Catalog tables for native Delta Lake tables and does not require generating manifest files. This improves customer experience because now you don’t have to regenerate manifest files whenever a new partition becomes available or a table’s metadata changes.
https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/

Should I use AWS Glue or Spark on EMR for processing binary data to parquet format

I have a work requirement of reading binary data from sensors and produce parquet output results for Analytics.
For storage I have chosen s3 and Dynamodb.
For processing engine I’m confused on how to choose between AWS EMR or AWS Glue.
Data processing code base will be maintained in python coupled with Spark.
Please post your suggestion on choosing between AWS EMR or AWS Glue.
Using Glue / EMR depends on your use-case.
EMR is a managed cluster of servers and costs less than Glue, but it also requires more maintenance and set-up overhead. You can not only run Spark but also other frameworks on EMR like Flink.
Glue is serverless Spark / Python and really easy to use. It does not run on the latest Spark version and abstracts a lot of Spark away, in a good but also in a bad sense, that you can not set specific configurations very easily.
It's an opinion based question and now you have AWS EMR Serverless.
AWS Glue is 1) more managed and thus with restrictions, and 2) imho issues with crawling for schema changes to consider, 3) own interpretation of dataframes 4) and less run-time configuration and 5) less options for serverless scalability. There seems to a few bugs etc. that keep on popping up.
AWS EMR is 1) an AWS platform easy enough to configure, 2) with the AWS flavour of what they think the best way of running Spark is, 3) some limitations in terms of subsequently scaling down resources when using dynamic scaling out, 4) a platform that uses Spark so there will be a bigger pool of persons to hire, 5) allowing bootstrapping of software not standardly supplied, and selection of standard software, such as, say, HBase.
So, comparable to an extent. And divergent in other ways; AWS Glue is ETL/ELT, AWS EMR is that with more capabilities.

How to continuously populate a Redshift cluster from AWS Aurora (not a sync)

I have a number of MySql databases (OLTP) running on an AWS Aurora cluster. I also have a Redshift cluster that will be used for OLAP. The goal is to replicate inserts and changes from Aurora to Redshift, but not deletes. Redshift in this case will be an ever-growing data repository, while the Aurora databases will have records created, modified and destroyed — Redshift records should never be destroyed (at least, not as part of this replication mechanism).
I was looking at DMS, but it appears that DMS doesn't have the granularity to exclude deletes from the replication. What is the simplest and most effective way of setting up the environment I need? I'm open to third-party solutions, as well, as long as they work within AWS.
Currently have DMS continuous sync set up.
You could consider using DMS to replicate to S3 instead of Redshift, then use Redshift Spectrum (or Athena) against that S3 data.
S3 as a DMS target is append only, so you never lose anything.
see
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html
and
https://aws.amazon.com/blogs/database/replicate-data-from-amazon-aurora-to-amazon-s3-with-aws-database-migration-service/
That way, things get a bit more complex and you may need some ETL to process that data (depending on your needs)
You will still get the deletes coming through with a record type of "D", but you can ignore or process these depending on your needs.
A simple and effective way to capture Insert and Updates from Aurora to Redshift may be to use below approach:
Aurora Trigger -> Lambda -> Firehose -> S3 -> RedShift
Below AWS blog-post eases this implementation and look almost similar to your use-case.
It provides sample code also to get the changes from Aurora table to S3 through AWS Lambda and Firehose. In Firehose, you may setup the destination as Redshift, which will copy over data from S3 seemlessly into Redshift.
Capturing Data Changes in Amazon Aurora Using AWS Lambda
AWS Firehose Destinations

How can I create daily partitions in Postgres using Flyway

The Flyway FAQ says I can't make structural changes to the DB outside of Flyway. Does that includes creating new partitions for an existing table?
If so, is there any way to use Flyway to automatically create daily partitions as required? Bear in mind that the process will be running for more than one day so it's not something that can be just triggered on start-up.
We're stuck with Postgres 9.6 at the mo, so the partitions have to be created manually.

spark-jobserver - managing multiple EMR clusters

I have a production environment that consists of several (persistent and ad-hoc) EMR Spark clusters.
I would like to use one instance of spark-jobserver to manage the job JARs for this environment in general, and be able to specify the intended master right when I POST /jobs, and not permanently in the config file (using master = "local[4]" configuration key).
Obviously I would prefer to have spark-jobserver running on a standalone machine, and not on any of the masters.
Is this somehow possible?
You can write a SparkMasterProvider
https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server/src/spark.jobserver/util/SparkMasterProvider.scala
A complex example is here https://github.com/spark-jobserver/jobserver-cassandra/blob/master/src/main/scala/spark.jobserver/masterLocators/dse/DseSparkMasterProvider.scala
I think all you have to do is write one that will return the config input as spark master, that way you can pass it as part of job config.