PySaprk- Perform Merge in Synapse using Databricks Spark - pyspark

We are having a tricky situation while performing ACID operation using Databricks Spark .
We want to perform UPSERT on a Azure Synapse table over a JDBC connection using PySpark . We are aware of Spark providing only 2 mode for writing data . APPEND and OVERWRITE (only these two use full in our case) . So based these two mode we thought of below options:
We will write whole dataframe into a stage table . And we will use this stage table to perform MERGE operation( ~ UPSERT )with final Table .Stage table will be truncated / dropped after that .
We Will bring target table data into Spark also. Inside Spark We will perform MERGE using Delta lake and will generate a final Dataframe .This dataframe will be written back to Target table in OVERWRITE mode.
Considering the cons. sides..
in Option 1 , We have to use two table just to write the final data. And In,case both Stage and target tables are big , then performing MERGE operation inside Synapse is another herculean task and May take time .
in option 2 ,We have to bring the Target table into Spark in-memory. Even though network IO is not much of our concern as both Databricks and Synpse will be in same Azure AZ, It may leads to memory issue in Spark side.
Is there any other feasible options ?? Or any recommendation ??

Answer would depend on many factors not listed in your question. It's a very open ended question.
(Given the way your question is phrased I'm assuming you're using Dedicated SQL Pools and not an On-demand Synapse)
Here are some thoughts:
You'll be using spark cluster's compute in option 1 and Synapse' compute in option 2. Compare cost.
Pick the lower cost.
Read and write to/from Spark to/from Synapse using their driver uses Datalake as stage. I.e. while reading a table from Synapse into a datafrmae in Spark, driver will first make Synapse export data to Datalake (as parquet IIRC) and then read the files in Datalake to create the Dataframe. This scales nicely if you're talking about 10s or million or billions of rows. But the overhead could become a performance overhead if row counts are low (10-100s of thousands).
Test and pick the faster one.
Remember that Synapse is not like a traditional MySQL or SQL-Server. It's an MPP DB.
"performing MERGE operation inside Synapse is another herculean task and May take time" is a wrong statement. It scales just like a Spark cluster.
It may leads to memory issue in Spark side, yes and no. One one hand all data isn't going to be loaded into a single worker node. OTOH yes, you do need enough memory for each node to do it's own part.
Although Synapse can be scaled up and down dynamically, I've seen it take up to 40 minutes to complete a scale up. Databricks on the other hand is fully on-demand and you can probably get away with turning on cluster, do upsert, shutdown cluster. With Synapse you'll probably have other clients using it, so may not be able to shut it down.
So with Synapse either you'll have to live with 40-80 minutes down time for each upsert (scale up, upsert, scale down), OR
pay for high DWU flat-rate all the time, though your usage is high only when you upsert but otherwise it's pretty low.
Lastly, remember that MERGE is in preview at the time of writing this. Means no Sev-A support cases/immediate support if something breaks in your prod because you're using MERGE.
You can always use DELETE + INSERT instead. Assumes the delta you receive has all columns from target table and not just updated ones.

Did you try creating checksum to do merge upsert only for rows that have actual data change?

Related

Spark data pipeline initial load impact on production DB

I want to write a Spark pipeline to perform aggregation on my production DB data and then write data back to the DB. My goal of writing the pipeline is to perform aggregation and not impact production DB while it runs, meaning I don't want users experiencing lag nor DB having heavy IOPS while the aggregation is performed. For example, an equivalent aggregation query just run as SQL would take a long time and also use up the RDS IOPS, which results in users not being able to get data - trying to avoid this. A few questions:
How is data loaded into Spark (AWS Glue) in general? Is there query load on prod DB?
Is there a difference in using a custom SQL query vs custom Spark code to filter items initially (initial loading of data, e.g. load 30 days sales data)? For example, does using custom SQL query end up performing a query on the prod DB, resulting in large load on prod DB?
When writing data back to DB, does that incur load on DB as well?
I'm using a PostgreSQL database in case this matters.
How is data loaded into Spark (AWS Glue) in general? Is there query load on prod DB?
By default there will be a single partition in Glue to which the whole table is read into.But you can do parallel reads using this and make sure to chose a column that will not affect the DB performance.
Is there a difference in using a custom SQL query vs custom Spark code to filter items initially (initial loading of data, e.g. load 30 days sales data)?
Yes when you pass a query instead of table you will be only reading the result of it from the DB and reducing the large n/w and IO transfer. This means you are delegating it to DB engine to calculate the result.Refer to this on how you can do it.
For example, does using custom SQL query end up performing a query on the prod DB, resulting in large load on prod DB?
Yes depending on the table size and query complexity this might affect DB performance and if you have a read replica then you can simply use that.
When writing data back to DB, does that incur load on DB as well?
Yes it depends on how you are writing the result back to DB. Few partitions is always good i.e, not too many and not too less.

AWS Redshift: How to run copy command from Apache NiFi without using firehose?

I have flow files with data records in it. I'm able to place it on S3 bucket. From there on I want to run COPY command and update command with joins to achieve MERGE / UPSERT operation. Can anyone suggest ways to solve this as firehose only executes copy command and I can't make UPSERT / MERGE operation as prescribed by AWS docs directly, so has to copy into staging table and update or insert using some conditions.
There are a number of ways to do this but I usually go with a lambda function run every 5 minutes or so that takes the data put in Redshift from firehose and merges it with existing data. Redshift likes to run on larger "chunks" of data and it is most efficient if you build up some size before performing these operations. The best practice is to move the data from the firehose target in an atomic operation like ALTER TABLE APPEND and use this new table as the source for merging. This is so firehose can keep adding data while the merge is in process.

Move data from PostgreSQL to AWS S3 and analyze with RedShift Spectrum

I have a big amount of PostgreSQL tables with different schemas and the massive amount of data inside them.
I'm unable to do the data analytics right now because the data amount is quite large - a few TB of data and PostgreSQL is not able to process queries in a reasonable amount of time.
I'm thinking about the following approach - I'll process all of my PostgreSQL tables with Apache Spark, load the DataFrames and store them as the Parquet files in AWS S3. Then I'll use RedShift Spectrum in order to query the information stored inside of these PARQUET files.
First of all, I'd like to ask - will this solution work at all?
And the second - will RedShift Spectrum be able to automatically create EXTERNAL tables from these Parquet files without additional schema specification(even when the original PostgreSQL tables contain the unsupported data types by AWS RedShift)?
Redshift Spectrum pretty much supports same datatypes as Redshift itself.
Redshift Spectrum creates cluster of compute nodes behind the scenes. The size of cluster is based on number of actual Redshift Cluster nodes, so if you plan to create 1 node Redshift cluster, Spectrum will run pretty slow.
As you noted in comments, you can use Athena to query the data, and it will be better option in your case instead of Spectrum. But Athena has several limitations, like 30 min run time, memory consumption etc. So if you plan to do complicated queries with several joins, it can just not work.
Redshift Spectrum can't create external tables without provided structure.
Best solution in your case will be to use Spark (on EMR, or Glue) to transform the data, Athena to query it, and if Athena can't do specific query - use SparkSQL on same data. You can use Glue, but running jobs on EMR on Spot Instances will be more flexible and cheaper. EMR clusters comes with EMRFS, which gives you the ability to use S3 almost transparently instead of HDFS.
AWS Glue might be interesting as an option for you. It is both a hosted version of Spark, with some AWS specific addons and a Data Crawler + Data Catalogue.
It can crawl unstructured data such as Parquet files and figure out the structure. Which then allows you to export it to AWS RedShift in structured form if needed.
See this blog post on how to connect it to a postgres database using JDBC to move data from Postgres to S3.

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile

How do I use Redshift Database for Transformation and Reporting?

I have 3 tables in my redshift database and data is coming from 3 different csv files from S3 every few seconds. One table has ~3 billion records and other 2 has ~100 million record. For the near realtime reporting purpose, I have to merge this table into 1 table. How do I achieve this in redshift ?
Near Real Time Data Loads in Amazon Redshift
I would say that the first step is to consider whether Redshift is the best platform for the workload you are considering. Redshift is not an optimal platform for streaming data.
Redshift's architecture is better suited for batch inserts than streaming inserts. "COMMIT"s are "costly" in Redshift.
You need to consider the performance impact of VACUUM and ANALYZE if those operations are going to compete for resources with streaming data.
It might still make sense to use Redshift on your project depending on the entire set of requirements and workload, but bear in mind that in order to use Redshift you are going to engineer around it, and probably change your workload from a "near-real-time" to a micro batch architecture.
This blog posts details all the recommendations for micro batch loads in Redshift. Read the Micro-batch article here.
In order to summarize it:
Break input files --- Break your load files in several smaller files
that are a multiple of the number of slices
Column encoding --- Have column encoding pre-defined in your DDL.
COPY Settings --- Ensure COPY does not attempt to evaluate the best
encoding for each load
Load in SORT key order --- If possible your input files should have
the same "natural order" as your sort key
Staging Tables --- Use multiple staging tables and load them in
parallel.
Multiple Time Series Tables --- This documented approach for dealing
with time-series in Redshift
ELT --- Do transformations in-database using SQL to load into the
main fact table.
Of course all the recommendations for data loading in Redshift still apply. Look at this article here.
Last but not least, enable Workload Management to ensure the online queries can access the proper amount of resources. Here is an article on how to do it.