Can I merge delta lake table to RDBMS table directly? Which is the preferred way in Databricks? - pyspark

I am dealing with updating master data. I'll do the UPCERT operations on the delta lake table. But after my UPCERT is complete I like to update the master data on the RDBMS table also. Is there any support from Databricks to perform this operation affectively in highly performant way. There are pyspark sql ways, but I don't see the merge option.
Appreciate any help on this.
Thanks
Krishna

Related

SQL vs PySpark/Spark SQL

Could someone please help me understand why we need to use PySpark or SprakSQL etc if the source and target of my data is the same DB?
For example, lets say I need to load data to table X in PostgresDB from tables X and Y. Would it not be simpler and faster to just do it in Postgres instead of using SprakSQL or PySpark etc?
I understand the need for these solutions if data is from multiple sources, but if it is from same source, do I need to use PySpark?
You can use spark when you want to do heavy data transformations, it makes it easier to load and process due to distributed processing.
It totally depends on how large is the data and how you want to transform it.
Using Postgres will be a good idea if data is relatively small and no transformation is required.
It is not necessary to use PySpark. Both PySpark & SparkSQL have their value in managing/manipulating large volumes of data few hundred of GBs, TBs, or PBs in a distributed computing setup. If this is your case, please use PySpark, it will be more efficient to load, manipulate, process/shape the data before inserting it into another table.
Thank you all for the feedback. I think I will use glue pyspark if source and destination are different. Else i will use glue python with jdbc connection and have one session do the tasks without bringing data to dataframes.

ADF: How do I clear a table in SQL?

I have a pipeline that ingests data from Kusto, does some simple transformation, and flows the data to SQL. It will be run once per day, and needs to clear the sink tables in SQL. I thought this would be straightforward (and probably is) but I can't figure out how to do it. Thanks for any assistance!
As #wBob said, if you are using Copy activity in ADF, we can enter TRUNCATE TABLE <your-table-name> at Pre-copy script. It will execute the T-SQL script here before copying.
You have to write a stored procedure on prior to transformation, which can delete your staging data.
Stored procedure->do transformation

Migrate data from NoSQL to an RDBMS

We have data existing in HBase and we want to move to AWS Aurora (MySQL) and we need to use the existing data so have to somehow load the NoSQL data into Aurora.
It's not a very big data base. Just a few tables.
Are there any best practices/tools to migrate data from NoSQL to a relational DB? I saw a lot of questions on the internet that ask to the reverse (DB -> NoSQL) but my requirement is a bit different and I don't find any helpful information.
Can someone please help? Where do I even start?
One simple way to do this without writing too much custom code would be to use Spark-HBase Connector from Hortonworks (SHC) to read data from an HBase table into a Spark dataframe and to write that dataframe into a MySQL table. The key challenge would be to get SHC to work, because in my experience it's extremely version sensitive. So the trick is to correctly coordinate your version of Spark, HBase, and SHC (and finding that right combination is trickier than you may think).
However, if you manage to get all the dependencies right, then doing the above is a matter of a few lines of code in Jupyter Notebook or Pyspark. You could run this on Yarn to parallelize the workload, in case it's large. Should work. Give it a try.

How can we handle Data validations in snowpipe in Snowflake

My Scenario is I have data in AWS S3 flat files.
I am using SNS to trigger the Snow-pipe when new file arrives in S3.
To load the data from flat files in S3 to Snowflake table I am using Snow-pipe.
So While loading data from flat files to snowflake table by Snow-pipe,
Can I handle data-validation and couple of calculations on source data?
Please help me if we have any way to do this...
Thanks in Advance.
Validation_mode copy option is not yet supported by snowpipe. However, snowpipe does support simple transformations like column reordering, cast etc are supported. The best way to perform calculations and transform your data would be to load the data into a staging table and process downstream into target tables.
Reference:
https://docs.snowflake.net/manuals/sql-reference/sql/create-pipe.html#usage-notes
https://docs.snowflake.net/manuals/user-guide/data-load-transform.html

How do I efficiently migrate the BigQuery Tables to On-Prem Postgres?

I need to migrate the tables from the BigQuery to the on-prem Postgres database.
How can I efficiently achieve that?
Some thoughts that are coming
I will use Google APIs to export the data from the tables
Store it locally
And finally, import to Postgres
But I am not sure if that can be done for a huge amount of data in TBs. Also, how can I automate this process? Can I use Jenkins for that?
Exporting the data from BigQuery, store it and importing it to PostgreSQL is a good approach. Here are other two alternatives that you can consider:
1) There's a PostgreSQL wrapper for BigQuery that allows to query directly from BigQuery. Depending on your case scenario this might be the easiest way to transfer the data; although, for TBs it might not be the best approach. This suggestion was made by #David in this SO question.
2) Using Dataflow. You can create a ETL process using Apache Beam to made the transfer. Take a look at this how-to for transferring data from BigQuery to CloudSQL. You would need to adapt it for local PostgreSQL, but the idea maintains.
Here's another SO answer that gives more context on this approach.