Load sql script in PySpark notebook - pyspark

In Azure Synapse Analytics, I want to keep my SQL queries separately from my PySpark notebook.
So I have created some SQL scripts. And I would like to use them in my PySpark notebook.
Is it possible ?
And what is the python code to load a SQL script into a variable ?

As I undertand the ask here , is can we read the SQL scripts which we have already created from pysprk notebook . I was looking at the storage account whiched is mapped to my synapse analytics studio ( ASA) and i do not see that the notebook or SQL scripts are stored there . So i dont think you can convert the existing SQL script to Pyspark code within the ASA . Yes if you export the SQL scripts and then upload to storage and then read the scripts from the notebook .

Related

How to read tables from synapse database tables using pyspark

I am a newbie to Azure Synapse, I have to work on the Azure spark notebook. One of my colleagues connected the on-prime database using the azure link service. Now I have written a test framework for comparing the on-prime data and data-lake(curated) data. but I don't understand how to read those tables using Pyspark.
here is my linked service data structure.
enter image description here
here my Link service names and Database name.
You can read any file as a table which is stored in Synapse Linked location by using Azure Synapse Dedicated SQL Pool Connector for Apache Spark.
First you need to read the file which you need to read as the table in Synapse. Use below code to read the file.
%%pyspark
df = spark.read.load('abfss://sampleadls2#sampleadls1.dfs.core.windows.net/business.csv', format='csv', header=True)
Then convert this file into table using the code below:
%%pyspark
spark.sql("CREATE DATABASE IF NOT EXISTS business")
df.write.mode("overwrite").saveAsTable("business.data")
Refer below image.
Now you can run any Spark SQL command on this table as shown below:
%%pyspark
data = spark.sql("SELECT * FROM business.data")
display(data)
See the output in below image.

Error when trying to import with CSV file format in Cloud SQL

HTTPError 400: Unknow export file type was thrown when I try to Import csv file from my Cloud Storage bucket into my Cloud SQL db. Any idea what I missed out.
Reference:
gcloud sql import csv
CSV files are not supported in Cloud SQL, MS SQL Server. As mentioned here,
In Cloud SQL, SQL Server currently supports importing databases using
SQL and BAK files.
Somehow, it is supported for MySQL and PostgreSQL versions of Cloud SQL.
You could perform one of the next solutions:
Change the database engine to either PostgreSQL or MySQL (where CSV files are supported).
If the data on your CSV file came from an on-premise SQL Server DB table, you can create an SQL file from it, then use it to import into Cloud SQL, SQL Server.

Using AWS Glue Python jobs to run ETL on redshift

We have a setup to sync rds postgres changes into s3 using DMS. Now, I want to run ETL on this s3 data(in parquet) using Glue as scheduler.
My plan is to build SQL queries to do the transformation, execute them on redshift spectrum and unload data back into s3 in parquet format. I don't want to Glue Spark as my data loads do not require that kind of capacity.
However, I am facing some problems connecting to redshift from glue, primarily library version issues and the right whl files to be used for pg8000/psycopg2. Wondering if anyone has experience with such implementation and how were you able to manage the db connections from Glue Python shell.
I'm doing something similar in a Python Shell Job but with Postgres instead of Redshift.
This is the whl file I use
psycopg_binary-2.9.2-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
An updated version can be found here.

How to execute SQL scripts using azure databricks

I have one SQL scripts file
In that file there is some sql query
I want to upload it on dbfs and read it from azure databricks and execute querys from the script on azure databricks
Databricks does not directly support the execution of .sql files. However you could just read them into a string and execute them.
with open("/dbfs/path/to/query.sql") as queryFile:
queryText = queryFile.read()
results = spark.sql(queryText)

how to run a shell script from Azure data factory

How to run a shell script from Azure Data Factory. Inside the shell script I am trying to execute an hql file like below:
/usr/bin/hive -f "wasbs://hivetest#batstorpdnepetradev01.blob.core.windows.net/hivetest/extracttemp.hql" >
wasbs://hivetest#batstorpdnepetradev01.blob.core.windows.net/hivetest/extracttemp.csv
My hql file is stored inside a Blob Storage and I want to execute it and collect the result into a csv file and store it back to Blob Storage . This entire script is stored in shell script which also in a Blob Storage. NowIi want to execute in a Azure Data Factory in hive activity. Help will be appreciated.
You could use Hadoop hive activity in ADF. Please take a look at this doc. And you could build your pipeline with ADF V2 UI.