How to read tables from synapse database tables using pyspark - pyspark

I am a newbie to Azure Synapse, I have to work on the Azure spark notebook. One of my colleagues connected the on-prime database using the azure link service. Now I have written a test framework for comparing the on-prime data and data-lake(curated) data. but I don't understand how to read those tables using Pyspark.
here is my linked service data structure.
enter image description here
here my Link service names and Database name.

You can read any file as a table which is stored in Synapse Linked location by using Azure Synapse Dedicated SQL Pool Connector for Apache Spark.
First you need to read the file which you need to read as the table in Synapse. Use below code to read the file.
%%pyspark
df = spark.read.load('abfss://sampleadls2#sampleadls1.dfs.core.windows.net/business.csv', format='csv', header=True)
Then convert this file into table using the code below:
%%pyspark
spark.sql("CREATE DATABASE IF NOT EXISTS business")
df.write.mode("overwrite").saveAsTable("business.data")
Refer below image.
Now you can run any Spark SQL command on this table as shown below:
%%pyspark
data = spark.sql("SELECT * FROM business.data")
display(data)
See the output in below image.

Related

Best practice for importing bulk data to AWS RDS PostgreSQL database

I have a big AWS RDS database that needs to be updated with data on a periodic basis. The data is in JSON files stored in S3 buckets.
This is my current flow:
Download all the JSON files locally
Run a ruby script to parse the JSON files to generate a CSV file matching the table in the database
Connect to RDS using psql
Use \copy command to append the data to the table
I would like switch this to an automated approach (maybe using an AWS Lambda). What would be the best practices?
Approach 1:
Run a script (Ruby / JS) that parses all folders in the past period (e.g., week) and within the parsing of each file, connect to the RDS db and execute an INSERT command. I feel this is a very slow process with constant writes to the database and wouldn't be optimal.
Approach 2:
I already have a Ruby script that parses local files to generate a single CSV. I can modify it to parse the S3 folders directly and create a temporary CSV file in S3. The question is - how do I then use this temporary file to do a bulk import?
Are there any other approaches that I have missed and might be better suited for my requirement?
Thanks.

o110.pyWriteDynamicFrame. null

I have created a visual job in AWS Glue where I extract data from Snowflake and then my target is a postgresql database in AWS.
I have been able to connect to both Snowflak and Postgre, I can preview data from both.
I have also been able to get data from snoflake, write to s3 as csv and then take that csv and upload it to postgre.
However when I try to get data from snowflake and push it to postgre I get the below error:
o110.pyWriteDynamicFrame. null
So it means that you can get the data from snowflake in a Datafarme and while writing the data from this datafarme to postgres, you are failing.
You need to check was glue logs to get more understanding why is this failing while writing the data into postgres.
Please check if you have the right version of jars (needed by postgres) compatible with scala(on was glue side).

Error when trying to import with CSV file format in Cloud SQL

HTTPError 400: Unknow export file type was thrown when I try to Import csv file from my Cloud Storage bucket into my Cloud SQL db. Any idea what I missed out.
Reference:
gcloud sql import csv
CSV files are not supported in Cloud SQL, MS SQL Server. As mentioned here,
In Cloud SQL, SQL Server currently supports importing databases using
SQL and BAK files.
Somehow, it is supported for MySQL and PostgreSQL versions of Cloud SQL.
You could perform one of the next solutions:
Change the database engine to either PostgreSQL or MySQL (where CSV files are supported).
If the data on your CSV file came from an on-premise SQL Server DB table, you can create an SQL file from it, then use it to import into Cloud SQL, SQL Server.

AWS Glue ETL Job Missing collection name

I have data catalog tables generated by crawlers one is data source from mongodb, and second is datasource Postgres sql (rds). Crawlers running successfully & connections test working.
I am trying to define an ETL job from mongodb to postgres sql (simple transform).
In the job I defined source as AWS Glue Data Catalog (mongodb) and target as Data catalog Postgres.
When I run the job I get this error:
IllegalArgumentException: Missing collection name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.collection' property
It looks like this is related to the mongodb part. I tried to set the 'database' and 'collection' parameters in the data catalog tables and it didn't help
Script generated for source is:
AWSGlueDataCatalog_node1653400663056 = glueContext.create_dynamic_frame.from_catalog(
database="data-catalog-db",
table_name="data-catalog-table",
transformation_ctx="AWSGlueDataCatalog_node1653400663056"
What could be missing?
I had the same problem, just add the parameter below.
AWSGlueDataCatalog_node1653400663056 = glueContext.create_dynamic_frame.from_catalog(
database="data-catalog-db",
table_name="data-catalog-table",
transformation_ctx="AWSGlueDataCatalog_node1653400663056"
additional_options = {"database":"data-catalog-db",
"collection":"data-catalog-table"}
Additional parameters can be found on the AWS page
https://docs.aws.amazon.com/glue/latest/dg/connection-mongodb.html

How Do i read the Lake database in Azure Synapse in a PySpark notebook

Hi I created a Database in Azure Synapse Studio and I can see the database and table in there, Now I have created a Notebook where I have added the required libraries but I am unable to read the table by below code. Can anyone fix what wrong am i doing here ?
My database name is Utilities_66_Demo . It gives me error as
AnalysisException: Path does not exist:
abfss://users#stcdmsynapsedev01.dfs.core.windows.net/Utilities_66_Demo.parquet
From where should I take the path? I tried to follow the MS article. Where Do I read path? if I click on edit Database, i get this
%%pyspark
df = spark.read.load('abfss://users#stcdmsynapsedev01.dfs.core.windows.net/Utilities_66_Demo.parquet', format='parquet')
display(df.limit(10))
Trying to access the created Lake Database Table:
Selected Azure Synapse Analytics:
I select my workspace and in dropdown there is no table shown:
I select Edit and put my Db name and Table name and it says Invalid
details.
Now I select Azure Dedicated Synapse Pool from Linked Service,
I get no option to select in SQL Pool or Table, and without SQL Pool I am unable to create a Linked service just by inserting Table name:
You can directly go to your ADLS and right click the parquet file and select properties. There, you will be able to find the ABFSS path which is in the format :
abfss://<container_name>#<storage_account_name>.dfs.core.windows.net/<path