Delete redshift table from within databricks using pyspark - amazon-redshift

I tried to connect to a redshift system table called stv_sessions and I can read the data into a dataframe.
This stv_sessions table is a redshift system table which has the process id's of all the queries that are currently running.
To delete a query from running we can do this.
select pg_terminate_backend(pid)
While this works for me if I directly connect to redshift (using aginity), it gives me insuffecient previlege issues when trying to run from databricks.
Simply put I dont know how to run the query from databricks notebook.
I have tried this so far,
kill_query = "select pg_terminate_backend('12345')"
some_random_df_i_created.write.format("com.databricks.spark.redshift").option("url",redshift_url).option("dbtable","stv_sessions").option("tempdir", temp_dir_loc).option("forward_spark_s3_credentials", True).options("preactions", kill_query).mode("append").save()
Please let me know if the methodology i follow is correct.
Thank you

Databricks purposely does not preinclude this driver. You need to Download and install the offical Redshift JDBC driver for databricks. : download the official Amazon Redshift JDBC driver, upload it to Databricks, and attach the library to your cluster.(recommend using v1.2.12 or lower with Databricks clusters). Then, use JDBC URLs of the form
val jdbcUsername = "REPLACE_WITH_YOUR_USER"
val jdbcPassword = "REPLACE_WITH_YOUR_PASSWORD"
val jdbcHostname = "REPLACE_WITH_YOUR_REDSHIFT_HOST"
val jdbcPort = 5439
val jdbcDatabase = "REPLACE_WITH_DATABASE"
val jdbcUrl = s"jdbc:redshift://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}"
jdbcUsername: String = REPLACE_WITH_YOUR_USER
jdbcPassword: String = REPLACE_WITH_YOUR_PASSWORD
jdbcHostname: String = REPLACE_WITH_YOUR_REDSHIFT_HOST
jdbcPort: Int = 5439
jdbcDatabase: String = REPLACE_WITH_DATABASE
jdbcUrl: String = jdbc:redshift://REPLACE_WITH_YOUR_REDSHIFT_HOST:5439/REPLACE_WITH_DATABASE?user=REPLACE_WITH_YOUR_USER&password=REPLACE_WITH_YOUR_PASSWORD
Then try putting jdbcUrl in place of your redshift_url.
That may be the only reason you are getting privilege issues.
Link1:https://docs.databricks.com/_static/notebooks/redshift.html
Link2:https://docs.databricks.com/data/data-sources/aws/amazon-redshift.html#installation
Another reason could be the redshift-databricks connector only uses SSL(encryption in flight) and it is possible that IAM roles may have been set on your redshift cluster to only allow some users to delete tables.
Apologies if none of this helps your case.

Related

How to read tables from synapse database tables using pyspark

I am a newbie to Azure Synapse, I have to work on the Azure spark notebook. One of my colleagues connected the on-prime database using the azure link service. Now I have written a test framework for comparing the on-prime data and data-lake(curated) data. but I don't understand how to read those tables using Pyspark.
here is my linked service data structure.
enter image description here
here my Link service names and Database name.
You can read any file as a table which is stored in Synapse Linked location by using Azure Synapse Dedicated SQL Pool Connector for Apache Spark.
First you need to read the file which you need to read as the table in Synapse. Use below code to read the file.
%%pyspark
df = spark.read.load('abfss://sampleadls2#sampleadls1.dfs.core.windows.net/business.csv', format='csv', header=True)
Then convert this file into table using the code below:
%%pyspark
spark.sql("CREATE DATABASE IF NOT EXISTS business")
df.write.mode("overwrite").saveAsTable("business.data")
Refer below image.
Now you can run any Spark SQL command on this table as shown below:
%%pyspark
data = spark.sql("SELECT * FROM business.data")
display(data)
See the output in below image.

o110.pyWriteDynamicFrame. null

I have created a visual job in AWS Glue where I extract data from Snowflake and then my target is a postgresql database in AWS.
I have been able to connect to both Snowflak and Postgre, I can preview data from both.
I have also been able to get data from snoflake, write to s3 as csv and then take that csv and upload it to postgre.
However when I try to get data from snowflake and push it to postgre I get the below error:
o110.pyWriteDynamicFrame. null
So it means that you can get the data from snowflake in a Datafarme and while writing the data from this datafarme to postgres, you are failing.
You need to check was glue logs to get more understanding why is this failing while writing the data into postgres.
Please check if you have the right version of jars (needed by postgres) compatible with scala(on was glue side).

AWS Glue ETL Job Missing collection name

I have data catalog tables generated by crawlers one is data source from mongodb, and second is datasource Postgres sql (rds). Crawlers running successfully & connections test working.
I am trying to define an ETL job from mongodb to postgres sql (simple transform).
In the job I defined source as AWS Glue Data Catalog (mongodb) and target as Data catalog Postgres.
When I run the job I get this error:
IllegalArgumentException: Missing collection name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.collection' property
It looks like this is related to the mongodb part. I tried to set the 'database' and 'collection' parameters in the data catalog tables and it didn't help
Script generated for source is:
AWSGlueDataCatalog_node1653400663056 = glueContext.create_dynamic_frame.from_catalog(
database="data-catalog-db",
table_name="data-catalog-table",
transformation_ctx="AWSGlueDataCatalog_node1653400663056"
What could be missing?
I had the same problem, just add the parameter below.
AWSGlueDataCatalog_node1653400663056 = glueContext.create_dynamic_frame.from_catalog(
database="data-catalog-db",
table_name="data-catalog-table",
transformation_ctx="AWSGlueDataCatalog_node1653400663056"
additional_options = {"database":"data-catalog-db",
"collection":"data-catalog-table"}
Additional parameters can be found on the AWS page
https://docs.aws.amazon.com/glue/latest/dg/connection-mongodb.html

Azure databricks - Do we have postgres connector for spark

Azure databricks - Do we have postgres connector for spark
Also, how to upsert/update record in postgres using spark databricks.
I am using Spark 3.1.1
When trying to write using mode=overwrite, it truncates the table but recird is not getting inserted
I am new to this. Please help.
You don't need a separate connector for PostgreSQL - it works via standard JDBC connector, and PostgreSQL JDBC driver should be included into databricks runtime - check release notes for your specific runtime. So you just need to form a correct JDBC URL as described in documentation (Spark documentation also has examples of URL for PostgreSQL).
Something like this:
df.write \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.save()
Regarding the UPSERT, it's not so simple, not only for PostgreSQL, but also for other databases:
either you do the full join, and select entries not existing only in your dataset, and the rest is taking from the database, and then overwrite - but this is very expensive because you're reading a full database, and writing it back
or you're doing left join with database (you need to read it again) going down to RDD with .foreachPartition/.foreach, and forming a series of INSERT/UPDATE operations depending on if data exist in database, or not - it's doable, but you need more experience.
Specifically for PosgtreSQL you can convert this (foreach) into INSERT ... ON CONFLICT so you won't need to read full database (see their wiki for more information about this operation)
Another approach - write your data into temporary table, and then via JDBC issue the MERGE command to incorporate your changes into the table. This is more "lightweight" method from my point of view.

Running complex SQL queries on Cassandra tables using Spark SQL

hereI have setup Cassandra and Spark with cassandra- spark connector. I am able to create RDDs using Scala. But I would like to run complex SQL queries (Aggregation/Analytical functions/Window functions) using Spark SQL on Cassandra tables , could you help how should I proceed ?getting error like this
following is the query used :
sqlContext.sql(
"""CREATE TEMPORARY TABLE words
|USING org.apache.spark.sql.cassandra
|OPTIONS (
| table "words",
| keyspace "test",
| cluster "Test Cluster",
| pushdown "true"
|)""".stripMargin)
below is the error :[enter image description here][2]
new error:
enter image description here
First thing I noticed from your post is that , sqlContext.sql(...) used in your query but your screenshot shows sc.sql(...).
I take screenshot content as your actual issue. In Spark shell, Once you've loaded the shell, both the SparkContext (sc) and the SQLContext (sqlContext) are already loaded and ready to go. sql(...) does't exit in SparkContext so you should try with sqlContext.sql(...).
Most probably in your spark-shell context started as Spark Session and value for that is spark. Try your commands with spark instead of sqlContext.