I am working on Pyspark in AWS Glue
I want to execute Stored Procedure/Function on Postgresql Database
Is it possible?
What is the syntax? Is there any special package needed?
Ankur
You can try using a module like pg8000 to run this function
You can also try calling the postgres function like you would select data from a specific table using the spark read function with jdbc as the format. Considering glue uses pyspark in the back end, i would imagine just giving the function name instead of a table name, should do the trick. Just remember to add the jdbc driver to your glue job
eg: You can do this in spark
jdbcDF = spark.read.format("jdbc").option("url","jdbc:postgresql://host:5432/db").option("driver", "org.postgresql.Driver").option("query", "SELECT * from function()").option("user", "user").option("password", "password").load()
Related
In the databricks spec it is stated:
all tables created in Databricks are Delta tables, by default.
I create a table with
df.write.saveAsTable("table_name")
With the sql api I can time-travel:
%sql
SELECT * FROM table_name VERSION AS OF 0
How can I now time-travel with python?
I search for something like
spark.table("mytab2").versionAsOf(3)
Simplest way:
spark.table("mytab2#v3") # as of version
or
spark.table("mytab2#20221012093243000") # as of timestamp
Reference: Table batch read and writes / # syntax. On the same page there's also an option for DataFrameReader API, although for this you need to provide explicit DBFS path to Delta table, so it's a bit less convenient.
This syntax also works:
spark.read.format("delta").option("versionAsOf", "0").table("mytab2")
I am trying the below code but it is throwing some random error that I am unable to understand:
df.registerTempTable("Temp_table")
spark.sql("Update Temp_table set column_a='1'")
Currently spark sql does not support UPDATE statments. The workaround is to use create a delta lake / iceberg table using your spark dataframe and execute you sql query directly on this table.
For iceberg implementation refer to :
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html
I can read the snowflake table in pyspark dataframe using sqlContext
sql = f"""select * from table1""";
df = sqlContext.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(**snowflake_options)
.option("query", sql)
.load()
How do I create a temporary table in snowflake (using pyspark code) and insert values from this pyspark dataframe (df)?
just save as usual, with snowflake format
snowflake_options = {
...
'sfDatabase': 'dbabc',
'dbtable': 'tablexyz',
...
}
(df
.write
.format(SNOWFLAKE_SOURCE_NAME)
.options(**snowflake_options)
.save()
)
I don't believe this can be done. At least not the way you want.
You can, technically, create a temporary table; but persisting it is something that I have had a great deal of difficulty finding how to do (i.e. I haven't). If you run the following:
spark.sparkContext._jvm.net.snowflake.spark.snowflake.Utils.runQuery(snowflake_options, 'create temporary table tmp_table (id int, value text)')
you'll notice that it successfully returns a java object indicating the temp table was created successfully; but once you try and run any further statements on it, you'll get nasty errors that mean it no longer exists. Somehow we mere mortals would need to find a way to access and persist the Snowflake session through the jvm api. That being said, I also think that would run contrary to the Spark paradigm.
If you really need the special-case performance boost of running transformations on Snowflake instead of bringing it all into Spark, just keep everything in Snowflake to begin with by either
Using CTEs in the query, or
Using the runQuery api described above to create "temporary" permanent/transient tables and designing Snowflake queries that insert directly to those and then clean them up (DROP them) when you are done.
I am trying to create some spark SQL queries for different tables which i have collected as a list. I want to create SQL queries for all the tables present in the hive database.The hive context has been initialized Following is my approach.
tables= spark.sql("select tables in survey_db")
# registering dataframe as temp view with 2 columns - tableName and db name
tables.createOrReplaceTempView("table_list")
# collecting my table names in a list
table_array= spark.sql("select collect_list(tableName) from table_list").collect()[0][0]
# array values(list)
table_array= [u'survey',u'market',u'customer']
I want to create spark SQL queries for the table names stored in table_array. for example:
for i in table_array:
spark.sql("select * from survey_db.'i'")
I cant use shell scripting as i have to write a pyspark script for this. Please advice if spark.sql queries can be created using loop/map . Thanks everyone.
You can achieve the same as follows:
sql_list = [f"select * from survey_db.{table}" for table in table_array]
for sql in sql_list:
df = spark.sql(sql)
df.show()
PostgreSQL offers a way to query a remote database through dblink.
Similarly (sort-of), Exasol provides a way to connect to a remote Postgres database via the following syntax:
CREATE CONNECTION JDBC_PG
TO 'jdbc:postgresql://...'
IDENTIFIED BY '...';
SELECT * FROM (
IMPORT FROM JDBC AT JDBC_PG
STATEMENT 'SELECT * FROM MY_POSTGRES_TABLE;'
)
-- one can even write direct joins such as
SELECT
t.COLUMN,
r.other_column
FROM MY_EXASOL_TABLE t
LEFT JOIN (
IMPORT FROM JDBC AT JDBC_PG
STATEMENT 'SELECT key, other_column FROM MY_POSTGRES_TABLE'
) r ON r.key = t.KEY
This is very convenient to import data from PostgreSQL directly into Exasol without having to use a temporary file (csv, pg_dump...).
Is it possible to achieve the same thing from Snowflake (querying a remote PostgreSQL database from Snowflake with a direct live connection)? I couldn't find any mention of it in the documentation.
Have you looked into using external functions? It's not exactly what you're looking for (Snowflake doesn't have that capability yet) but it can be used as a workaround in some use cases. For instance, you could create a Python function on AWS Lambda that queries PostgreSQL for small amounts of data (due to Lambda limits) or have it trigger a PostgreSQL process that dumps to S3 to trigger Snowpipe for the bulk import use case.