Can any one of you suggest the method of executing queries on redshift tables using pyspark?
Using pyspark data-frames option is one option to read/write to Redshift tables. And executing queries with dataframe's help is restricted to using preactions/postactions method.
If you need to execute multiple queries, one way is to use psycopg2 module.
First, you need to install psycopg2 module on your server:
sudo python -m pip install psycopg2
Then open the pyspark shell and execute the following:
import psycopg2
conn={'db_name': {'hostname':'redshift_host_url','database':'database_on_redshift','username':'redshift_username','password':'p', 'port':your_port} }
db='db_name'
hostname, username, password, database, portnumber = conn[db]['hostname'], conn[db]['username'], conn[db]['password'], conn[db]['database'], conn[db]['port']
con = psycopg2.connect( host=hostname, port=portnumber, user=username, password=password, dbname=database)
query="insert into sample_table select * from table1"
cur = con.cursor()
rs = cur.execute(query)
con.commit()
con.close()
You can also refer: https://www.psycopg.org/docs/usage.html
Related
I'm trying to use boto3 redshift-data client to execute transactional SQL for external table (Redshift spectrum) with following statement,
ALTER TABLE schema.table ADD IF NOT EXISTS
PARTITION(key=value)
LOCATION 's3://bucket/prefix';
After submit using execute_statement, I received error "ALTER EXTERNAL TABLE cannot run inside a transaction block".
I tried use VACUUM and COMMIT commands before the statement, but it will just mention that VACUUM or COMMIT cannot run inside a transaction block.
How may I successfully execute such statement?
This has to do with the settings of your bench. You have an open transaction at the start of every statement you run. Just add “END;” before the statement that needs to run outside of a transaction and things should work. Just make sure you launch both commands at the same time from your bench.
Like this:
END; VACUUM;
It seems not quite easy to run transactional SQL through boto3. However, I found a workaround using the redshift_connector library.
import redshift_connector
connection = redshift_connector.connect(
host=host, port=port, database=database, user=user, password=password
)
connection.autocommit = True
connection.cursor.execute(transactional_sql)
connection.autocommit = False
Reference - https://docs.aws.amazon.com/redshift/latest/mgmt/python-connect-examples.html#python-connect-enable-autocommit
when running this query on db2 on DBeaver :
reorg table departments
i got this error (just on external channel):
DB2 SQL Error: SQLCODE=-104, SQLSTATE=42601, SQLERRMC=table;reorg ;JOIN <joined_table>, DRIVER=4.19.49
what does this query mean?
how can I fix the error?
appricicate any help.
Try call sysproc.admin_cmd('reorg table db2inst1.departments')
as you are using DBeaver which is a jdbc application.
If you do not qualify the table name (for example, with db2inst1) then Db2 will assume that the qualifier (schema name) is the same as the userid name you used when connecting to the database.
DBeaver runs SQL statements, but it cannot directly run commands of Db2 - instead, any jdbc app can run Db2-commands indirectly via a stored-procedure that you CALL. The CALL is an SQL statement.
The reorg table is a command, it is not an SQL statement, so it needs to be run via the admin_cmd stored-procedure, or it can be run from the operating system command line (or db2 clp) after connecting.
So if you have db2cmd.exe on MS-Windows, or bash on linux/unix, you can connect to the database, and run commands via the db2 command.
I tried to connect to a redshift system table called stv_sessions and I can read the data into a dataframe.
This stv_sessions table is a redshift system table which has the process id's of all the queries that are currently running.
To delete a query from running we can do this.
select pg_terminate_backend(pid)
While this works for me if I directly connect to redshift (using aginity), it gives me insuffecient previlege issues when trying to run from databricks.
Simply put I dont know how to run the query from databricks notebook.
I have tried this so far,
kill_query = "select pg_terminate_backend('12345')"
some_random_df_i_created.write.format("com.databricks.spark.redshift").option("url",redshift_url).option("dbtable","stv_sessions").option("tempdir", temp_dir_loc).option("forward_spark_s3_credentials", True).options("preactions", kill_query).mode("append").save()
Please let me know if the methodology i follow is correct.
Thank you
Databricks purposely does not preinclude this driver. You need to Download and install the offical Redshift JDBC driver for databricks. : download the official Amazon Redshift JDBC driver, upload it to Databricks, and attach the library to your cluster.(recommend using v1.2.12 or lower with Databricks clusters). Then, use JDBC URLs of the form
val jdbcUsername = "REPLACE_WITH_YOUR_USER"
val jdbcPassword = "REPLACE_WITH_YOUR_PASSWORD"
val jdbcHostname = "REPLACE_WITH_YOUR_REDSHIFT_HOST"
val jdbcPort = 5439
val jdbcDatabase = "REPLACE_WITH_DATABASE"
val jdbcUrl = s"jdbc:redshift://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}"
jdbcUsername: String = REPLACE_WITH_YOUR_USER
jdbcPassword: String = REPLACE_WITH_YOUR_PASSWORD
jdbcHostname: String = REPLACE_WITH_YOUR_REDSHIFT_HOST
jdbcPort: Int = 5439
jdbcDatabase: String = REPLACE_WITH_DATABASE
jdbcUrl: String = jdbc:redshift://REPLACE_WITH_YOUR_REDSHIFT_HOST:5439/REPLACE_WITH_DATABASE?user=REPLACE_WITH_YOUR_USER&password=REPLACE_WITH_YOUR_PASSWORD
Then try putting jdbcUrl in place of your redshift_url.
That may be the only reason you are getting privilege issues.
Link1:https://docs.databricks.com/_static/notebooks/redshift.html
Link2:https://docs.databricks.com/data/data-sources/aws/amazon-redshift.html#installation
Another reason could be the redshift-databricks connector only uses SSL(encryption in flight) and it is possible that IAM roles may have been set on your redshift cluster to only allow some users to delete tables.
Apologies if none of this helps your case.
I want to execute the following query on a remote Postgres server from a PySpark application using the JDBC connector:
SELECT id, postgres_function(some_column) FROM my_database GROUP BY id
The problem is I can't execute this kind of query on Pyspark using spark.sql(QUERY), obviously because the postgres_function is not an ANSI SQL function supported since Spark 2.0.0.
I'm using Spark 2.0.1 and Postgres 9.4.
The only option you have is to use subquery:
table = """
(SELECT id, postgres_function(some_column) FROM my_database GROUP BY id) AS t
"""
sqlContext.read.jdbc(url=url, table=table)
but this will execute a whole query, including aggregation, on the database side and fetch the result.
In general it doesn't matter if function is an ANSI SQL function or if it has an equivalent in the source database and ll functions called in spark.sql are executed in Spark after data is fetched.
My product needs to support Oracle, SQLServer, and DB2 v9. We are trying to figure out the most efficient way to periodically load data into the database. This currently takes 40+ minutes with individual insert statements, but just a few minutes when we use SQLLDR or BCP. Is there an equivalent in DB2 that allows CSV data to be loaded into the database quickly?
Our software runs on windows, so we need to assume that the database is running on a remote system.
load:
http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/core/r0008305.htm
If the data is in CSV format try import the data with the delimiter as coma(,)
db2 import from <filename> of del modified by coldel, insert into <Table Nmae>
Or else you ca use Load command - load from file
db2 load client from /u/user/data.del of del
modified by coldel, insert into mytable