Querying a PostgreSQL database from Snowflake - postgresql

PostgreSQL offers a way to query a remote database through dblink.
Similarly (sort-of), Exasol provides a way to connect to a remote Postgres database via the following syntax:
CREATE CONNECTION JDBC_PG
TO 'jdbc:postgresql://...'
IDENTIFIED BY '...';
SELECT * FROM (
IMPORT FROM JDBC AT JDBC_PG
STATEMENT 'SELECT * FROM MY_POSTGRES_TABLE;'
)
-- one can even write direct joins such as
SELECT
t.COLUMN,
r.other_column
FROM MY_EXASOL_TABLE t
LEFT JOIN (
IMPORT FROM JDBC AT JDBC_PG
STATEMENT 'SELECT key, other_column FROM MY_POSTGRES_TABLE'
) r ON r.key = t.KEY
This is very convenient to import data from PostgreSQL directly into Exasol without having to use a temporary file (csv, pg_dump...).
Is it possible to achieve the same thing from Snowflake (querying a remote PostgreSQL database from Snowflake with a direct live connection)? I couldn't find any mention of it in the documentation.

Have you looked into using external functions? It's not exactly what you're looking for (Snowflake doesn't have that capability yet) but it can be used as a workaround in some use cases. For instance, you could create a Python function on AWS Lambda that queries PostgreSQL for small amounts of data (due to Lambda limits) or have it trigger a PostgreSQL process that dumps to S3 to trigger Snowpipe for the bulk import use case.

Related

Execute Postgresql Stored Procedure in PySpark

I am working on Pyspark in AWS Glue
I want to execute Stored Procedure/Function on Postgresql Database
Is it possible?
What is the syntax? Is there any special package needed?
Ankur
You can try using a module like pg8000 to run this function
You can also try calling the postgres function like you would select data from a specific table using the spark read function with jdbc as the format. Considering glue uses pyspark in the back end, i would imagine just giving the function name instead of a table name, should do the trick. Just remember to add the jdbc driver to your glue job
eg: You can do this in spark
jdbcDF = spark.read.format("jdbc").option("url","jdbc:postgresql://host:5432/db").option("driver", "org.postgresql.Driver").option("query", "SELECT * from function()").option("user", "user").option("password", "password").load()

Is there a way to describe an external/spectrum table via redshift?

In AWS Athena you can write
SHOW CREATE TABLE my_table_name;
and see a SQL-like query that describes how to build the table's schema. It works for tables whose schema are defined in AWS Glue. This is very useful for creating tables in a regular RDBMS, for loading and exploring data views.
Interacting with Athena in this way is manual, and I would like to automate the process of creating regular RDBMS tables that have the same schema as those in Redshift Spectrum.
How can I do this through a query that can be run via psql? Or is there another way to get this via the aws-cli?
Redshift Spectrum does not support SHOW CREATE TABLE syntax, but there are system tables that can deliver same information. I have to say, it's not as useful as the ready to use sql returned by Athena though.
The tables are
svv_external_schemas - gives you information about glue database mapping and IAM roles bound to it
svv_external_tables - gives you the location information, and also data format and serdes used
svv_external_columns - gives you the column names, types and order information.
Using that data, you could reconstruct the table's DDL.
For example to get the list of columns and their types in the CREATE TABLE format one can do:
select distinct
listagg(columnname || ' ' || external_type, ',\n')
within group ( order by columnnum ) over ()
from svv_external_columns
where tablename = '<YOUR_TABLE_NAME>'
and schemaname = '<YOUR_SCHEM_NAME>'
the query give you the output similar to:
col1 int,
col2 string,
...
*) I am using listagg window function and not the aggregate function, as apparently listagg aggregate function can only be used with user defined tables. Bummer.
I had been doing something similar to #botchniaque's answer in the past, but recently stumbled across a solution in the AWS-Labs' amazon-redshift-utils code package that seems to be more reliable than my hand-spun queries:
amazon-redshift-utils: v_generate_external_tbl_ddl
If you don't have the ability to create a view backed with the ddl listed in that package, you can run it manually by removing the CREATE statement from the start of the query. Assuming you can create it as a view, usage would be:
SELECT ddl
FROM admin.v_generate_external_tbl_ddl
WHERE schemaname = '<external_schema_name>'
-- Optionally include specific table references:
-- AND tablename IN ('<table_name_1>', '<table_name_2>', ..., '<table_name_n>')
ORDER BY tablename, seq
;
They added show external table now.
SHOW EXTERNAL TABLE external_schema.table_name [ PARTITION ]
SHOW EXTERNAL TABLE my_schema.my_table;
https://docs.aws.amazon.com/redshift/latest/dg/r_SHOW_EXTERNAL_TABLE.html

Query for PostgreSQL Server status variable?

In my project i want to collect PostgreSQL server's performance counter. For that i want query to collect it from the database. i am new to postgreSQL. when i am searching, i got something like,
SELECT * FROM pg_stat_database
but when i am use this in java in the following manner, Here Map_PostgreSQL is a hashmap.
while(rs.next())
{
Counter_Name.add(rs.getString(1).trim());
Map_PostgreSQL.put(rs.getString(1).trim(), rs.getString(2));
}
I got output like
{12024=template0, 1=template1, 12029=postgres}
What is the actual query to collect its status variables like "SHOW GLOBAL STATUS" in MySQL.
Thanks in advance..
1st, try to launch the sql query in your PostgreSQL Shell to see exactly which data are returned and how it is organised in rows and columns.
You'll see that the hashmap keys are your datid (database ids) and the values are your databases names.
I think you assumed that statistics were structured in "rows" whereas they are structured in columns.
Don't forget : PostgreSQL is a database server which means it can handle several databases (and in fact, it has several databases because some of them are already created such as the 'postgres' database itself - which Postgres (the server) uses internally, or 'template0').
By launching :
SELECT * FROM pg_stat_database;
You're asking the server to return statistics for every databases (provided you're allowed to get them)
If you want to only have stats for your own database, do :
SELECT * FROM pg_stat_database WHERE datname='your_database_name';
Hope this helped

How do I save results from a pass through query to a local table?

I need to submit a series of queries to an Oracle server over ODBC from an MS SQL server and store the results as a table on the MS SQL server.
It has to be a pass through because the query requires a server side function defined on the Oracle server.
I can't save the table on the Oracle server and then access it via ODBC because of licensing restrictions from the vendor of the db running on Oracle.
Here's the code that returns the correct results, but I don't know how to save them:
DECLARE #BibID AS bigint
DECLARE BibList CURSOR FOR
SELECT BIB_ID FROM tblActiveSerialsThatHave740s
OPEN BibList
FETCH NEXT FROM BibList INTO #BibID
WHILE ##FETCH_STATUS=0
BEGIN
EXECUTE
('SELECT
AMDB.BIB_DATA.BIB_ID As BIB_ID,
AMDB.GetAllBibTag(AMDB.BIB_DATA.BIB_ID, ''740'', ''2'') As F740_All
FROM
AMDB.BIB_DATA
WHERE
AMDB.BIB_DATA.BIB_ID = ' + #BibID + '
GROUP BY BIB_ID '
)
AT REPORT
FETCH NEXT FROM BibList INTO #BibID
END
DEALLOCATE BibList
You need to use INSERT INTO to capture the results of an EXECUTE.
Because you are executing a passthrough query, the Distributed Transaction Coordinator is going to come into play and you might need to ensure that a distributed transaction is not created (it's unlikely to be necessary to have a distributed transaction in your case) or that the Distributed Transaction Coordinator service is running:
http://blogs.msdn.com/b/sqlprogrammability/archive/2008/08/22/how-to-create-an-autonomous-transaction-in-sql-server-2008.aspx
http://technet.microsoft.com/en-us/library/ms178532.aspx
http://www.sqlservercentral.com/Forums/Topic861249-392-1.aspx

Creating a connection from Microsoft SQL server to an AS/400

I'm trying to connect from Microsoft SQL server to as AS/400 so i can pull data from the AS/400 then flag the data as being pulled.
I've successfully created and OLE DB "IBMDASQL" connection, and am able to pull data some data, but i'm running into an issue when i try to pull data from a very large table
This runs fine, and returns a count of 170 million:
select count(*)
from transactions
This query executed for 15 hours before i gave up on it. (It should return zero since i haven't flagged anything as 'in process' yet)
select count(*)
from transactions
where processed = 'In process'
I'm a Microsoft guy, but my AS/400 guy says that there is an index on the 'processed' column and that locally, that query run instantaneously.
Any thoughts on what i might be doing wrong? I found a table with only 68 records in it, and was able to run this query in about a second:
select count(*)
from smallTable
where RandomColumn = 'randomValue'
So I know that the AS/400 is at least able to understand that type of query.
I have had to fight this battle many times.
There are two ways of approaching this.
1) Stage your data from the AS400 into SQL server where you can optimize your indexes
2) Ask the AS400 folks to create logical views which speed up data retrieval, your AS400 programmer is correct, index will help but I forget the term they use to define a "view" similar to a sql server view, I beleive its something like "physical" v/s "logical". Logical is what you want.
Thirdly, 170 million is a lot of records, even for a relational database like SQL server, have you considered running an SSIS package nightly that stages your data into your own SQL table to see if it improves performance?
I would suggest this way to have good performance, i suppose you have at least SQL2005, i havent tested yet but this is a tip
Let the AS400 perform the select in native way by creating stored procedure in the AS400
open a AS400 session
launch STRSQL
create an AS400 stored procedure in this way to get/update the recordset
CREATE PROCEDURE MYSELECT (IN PARAM CHAR(10))
LANGUAGE SQL
DYNAMIC RESULT SETS 1
BEGIN
DECLARE C1 CURSOR FOR SELECT * FROM MYLIB.MYFILE WHERE MYFIELD=PARAM;
OPEN C1;
RETURN;
END
create an AS400 stored procedure to update the recordset
CREATE PROCEDURE MYUPDATE (IN PARAM CHAR(10))
LANGUAGE SQL
RESULT SETS 0
BEGIN
UPDATE MYLIB.MYFILE SET MYFIELD='newvalue' WHERE MYFIELD=PARAM;
END
Call those AS400 SP from SQL SERVER
declare #myParam char(10)
set #myParam = 'In process'
-- get the recordset
EXEC ('CALL NAME_AS400.MYLIB.MYSELECT(?) ', #myParam) AT AS400 -- < AS400 = name of linked server
-- update
EXEC ('CALL NAME_AS400.MYLIB.MYUPDATE(?) ', #myParam) AT AS400
Hope it helps
I recommend following the suggestions in the IBM Redbook SQL Performance Diagnosis on IBM DB2 Universal Database for iSeries to determine what's really happening.
IBM technical support can also be extremely helpful in diagnosing issues such as these. Don't be afraid to get in touch with them as the software support is generally included as part of the maintenance contract and there is no charge to talk to them.
I've seen OLEDB connections eat up 100% cpu for hours and when the same query is run through VisualExplain (query analyzer) it estimates mere seconds to execute.
We found that running the query like this performed liked expected:
SELECT *
FROM OpenQuery( LinkedServer,
'select count(*)
from transactions
where processed = ''In process''')
GO
Could this be a collation problem? - your WHERE clause is testing on a text field and if the collations of the two servers don't match this clause will be applied clientside rather than serverside so you are first of all pulling all 170 million records down to the client and then performing the WHERE clause on it there.
Based on the past interactions I have had, the query should take about the same amount of time no matter how you access the data. Another thought would be if you could create a view on the table to get the data you need or use a stored procedure.