How to unload pg_table_def table to s3 - amazon-redshift

I'd like to take a dump of schema of redshift and do some comparisons among different environments.
unload('select * from pg_table_def')
to 's3://blah/blah/stage.txt'
credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=XXXXXX'
parallel off allowoverwrite manifest;
But the code above throws me the following error.
INFO: Function "format_type(oid,integer)" not supported.
INFO: Function "pg_table_is_visible(oid)" not supported.
ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables.
Any idea how to make this work ? or is there any way other way to get the schema. i need to know sort key and dist key information as well.

Redshift keeps certain information in special area on the leader node whereas UNLOAD commands are processed on each slice (AFAIK) and therefore can't use leader node only functions.
You would probably need to extract this from an external machine using psql.

Related

Specified types or functions (one per INFO message) not supported on Redshift tables

select distinct(table_name) from svv_all_columns ;
SELECT distinct(source_table_name) FROM "ctrl_stg_cdr_rules" order by source_table_name ;
I want to take intersection for the above two queries but getting this error
ERROR: Specified types or functions (one per INFO message) not
supported on Redshift tables. [ErrorId:
1-63eb4c35-4b45b94c02210a19663d78db]
SELECT table_name FROM svv_all_columns WHERE database_name = 'singh_sandbox' AND schema_name = 'new_sources'
INTERSECT
SELECT source_table_name FROM ctrl_stg_cdr_rules
ORDER BY table_name;
I am expected to get list of all missing tables
Oh this error again. This has to be one of the worst written error messages in existence. What this means is that you are trying to use leader-node only data in a query being run on the compute nodes.
You see Redshift is a cluster with a leader node that has a different purpose than the other nodes (compute nodes). When a query runs the compute nodes execute on data they have direct access to, then the results from the compute nodes is passed to the leader for any final actions and passed to the connected client. In this model the data only flow one way during query execution. This error happens when data only accessible from the leader node is needed by the compute nodes - this includes results from leader-only functions and/or leader-node only tables. This is what is happening when you perform INTERSECT between these two selects.
To resolve this you need to produce the leader-only data as a separate select and route the data back to the compute nodes through a supported process. There are two classes of methods to do this - have an external system route the data back OR use a cursor and route the results back. I wrote up how to perform the cursor approach in this answer: How to join System tables or Information Schema tables with User defined tables in Redshift
The bottom line is that you cannot do what you intend simply because of the architecture of Redshift. You need a different approach.

Is there a way for PySpark to give user warning when executing a query on Apache Hive table without specifying partition keys?

We are using Spark SQL with Apache Hive tables (via AWS Glue Data catalog). One problem is that when we execute a Spark SQL query without specifying the partitions to read via the WHERE clause, it gives us/the user no warning about the fact that it will proceed to load all partitions and thus likely time out or fail.
Is there a way to ideally error out, or at least give some warning, when a user executes a Spark SQL query on Apache Hive table without specifying partition keys? It's very easy to forget to do this.
I searched for existing solutions to this and found none, both on Stack Overflow and on the wider internet. I was expecting some configuration option/code that would help me achieve the goal.

How to pass variable in Load command from IBM Object Storage file to Cloud DB2

I am using below command to load Object Storage file into DB2 table:NLU_TEMP_2.
CALL SYSPROC.ADMIN_CMD('load from "S3::s3.jp-tok.objectstorage.softlayer.net::
<s3-access-key-id>::<s3-secret-access-key>::nlu-test::practice_nlu.csv"
OF DEL modified by codepage=1208 coldel0x09 method P (2) WARNINGCOUNT 1000
MESSAGES ON SERVER INSERT into DASH12811.NLU_TEMP_2(nlu)');
above command inserts 2nd column from object storage file to DASH12811.NLU_TEMP_2 in nlu column.
I want to insert request_id from variable as a additional column:request_id in DASH12811.NLU_TEMP_2(request_id,nlu).
I read in some article to use statement concentrator literals to dynamically pass a value. Please let us know if anyone has an idea on how to use it.
Note, i would be using this query in DB2 but not DB2 warehouse. External tables wont work in DB2.
LOAD does not have any ability to include extra values that are not part of the load file. You can try to do things with columns that are generated by default in Db2 but it is not a good solution.
Really you need to wait until DB2 on Cloud supports external tables

DB2 Tables Not Loading when run in Batch

I have been working on a reporting database in DB2 for a month or so, and I have it setup to a pretty decent degree of what I want. I am however noticing small inconsistencies that I have not been able to work out.
Less important, but still annoying:
1) Users claim it takes two login attempts to connect, first always fails, second is a success. (Is there a recommendation for what to check for this?)
More importantly:
2) Whenever I want to refresh the data (which will be nightly), I have a script that drops and then recreates all of the tables. There are 66 tables, each ranging from 10's of records to just under 100,000 records. The data is not massive and takes about 2 minutes to run all 66 tables.
The issue is that once it says it completed, there is usually at least 3-4 tables that did not load any data in them. So the table is deleted and then created, but is empty. The log shows that the command completed successfully and if I run them independently they populate just fine.
If it helps, 95% of the commands are just CAST functions.
While I am sure I am not doing it the recommended way, is there a reason why a number of my tables are not populating? Are the commands executing too fast? Should I lag the Create after the DROP?
(This is DB2 Express-C 11.1 on Windows 2012 R2, The source DB is remote)
Example of my SQL:
DROP TABLE TEST.TIMESHEET;
CREATE TABLE TEST.TIMESHEET AS (
SELECT NAME00, CAST(TIMESHEET_ID AS INTEGER(34))TIMESHEET_ID ....
.. (for 5-50 more columns)
FROM REMOTE_DB.TIMESHEET
)WITH DATA;
It is possible to configure DB2 to tolerate certain SQL errors in nested table expressions.
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.data.fluidquery.doc/topics/iiyfqetnint.html
When the federated server encounters an allowable error, the server allows the error and continues processing the remainder of the query rather than returning an error for the entire query. The result set that the federated server returns can be a partial or an empty result.
However, I assume that your REMOTE_DB.TIMESHEET is simply a nickname, and not a view with nested table expressions, and so any errors when pulling data from the source should be surfaced by DB2. Taking a look at the db2diag.log is likely the way to go - you might even be hitting a Db2 issue.
It might be useful to change your script to TRUNCATE and INSERT into your local tables and see if that helps avoid the issue.
As you say you are maybe not doing things the most efficient way. You could consider using cache tables to take a periodic copy of your remote data https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.data.fluidquery.doc/topics/iiyvfed_tuning_cachetbls.html

Execute function returning data in remote PostgreSQL database from local PostgreSQL database

Postgres version: 9.3.4
I have the need to execute a function which resides in a remote database. The function returns a table of statistic data based on the parameters given.
I am in effect only mirroring the function in my local database to lock down access to this function using my database roles and grants.
I have found the following which seem to only provide table-based access.
http://www.postgresql.org/docs/9.3/static/postgres-fdw.html
http://multicorn.org/foreign-data-wrappers/#idsqlalchemy-foreign-data-wrapper
First question: is that correct or are there ways to use these libraries for non-table based operations?
I have found the following which seems to provide me with any SQL operation on the foreign database. The negative seems to be increased complexity and reduced performance due to manual connection and error handling.
http://www.postgresql.org/docs/9.3/static/dblink.html
Second question: are these assumptions correct, and are there any ways to bypass these concerns or libraries/samples one can begin from?
The fdw interface provides a way to make a library which can allow a postgresql database to query any external data source as though it was a table. From that point of view, it could do what you want.
The inbuilt postgresql_fdw driver, however, does not allow you to specify a function as a remote table.
You could write your own fdw driver, possibly using the multicorn library, or some other language. That is likely to be a bit of work though, and would have some specific disadvantages, in particular I don't know how you would pass parameters to the function.
dblink is probably going to be the easiest solution. It allows you to execute arbitrary SQL on the remote server, returning a set of records.
SELECT *
FROM dblink('dbname=mydb', 'SELECT * FROM thefunction(1,2,3)')
AS t1(col1 INTEGER, col2 INTEGER);
There are other potential solutions but they would all be more effort to set up.