Specified types or functions (one per INFO message) not supported on Redshift tables - amazon-redshift

select distinct(table_name) from svv_all_columns ;
SELECT distinct(source_table_name) FROM "ctrl_stg_cdr_rules" order by source_table_name ;
I want to take intersection for the above two queries but getting this error
ERROR: Specified types or functions (one per INFO message) not
supported on Redshift tables. [ErrorId:
1-63eb4c35-4b45b94c02210a19663d78db]
SELECT table_name FROM svv_all_columns WHERE database_name = 'singh_sandbox' AND schema_name = 'new_sources'
INTERSECT
SELECT source_table_name FROM ctrl_stg_cdr_rules
ORDER BY table_name;
I am expected to get list of all missing tables

Oh this error again. This has to be one of the worst written error messages in existence. What this means is that you are trying to use leader-node only data in a query being run on the compute nodes.
You see Redshift is a cluster with a leader node that has a different purpose than the other nodes (compute nodes). When a query runs the compute nodes execute on data they have direct access to, then the results from the compute nodes is passed to the leader for any final actions and passed to the connected client. In this model the data only flow one way during query execution. This error happens when data only accessible from the leader node is needed by the compute nodes - this includes results from leader-only functions and/or leader-node only tables. This is what is happening when you perform INTERSECT between these two selects.
To resolve this you need to produce the leader-only data as a separate select and route the data back to the compute nodes through a supported process. There are two classes of methods to do this - have an external system route the data back OR use a cursor and route the results back. I wrote up how to perform the cursor approach in this answer: How to join System tables or Information Schema tables with User defined tables in Redshift
The bottom line is that you cannot do what you intend simply because of the architecture of Redshift. You need a different approach.

Related

Execute Multiple SQL Insert Statements in parallel in Snowflake

I have a question about how it works when several SQL statements are executed in parallel in Snowflake.
For example, if I execute 10 insert statements on 10 different tables with the same base table - will the tables be loaded in parallel?
Since Copy and Insert statement only write in new partitions they can run in parallel with other Copy or Insert statements.
https://docs.snowflake.com/en/sql-reference/transactions.html
"https://docs.snowflake.com/en/sql-reference/transactions.html#transaction-commands-and-functions" states that "Most INSERT and COPY statements write only new partitions. Those statements often can run in parallel with other INSERT and COPY operations,..."
I assume that statements cannot run in parallel when they want to insert into the same micro partition. Is that correct or is there another explanation why locks on INSERTs can happen?
I execute 10 insert statements on 10 different tables
with the same base table - will the tables be loaded in parallel?
YES!
Look for multi-table insert in SF https://docs.snowflake.com/en/sql-reference/sql/insert-multi-table.html
You can execute queries parallelly by just adding a ">" symbol.
for example:
The below statement will submit all the mentioned queries parallelly to snowflake. It will not exit out though if there is any error encountered in any of the queries.
snowsql -o log_level=DEBUG -o exit_on_error=true -q "select 1;>select * from SNOWSQLTABLE;>select 2;>select 3;>insert into TABLE values (1)>;select * from SNOWLTABLE;>select 5;"
The below statement will cause the queries to run one at a time and exit if any error is found.
snowsql -o log_level=DEBUG -o exit_on_error=true -q "select 1;select * from SNOWSQLTABLE;select 2;select 3;insert into SNOQSQLTABLE values (1);select * from SNOWLTABLE;select 5;"
Concurrency in Snowflake is managed with either multiple warehouses (compute resources) or enabling multi-clustering on a warehouse (one virtual warehouse with more than one cluster of servers).
https://docs.snowflake.com/en/user-guide/warehouses-multicluster.html
I'm working with a customer today that does millions of SQL commands a day, they have many different warehouses and most of these warehouses are set to multi-cluster "auto-scale" mode.
Specifically, for your question, it sounds like you have ten sessions connected, running inserts into ten tables via querying a single base table. I'd probably begin my testing of this with one virtual warehouse, configured with a minimum
of one cluster and a maximum of three or four, and then run tests and review the results.
The size of the warehouse I would use would mostly be determined by how large the query is (the SELECT portion), you can start with something like a medium and review the performance and query plans of the inserts to see if that is the appropriate size.
When reviewing the plans, check for queuing time to see if perhaps three or four clusters isn't enough, it probably will be fine.
Your query history will also indicate which "cluster_number" your query ran on, within the virtual warehouse. This is one way to check to see how many clusters were running (the maximum cluster_number), another is to view the warehouses tab in the webUI or to execute the "show warehouses;" command.
Some additional links that might help you:
https://www.snowflake.com/blog/auto-scale-snowflake-major-leap-forward-massively-concurrent-enterprise-applications/
https://community.snowflake.com/s/article/Putting-Snowflake-s-Automatic-Concurrency-Scaling-to-the-Test
https://support.snowflake.net/s/question/0D50Z00009T2QTXSA3/what-is-the-difference-in-scale-out-vs-scale-up-why-is-scale-out-for-concurrency-and-scale-up-is-for-large-queries-

PostgreSQL data warehouse: create separate databases or different tables within the same database?

We seek to run cross-table queries and perform different type of merges. For cross-database queries we need to establish connection every time.
So to answer your question,
we should create multiple(different)tables within same database.
because cross database operation is not supported(for ex. In short you can't do the join on 2 tables in different database)
But if you want to segregate your data within same database you can create different schemas/layer .and create your tables under that.
for ex.
**1st load landingLayer.tablename
2nd transformation goodDataLayer.tablename
3rd transformation widgetLayer.tablename**

Storing relational data in Apache Flink as State and querying by a property

I have a database with Tables T1(id, name, age) and T2(id, subject).
Flink receives all updates from the database as event stream using something like debezium. The tables are related to each other and required data can be extracted by joining T1 with T2 on id. Currently the whole state of the database is stored in Flink MapState with id as the key. Now the problem is that I need to select the row based on name from T1 without using id. It seems like I need an index on T1(name) for making it faster. Is there any way I can automatically index it, without having to manually create an index for each table. What is the recommended way for doing this?. I know about SQL streaming on tables, but I require support for updates to the tables. By the way, I use Flink with Scala. Any pointers/suggestions would be appreciated.
My understanding is that you are connecting T1 and T2, and storing some representation (in MapState) of the data from these two streams in keyed state, keyed by id. It sounds like T1 and T2 are evolving over time, and you want to be able to interactively query the join at any time by specifying a name.
One idea would be to broadcast in the name(s) you want to select, and use a KeyedBroadcastProcessFunction to process them. In its processBroadcastElement method you could use ctx.applyToKeyedState to compute the results by extracting data from the MapState records (which would have to be held in this operator). I suspect you will want to use the names as the keys in these MapState records, so that you don't have to iterate over all of the entries in each map to find the items of interest.
You will find a somewhat similar example of this pattern in https://training.data-artisans.com/exercises/ongoingRides.html.

How does pglogical-2 handle logical replication on same table while allowing it to be writeable on both databases?

Based on the above image, there are certain tables I want to be in the Internal Database (right hand side). The other tables I want to be replicated in the external database.
In reality there's only one set of values that SHOULD NOT be replicated across. The rest of the database can be replicated. Basically the actual price columns in the prices table cannot be replicated across. It should stay within the internal database.
Because the vendors are external to the network, they have no access to the internal app.
My plan is to create a replicated version of the same app and allow vendors to submit quotations and picking items.
Let's say the replicated tables are at least quotations and quotation_line_items. These tables should be writeable (in terms of data for INSERTs, UPDATEs, and DELETEs) at both the external database and the internal database. Hence at both databases, the data in the quotations and quotation_line_items table are writeable and should be replicated across in both directions.
The data in the other tables are going to be replicated in a single direction (from internal to external) except for the actual raw prices columns in the prices table.
The quotation_line_items table will have a price_id column. However, the raw price values in the prices table should not appear in the external database.
Ultimately, I want the data to be consistent for the replicated tables on both databases. I am okay with synchronous replication, so a bit of delay (say, a couple of second for the write operations) is fine.
I came across pglogical https://github.com/2ndQuadrant/pglogical/tree/REL2_x_STABLE
and they have the concept of PUBLISHER and SUBSCRIBER.
I cannot tell based on the readme which one would be acting as publisher and subscriber and how to configure it for my situation.
That won't work. With the setup you are dreaming of, you will necessarily end up with replication conflicts.
How do you want to prevent that data are modified in a conflicting fashion in the two databases? If you say that that won't happen, think again.
I believe that you would be much better off using a single database with two users: one that can access the “secret” table and one that cannot.
If you want to restrict access only to certain columns, use a view. Simple views are updateable in PostgreSQL.
It is possible with BDR replication which uses pglogical. On a basic level by allocating ranges of key ids to each node so writes are possible in both locations without conflict. However BDR is now a commercial paid for product.

How to unload pg_table_def table to s3

I'd like to take a dump of schema of redshift and do some comparisons among different environments.
unload('select * from pg_table_def')
to 's3://blah/blah/stage.txt'
credentials 'aws_access_key_id=XXXXX;aws_secret_access_key=XXXXXX'
parallel off allowoverwrite manifest;
But the code above throws me the following error.
INFO: Function "format_type(oid,integer)" not supported.
INFO: Function "pg_table_is_visible(oid)" not supported.
ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables.
Any idea how to make this work ? or is there any way other way to get the schema. i need to know sort key and dist key information as well.
Redshift keeps certain information in special area on the leader node whereas UNLOAD commands are processed on each slice (AFAIK) and therefore can't use leader node only functions.
You would probably need to extract this from an external machine using psql.