Using redshift system views with select pivot - amazon-redshift

I'd like to pivot the values in the system views svv_*_privileges to show privileges assigned to security principles by object. (For the complete solution will need to union results for all objects and pivot to all privileges)
as an example for the default privileges:
select * from
(select object_type, grantee_name, grantee_type, privilege_type, 1 as is_priv from pg_catalog.svv_default_privileges where grantee_name = 'abc' and grantee_type = 'role')
pivot (max(is_priv) for privilege_type in ('EXECUTE', 'INSERT', 'SELECT', 'UPDATE', 'DELETE', 'RULE', 'REFERENCES', 'TRIGGER', 'DROP') );
This gives error (only valid on leader node?)
[Amazon](500310) Invalid operation: Query unsupported due to an internal error.
Then thought of trying a temp table, pivot then being on a redshift table
select * into temp schema_default_priv from pg_catalog.svv_default_privileges where grantee_name = 'abc' and grantee_type = 'role'
... same error as above :-(
Is there a way I can work with SQL on the system tables to accomplish this in Redshift SQL????
While I can do the pivot in python ... why should I, It's supposedly a sql db!!!

On reread of your question the issue became clear. You are using a leader node only system table and looking to apply compute node data and/or functions. This path of data flow is not supported on Redshift. I do have some question as to what action is requiring compute node action but that isn't the crucial and digging is would take time.
If you need to get leader node data to the compute nodes there are a few ways and none of them are trivial. I find that the best method is to move the needed data is to use a cursor. This previous answer outlines hot to do this
How to join System tables or Information Schema tables with User defined tables in Redshift

Related

postgres optimisation: run view query before policy

I have a postgres table which has a policy enforced on it, like so (extra columns redacted for brevity):
create table live_specs (
catalog_name catalog_name not null,
spec_type catalog_spec_type not null,
);
create policy "Users must be read-authorized to the specification catalog name"
on live_specs as permissive for select
using (auth_catalog(catalog_name, 'read'));
create index idx_live_specs_spec_type on live_specs (spec_type);
create index idx_live_specs_catalog_name on live_specs (catalog_name);
The auth_catalog function cannot be indexed because it's not immutable, so it's hard to optimise this function.
I have a view that I query, which in turn queries this table with the policy:
create view live_specs_ext as
select
l.*,
c.id as connector_id,
from live_specs l
left outer join connectors c on c.image_name = l.connector_image_name;
Now, I'm running a query against this view, filtering on spec_type which is an indexed field, however I can see that postgres seems to do a full table scan when enforcing the policy and doesn't utilise the index of spec_type (extra lines of explain omitted for brevity):
EXPLAIN SELECT * FROM live_specs_ext WHERE spec_type = 'capture' LIMIT 10;
.
.
Filter: (auth_catalog((catalog_name)::text, 'read'::grant_capability) AND (spec_type = 'capture'::catalog_spec_type))
.
.
From reading CREATE POLICY page of postgres doc I understand that:
Generally, the system will enforce filter conditions imposed using security policies prior to qualifications that appear in user queries, in order to prevent inadvertent exposure of the protected data to user-defined functions which might not be trustworthy. However, functions and operators marked by the system (or the system administrator) as LEAKPROOF may be evaluated before policy expressions, as they are assumed to be trustworthy.
However, if I understand this correctly, it means the spec_type = 'capture' qual, which uses the built-in = function, is not being run before the policy because = is not leakproof. Is that a correct understanding?
Is there any way for me to ask Postgres to run my spec_type = 'capture' qual before the policy?

Observation of DBMS_STATS global preferences for autonomous database; and is there any risk in changing to desired values or overriding at table level

I am interested in understanding the rationale behind two DBMS_STATS global preferences in the autonomous database / data warehouse and what is the risk /downside in changing as compared to the risk in non-autonomous database.
In the autonomous database:
DBMS_STATS
.GET_PREFS
( PNAME => 'METHOD_OPT'
)
AS METHOD_OPT
FROM DUAL;
Yields: FOR ALL COLUMNS SIZE 254
And
SELECT
DBMS_STATS
.GET_PREFS
( PNAME => 'INCREMENTAL_LEVEL'
)
AS INCREMENTAL_LEVEL
FROM DUAL
;
Yields: TABLE
In the the non-Autonomous database those two queries yield
FOR ALL COLUMNS SIZE AUTO and PARTITION.
I would like to understand the rationale for the difference and understand the negatives of either changing the global setting or overriding it at the table level for the autonomous database.
With respect to METHOD_OPT, the autonomous database setting seems to be wasting resources (time, cpu, and space) for no gain (unless one is talking about when data is loaded before there is any usage).
With respect to INCREMENTAL_LEVEL the autonomous setting seems to be beneficial for non-partitioned tables that are in partitions exchanges. But for partitioned tables, it seems to be forcing entire table work because the setting of table is requesting a TABLE SYNOPSE to be created even when only a single partition is modified. The following command is used to gather the statistics:
DBMS_STATS.GATHER_SCHEMA_STATS
( USER
, CASCASE => TRUE
, OPTIONS => 'GATHER AUTO'
, DEGREE => DBMS_STATS.AUTO_DEGREE
);
The objective was that the schema stats would only decide to do partition level statistics gathering for stale partitions and to use that information to INCREMENTALLY do the global table statistics. But the observed behavior seems to be table scans due to the increment level
And yes, the following query does return TRUE and is not overwritten at the schema/table level.
SELECT
DBMS_STATS
.GET_PREFS
( PNAME => 'INCREMENTAL'
)
AS INCREMENTAL
FROM DUAL
;
So to cut to the questions at hand:
Why might the autonomous database setting these global DBMS_STATS preferences in this manner?
Is there any prohibition with either changing them as indicated or overwriting them at the table level?
What are the possible downsides?
Any insights are appreciated. Thanks in advance.

Data from FDW loads only once and subscription status = "synchronized"

I'm facing problem within postgres12 FDW and can't understand where is the issue.
In DWH we are replicating tables from two difference sources (for example source1 and source2) but there is this once table to whom replication from source keep crashing. Table structures and data types are identical except that on source2 we have additional column which is also in target db (DWH) within default value "0" so data from source1 could also be replicated. As I know, that's not a problem that in target table are more columns than source, but issue is that subscription process synchronizing only with source2. Within source1 it's synchronize once, on alter subscription source1 refresh publication and then it's stops and data are not replicated anymore (but subscriptions keeps working for other tables, problem is only within this particular table).
There are no error messages on log file or anything that could help to resolve it by myself. I tried to drop table in DWH and recreate it, but it wont help. There are no duplicate entries or anything that could crash replication.
select
b.subname
,c.relname
,case
when a.srsubstate = 'i' then 'initialize'
when a.srsubstate = 'd' then 'data is being copied'
when a.srsubstate = 's' then 'synchronized'
when a.srsubstate = 'r' then 'ready (normal replication)'
else null
end srsubstate
from pg_subscription_rel a
left join pg_subscription b on a.srsubid = b.oid
left join pg_class c on a.srrelid = c.oid
where c.relname ='table_name';
RESULT:
"source2" "table_name" "ready (normal replication)"
"source1" "table_name" "synchronized"
REPLICA IDENTITY for tables in source and target = INDEX
INDEX in DWH and source db's are the same: "CREATE UNIQUE INDEX table_name_idx ON public.table_name USING btree (id, country)"
Also altered table: alter table table_name replica identity using index table_name_idx;
I guess DB links works correctly as other tables from both sources are replicated correctly.
PROBLEM: Data on DWH from source1 keep synchronized only once - on alter subscription refresh publication....

PostgreSQL Database size is not equal to sum of size of all tables

I am using an AWS RDS PostgreSQL instance. I am using below query to get size of all databases.
SELECT datname, pg_size_pretty(pg_database_size(datname))
from pg_database
order by pg_database_size(datname) desc
One database's size is 23 GB and when I ran below query to get sum of size of all individual tables in this particular database, it was around 8 GB.
select pg_size_pretty(sum(pg_total_relation_size(table_schema || '.' || table_name)))
from information_schema.tables
As it is an AWS RDS instance, I don't have rights on pg_toast schema.
How can I find out which database object is consuming size?
Thanks in advance.
The documentation says:
pg_total_relation_size ( regclass ) → bigint
Computes the total disk space used by the specified table, including all indexes and TOAST data. The result is equivalent to pg_table_size + pg_indexes_size.
So TOAST tables are covered, and so are indexes.
One simple explanation could be that you are connected to a different database than the one that is shown to be 23GB in size.
Another likely explanation would be materialized views, which consume space, but do not show up in information_schema.tables.
Yet another explanation could be that there have been crashes that left some garbage files behind, for example after an out-of-space condition during the rewrite of a table or index.
This is of course harder to debug on a hosted platform, where you don't have shell access...

How to avoid skewing in redshift for Big Tables?

I wanted to load the table which is having a table size of more than 1 TB size from S3 to Redshift.
I cannot use DISTSTYLE as ALL because it is a big table.
I cannot use DISTSTYLE as EVEN because I want to use this table in joins which are making performance issue.
Columns on my table are
id INTEGER, name VARCHAR(10), another_id INTEGER, workday INTEGER, workhour INTEGER, worktime_number INTEGER
Our redshift cluster has 20 nodes.
So, I tried distribution key on a workday but the table is badly skewed.
There are 7 unique work days and 24 unique work hours.
How to avoid the skew in such cases?
How we avoid skewing of the table in case of an uneven number of row counts for the unique key (let's say hour1 have 1million rows, hour2 have 1.5million rows, hour3 have 2million rows, and so on)?
Distribute your table using DISTSTYLE EVEN and use either SORTKEY or COMPOUND SORTKEY. Sort Key will help your query performance. Try this first.
DISTSTYLE/DISTKEY determines how your data is distributed. From the columns used in your queries, it is advised choose a column that causes the least amount of skew as the DISTKEY. A column which has many distinct values, such as timestamp, would be a good first choice. Avoid columns with few distinct values, such as credit card types, or days of week.
You might need to recreate your table with different DISTKEY / SORTKEY combinations and try out which one will work best based on your typical queries.
For more info https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
Here is the architecture that I recommend
1) load to a staging table with dist even and sort by something that is sorted on your loaded s3 data - this means you will not have to vacuum the staging table
2) set up a production table with the sort / dist you need for your queries. after each copy from s3, load that new data into the production table and vacuum.
3) you may wish to have 2 mirror production tables and flip flop between them using a late binding view.
its a bit complex to do this you need may need some professional help. There may be specifics to your use case.
As of writing this(Just after Re-invent 2018), Redshift has Automatic Distribution available, which is a good starter.
The following utilities will come in handy:
https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AdminScripts
As indicated in Answers POSTED earlier try a few combinations by replicating the same table with different DIST keys ,if you don't like what Automatic DIST is doing. After the tables are created run the admin utility from the git repos (preferably create a view on the SQL script in the Redshift DB).
Also, if you have good clarity on query usage pattern then you can use the following queries to check how well the sort key are performing using the below SQLs.
/**Queries on tables that are not utilizing SORT KEYs**/
SELECT t.database, t.table_id,t.schema, t.schema || '.' || t.table AS "table", t.size, nvl(s.num_qs,0) num_qs
FROM svv_table_info t
LEFT JOIN (
SELECT tbl, COUNT(distinct query) num_qs
FROM stl_scan s
WHERE s.userid > 1
AND s.perm_table_name NOT IN ('Internal Worktable','S3')
GROUP BY tbl) s ON s.tbl = t.table_id
WHERE t.sortkey1 IS NULL
ORDER BY 5 desc;
/**INTERLEAVED SORT KEY**/
--check skew
select tbl as tbl_id, stv_tbl_perm.name as table_name,
col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;
of course , there is always room for improvement in the SQLs above, depending on specific stats that you may want to look at or drill down to.
Hope this helps.