Audit tables on Redshift - amazon-redshift

Is there a way to get statistics on a table in Redshift like the way we can get on a dataframe in python by using df.describe() as follows-
+-------+---------------------------+-----------------+----------------------+--------------------+---------------+
|summary|           col 1           |      col2       |col3                  |           col4     |col5           |
+-------+---------------------------+-----------------+----------------------+--------------------+---------------+
|  count|                      26716|              869|                 26716|               26716|          26716|
|   mean|                        0.0|          49409.0|                  null|                null|           null|
| stddev|                        0.0|24096.28685088223|                  null|                null|           null|
|    min|                          0|             7745|  pqr                 |xyz                 |abcd  |
|    max|                          0|            91073|  pqr                 |xyz                 |abcd           |
+-------+---------------------------+-----------------+----------------------+--------------------+---------------+
I have a use case to run statistics like above on tables in Redshift on regular basis. I can get column names and data types for a table from PG_TABLE_DEF and I am looking to run Redshift in-built functions such as min(), max(), count(), mean() etc. over the columns identified from the table. Not sure if this is the right approach but if there is a better approach, please share your thoughts.

Related

How to get all external in db2 using a system tables information

I need to get complete list of external tables in db2 database. I have defined schema called DB2INST1. How to get complete list of external tables information using system tables?
The information lives in syscat.tables (documentation here) for Db2-LUW databases that support external tables, which have their PROPERTY column with value Y in position 27 of that column.
Example query returns the fully qualified name of external tables:
select trim(tabschema)||'.'||rtrim(tabname)
from syscat.tables
where substr(property,27,1)='Y'
with ur;
In general, the best and most reliable way to retrieve all information and to recreate the DDL statements for tables is to use the db2look tool. If you want to extract the metadata on your own, there are some catalog views to start with:
SYSCAT.TABLES holds the table information. Look for the PROPERTY column and check there if it is an external table.
SYSCAT.COLUMNS has the basic column information. But there are more related tables depending on types and attributes.
SYSCAT.EXTERNALTABLEOPTIONS shows the actual options for an external table, the things in addition to what makes a regular table.
There are many more tables to hold table properties, depending on the complexity of the table and column definitions.

Delete tables in batches (Pyspark)

I have a database that has many tables in it. I want to drop all tables in that database that have "oct" in the name in a batch. Is there a way to do this? I can't find a clear answer online and I don't want to make a mistake and delete tables I shouldn't. Thanks for any help in advance!
I assume, you are talking about Hive for simplicity, and the metastore is configured. Then, you can use spark.sql to achieve it with the usual SQL commands. List the tables using like (with pattern matching), iterate the dataframe and drop them.
# Pick all tables in 'agg' schema which contains word 'customer' in it. Usual pattern matching.(In your case, its oct)
df = spark.sql("show tables in agg like '*customer*'")
# Iterate the dataframe that contains list of tables, and drop one by one.
for row in df.rdd.collect():
print(f'Dropping table {row.tableName}')
spark.sql(f'drop table agg.{row.tableName}')

distkey and sortkey on temporary tables - Redshift

I am starting to do some research on query tuning, and have been experimenting with using distkey and sortkey. From what I've read if I set the distkey to the joining column, the query planner will use a merge join instead of a hash join, which should be faster in Redshift. I was wondering if this also applies to temporary tables? Our production tables are actually views, so they do not have any keys already set. I'm not sure why we don't use the actual warehouse tables.
Yes, keys can be set for temporary tables:
create temp table fred DISTKEY (1) as ...
this is easily done with column position - first column in this example. You can also set the distribution style on temp tables is you so desire. Doing this can force data to stay "on node" for intermediate results in very large and complex queries. Redshift does a good job make reasonable decisions on how to distribute intermediate results but isn't perfect and doesn't understand the nature of the data. I've done this with good results when large data images are in play.
As to you second point about using views instead of tables - In Redshift standard views are basically SQL macros that are flattened / optimized through by the Redshift query compiler. So use of views instead of tables is not bad in itself. Use of view, especially complex ones, can hide what is being done by the query and this can add unneeded and unexpected complexity to the query. The keys are set in the tables referenced by the views. (I'm assuming that the views are not referencing external/spectrum tables)
Lastly, you state you are looking to achieve Merge Join behavior to improve performance. While it is true that this is the fastest type of join, the time and work required to get merge joins to happen on temp tables will not be offset by this performance gain (experience). Redshift will only use a Merge join when it is sure that the data being joined will "zipper" together without issue. If it isn't completely sure this is the case it has to perform a Hash join which is a more general process. To get Redshift to do the Merge join you will need to sort and analyze your temp tables which will cost much more time than the savings you will get. It is far more important to have your joins be "DIST NONE" - no network distribution of the data - than moving from a hash join to a merge join.
Yes, it can be done. Just put the distkey before the start of the table query
create temp table a distkey(column_name) as
(select query .....)

Find unused columns in Redshift

I'm trying to find columns that are not queried in Redshift tables, in order to drop them.
I tried doing this by analyzing stl_query table, but that doesn't provide very accurate results (because of views).
Is there some kind of utility that can help with that?
Alex

select all columns except two in q kdb historical database

In output I want to select all columns except two columns from a table in q/kdb historical database.
I tried running below query but it does not work on hdb.
delete colid,coltime from table where date=.z.d-1
but it is failing with below error
ERROR: 'par
(trying to update a physically partitioned table)
I referred https://code.kx.com/wiki/Cookbook/ProgrammingIdioms#How_do_I_select_all_the_columns_of_a_table_except_one.3F but no help.
How can we display all columns except for two in kdb historical database?
The reason you are getting par error is due to the fact that it is a partitioned table.
The error is documented here
trying to update a partitioned table
You cannot directly update, delete anything on a partitioned table ( there is a separate db maintenance script for that)
The query you have used as fix is basically selecting the data first in-memory (temporarily) and then deleting the columns, hence it is working.
delete colid,coltime from select from table where date=.z.d-1
You can try the following functional form :
c:cols[t] except `p
?[t;enlist(=;`date;2015.01.01) ;0b;c!c]
Could try a functional select:
?[table;enlist(=;`date;.z.d);0b;{x!x}cols[table]except`colid`coltime]
Here the last argument is a dictionary of column name to column title, which tells the query what to extract. Instead of deleting the columns you specified this selects all but those two, which is the same query more or less.
To see what the functional form of a query is you can run something like:
parse"select colid,coltime from table where date=.z.d"
And it will output the arguments to the functional select.
You can read more on functional selects at code.kx.com.
Only select queries work on partitioned tables, which you resolved by structuring your query where you first selected the table into memory, then deleted the columns you did not want.
If you have a large number of columns and don't want to create a bulky select query you could use a functional select.
?[table;();0b;{x!x}((cols table) except `colid`coltime)]
And show all columns except a subset of columns. The column clause expects a dictionary hence I am using the function {x!x} to convert my list to a dictionary. See more information here
https://code.kx.com/q/ref/funsql/
As nyi mentioned, if you want to permanently delete columns from an historical database you can use the deleteCol function in the dbmaint tools https://github.com/KxSystems/kdb/blob/master/utils/dbmaint.md