Delete tables in batches (Pyspark) - pyspark

I have a database that has many tables in it. I want to drop all tables in that database that have "oct" in the name in a batch. Is there a way to do this? I can't find a clear answer online and I don't want to make a mistake and delete tables I shouldn't. Thanks for any help in advance!

I assume, you are talking about Hive for simplicity, and the metastore is configured. Then, you can use spark.sql to achieve it with the usual SQL commands. List the tables using like (with pattern matching), iterate the dataframe and drop them.
# Pick all tables in 'agg' schema which contains word 'customer' in it. Usual pattern matching.(In your case, its oct)
df = spark.sql("show tables in agg like '*customer*'")
# Iterate the dataframe that contains list of tables, and drop one by one.
for row in df.rdd.collect():
print(f'Dropping table {row.tableName}')
spark.sql(f'drop table agg.{row.tableName}')

Related

Querying from parent table's data (postgresql)

I have a parent table (parent_table) and few children tables inherit from it (child_one_table and child_two_table).
I want to query data using columns belong to the parent table only (the data itself is inserted to the children tables), when I run an explain on my query I see that there are sequence scans running on all the tables (parent_table, child_one_table and child_two_table).
Is there a more efficient way to do this? When I try to query using ONLY on parent_table I get back 0 result since the data was inserted to children table.
Thanks in advance!
Thanks for clarifying your question!
When I query the data for SELECT * FROM parent_table WHERE name = 'joe'; I see that it actually go over all 3 tables and query for that.
This seems to be standard behaviour from PostgreSQL. According to the 5.10.1. Caveats section of https://www.postgresql.org/docs/current/ddl-inherit.html
Note that not all SQL commands are able to work on inheritance hierarchies. Commands that are used for data querying, data modification, or schema modification (e.g., SELECT, UPDATE, DELETE, most variants of ALTER TABLE, but not INSERT or ALTER TABLE ... RENAME) typically default to including child tables and support the ONLY notation to exclude them.
So really what you want to do is explore the usage of ONLY to specify the scope of your query.
I hope this can help!

Adding a table to HDB by using dbmaint function

I would like to backfill a table to all dates in HDB. but the table has like 100 columns. What's the fastest way to backfill with the existing table?
I tried to get the schema from the current table and use the schema to backfill but doesn't work.
this is what I tried:
oldTable:0#newTable;
addtable[dbdir;`table;oldTable]
but this doesn't work. Any good way?
Does the table exist within the latest date partition of the HDB?
If so .Q.chk will add tables to partitions in which they are missing.
https://code.kx.com/q/ref/dotq/#qchk-fill-hdb
And with regards to addtable, what specific error are you getting when trying the above?

select all columns except two in q kdb historical database

In output I want to select all columns except two columns from a table in q/kdb historical database.
I tried running below query but it does not work on hdb.
delete colid,coltime from table where date=.z.d-1
but it is failing with below error
ERROR: 'par
(trying to update a physically partitioned table)
I referred https://code.kx.com/wiki/Cookbook/ProgrammingIdioms#How_do_I_select_all_the_columns_of_a_table_except_one.3F but no help.
How can we display all columns except for two in kdb historical database?
The reason you are getting par error is due to the fact that it is a partitioned table.
The error is documented here
trying to update a partitioned table
You cannot directly update, delete anything on a partitioned table ( there is a separate db maintenance script for that)
The query you have used as fix is basically selecting the data first in-memory (temporarily) and then deleting the columns, hence it is working.
delete colid,coltime from select from table where date=.z.d-1
You can try the following functional form :
c:cols[t] except `p
?[t;enlist(=;`date;2015.01.01) ;0b;c!c]
Could try a functional select:
?[table;enlist(=;`date;.z.d);0b;{x!x}cols[table]except`colid`coltime]
Here the last argument is a dictionary of column name to column title, which tells the query what to extract. Instead of deleting the columns you specified this selects all but those two, which is the same query more or less.
To see what the functional form of a query is you can run something like:
parse"select colid,coltime from table where date=.z.d"
And it will output the arguments to the functional select.
You can read more on functional selects at code.kx.com.
Only select queries work on partitioned tables, which you resolved by structuring your query where you first selected the table into memory, then deleted the columns you did not want.
If you have a large number of columns and don't want to create a bulky select query you could use a functional select.
?[table;();0b;{x!x}((cols table) except `colid`coltime)]
And show all columns except a subset of columns. The column clause expects a dictionary hence I am using the function {x!x} to convert my list to a dictionary. See more information here
https://code.kx.com/q/ref/funsql/
As nyi mentioned, if you want to permanently delete columns from an historical database you can use the deleteCol function in the dbmaint tools https://github.com/KxSystems/kdb/blob/master/utils/dbmaint.md

Audit tables on Redshift

Is there a way to get statistics on a table in Redshift like the way we can get on a dataframe in python by using df.describe() as follows-
+-------+---------------------------+-----------------+----------------------+--------------------+---------------+
|summary|           col 1           |      col2       |col3                  |           col4     |col5           |
+-------+---------------------------+-----------------+----------------------+--------------------+---------------+
|  count|                      26716|              869|                 26716|               26716|          26716|
|   mean|                        0.0|          49409.0|                  null|                null|           null|
| stddev|                        0.0|24096.28685088223|                  null|                null|           null|
|    min|                          0|             7745|  pqr                 |xyz                 |abcd  |
|    max|                          0|            91073|  pqr                 |xyz                 |abcd           |
+-------+---------------------------+-----------------+----------------------+--------------------+---------------+
I have a use case to run statistics like above on tables in Redshift on regular basis. I can get column names and data types for a table from PG_TABLE_DEF and I am looking to run Redshift in-built functions such as min(), max(), count(), mean() etc. over the columns identified from the table. Not sure if this is the right approach but if there is a better approach, please share your thoughts.

Querying across multiple tables with identical schemas

I'm trying to run the same query over multiple tables in my Postgres database, that all have the same schema.
This question: Select from multiple tables without a join?
shows that this is possible, however they are hard-coding the set of tables.
I have another query that returns the five specific tables I would like my main query to run on. How can I go about using the result of this with the UNION approach?
In short, I want my query to see the five specific tables (determined by the outcome of another query) as one large table when it runs the query.
I understand that in many cases similar to my scenario you'd simply just want to merge the tables. I can not do this.
One way of doing this that may satisfy your constraints is using table inheritance. In short, you will need to create a parent table with the same schema, and for each child you want to query you must ALTER that_table INHERIT parent_table. Any queries against the parent table will query all of the child tables. If you need to query different tables in different circumstances, I think the best way would be to add a column named type or some such, and query only certain values of that table.