DataBricks: Cache Select on Temp Table - pyspark

How might I cache a temp table?
The documentation suggests it is possible: https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-cache.html
Consider the following on DBR 10.5 and Spark 3.2.1:
%python
df.createOrReplaceTempView("changeset");
CACHE SELECT * from changeset
CACHE supports only SELECT queries with optional WHERE clause, e.g. CACHE SELECT <columns> FROM <table> [ WHERE <predicate> ]
Edit
I have found that if you do ANYTHING to your data, like:
spark.sql("select distinct * from table") # CACHE SELECT will fail
# instead of
spark.sql("select * from table") # CACHE SELECT will work
Your attempts to CACHE SELECT will fail.

So despite the documentation having latest in the name, there is other documentation that shows the query should be structured like this (which works), instead
CACHE TABLE testCache OPTIONS ('storageLevel' 'DISK_ONLY') SELECT * FROM testData;
https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-aux-cache-cache-table.html

Related

Pivot function without manually typing values in `for in`?

Documentation provides an example of using the pivot() function.
SELECT *
FROM (SELECT partname, price FROM part) PIVOT (
AVG(price) FOR partname IN ('prop', 'rudder', 'wing')
);
I would like to use pivot() without having to manually specify each value of partname. I want all parts. I tried:
SELECT *
FROM (SELECT partname, price FROM part) PIVOT (
AVG(price) FOR partname);
That gave an error. Then tried:
SELECT *
FROM (SELECT partname, price FROM part) PIVOT (
AVG(price) FOR partname IN (select distinct partname from part)
);
That also threw an error.
How can I tell Redshift to include all values of partname in the pivot?
I don't think this can be done in a simple single query. This would mean that the query compiler would need to work without knowing how many output columns will be produced. I don't think it can do that.
You can do this in multiple queries - use a query to create the list of partnames and then use this to "generate" a second query that populates the IN list. So something needs issue these queries and generated the second. This can be some code external to Redshift (lots of options) or a stored procedure in Redshift. This code, no matter where it exists, should understand that Redshift has a max number of columns limit - 1,600.
The Redshift docs are fairly good on the topic of dynamic SQL for stored procedures. The EXECUTE statement will be used to fire off the second query in a stored procedure. See: https://docs.aws.amazon.com/redshift/latest/dg/c_PLpgSQL-statements.html

postgresql: check if there is any rows at all before using limit 1

I have a huge table of data. I am looking for something which may or may not exist yet
So when i do
SELECT *
FROM tablename
WHERE col='test'
I get
No rows found.
Total runtime: 70.147 ms
SQL executed.
Where as when I try
SELECT *
FROM tablename
WHERE col='test'
ORDER BY id DESC
LIMIT 1;
I see it hangs may be because the table is too big
How can i avoid this situation

Aginity Netezza macro containing a list

I would like to put a list of names in my Aginity Netezza macro. For instance, I would like to be able to repeatedly use the list ("Adam", "Bill", "Cynthia", "Dick", "Ella", "Fanny") in my future queries, e.g. in WHERE clauses.
My questions are:
(1) Is there a limit to how many characters I can put inside the "Value" window of the Query Parameters Editor?
(2) Is there a way to make this work without using a macro? For instance, predefining this list somewhere?
I would put the list into a (temporary) table, and simply join to it when necessasary:
Create temp table names as
Select ‘Adam’::varchar(50)
Union all Select ‘Bill’::varchar(50)
Union all Select ‘Cynthia’::varchar(50)
Union all Select ‘Dick’::varchar(50)
Union all Select ‘Ella’::varchar(50)
Union all Select ‘Fanny’
;
Select x.a,x.b
from x
where x.name in (select * from Names)
;
Select
case
when x.name in (select * from Names)
then ‘Special’
Else ‘Other’
End as NameGrp,
Count(*) as size,
Sum(income) as TotalIncome
Group by NameGrp
Order by size desc
;
Alternatively netezza has an extension toolkit that enables ARRAY data types, but especially the first query will not perform well if you use it for that purpose. Interested? See here: https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.2.1/com.ibm.nz.sqltk.doc/c_sqlext_array.html or google for examples

Select physically last record without ORDER BY

An application inherited by me was oriented on so to say "natural record flow" in a PostgreSQL table and there was a Delphi code:
query.Open('SELECT * FROM TheTable');
query.Last();
The task is to get all the fields of last table record. I decided to rewrite this query in a more effective way, something like this:
SELECT * FROM TheTable ORDER BY ReportDate DESC LIMIT 1
but it broke all the workflow. Some of ReportDate records turned out to be NULL. The application was really oriented on a "natural" records order in a table.
How to do a physically last record selection effectively without ORDER BY?
to do a physically last record selection, you should use ctid - the tuple id, to get the last one - just select max(ctid). smth like:
t=# select ctid,* from t order by ctid desc limit 1;
ctid | t
--------+-------------------------------
(5,50) | 2017-06-13 11:41:04.894666+00
(1 row)
and to do it without order by:
t=# select t from t where ctid = (select max(ctid) from t);
t
-------------------------------
2017-06-13 11:41:04.894666+00
(1 row)
Its worth knowing that you can find ctid only after sequential scan. so checking the latest physically row will be costy on large data sets

Can a database table partition name be used as a part of WHERE clause for IBM DB2 9.7 SELECT statement?

I am trying to select all data out of the same specific table partition for 100+ tables using the DB2 EXPORT utility. The partition name is constant across all of my partitioned tables, which makes this method more advantageous than using some other possible methods.
I cannot detach the partitions as they are in a production environment.
In order to script this for semi-automation, I need to be able to run the query:
SELECT * FROM MYTABLE
WHERE PARTITION_NAME = MYPARTITION;
I am not able to find the correct syntax for utilizing this type of logic in my SELECT statement passed to the EXPORT utility.
You can do something like this by looking up the partition number first:
SELECT SEQNO
FROM SYSCAT.DATAPARTITIONS
WHERE TABNAME = 'YOURTABLE' AND DATAPARTITIONNAME = 'WHATEVER'
then using the SEQNO value in the query:
SELECT * FROM MYTABLE
WHERE DATAPARTITIONNUM(anycolumn) = <SEQNO value>
Edit:
Since it does not matter what column you reference in DATAPARTITIONNUM(), and since each table is guaranteed to have at least one column, you can automatically generate queries by joining SYSCAT.DATAPARTITIONS and SYSCAT.COLUMNS:
select
'select * from', p.tabname,
'where datapartitionnum(', colname, ') = ', seqno
from syscat.datapartitions p
inner join syscat.columns c
on p.tabschema = c.tabschema and p.tabname = c.tabname
where colno = 1
and datapartitionname = '<your partition name>'
and p.tabname in (<your table list>)
However, building dependency on database metadata into your application is, in my view, not very reliable. You can simply specify the appropriate partitioning key range to extract the data, which will be as efficient.