How can I get the total run time of a query in redshift, with a query? - amazon-redshift

I'm in the process of benchmarking some queries in redshift so that I can say something intelligent about changes I've made to a table, such as adding encodings and running a vacuum. I can query the stl_query table with a LIKE clause to find the queries I'm interested in, so I have the query id, but tables/views like stv_query_summary are much too granular and I'm not sure how to generate the summarization I need!
The gui dashboard shows the metrics I'm interested in, but the format is difficult to store for later analysis/comparison (in other words, I want to avoid taking screenshots). Is there a good way to rebuild that view with sql selects?

To add to Alex answer, I want to comment that stl_query table has the inconvenience that if the query was in a queue before the runtime then the queue time will be included in the run time and therefore the runtime won't be a very good indicator of performance for the query.
To understand the actual runtime of the query, check on stl_wlm_query for the total_exec_time.
select total_exec_time
from stl_wlm_query
where query='query_id'

There are some usefuls tools/scripts in https://github.com/awslabs/amazon-redshift-utils
Here is one of said scripts stripped out to give you query run times in milliseconds. Play with the filters, ordering etc to show the results you are looking for:
select userid, label, stl_query.query, trim(database) as database, trim(querytxt) as qrytext, starttime, endtime, datediff(milliseconds, starttime,endtime)::numeric(12,2) as run_milliseconds,
aborted, decode(alrt.event,'Very selective query filter','Filter','Scanned a large number of deleted rows','Deleted','Nested Loop Join in the query plan','Nested Loop','Distributed a large number of rows across the network','Distributed','Broadcasted a large number of rows across the network','Broadcast','Missing query planner statistics','Stats',alrt.event) as event
from stl_query
left outer join ( select query, trim(split_part(event,':',1)) as event from STL_ALERT_EVENT_LOG group by query, trim(split_part(event,':',1)) ) as alrt on alrt.query = stl_query.query
where userid <> 1
-- and (querytxt like 'SELECT%' or querytxt like 'select%' )
-- and database = ''
order by starttime desc
limit 100

Related

when to use direct sql query vs type-safe slick

I am not sure when slick runs the sql query and returns the result set. I want to run different queries but my table is very big. The sql command it self takes few seconds to run. I am afraid if I use slick to return entire table and then do selection or group by, it consumes a lot of memory and makes everything slow.
Here is one example to get the count of customers per day:
sql"""
select timestamp::date, count(distinct id) from tenant GROUP BY timestamp::date ORDER BY timestamp::date asc;
""".as[(Date, Int)]
My table has multiple columns other than date and id. If I want to bring the entire table and then to do group by and map in scala, it takes a lot of memory.
If I use above command, it is prone to errors.
Is there a way to return only id and date from sql db? and then what is the best way to do group by?
I have tried above sql command but not sure how to use slick better.

select 10000 records take too long time in PostgreSQL

my table contains 1 billion records. It is also partitioned by month.Id and datetime is the primary key for the table. When I select
select col1,col2,..col8
from mytable t
inner join cte on t.Id=cte.id and dtime>'2020-01-01' and dtime<'2020-10-01'
It uses index scan, but takes more than 5 minutes to select.
Please suggest me.
Note: I have set work_mem to 1GB. cte table results comes with in 3 seconds.
Well it's the nature of join and it is usually known as a time consuming operation.
First of all, I recommend to use in rather than join. Of course they have got different meanings, but in some cases technically you can use them interchangeably. Check this question out.
Secondly, according to the relation algebra whenever you use join each rows of mytable table is combined with each rows from the second table, and DBMS needs to make a huge temporary table, and finally igonre unsuitable rows. Undoubtedly all the steps and the result would take much time. Before using the Join opeation, it's better to filter your tables (for example mytable based date) and make them smaller, and then use the join operations.

Caching various statistics in a special one row table?

Expecting hundreds of millions of rows and write-heavy applciation.
We need return SELECT COUNT(*) FROM orders and SELECT SUM(amount) FROM orders quite frequently and both of them are too slow to be ran on every request.
We are thinking about adding a special table called stats with just a single row. It has total_orders and total_amount, which we would increase every time we add a new order. Is this kind of SQL "cache" table a practical solution? What does it mean in terms of write performance?
Another option is to use Memcached or Redis, but they can get out of sync and are not persistent. Any other ideas?

Execution Plan on a View looking at Partitioned Tables

I currently have tables that are partitioned out by year & month for our sales transactions. For example, we have sales tables that would look something like this:
factdailysales_201501
factdailysales_201502
factdailysales_201503 etc ...
Generally, I've always performed dynamic SQL to capture a Start Date, End Date, find out what partitions those are, and then loop through each of those partitions ... but its starting to become such a hassle and I've learned that this is probably not the best way to do it in terms of just maintenance, trouble shooting, and performance.
I decided to build a view that would UNION ALL of my sales partitions together. However, I don't want selecting from the view to have to scan all of the partitions on execution, it would take away the whole purpose of partitioning tables out. Because of this, I added check constraints on date to each of my sales tables. This way when I selected from the view, it would know which tables to access from instead of scanning every table.
Here are the following examples below:
SELECT SUM([retail])
FROM Sales_Orig
WHERE [Date] >= '2015-03-01'
This query has the execution plan of only pulling from the partitions that I need.
My problem that i'm facing right now is that most of the time when my team will be writing stored procedures, they would more than likely write their queries where a date variable is passed into the where statement.
DECLARE #SD DATE = '2015-03-01'
SELECT SUM([retail])
FROM Sales_Orig
WHERE [Date] >= #SD
However, when a variable is being passed in, the execution plan now scans ALL of the partitions in the view, causing the performance to take wayyy longer than when I hard coded in the date
I suppose I could do dynamic SQL again and insert the date string into the SELECT statement, but it would bring me back to the beginning of trying to get rid of dynamic SQL in the first place for this simple sales query.
So my question is, am I setting this up wrong? Am I on the right track? It seems that the view can't take in a variable for the check constraint and ends up scanning every table. Is there another approach anyone would recommend? Maybe my original solution of just looping through partitions via dynamic SQL is the best way to do it?
** EDIT **
http://sqlsunday.com/2014/08/31/partitioned-views/
This article is actually where I initially saw the idea! It seems when using that exact same solution, I'm still experiencing the same struggle!
Thanks!!
Okay this might work. It's a table-valued function that only access tables according to your #start and #end parameters so only accessing your "partitions" that it needs. I figured you could take this concept and write some dynamic SQL to create all the if statements.
Now of course new tables are added every day so how does that tie in. Well I think the best way would be is that every day you alter the function adding the next sales table. That way querying it is simple. And you could use the same dynamic sql you used to create the function to alter it which should be relatively simple.
Note: I added default values that are the min and max of the data type DATE. That way you could query something like everything from 20140101 and onward or vice versa.
Your tables
SELECT CAST('20150101' AS DATE) datesVal INTO factDailySales_20150101;
SELECT CAST('20150102' AS DATE) datesVal INTO factDailySales_20150102;
SELECT CAST('20150103' AS DATE) datesVal INTO factDailySales_20150103;
The Function
CREATE FUNCTION ufn_factTotalSales (#Start DATE = '17530101', #End DATE = '99991231')
RETURNS #factTotalSales TABLE
(
datesVal DATE
)
AS
BEGIN
IF(CAST('20150101' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150101
END
IF(CAST('20150102' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150102
END
IF(CAST('20150103' AS DATE) BETWEEN #Start AND #End)
BEGIN
INSERT INTO #factTotalSales
SELECT datesVal
FROM factDailySales_20150103
END
RETURN;
END
GO
All tables
SELECT *
FROM ufn_factTotalSales(default,default)
All tables greater than or equal to 20150102
SELECT *
FROM ufn_factTotalSales('20150102',default)
**All tables less than or equal to 20150102
SELECT *
FROM ufn_factTotalSales(default,'20150102')
All tables between specific range
SELECT *
FROM ufn_factTotalSales('20150101','20150102')
Is this the ideal solution? No. The ideal would be to combine all tables into one and having good indexes. I know you said that wouldn't work because of the way other code has been written. Hear me out. Now perhaps this is off the wall, lets say you do combine the tables but obviously there are old scripts looking for specific daily sales tables. Maybe you could create views with the dailySales names that access the factTotalSales. OR You could create synonyms for the factTotalSales that would correspond to each factDailySales.
Maybe you could look into that. It wouldn't be easy, but I think letting SQL Server optimize your queries the way it was designed is a better way of doing it instead of forcing it with dynamic SQL.
Just my two cents. Hope this helps. At the very least, I hope it gave you some ideas.
5 years later: option(recompile).
The planner needs to have access to the constants to eliminate the table entirely from the query plan. With a variable, without a forced recompile, a generic plan is used. (Related: parameter sniffing.)
While this means the query plan is larger as it has to include all tables, it does not mean that all tables are actually scanned: look at the IO stats, as table scan elimination occurs even if such shows in the query plan.
The 'Number Of Executions' in the query plan will be 0 when the tables are not scanned: unfortunately, these branches are still reported as a non-zero percentage cost "Table Scan" node in the query plan & UI, which will appear high proportionally if the query is trivially fast. The displayed percentage cost of these extra "Table Scan" nodes approaches zero as the amount of data returned from the actually used base tables increases.
This same optimization/elimination occurs when the view is not a Partitioned View (eg. base tables are missing partition column in PK), yet the underlying tables have a suitable Check Constraint on the filtered column. It also occurs when the view selects a constant value to establish the partition that is not otherwise stored in the table. With a constant in the query or recompiled plan the tables will be eliminated entirely. With a variable the tables will still not actually be scanned and thus eliminated logically during query execution.
The use of a proper Partitioned View is only really beneficial to allow a direct Insert & Update, with the major caveat that it requires the partition column to be in each table's PK and disallows the use of an identity column (making a Partitioned View largely useless IMOHO). SQL Server handles the optimizations very similarly for other quasi-Partitioned View cases.
(This is on SQL Server 2014; earlier versions might not have optimized the different patterns as efficiently.)

Amazon redshift query planner

I'm facing a situation with Amazon Redshift that I haven't been able to explain to myself yet. Query planner seems not to be able to handle same table in subquery of two derived tables in a join.
I have essentially three tables, Source_A, Source_B, Target_1, Target_2 and a query like
SELECT a,b,c,d FROM
(
SELECT a,b FROM Source_A where date > (SELECT max(date) FROM Target_1)
)
INNER JOIN
(
SELECT c,d FROM Source_B where date > (SELECT max(date) FROM Target_2)
)
ON Source_A.a = Source_B.c
The query works fine as long as tables Target_1 and Target_2 are different tables. If I change the query so that Target_2 = Target_1, something happens. After the change, the query starts to take about 10 times longer time. And when I look at the performance monitor I can see that all this extra time is taken so that only the Leader Node is active.
When I take EXPLAIN of both options I see practically no difference in the output. All the steps are the same. But the is the difference that the EXPLAIN takes seconds in one and almost half an hour with the other one where the Target tables are the same.
So to summarise what I think I have observed is -- that on join, if I use same table in a subquery of each derived tables, the query planner goes nuts.