I'm facing a situation with Amazon Redshift that I haven't been able to explain to myself yet. Query planner seems not to be able to handle same table in subquery of two derived tables in a join.
I have essentially three tables, Source_A, Source_B, Target_1, Target_2 and a query like
SELECT a,b,c,d FROM
(
SELECT a,b FROM Source_A where date > (SELECT max(date) FROM Target_1)
)
INNER JOIN
(
SELECT c,d FROM Source_B where date > (SELECT max(date) FROM Target_2)
)
ON Source_A.a = Source_B.c
The query works fine as long as tables Target_1 and Target_2 are different tables. If I change the query so that Target_2 = Target_1, something happens. After the change, the query starts to take about 10 times longer time. And when I look at the performance monitor I can see that all this extra time is taken so that only the Leader Node is active.
When I take EXPLAIN of both options I see practically no difference in the output. All the steps are the same. But the is the difference that the EXPLAIN takes seconds in one and almost half an hour with the other one where the Target tables are the same.
So to summarise what I think I have observed is -- that on join, if I use same table in a subquery of each derived tables, the query planner goes nuts.
Related
So I have complicated query, to simplify let it be like
SELECT
t.*,
SUM(a.hours) AS spent_hours
FROM (
SELECT
person.id,
person.name,
person.age,
SUM(contacts.id) AS contact_count
FROM
person
JOIN contacts ON contacts.person_id = person.id
) AS t
JOIN activities AS a ON a.person_id = t.id
GROUP BY t.id
Such query works fine in MySQL, but Postgres needs to know that GROUP BY field is unique, and despite it actually is, in this case I need to GROUP BY all returned fields from returned t table.
I can do that, but I don't believe that will work efficiently with big data.
I can't JOIN with activities directly in first query, as person can have several contacts which will lead query counting hours of activity several time for every joined contact.
Is there a Postgres way to make this query work? Maybe force to treat Postgres t.id as unique or some other solution that will make same in Postgres way?
This query will not work on both database system, there is an aggregate function in the inner query but you are not grouping it(unless you use window functions). Of course there is a special case for MySQL, you can use it with disabling "sql_mode=only_full_group_by". So, MySQL allows this usage because of it' s database engine parameter, but you cannot do that in PostgreSQL.
I knew MySQL allowed indeterminate grouping, but I honestly never knew how it implemented it... it always seemed imprecise to me, conceptually.
So depending on what that means (I'm too lazy to look it up), you might need one of two possible solutions, or maybe a third.
If you intent is to see all rows (perform the aggregate function but not consolidate/group rows), then you want a windowing function, invoked by partition by. Here is a really dumbed down version in your query:
.
SELECT
t.*,
SUM (a.hours) over (partition by t.id) AS spent_hours
FROM t
JOIN activities AS a ON a.person_id = t.id
This means you want all records in table t, not one record per t.id. But each row will also contain a sum of the hours for all values that value of id.
For example the sum column would look like this:
Name Hours Sum Hours
----- ----- ---------
Smith 20 120
Jones 30 30
Smith 100 120
Whereas a group by would have had Smith once and could not have displayed the hours column in detail.
If you really did only want one row per t.id, then Postgres will require you to tell it how to determine which row. In the example above for Smith, do you want to see the 20 or the 100?
There is another possibility, but I think I'll let you reply first. My gut tells me option 1 is what you're after and you want the analytic function.
my table contains 1 billion records. It is also partitioned by month.Id and datetime is the primary key for the table. When I select
select col1,col2,..col8
from mytable t
inner join cte on t.Id=cte.id and dtime>'2020-01-01' and dtime<'2020-10-01'
It uses index scan, but takes more than 5 minutes to select.
Please suggest me.
Note: I have set work_mem to 1GB. cte table results comes with in 3 seconds.
Well it's the nature of join and it is usually known as a time consuming operation.
First of all, I recommend to use in rather than join. Of course they have got different meanings, but in some cases technically you can use them interchangeably. Check this question out.
Secondly, according to the relation algebra whenever you use join each rows of mytable table is combined with each rows from the second table, and DBMS needs to make a huge temporary table, and finally igonre unsuitable rows. Undoubtedly all the steps and the result would take much time. Before using the Join opeation, it's better to filter your tables (for example mytable based date) and make them smaller, and then use the join operations.
I'm in the process of benchmarking some queries in redshift so that I can say something intelligent about changes I've made to a table, such as adding encodings and running a vacuum. I can query the stl_query table with a LIKE clause to find the queries I'm interested in, so I have the query id, but tables/views like stv_query_summary are much too granular and I'm not sure how to generate the summarization I need!
The gui dashboard shows the metrics I'm interested in, but the format is difficult to store for later analysis/comparison (in other words, I want to avoid taking screenshots). Is there a good way to rebuild that view with sql selects?
To add to Alex answer, I want to comment that stl_query table has the inconvenience that if the query was in a queue before the runtime then the queue time will be included in the run time and therefore the runtime won't be a very good indicator of performance for the query.
To understand the actual runtime of the query, check on stl_wlm_query for the total_exec_time.
select total_exec_time
from stl_wlm_query
where query='query_id'
There are some usefuls tools/scripts in https://github.com/awslabs/amazon-redshift-utils
Here is one of said scripts stripped out to give you query run times in milliseconds. Play with the filters, ordering etc to show the results you are looking for:
select userid, label, stl_query.query, trim(database) as database, trim(querytxt) as qrytext, starttime, endtime, datediff(milliseconds, starttime,endtime)::numeric(12,2) as run_milliseconds,
aborted, decode(alrt.event,'Very selective query filter','Filter','Scanned a large number of deleted rows','Deleted','Nested Loop Join in the query plan','Nested Loop','Distributed a large number of rows across the network','Distributed','Broadcasted a large number of rows across the network','Broadcast','Missing query planner statistics','Stats',alrt.event) as event
from stl_query
left outer join ( select query, trim(split_part(event,':',1)) as event from STL_ALERT_EVENT_LOG group by query, trim(split_part(event,':',1)) ) as alrt on alrt.query = stl_query.query
where userid <> 1
-- and (querytxt like 'SELECT%' or querytxt like 'select%' )
-- and database = ''
order by starttime desc
limit 100
I am trying to get the last date an insert was performed in a table (on Amazon Redshift), is there any way to do this using the metadata? The tables do not store any timestamp column, and even if they had it, we need to find out for 3k tables so it would be impractical so a metadata approach is our strategy. Any tips?
All insert execution steps for queries are logged in STL_INSERT. This query should give you the information you're looking for:
SELECT sti.schema, sti.table, sq.endtime, sq.querytxt
FROM
(SELECT MAX(query) as query, tbl, MAX(i.endtime) as last_insert
FROM stl_insert i
GROUP BY tbl
ORDER BY tbl) inserts
JOIN stl_query sq ON sq.query = inserts.query
JOIN svv_table_info sti ON sti.table_id = inserts.tbl
ORDER BY inserts.last_insert DESC;
Note: The STL tables only retain approximately two to five days of log history.
I am running two sql queries say,
select obname from table1 where obid = 12
select modname from table2 where modid = 12
Both are taking very less time, say 300 ms each.
But when I am running:
select obname, modname
from (select obname from table1 where obid = 12) as alias1,
(select modname from table2 where modid = 12) as alias2
It is taking 3500ms. Why is it so?
In general, putting two scalar queries in the from clause is not going to affect performance. In fact, from an application perspective, one query may be faster because there is less overhead going back and forth to the database. A scalar query returns one column and one row.
However, if the queries are returning multiple rows, then your little comma is doing a massive Cartesian product (which is why I always use CROSS JOIN rather than a comma in a FROM clause). In that case, all bets are off, because the data has to be processed after the results start returning.