Selecting all rows who belong to a group with some properties - postgresql

I use PostgreSQL for a web application, and I've run into a type of query I can't think of a way to write efficiently.
What I'm trying to do is select all rows from a table which, when grouped a certain way, the group meets some criteria. For example, the naive way to structure this query might be something like this:
SELECT *
FROM table T
JOIN (
SELECT iT.a, iT.b, SUM(iT.c) AS sum
FROM table iT
GROUP BY iT.a, iT.b
) TG ON (TG.a = T.a AND TG.b = T.b)
WHERE TG.sum > 100;
The problem I'm having is that this effectively doubles the time it takes the query to execute, since it's essentially selecting the rows from that table twice.
How can I structure queries of this type efficiently?

You can try a window function although I don't know if it is more efficient. I guess it is as it avoids the join. Test this and your query with explain
select *
from (
select
a, b,
sum(c) over(partition by a, b) as sum
from t
) s
where "sum" > 100

Related

Postgres in-filter complexity

I have a query like this:
SELECT var_2
FROM table_a
WHERE var_1 IN (<Large list of values>);
Lets say table_a has n rows and the large list of values is of length t, whcih is less than n. What is then the complexity of this query? O(n), O(n*log(t)) or O(n*t) I'm using postgres 10.
Sometimes, rewriting a large IN list to a JOIN against a VALUES list, creates the better execution plan.
So a query like this:
select column_2
from the_table
where column_1 in (1,2,3,4);
If the list does not contain duplicate values, the above can be rewritten to:
select t.column_2
from the_table t
join (
values (1),(2),(3),(4)
) as v(c1) on v.c1 = t.column_1
To find out, if that improves the query, you will have to check the execution plan

PostgreSQL CTE records as parameters to function

I have a function that accepts two integers as parameters my_function(input_a, input_b). Is there an easy way to pass the results of a CTE (that returns records of input_a, input_b) into the function?
Should I be looking into writing a custom function with a for loop or is there a better approach?
If the function returns a single record then:
WITH cte AS (SELECT 1 a, 2 b)
SELECT my_function(a, b) FROM cte;
will work. However, if the function is an SRF (Set-Returning-Function), then you need to use LATERAL, to let the database know that you want to feed the results of the prior tables in the JOIN statement to the functions later on in the JOIN. This is accomplished like so:
WITH cte AS (SELECT 1 a, 2 b)
SELECT * FROM cte, LATERAL my_function(a, b);
The LATERAL will cause PostgreSQL to take each row from the CTE and run "my_function" with the values from that row, returning the results of that function to the overall SELECT statement.

optimize sql "With ismember Select From query"

I have a query which run extremely slow when "checking is_member" in comparison to just loading the whole dataset. This view acts as a security check, it checks if you are a member of a particular group - ie group 1, then the next column will state what access it has - ie division 2.
This view then is joined with the Fact table, so that it will only retrieve division 2 rows.
The question is, does the is_member execute for each line of Fact data? Just my theory because it runs 1000 times faster without this view. And if anyone can suggest an alternative structure?
WITH group_security AS (SELECT DISTINCT division_cod FROM dbo.dim_group_security_division AS gsd
WHERE (IS_MEMBER(group_name) = 1))
SELECT dbo.dim_division.dim_division_key, dbo.dim_division.division_ID, dbo.dim_division.division_code, dbo.dim_division.division_name
FROM dbo.dim_division INNER JOIN
group_security ON dbo.dim_division.division_code = group_security.division_code OR group_security.division_code = 'ALL'
Since you JOIN on dbo.dim_division.division_code, do you have an index on this column?
Alternatively you could give this a try:
SELECT dim.dim_division_key,
dim.division_ID,
dim.division_code,
dim.division_name
FROM dim_division dim
WHERE EXISTS ( SELECT *
FROM dbo.dim_group_security_division gsd
WHERE gsd.division_code IN ('ALL', dim.division_code)
AND IS_MEMBER(gsd.group_name) = 1 )
This way the system can stop at the first 'match' in dim_group_security_division instead of having to find all and then aggregate the result because of the DISTINCT.
In this case, it might also be useful to have an index on gsd.division_code to speed things up a bit.

Hive: How to do a SELECT query to output a unique primary key using HiveQL?

I have the following schema dataset which i want to transform into a table that can be exported to SQL. I am using HIVE. Input as follows
call_id,stat1,stat2,stat3
1,a,b,c,
2,x,y,z,
3,d,e,f,
1,j,k,l,
The output table needs to have call_id as its primary key so it needs to be unique. The output schema should be
call_id,stat2,stat3,
1,b,c, or (1,k,l)
2,y,z,
3,e,f,
The problem is that when i use the keyword DISTINCT in the HIVE query, the DISTINCT applies to the all the colums combined. I want to apply the DISTINCT operation only to the call_id. Something on the lines of
SELECT DISTINCT(call_id), stat2,stat3 from intable;
However this is not valid in HIVE(I am not well-versed in SQL either).
The only legal query seems to be
SELECT DISTINCT call_id, stat2,stat3 from intable;
But this returns multiple rows with same call_id as the other columns are different and the row on the whole is distinct.
NOTE: There is no arithmetic relation between a,b,c,x,y,z, etc. So any trick of averaging or summing is not viable.
Any ideas how i can do this?
One quick idea,not the best one, but will do the work-
hive>create table temp1(a int,b string);
hive>insert overwrite table temp1
select call_id,max(concat(stat1,'|',stat2,'|',stat3)) from intable group by call_id;
hive>insert overwrite table intable
select a,split(b,'|')[0],split(b,'|')[1],split(b,'|')[2] from temp1;
,,I want to apply the DISTINCT operation only to the call_id"
But how will then Hive know which row to eliminate?
Without knowing the amount of data / size of the stat fields you have, the following query can the job:
select distinct i1.call_id, i1.stat2, i1.stat3 from (
select call_id, MIN(concat(stat1, stat2, stat3)) as smin
from intable group by call_id
) i2 join intable i1 on i1.call_id = i2.call_id
AND concat(i1.stat1, i1.stat2, i1.stat3) = i2.smin;

way to reduce the cost in db2 for count(*)

Hi I had a DB2 Query as below
select count(*) as count from
table_a,
table_b,
table c
where
b.xxx=234 AND
b.yyy=c.wedf
Result SEt:
Count
618543562
For the above query i even tried with Count(1) but when i took the access plan, cost is same.
select count(1) as count from
table_a,
table_b,
table c
where
b.xxx=234 AND
b.yyy=c.wedf
Result SEt:
Count
618543562
Is there any other way to reduce the cost.
PS: b.xxx,b.yyy, c.wedf is indexed..
Thanks in advance.
I think one of the problem are statistics on the table. Did you execute Runstats? Probably, the data distribution or the quantity of rows that has to be read is a lot, and DB2 concludes that is better to read the whole table, instead of process an index, and then fetch the rows from the table.
It seems that both queries are taking the same access plan, and I think they are doing table scans.
Are the three columns part of the same index? or they are indexed separately? If they are part of different indexes, is there any ANDing between indexes in the access plan? If there is not ANDing with different indexes, the columns has to be read from the table in order to process the predicates.
The reason count(1) and count(*) are giving the same cost, is because both has to do a TableScan.
Please, take a look at the access plan, not only the results in timerons, but also the steps. Is the access plan taking the indexes? how many sorts is executing?
Try to change the optimization level, and you will see that the access plans change. I think you are executing with the default one (5)
If you want to force the query to take in account an index, you can create an optimization profile
What is the relation between (B,C) tables and A table. In your query you just use CROSS JOIN between A and (B,C). So it is the MAIN performance issue.
If you really need this count just multiply counts for A and (B,C):
select
(select count(*) from a)
*
(select count(*) from b, c where b.xxx=234 AND b.yyy=c.wedf )
for DB2 use this:
select a1.cnt*
(select count(*) as cnt2 from b, c where b.xxx=234 AND b.yyy=c.wedf )
from
(select count(*) as cnt1 from a) a1