How to write proper/efficient query - postgresql

I have a question about the right way of writing the query.
I have an employees table, lets say there are 4 columns employee_id, department, salary, email.
There are some records without email address, I'd like to find the most efficient way to write SQL query using window function that brings the sum salary per group, divided by all of those without email address.
I have 2 solutions, of course only one is efficient, can anyone give any advice about it?
select department, sum(salary) as total
from employees
where email is null
group by 1
option 1
select a.department , a.total/(select sum(salary) from employees where email is null)
from (
select department, sum(salary) as total
from employees
where email is null
group by 1
) a
option 2
select a.department , a.total/sum(a.total) over()
from (
select department, sum(salary) as total
from employees
where email is null
group by 1
) a
I guess that query 2 is more efficient, but is it the right way? and is it valid to leave over clause empty?
Just started using PostgreSQL instead of MySQL 5.6.

Your second query is better.
The first query has to scan employees twice, while the second table only scans the (hopefully smaller) result set of the subquery to calculate the sum.
It is perfectly valid to leave the OVER clause empty, that just means that all result rows will get the same value (which is what you want).

Related

How to get latest data for a column when using grouping in postgres

I am using postgres alongside sequelize. I have encountered a case where I need to write a coustom query which groups the records are a particular field. I know for the remaning columns that are not used for grouping, I need to use a aggregate function like SUM. But the problem is that for some columns I need to get the one what is the latest one (DESC sorted by created_at). I see no function in sql to do so. Is my only option to write subqueries or is there a better way? Thanks?
For better understanding, If you look at the below picture, I want the group the records with address. So after the query there should only be two records, one with sydney and the other with new york. But when it comes to the distance, I want the result of the query to contain the distance form the row that was most recently created, i.e with the latest created_at.
so the final two query results should be:
sydney 100 2022-09-05 18:14:53.492131+05:45
new york 40 2022-09-05 18:14:46.23328+05:45
select address, distance, created_at
from(
select address, distance, created_at, row_number() over(partition by address order by created_at DESC) as rn
from table) x
where rn = 1

T-SQL "partition by" results not as expected

What I'm trying to do is get a total count of "EmailAddresses" via using partitioning logic. As you can see in the result set spreadsheet, the first record is correct - this particular email address exists 109 times. But, the second record, same email address, the numberOfEmailAddresses column shows 108. And so on - just keeps incrementing downward by 1 on the same email address. Clearly, I'm not writing this SQL right and I was hoping to get some feedback as to what I might be doing wrong.
What I would like to see is the number 109 consistently down the column numberOfEmailAddresses for this particular email address. What might I be doing wrong?
Here's my code:
select
Q1.SubscriberKey,
Q1.EmailAddress,
Q1.numberOfEmailAddresses
from
(select
sub.SubscriberKey as SubscriberKey,
sub.EmailAddress as EmailAddress,
count(*) over (partition by sub.EmailAddress order by sub.SubscriberKey asc) as numberOfEmailAddresses
from
ent._Subscribers sub) Q1
And here's my result set, ordered by "numberOfEmailAddresses":
select distinct
Q1.SubscriberKey,
Q1.EmailAddress,
(select count(*) from ent._Subscribers sub where sub.EmailAddress = Q1.EmailAddress) as numberOfEmailAddress
from ent._Subscribers Q1
will get you what you want. I think the inclusion of the order by in your partition function is what is causing the descending count. Ordering in a partition function further subdivides the partition as I understand it.
select
Q1.SubscriberKey,
Q1.EmailAddress,
Q1.numberOfEmailAddresses
from
(select
sub.SubscriberKey as SubscriberKey,
sub.EmailAddress as EmailAddress,
count(*) over (partition by sub.EmailAddress) as numberOfEmailAddresses
from
ent._Subscribers sub) Q1
May also work but I can't find a suitable dataset to test.

Postgres query filter by non column in table

i have a challenge whose consist in filter a query not with a value that is not present in a table but a value that is retrieved by a function.
let's consider a table that contains all sales on database
id, description, category, price, col1 , ..... col n
i have function that retrieve me a table of similar sales from one (based on rules and business logic) . This function performs a query again on all records in the sales table and match validation in some fields.
similar_sales (sale_id integer) - > returns a integer[]
now i need to list all similar sales for each one present in sales table.
select s.id, similar_sales (s.id)
from sales s
but the similar_sales can be null and i am interested only return sales which contains at least one.
select id, similar
from (
select s.id, similar_sales (s.id) as similar
from sales s
) q
where #similar > 1 (Pseudocode)
limit x
i can't do the limit in subquery because i don't know what sales have similar or not.
I just wanted do a subquery for a set of small rows and not all entire table to get query performance gains (pagination strategy)
you can try this :
select id, similar
from sales s
cross join lateral similar_sales (s.id) as similar
where not isempty(similar)
limit x

How to limit to just one result per condition when looking through multiple OR/IN conditions in the WHERE clause (Postgresql)

For Example:
SELECT * FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
I want to LIMIT 1 for each of the countries in my IN clause so I only see a total of 3 rows: One customer for per country (1 German, 1 France, 1 UK). Is there a simple way to do that?
Normally, a simple GROUP BY would suffice for this type of solution, however as you have specified that you want to include ALL of the columns in the result, then we can use the ROW_NUMBER() window function to provide a value to filter on.
As a general rule it is important to specify the column to sort on (ORDER BY) for all windowing or paged queries to make the result repeatable.
As no schema has been supplied, I have used Name as the field to sort on for the window, please update that (or the question) with any other field you would like, the PK is a good candidate if you have nothing else to go on.
SELECT * FROM
(
SELECT *
, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Name) AS _rn
FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
)
WHERE _rn = 1
The PARTITION BY forces the ROW_NUMBER to be counted across all records with the same Country value, starting at 1, so in this case we only select the rows that get a row number (aliased as _rn) of 1.
The WHERE clause could have been in the outer query if you really want to, but ROW_NUMBER() can only be specified in the SELECT or ORDER BY clauses of the query, so to use it as a filter criteria we are forced to wrap the results in some way.

count max values in postgresql

I have a problem to formulate an sql question in postgresql, hoping to get some help here
I have three tables employee, visitor, and visit. I want to find out which employee (fk_employee_id) who have been responsible for most visit that haven't been checked out.
I want to make an sql question which are returning just the number one result, (by max function maybe?) instead of my current one, which are returning a ranked list (this ranked list doesn't work either if the number one position is shared by two persons)
This is my current sql question:
select visitor.fk_employee_id, count(visitor.fk_employee_id)
From Visit
Inner Join visitor on visit.fk_visitor_id = visitor.visitor_id
WHERE check_out_time IS NULL
group by visitor.fk_employee_id, visitor.fk_employee_id
Limit 1
Anyone now how to do this?
enter image description here
To avoid confusion, I will change the column names to:
visitor table, the FK to employee id : employee_in_charge_id
visit table, the FK to employee id : employee_to_meet_id
From your explanation in comments, you are looking for Employee, who has the most visits which are not check-out .
In the case where, more than 1 employees are having same max number of visits which are not check-out, this query lists all the multiple employees:
SELECT * FROM
(
SELECT
r.employee_in_charge_id,
count(*) cnt,
rank() over (ORDER BY count(*) DESC)
FROM visit v
JOIN visitor r ON v.visitor_id = r.id
WHERE v.check_out_time IS NULL
GROUP BY r.employee_in_charge_id
) a
WHERE rank = 1;
Refer SQLFidle link: http://sqlfiddle.com/#!17/423d9/2
Side Note:
To me, it sounds more correct if employee_in_charge_id is part of visit table, rather than visitor table. My assumption is for each visit, there is 1 employee (A) who is responsible to handle the visit, & the visitor is meeting 1 employee (B). So 1 visitor can make multiple visits, which handle by different employees.
Anyway, my answer above is based on your original schema design.
Assuming a standard n:m implementation like detailed here, this whould be one way to do it:
SELECT fk_employee_id
FROM visit
WHERE check_out_time IS NULL
GROUP BY fk_employee_id
ORDER BY count(*) DESC
LIMIT 1;
Assuming referential integrity, you do not need to include the table visitor in the query at all.
count(*) is a bit faster than count(fk_employee_id) doing the same in this case. (assuming fk_employee_id is NOT NULL). See:
PostgreSQL: running count of rows for a query 'by minute'