Postgres: averaging a column on distinct of another column in already grouped query - postgresql

Is there a way to average a column only on a distinct of another column when the query is already grouped for another purpose without using a subquery? I know it can be done through subqueries, but trying to avoid restructuring an old query unless it is absolutely necessary.
The existing query, while complex, has more or less the same structure as the example below. As you can see, a library has any number of books, a book has any number of chapters, and a chapter has any number of paragraphs while the query returns the total numbers of books and paragraphs for each library.
SELECT libraries.name,
COUNT(DISTINCT books.id) AS num_books,
COUNT(paragraphs.id) AS num_paragraphs
FROM libraries
LEFT JOIN books ON books.library_id = libraries.id
LEFT JOIN chapters ON chapters.book_id = books.id
LEFT JOIN paragraphs ON paragraphs.chapter_id = chapters.id
GROUP BY libraries.name
Now suppose the table books has a column publish_year and I want the average year books in the library were published. Obviously I can't simply add AVERAGE(books.publish_year) since books with more chapters and paragraphs would skew the average.
Is there a good way of averaging books.publish_year based upon distinct books.id again without restructuring the query or is restructuring the query inevitable?

A window function before joining
select
l.name,
count(distinct b.id) as num_books,
count(p.id) as num_paragraphs,
min(year_avg) as year_avg
from
libraries l
left join (
select *, avg(publish_year) over(partition by library_id) as year_avg
from books
) b on b.library_id = l.id
left join chapters c on c.book_id = b.id
left join paragraphs p on p.chapter_id = c.id
group by l.name

Related

Grouping by attributes and counting, postgreSQL

I have written the following code that counts how many instances of each book_id there are in the table soldBooks.
SELECT book_id, sum(counter) AS no_of_books_sold, sum(retail_price) AS generated_revenue
FROM(
SELECT book_id,1 AS counter, retail_price
FROM shipments
LEFT JOIN editions ON (shipments.isbn = editions.isbn)
LEFT JOIN stock ON (shipments.isbn = stock.isbn)
) AS soldBooks
GROUP BY book_id
As you can see, I used a "counter" in order to solve my problem. But I am sure there must be a better, more built in way of achieving the same result! There must be some way to group a table together by a given attribute, and to create a new column displaying the count of EACH attribute. Can somebody share this with me?
Thanks!
SELECT book_id,
COUNT(book_id) AS no_books_sold,
SUM(retail_price) AS gen_rev
FROM shipments
JOIN editions ON (shipments.isbn=editions.isbn)
JOIN stock ON (shipments.isbn=stock.isbn)
GROUP BY book_id

INNER JOIN, LEFT/RIGHT OUTER JOIN

Apology in advance for a long question, but doing this just for the sake of learning:
i'm new to SQL and researching on JOIN for now. I'm getting two different behaviors when using INNER and OUTER JOIN. What I know is, INNER JOIN gives an intersection kind of result while returning only common rows among tables, and (LEFT/RIGHT) OUTER JOIN is outputting what is common and remaining rows in LEFT or RIGHT tables, depending upon LEFT/RIGHT clause respectively.
While working with MS Training Kit and trying to solve this practice: "Practice 2: In this practice, you identify rows that appear in one table but have no matches in another. You are given a task to return the IDs of employees from the HR.Employees table who did not handle orders (in the Sales.Orders table) on February 12, 2008. Write three different solutions using the following: joins, subqueries, and set
operators. To verify the validity of your solution, you are supposed to return employeeIDs: 1, 2, 3, 5, 7, and 9."
I'm successful doing this with subqueries and set operators but with JOIN is returning something not expected. I've written the following query:
USE TSQL2012;
SELECT
E.empid
FROM
HR.Employees AS H
JOIN Sales.Orders AS O
ON H.empid = O.empid
AND O.orderdate = '20080212'
JOIN HR.Employees AS E
ON E.empid <> H.empid
ORDER BY
E.empid
;
I'm expecting results as: 1, 2, 3, 5, 7, and 9 (6 rows)
But what i'm getting is: 1,1,1,2,2,2,3,3,3,4,4,5,5,5,6,6,7,7,7,8,8,9,9,9 (24 rows)
I tried some videos but could not understand this side of INNER/OUTER JOIN. I'll be grateful if someone could help this side of JOIN, why is it so and what should I try to understand while working with JOIN.
you can also use left outer join to get not matching
*** The LEFT JOIN keyword returns all rows from the left table (table1), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match.
SELECT
H.empid
FROM
HR.Employees AS H
LEFT OUTER JOIN Sales.Orders AS O
ON H.empid = O.empid
AND O.orderdate = '20080212'
WHERE O.empid IS NULL
Above script will return emp id who did not handle orders on specify date
here you can see all kind of join
Diagram taken from: http://dsin.wordpress.com/2013/03/16/sql-join-cheat-sheet/
adjust your query to be like this
USE TSQL2012;
SELECT
E.empid
FROM
HR.Employees AS H
JOIN Sales.Orders AS O
ON H.empid = O.empid
where O.orderdate = '2008-02-12' AND O.empid IN null
ORDER BY
E.empid
;
USE TSQL2012;
SELECT
distinct E.empid
FROM
HR.Employees AS H
JOIN Sales.Orders AS O
ON H.empid = O.empid
AND O.orderdate = '20080212'
JOIN HR.Employees AS E
ON E.empid <> H.empid
ORDER BY
E.empid
;
Primary things to always remind yourself when working with SQL JOINs:
INNER JOINs require a match in the join in order for result set rows produced prior to the INNER JOIN to remain in the result set. When no match is found for a row, the row is discarded from the result set.
For a row fed to an INNER JOIN that matches to ONLY one row, only one copy of that row fed to the result set is delivered.
For a row fed to an INNER JOIN that matches to multiple rows, the row will be delivered multiple times, once for each row match from the INNER JOIN table.
OUTER JOINs will not discard rows fed to them in the result set, whether or not the OUTER JOIN results in a match or not.
Just like INNER JOINs, if an OUTER JOIN matches to more than one row, it will increase the number of rows in the result set by duplicating rows equal to the number of rows matched from the OUTER JOIN table.
Ask yourself "if I get NO match on the JOIN, do I want the row discarded or not?" If the answer is NO, use an OUTER JOIN. If the answer is YES, use an INNER JOIN.
If you don't need to reference any of the columns from a JOIN table, don't perform a JOIN at all. Instead, use a WHERE EXISTS, WHERE NOT EXISTS, WHERE IN, WHERE NOT IN, etc. or similar, depending on your database engine in use. Don't rely on the database engine to be smart enough to discard unreferenced columns resulting from JOINs from the result set. Some databases may be smart enough to do that, some not. There's no reason to pull columns into a result set only to not reference them. Doing so increases chance of reduced performance.
Your JOIN of:
JOIN HR.Employees AS E
ON E.empid <> H.empid
...is matching to all Employees rows with a DIFFERENT EMPID to all rows fed to that join. Use of NOT EQUAL on an INNER JOIN is a very rare thing to do or need, especially if the JOIN predicate is testing only ONE condition. That is why your getting duplicate rows in the result set.
On DB2, we could perform an EXCEPTION JOIN to accomplish that using a JOIN alone. Normally, on DB2, I would use a WHERE NOT EXISTS for that. On SQL Server you could do a JOIN to a query where the query set is all employees without orders in SALES.ORDERS on the specified date, but I don't know if that violates the rules of your tutorial.
Naveen posted the solution it appears your tutorial is looking for!

Postgres: left join with order by and limit 1

I have the situation:
Table1 has a list of companies.
Table2 has a list of addresses.
Table3 is a N relationship of Table1 and Table2, with fields 'begin' and 'end'.
Because companies may move over time, a LEFT JOIN among them results in multiple records for each company.
begin and end fields are never NULL. The solution to find the latest address is use a ORDER BY being DESC, and to remove older addresses is a LIMIT 1.
That works fine if the query can bring only 1 company. But I need a query that brings all Table1 records, joined with their current Table2 addresses. Therefore, the removal of outdated data must be done (AFAIK) in LEFT JOIN's ON clause.
Any idea how I can build the clause to not create duplicated Table1 companies and bring latest address?
Use a dependent subquery with max() function in a join condition.
Something like in this example:
SELECT *
FROM companies c
LEFT JOIN relationship r
ON c.company_id = r.company_id
AND r."begin" = (
SELECT max("begin")
FROM relationship r1
WHERE c.company_id = r1.company_id
)
INNER JOIN addresses a
ON a.address_id = r.address_id
demo: http://sqlfiddle.com/#!15/f80c6/2
Since PostgreSQL 9.3 there is JOIN LATERAL (https://www.postgresql.org/docs/9.4/queries-table-expressions.html) that allows to make a sub-query to join, so it solves your issue in an elegant way:
SELECT * FROM companies c
JOIN LATERAL (
SELECT * FROM relationship r
WHERE c.company_id = r.company_id
ORDER BY r."begin" DESC LIMIT 1
) r ON TRUE
JOIN addresses a ON a.address_id = r.address_id
The disadvantage of this approach is the indexes of the tables inside LATERAL do not work outside.
I managed to solve it using Windows Function:
WITH ranked_relationship AS(
SELECT
*
,row_number() OVER (PARTITION BY fk_company ORDER BY dt_start DESC) as dt_last_addr
FROM relationship
)
SELECT
company.*
address.*,
dt_last_addr as dt_relationship
FROM
company
LEFT JOIN ranked_relationship as relationship
ON relationship.fk_company = company.pk_company AND dt_last_addr = 1
LEFT JOIN address ON address.pk_address = relationship.fk_address
row_number() creates an int counter for each record, inside each window based to fk_company. For each window, the record with latest date comes first with rank 1, then dt_last_addr = 1 makes sure the JOIN happens only once for each fk_company, with the record with latest address.
Window Functions are very powerful and few ppl use them, they avoid many complex joins and subqueries!

Mysql Multiple Left join logic

Can someone explain multiple left join logic?
For example i have 3 tables: Company, company_text, company_rank.
Company has 4 records (id's: 1,2,3,4), company_text has 4 records with company names (1-a,2-b,3-c, 4-d), company_rank has following (1-1st, 2-2nd, 3-3rd). Note that company_rank table is not having 4th record. Now i want records of all companies using LEFT join. In case if there is no rank, display as 'zzzz' and sort in the descending order of rank and when rank is null sort by descending order of id.
select * from company
LEFT JOIN company_text ON company.id = company_text.id
LEFT JOIN company_rank ON company_text.id = company_rank.id
order by isnull(company_rank.rank,'zzzz'), rank desc
Will this work?
Basically i am trying to understand how LEFT JOIN works if there are many left joins? This doc. has good info on joins: http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html but it don't have information on how multiple LEFT JOINS work? In case if multiple joins are present, how records data will be pulled?

Intersect of Select Statements based on a particular column

I have a Q about INTERSECT clause between two select statements in Sql server 2008.
Select 1 a,b,c ..... INTERSECT Select 2 a,b,c....
Here, the datasets of the two queries should exactly match to return the common elements.
But, I want only column a of both select statements to match.
If the values of column a in both the queries have same values, the entire row should appear in the result set.
Can i Do that and How ??
Thanks,
Marcus..
The best thing to do is to look at the queries itself. DO they need an INTERSECT, of is it possible to make a join with it
for example.
An INTERSECT looks like this
select columnA
from tableA
INTERSECT
select columnAreference
from tableB
Your result would have all columns that are in BOTH tables.. so a join would be more usefull
select columnA
from tableA a
inner join tableB b
on b.columnAReference = a.columnA
If you look into the execution plan you'll see that the INTERSECT will do a left semi join and the inner join will do a, like expected, an inner join. A left semi join isn't something you can tell the query optimizer to do, BUT IT IS FASTER!!!! A left semi join will only return 1 row from the left table, where a normal join will return them all. In this particular case it will be faster.
So an INTERSECT isn't a bad thing which should be eliminated with an INNER JOIN construction, sometimes it will perform even better.
However, to give you the best answer, i will need some more details about your query :)
select * from table1 t1 inner join Table2 t2
on t1.col1=t2.col1