How to join 2 tables without value duplication in PostgreSql - postgresql

I am joining two tables using:
select table1.date, table1.item, table1.qty, table2.anotherQty
from table1
INNER JOIN table2
on table1.date = table2.date
table1
date | item | qty
july1 | itemA | 20
july1 | itemB | 30
july2 | itemA | 20
table2
date | anotherQty
july1 | 200
july2 | 300
Expected result should be:
date | item | qty | anotherQty
july1 | itemA | 20 | 200
july1 | itemB | 30 | null or 0
july2 | itemA | 20 | 300
So that when i sum(anotherQty) it will have 500 only, instead of:
date | item | qty | anotherQty
july1 | itemA | 20 | 200
july1 | itemB | 30 | 200
july2 | itemA | 20 | 300
That is 200+200+300 = 700

SQL DEMO
WITH T1 as (
SELECT *, ROW_NUMBER() OVER (PARTITION BY "date" ORDER BY "item") as rn
FROM Table1
), T2 as (
SELECT *, ROW_NUMBER() OVER (PARTITION BY "date" ORDER BY "anotherQty") as rn
FROM Table2
)
SELECT *
FROM t1
LEFT JOIN t2
ON t1."date" = t2."date"
AND t1.rn = t2.rn
OUTPUT
Filter the columns you want, and change the order if need it.
| date | item | qty | rn | date | anotherQty | rn |
|-------|-------|-----|----|--------|------------|--------|
| july1 | itemA | 20 | 1 | july1 | 200 | 1 |
| july1 | itemB | 30 | 2 | (null) | (null) | (null) |
| july2 | itemA | 20 | 1 | july2 | 300 | 1 |

Try the following code, but know that so long as the qty values differ across rows, that you're going to still get the 'anotherQty' field breaking out into distinct values:
select
table1.date,
table1.item,
table1.qty,
SUM(table2.anotherQty)
from table1
INNER JOIN table2
on table1.date = table2.date
GROUP BY
table1.item,
table1.qty,
table1.date
If you need it to always aggregate down to a single line per item/date, then you will need to add a SUM() to table1.qty as well. Alternately, you could run a common table expression (WITH() statement) for each quantity that you want, summing them within the common table expression, and then rejoining the expressions to your final SELECT statement.
Edit:
Based on the comment from #Juan Carlos Oropeza, I'm not sure that there is a way to get the summed value of 500 while including table1.date in your query, because you will have to group the output by date which will cause the aggregation to split into distinct lines. The following query will get you the sum of anotherQty, at the sacrifice of displaying date:
select
table1.item,
SUM(table1.qty),
SUM(table2.anotherQty)
from table1
INNER JOIN table2
on table1.date = table2.date
GROUP BY
table1.item
If you need date to persist, you can get the sum to show up by using a WINDOW function, but note that this is essentially doing a running sum, and may throw off any subsequent summation you're doing on this query's output in terms of post-processing:
select
table1.item,
table1.date,
SUM(table1.qty),
SUM(table2.anotherQty) OVER (Partition By table1.item)
from table1
INNER JOIN table2
on table1.date = table2.date
GROUP BY
table1.item,
table1.date,
table2.anotherQty

Related

How do I get PostgreSQL to recognize repeating dates for a mathematical operation?

Very new to SQL querying. Using PostgreSQL.
I am trying to build a query that tells me what percentage of the time a unique customer id makes multiple transactions on the same day.
I have a query built that gets me the customer ids and transaction dates (if there are multiple on the same day, the date repeats.
Below is my query..
SELECT customer.customer_id, rental_date::date FROM customer
FULL OUTER JOIN rental
ON customer.customer_id = rental.customer_id
FULL OUTER JOIN inventory
ON rental.inventory_id = inventory.inventory_id
FULL OUTER JOIN film
ON inventory.film_id = film.film_id
ORDER BY customer.customer_id, rental_date
Update:
Query now reads:
SELECT customer.customer_id, rental_date::date, COUNT (*)
FROM customer
JOIN rental ON customer.customer_id = rental.customer_id
JOIN inventory ON rental.inventory_id = inventory.inventory_id
JOIN film ON inventory.film_id = film.film_id
GROUP BY customer.customer_id, rental_date
ORDER BY customer.customer_id, rental_date
Output:
+-------------+-------------+-------+
| customer_id | rental_date | count |
+-------------+-------------+-------+
| 1 | 2005-05-25 | 1 |
| 1 | 2005-05-28 | 1 |
| 1 | 2005-06-15 | 1 |
| 1 | 2005-06-15 | 1 |
| 1 | 2005-06-15 | 1 |
| 2 | 2005-06-16 | 1 |
+-------------+-------------+-------+
Desired output:
+-------------+-------------+-------+
| customer_id | rental_date | count |
+-------------+-------------+-------+
| 1 | 2005-05-25 | 1 |
| 1 | 2005-05-28 | 1 |
| 1 | 2005-06-15 | 3 |
| 2 | 2005-06-16 | 1 |
+-------------+-------------+-------+
What you are looking for is count and having. Count will get you the number of purchases by day and Having can be used to eliminate those with 0 or 1 purchases on a given day.
select customer.customer_id, rental_date, count(*)
from customer
join rental on customer.customer_id = rental.customer_id
join inventory on rental.inventory_id = inventory.inventory_id
join film on inventory.film_id = film.film_id
group by customer.customer_id, rental_date
having count(*) > 1
order by customer.customer_id, rental_date ;
Also I doubt you want full outer join. That returns all rows from both the joined tables even when none exist in the other. I change it to an inner join (you only want customers that have rentals and also inventory that also have rentals. Even though now the having would eliminate the extras. Try removing the having clause then run with both full and again inner joins and see the difference.

Accomplishing what I need without a CROSS JOIN

I have a query that pulls from a table. With this table, I would like to build a query that allows me to make projections into the future.
SELECT
b.date,
a.id,
SUM(CASE WHEN a.date = b.date THEN a.sales ELSE 0 END) sales,
SUM(CASE WHEN a.date = b.date THEN a.revenue ELSE 0 END) revenue
FROM
table_a a
CROSS JOIN table_b b
WHERE a.date BETWEEN '2018-10-31' AND '2018-11-04'
GROUP BY 1,2
table_b is a table with literally only one column that contains dates going deep into the future. This returns results like this:
+----------+--------+-------+---------+
| date | id | sales | revenue |
+----------+--------+-------+---------+
| 11/4/18 | 113972 | 0 | 0 |
| 11/4/18 | 111218 | 0 | 0 |
| 11/3/18 | 111218 | 0 | 0 |
| 11/3/18 | 113972 | 0 | 0 |
| 11/2/18 | 111218 | 0 | 0 |
| 11/2/18 | 113972 | 0 | 0 |
| 11/1/18 | 111218 | 89 | 2405.77 |
| 11/1/18 | 113972 | 265 | 3000.39 |
| 10/31/18 | 111218 | 64 | 2957.71 |
| 10/31/18 | 113972 | 120 | 5650.91 |
+----------+--------+-------+---------+
Now there's more to the query after this where I get into the projections and what not, but for the purposes of this question, this is all you need, as it's where the CROSS JOIN exists.
How can I recreate these results without using a CROSS JOIN? In reality, this query is a much larger date range with way more data and takes hours and so much power to run and I know CROSS JOIN's should be avoided if possible.
Use the table of all dates as the "from table" and left join the data, this still returns each date.
SELECT
d.date
, t.id
, COALESCE(SUM(t.sales),0) sales
, COALESCE(SUM(t.revenue),0) revenue
FROM all_dates d
LEFT JOIN table_data t
ON d.date = t.date
WHERE d.date BETWEEN '2018-10-31' AND '2018-11-04'
GROUP BY
d.date
, t.id
Another alternative (to avoid the cross join) could be to use generate series but for this - in Redshift - I suggest this former answer. I'm a fan of generate series, but if you already have a table I would probably stay with that (but this is based on what little I know about your query etc.).

Postgresql use more than one row as expression in sub query

As the title says, I need to create a query where I SELECT all items from one table and use those items as expressions in another query. Suppose I have the main table that looks like this:
main_table
-------------------------------------
id | name | location | //more columns
---|------|----------|---------------
1 | me | pluto | //
2 | them | mercury | //
3 | we | jupiter | //
And the sub query table looks like this:
some_table
---------------
id | item
---|-----------
1 | sub-col-1
2 | sub-col-2
3 | sub-col-3
where each item in some_table has a price which is in an amount_table like so:
amount_table
--------------
1 | 1000
2 | 2000
3 | 3000
So that the query returns results like this:
name | location | sub-col-1 | sub-col-2 | sub-col-3 |
----------------------------------------------------|
me | pluto | 1000 | | |
them | mercury | | 2000 | |
we | jupiter | | | 3000 |
My query currently looks like this
SELECT name, location, (SELECT item FROM some_table)
FROM main_table
INNER JOIN amount_table WHERE //match the id's
But I'm running into the error more than one row returned by a subquery used as an expression
How can I formulate this query to return the desired results?
you should decide on expected result.
to get one-tp-many relation:
SELECT name, location, some_table.item
FROM main_table
JOIN some_table on true -- or id if they match
INNER JOIN amount_table --WHERE match the id's
to get one-to-one with all rows:
SELECT name, location, (SELECT array_agg(item) FROM some_table)
FROM main_table
INNER JOIN amount_table --WHERE //match the id's

Updating multiple rows with a certain value from the same table

So, I have the next table:
time | name | ID |
12:00:00| access | 1 |
12:05:00| select | null |
12:10:00| update | null |
12:15:00| insert | null |
12:20:00| out | null |
12:30:00| access | 2 |
12:35:00| select | null |
The table is bigger (aprox 1-1,5 mil rows) and there will be ID equal to 2,3,4 etc and rows between.
The following should be the result:
time | name | ID |
12:00:00| access | 1 |
12:05:00| select | 1 |
12:10:00| update | 1 |
12:15:00| insert | 1 |
12:20:00| out | 1 |
12:30:00| access | 2 |
12:35:00| select | 2 |
What is the most simple method to update the rows without making the log full? Like, one ID at a time.
You can do it with a sub query:
UPDATE YourTable t
SET t.ID = (SELECT TOP 1 s.ID
FROM YourTable s
WHERE s.time < t.time AND s.name = 'access'
ORDER BY s.time DESC)
WHERE t.name <> 'access'
Index on (ID,time,name) will help.
You can do it using CTE as below:
;WITH myCTE
AS ( SELECT time
, name
, ROW_NUMBER() OVER ( PARTITION BY name ORDER BY time ) AS [rank]
, ID
FROM YourTable
)
UPDATE myCTE
SET myCTE.ID = myCTE.rank
SELECT *
FROM YourTable ORDER BY ID

How can I get the sum(value) on the latest gather_time per group(name,col1) in PostgreSQL?

Actually, I got a good answer about the similar issue on below thread, but I need one more solution for different data set.
How to get the latest 2 rows ( PostgreSQL )
The Data set has historical data, and I just want to get sum(value) for the group on the latest gather_time.
The final result should be as following:
name | col1 | gather_time | sum
-------+------+---------------------+-----
first | 100 | 2016-01-01 23:12:49 | 6
first | 200 | 2016-01-01 23:11:13 | 4
However, I just can see the data for the one group(first-100) with a query below meaning that there is no data for the second group(first-200).
Thing is that I need to get the one row per the group.
The number of the group can be vary.
select name,col1,gather_time,sum(value)
from testtable
group by name,col1,gather_time
order by gather_time desc
limit 2;
name | col1 | gather_time | sum
-------+------+---------------------+-----
first | 100 | 2016-01-01 23:12:49 | 6
first | 100 | 2016-01-01 23:11:19 | 6
(2 rows)
Can you advice me to accomplish this requirement?
Data set
create table testtable
(
name varchar(30),
col1 varchar(30),
col2 varchar(30),
gather_time timestamp,
value integer
);
insert into testtable values('first','100','q1','2016-01-01 23:11:19',2);
insert into testtable values('first','100','q2','2016-01-01 23:11:19',2);
insert into testtable values('first','100','q3','2016-01-01 23:11:19',2);
insert into testtable values('first','200','t1','2016-01-01 23:11:13',2);
insert into testtable values('first','200','t2','2016-01-01 23:11:13',2);
insert into testtable values('first','100','q1','2016-01-01 23:11:11',2);
insert into testtable values('first','100','q1','2016-01-01 23:12:49',2);
insert into testtable values('first','100','q2','2016-01-01 23:12:49',2);
insert into testtable values('first','100','q3','2016-01-01 23:12:49',2);
select *
from testtable
order by name,col1,gather_time;
name | col1 | col2 | gather_time | value
-------+------+------+---------------------+-------
first | 100 | q1 | 2016-01-01 23:11:11 | 2
first | 100 | q2 | 2016-01-01 23:11:19 | 2
first | 100 | q3 | 2016-01-01 23:11:19 | 2
first | 100 | q1 | 2016-01-01 23:11:19 | 2
first | 100 | q3 | 2016-01-01 23:12:49 | 2
first | 100 | q1 | 2016-01-01 23:12:49 | 2
first | 100 | q2 | 2016-01-01 23:12:49 | 2
first | 200 | t2 | 2016-01-01 23:11:13 | 2
first | 200 | t1 | 2016-01-01 23:11:13 | 2
One option is to join your original table to a table containing only the records with the latest gather_time for each name, col1 group. Then you can take the sum of the value column for each group to get the result set you want.
SELECT t1.name, t1.col1, MAX(t1.gather_time) AS gather_time, SUM(t1.value) AS sum
FROM testtable t1 INNER JOIN
(
SELECT name, col1, col2, MAX(gather_time) AS maxTime
FROM testtable
GROUP BY name, col1, col2
) t2
ON t1.name = t2.name AND t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND
t1.gather_time = t2.maxTime
GROUP BY t1.name, t1.col1
If you wanted to use a subquery in the WHERE clause, as you attempted in your OP, to restrict to only records with the latest gather_time then you could try the following:
SELECT name, col1, gather_time, SUM(value) AS sum
FROM testtable t1
WHERE gather_time =
(
SELECT MAX(gather_time)
FROM testtable t2
WHERE t1.name = t2.name AND t1.col1 = t2.col1
)
GROUP BY name, col1