How do I get PostgreSQL to recognize repeating dates for a mathematical operation? - postgresql

Very new to SQL querying. Using PostgreSQL.
I am trying to build a query that tells me what percentage of the time a unique customer id makes multiple transactions on the same day.
I have a query built that gets me the customer ids and transaction dates (if there are multiple on the same day, the date repeats.
Below is my query..
SELECT customer.customer_id, rental_date::date FROM customer
FULL OUTER JOIN rental
ON customer.customer_id = rental.customer_id
FULL OUTER JOIN inventory
ON rental.inventory_id = inventory.inventory_id
FULL OUTER JOIN film
ON inventory.film_id = film.film_id
ORDER BY customer.customer_id, rental_date
Update:
Query now reads:
SELECT customer.customer_id, rental_date::date, COUNT (*)
FROM customer
JOIN rental ON customer.customer_id = rental.customer_id
JOIN inventory ON rental.inventory_id = inventory.inventory_id
JOIN film ON inventory.film_id = film.film_id
GROUP BY customer.customer_id, rental_date
ORDER BY customer.customer_id, rental_date
Output:
+-------------+-------------+-------+
| customer_id | rental_date | count |
+-------------+-------------+-------+
| 1 | 2005-05-25 | 1 |
| 1 | 2005-05-28 | 1 |
| 1 | 2005-06-15 | 1 |
| 1 | 2005-06-15 | 1 |
| 1 | 2005-06-15 | 1 |
| 2 | 2005-06-16 | 1 |
+-------------+-------------+-------+
Desired output:
+-------------+-------------+-------+
| customer_id | rental_date | count |
+-------------+-------------+-------+
| 1 | 2005-05-25 | 1 |
| 1 | 2005-05-28 | 1 |
| 1 | 2005-06-15 | 3 |
| 2 | 2005-06-16 | 1 |
+-------------+-------------+-------+

What you are looking for is count and having. Count will get you the number of purchases by day and Having can be used to eliminate those with 0 or 1 purchases on a given day.
select customer.customer_id, rental_date, count(*)
from customer
join rental on customer.customer_id = rental.customer_id
join inventory on rental.inventory_id = inventory.inventory_id
join film on inventory.film_id = film.film_id
group by customer.customer_id, rental_date
having count(*) > 1
order by customer.customer_id, rental_date ;
Also I doubt you want full outer join. That returns all rows from both the joined tables even when none exist in the other. I change it to an inner join (you only want customers that have rentals and also inventory that also have rentals. Even though now the having would eliminate the extras. Try removing the having clause then run with both full and again inner joins and see the difference.

Related

Distinct Count Dates by timeframe

I am trying to find the daily count of frequent visitors from a very large data-set. Frequent visitors in this case are visitor IDs used on 2 distinct days in a rolling 3 day period.
My data set looks like the below:
ID | Date | Location | State | Brand |
1 | 2020-01-02 | A | CA | XYZ |
1 | 2020-01-03 | A | CA | BCA |
1 | 2020-01-04 | A | CA | XYZ |
1 | 2020-01-06 | A | CA | YQR |
1 | 2020-01-06 | A | WA | XYZ |
2 | 2020-01-02 | A | CA | XYZ |
2 | 2020-01-05 | A | CA | XYZ |
This is the result I am going for. The count in the visits column is equal to the count of distinct days from the date column, -2 days for each ID. So for ID 1 on 2020-01-05, there was a visit on the 3rd and 4th, so the count is 2.
Date | ID | Visits | Frequent Prior 3 Days
2020-01-01 |Null| Null | Null
2020-01-02 | 1 | 1 | No
2020-01-02 | 2 | 1 | No
2020-01-03 | 1 | 2 | Yes
2020-01-03 | 2 | 1 | No
2020-01-04 | 1 | 3 | Yes
2020-01-04 | 2 | 1 | No
2020-01-05 | 1 | 2 | Yes
2020-01-05 | 2 | 1 | No
2020-01-06 | 1 | 2 | Yes
2020-01-06 | 2 | 1 | No
2020-01-07 | 1 | 1 | No
2020-01-07 | 2 | 1 | No
2020-01-08 | 1 | 1 | No
2020-01-09 | 1 | null | Null
I originally tried to use the following line to get the result for the visits column, but end up with 3 in every successive row at whichever date it first got to 3 for that ID.
,
count(ID) over (Partition by ID order by Date ASC rows between 3 preceding and current row) as visits
I've scoured the forum, but every somewhat similar question seems to involve counting the values rather than the dates and haven't been able to figure out how to tweak to get what I need. Any help is much appreciated.
You can aggregate the dataset by user and date, then use window functions with a range frame to look at the three preceding rows.
You did not tell which database you are running - and not all databases support the window ranges, nor have the same syntax for literal intervals. In standard SQL, you would go:
select
id,
date,
count(*) cnt_visits
case
when sum(count(*)) over(
partition by id
order by date
range between interval '3' day preceding and current row
) >= 2
then 'Yes'
else 'No'
end is_frequent_visitor
from mytable
group by id, date
On the other hand, if you want a record for every user and every day (event when there is no visit), then it is a bit different. You can generate the dataset first, then bring the table with a left join:
select
i.id,
d.date,
count(t.id) cnt_visits,
case
when sum(count(t.id)) over(
partition by i.id
order by d.date
rows between '3' day preceding and current row
) >= 2
then 'Yes'
else 'No'
end is_frequent_visitor
from (select distinct id from mytable) i
cross join (select distinct date from mytable) d
left join mytable t
on t.date = d.date
and t.id = i.id
group by i.id, d.date
I would be inclined to approach this by expanding out the days and visitors using a cross join and then just window functions. Assuming you have all dates in the data:
select i.id, d.date,
count(t.id) over (partition by i.id
order by d.date
rows between 2 preceding and current row
) as cnt_visits,
(case when count(t.id) over (partition by i.id
order by d.date
rows between 2 preceding and current row
) >= 2
then 'Yes' else 'No'
end) as is_frequent_visitor
from (select distinct id from t) i cross join
(select distinct date from t) d left join
(select distinct id, date from t) t
on t.date = d.date and
t.id = i.id;

Accomplishing what I need without a CROSS JOIN

I have a query that pulls from a table. With this table, I would like to build a query that allows me to make projections into the future.
SELECT
b.date,
a.id,
SUM(CASE WHEN a.date = b.date THEN a.sales ELSE 0 END) sales,
SUM(CASE WHEN a.date = b.date THEN a.revenue ELSE 0 END) revenue
FROM
table_a a
CROSS JOIN table_b b
WHERE a.date BETWEEN '2018-10-31' AND '2018-11-04'
GROUP BY 1,2
table_b is a table with literally only one column that contains dates going deep into the future. This returns results like this:
+----------+--------+-------+---------+
| date | id | sales | revenue |
+----------+--------+-------+---------+
| 11/4/18 | 113972 | 0 | 0 |
| 11/4/18 | 111218 | 0 | 0 |
| 11/3/18 | 111218 | 0 | 0 |
| 11/3/18 | 113972 | 0 | 0 |
| 11/2/18 | 111218 | 0 | 0 |
| 11/2/18 | 113972 | 0 | 0 |
| 11/1/18 | 111218 | 89 | 2405.77 |
| 11/1/18 | 113972 | 265 | 3000.39 |
| 10/31/18 | 111218 | 64 | 2957.71 |
| 10/31/18 | 113972 | 120 | 5650.91 |
+----------+--------+-------+---------+
Now there's more to the query after this where I get into the projections and what not, but for the purposes of this question, this is all you need, as it's where the CROSS JOIN exists.
How can I recreate these results without using a CROSS JOIN? In reality, this query is a much larger date range with way more data and takes hours and so much power to run and I know CROSS JOIN's should be avoided if possible.
Use the table of all dates as the "from table" and left join the data, this still returns each date.
SELECT
d.date
, t.id
, COALESCE(SUM(t.sales),0) sales
, COALESCE(SUM(t.revenue),0) revenue
FROM all_dates d
LEFT JOIN table_data t
ON d.date = t.date
WHERE d.date BETWEEN '2018-10-31' AND '2018-11-04'
GROUP BY
d.date
, t.id
Another alternative (to avoid the cross join) could be to use generate series but for this - in Redshift - I suggest this former answer. I'm a fan of generate series, but if you already have a table I would probably stay with that (but this is based on what little I know about your query etc.).

How to join 2 tables without value duplication in PostgreSql

I am joining two tables using:
select table1.date, table1.item, table1.qty, table2.anotherQty
from table1
INNER JOIN table2
on table1.date = table2.date
table1
date | item | qty
july1 | itemA | 20
july1 | itemB | 30
july2 | itemA | 20
table2
date | anotherQty
july1 | 200
july2 | 300
Expected result should be:
date | item | qty | anotherQty
july1 | itemA | 20 | 200
july1 | itemB | 30 | null or 0
july2 | itemA | 20 | 300
So that when i sum(anotherQty) it will have 500 only, instead of:
date | item | qty | anotherQty
july1 | itemA | 20 | 200
july1 | itemB | 30 | 200
july2 | itemA | 20 | 300
That is 200+200+300 = 700
SQL DEMO
WITH T1 as (
SELECT *, ROW_NUMBER() OVER (PARTITION BY "date" ORDER BY "item") as rn
FROM Table1
), T2 as (
SELECT *, ROW_NUMBER() OVER (PARTITION BY "date" ORDER BY "anotherQty") as rn
FROM Table2
)
SELECT *
FROM t1
LEFT JOIN t2
ON t1."date" = t2."date"
AND t1.rn = t2.rn
OUTPUT
Filter the columns you want, and change the order if need it.
| date | item | qty | rn | date | anotherQty | rn |
|-------|-------|-----|----|--------|------------|--------|
| july1 | itemA | 20 | 1 | july1 | 200 | 1 |
| july1 | itemB | 30 | 2 | (null) | (null) | (null) |
| july2 | itemA | 20 | 1 | july2 | 300 | 1 |
Try the following code, but know that so long as the qty values differ across rows, that you're going to still get the 'anotherQty' field breaking out into distinct values:
select
table1.date,
table1.item,
table1.qty,
SUM(table2.anotherQty)
from table1
INNER JOIN table2
on table1.date = table2.date
GROUP BY
table1.item,
table1.qty,
table1.date
If you need it to always aggregate down to a single line per item/date, then you will need to add a SUM() to table1.qty as well. Alternately, you could run a common table expression (WITH() statement) for each quantity that you want, summing them within the common table expression, and then rejoining the expressions to your final SELECT statement.
Edit:
Based on the comment from #Juan Carlos Oropeza, I'm not sure that there is a way to get the summed value of 500 while including table1.date in your query, because you will have to group the output by date which will cause the aggregation to split into distinct lines. The following query will get you the sum of anotherQty, at the sacrifice of displaying date:
select
table1.item,
SUM(table1.qty),
SUM(table2.anotherQty)
from table1
INNER JOIN table2
on table1.date = table2.date
GROUP BY
table1.item
If you need date to persist, you can get the sum to show up by using a WINDOW function, but note that this is essentially doing a running sum, and may throw off any subsequent summation you're doing on this query's output in terms of post-processing:
select
table1.item,
table1.date,
SUM(table1.qty),
SUM(table2.anotherQty) OVER (Partition By table1.item)
from table1
INNER JOIN table2
on table1.date = table2.date
GROUP BY
table1.item,
table1.date,
table2.anotherQty

Postgresql use more than one row as expression in sub query

As the title says, I need to create a query where I SELECT all items from one table and use those items as expressions in another query. Suppose I have the main table that looks like this:
main_table
-------------------------------------
id | name | location | //more columns
---|------|----------|---------------
1 | me | pluto | //
2 | them | mercury | //
3 | we | jupiter | //
And the sub query table looks like this:
some_table
---------------
id | item
---|-----------
1 | sub-col-1
2 | sub-col-2
3 | sub-col-3
where each item in some_table has a price which is in an amount_table like so:
amount_table
--------------
1 | 1000
2 | 2000
3 | 3000
So that the query returns results like this:
name | location | sub-col-1 | sub-col-2 | sub-col-3 |
----------------------------------------------------|
me | pluto | 1000 | | |
them | mercury | | 2000 | |
we | jupiter | | | 3000 |
My query currently looks like this
SELECT name, location, (SELECT item FROM some_table)
FROM main_table
INNER JOIN amount_table WHERE //match the id's
But I'm running into the error more than one row returned by a subquery used as an expression
How can I formulate this query to return the desired results?
you should decide on expected result.
to get one-tp-many relation:
SELECT name, location, some_table.item
FROM main_table
JOIN some_table on true -- or id if they match
INNER JOIN amount_table --WHERE match the id's
to get one-to-one with all rows:
SELECT name, location, (SELECT array_agg(item) FROM some_table)
FROM main_table
INNER JOIN amount_table --WHERE //match the id's

Postgresql : Filtering duplicate pair

I am asking this from mobile, so apologies for bad formatting. For the following table.
Table players
| ID | name |matches_won|
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
| 1 | bob | 3 |
| 2 | Paul | 2 |
| 3 | John | 4 |
| 4 | Jim | 1 |
| 5 | hal | 0 |
| 6 | fin | 0 |
I want to pair two players together in a query. Who have a similar or near similar the number of matches won. So the query should display the following result.
| ID | NAME | ID | NAME |
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
| 3 | John | 1 | bob |
| 2 | paul | 4 | Jim |
| 5 | hal | 6 | fin |
Until now I have tried this query. But it gives repeat pairs.
Select player1.ID,player1.name,player2.ID,player2.name
From player as player1,
player as player2
Where
player1.matches_won >= player2.matches_won
And player1.ID ! = player2.ID;
The query will pair the player with the most won matches with everyone of the other players. While I only want one player to appear only once in the result. With the player who is nearest to his wins.
I have tried sub queries. But I don't know how to go about it, since it only returns one result. Also aggregates don't work in the where clause. So I am not sure how to achieve this.
An easier way, IMHO, to achieve this would be to order the players by their number of wins, divide these ranks by two to create matches and self join. CTEs (with expressions) allow you to do this relatively elegantly:
WITH wins AS (
SELECT id, name, ROW_NUMNBER() OVER (ORDER BY matches_won DESC) AS rn
FROM players
)
SELECT w1.id, w1.name, w2.id, w2.name
FROM (SELECT id, name, rn / 2 AS rn
FROM wins
WHERE rn % 2 = 1) w1
LEFT JOIN (SELECT id, name, (rn - 1) / 2 AS rn
FROM wins
WHERE rn % 2 = 0) w2 ON w1.rn = w2.rn
Add row numbers in descending order by won matches to the table and join odd row numbers with adjacent even row numbers:
with players as (
select *, row_number() over (order by matches_won desc) rn
from player)
select a.id, a.name, b.id, b.name
from players a
join players b
on a.rn = b.rn- 1
where a.rn % 2 = 1
id | name | id | name
----+------+----+------
3 | John | 1 | bob
2 | Paul | 4 | Jim
5 | hal | 6 | fin
(3 rows)