How to achieve matching with same table in postgresql?

How to achieve matching with same table in postgresql? - postgresql

I want to count number of sites partition by country on the basis of below criteria:-
For each TrialSite with Trial_Start_Date = X and Site_Activated = Y, you should be counting all rows that meet these conditions:
Trial_Start_Date <= Y, AND
TrialSite_Activation_Date >= X
i.e. all rows where there is some overlapping period with that row's Trial Start Date to TrialSite Activation Date.
sample data example:
Trial id start_date country site_id trialSite_activation_date
Trial A 01-01-2017 India site1 01-02-2017 ----> 2 (only overlaps with itself, and with 2nd row)
Trial A 01-01-2017 India site 2 01-04-2017 ----> 4 (overlaps with all rows, including itself)
Trial B 02-03-2017 India site3 01-04-2017 ----> 3 (does not overlap with first row, since Trial_Start_Date > 01-02-2017)
Trial B 02-03-2017 India site4 01-04-2017 ----> 3
This data can contain multiple countries and this logic needs to be applied with same country records.

You could use the “overlaps” operator && on dateranges:
SELECT t1, count(*)
FROM trial t1
JOIN trial t2
ON daterange(t1.trial_start_date, t1.trialsite_activation_date, '[]')
&& daterange(t2.trial_start_date, t2.trialsite_activation_date, '[]')
GROUP BY t1;
t1 | count
-----------------------------------------------+-------
("Trial A",2017-01-01,India,site1,2017-02-01) | 2
("Trial A",2017-01-01,India,site2,2017-04-01) | 4
("Trial B",2017-03-02,India,site3,2017-04-01) | 3
("Trial B",2017-03-02,India,site4,2017-04-01) | 3
(4 rows)
Instead of using the whole-row reference t1 in the SELECT list, you can specify individual columns there, but then you gave to list them in the GROUP BY clause as well.

Related

T-SQL vlookup with fake calendar table?

I am rather new in T-SQL and I have to create a view, where the output will be as shown below:
enter image description here
But my sales table doesn't have any data about sales in February and May for customer ABC and no data in January for customer XYZ, but I really want to have 0 for these months. How to do it in T-SQL?

This is great question about a very important topic that, even many experienced developers need to touch up on. Being "relatively new at SQL" I wont just offer a solution, I'll explain the key concepts involved.
The Auxiliary Table Numbers
First lets learn about what a tally table, aka numbers table is all about.
What does this do?
SELECT N = 1 ;
It returns the number 1.
N
-----
1
How about this?
SELECT N = 1 FROM (VALUES(0)) AS e(N);
Same thing:
N
-----
1
What does this return?
SELECT N = 1 FROM (VALUES(0),(0),(0),(0),(0),(0)) AS e(n);
Here I'm leveraging the VALUES table constructer which allows for a list of values to be treated like a view. This returns:
N
-------
1
1
1
1
1
We don't need the ones, we need the rows. This will make more sense in a moment. Now, what does this do?
WITH e(N) AS (SELECT 1 FROM (VALUES(0),(0),(0),(0),(0)) AS e(n))
SELECT N = 1 FROM e e1;
It returns the same thing, five 1's, but I've wrapped the code into a CTE named e. Think of CTEs as inline unnamed views that you can reference multiple times. Now lets CROSS JOIN e to itself. This returns for 25 dummy rows (5*5).
WITH e(N) AS (SELECT 1 FROM (VALUES(0),(0),(0),(0),(0)) AS e(n))
SELECT N = 1 FROM e e1, e e2;
Next we leverage ROW_NUMBER() over our set of dummy values.
WITH E1(N) AS (SELECT 1 FROM (VALUES(0),(0),(0),(0),(0)) AS e(n))
SELECT N = ROW_NUMBER() OVER (ORDER BY(SELECT NULL)) FROM E1, E1 a;
Returns (truncated for brevity):
N
--------------------
1
2
3
...
24
25
Using as an auxiliary numbers table
#OneToTen is a table with random numbers 1 to 10. I need to count how many there are, returning 0 when there aren't any. NOTE MY COMMENTS:
;--== 2. Simple Use Case - Counting all numbers, including missing ones (missing = 0)
DECLARE #OneToTen TABLE (N INT);
INSERT #OneToTen VALUES(1),(2),(2),(2),(4),(8),(8),(10),(10),(10);
WITH E1(N) AS (SELECT 1 FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS e(n)),
iTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY(SELECT NULL)) FROM E1, E1 a)
SELECT
N = i.N,
Wrong = COUNT(*), -- WRONG!!! Don't do THIS, this counts ALL rows returned
Correct = COUNT(t.N) -- Correct, this counts numbers from #OneToTen AKA "t.N"
FROM iTally AS i -- Aux Table of numbers
LEFT JOIN #OneToTen AS t -- Table to evaluate
ON i.N = t.N -- LEFT JOIN #OneToTen numbers to our Aux table of numbers
WHERE i.N <= 10 -- We only need the numbers 1 to 10
GROUP BY i.N; -- Group by with no Sort!!!
This returns:
N Wrong Correct
----- ----------- -----------
1 1 1
2 3 3
3 1 0
4 1 1
5 1 0
6 1 0
7 1 0
8 2 2
9 1 0
10 3 3
Note that I show you the wrong and right way to do this. Note how COUNT(*) is wrong for this, you need COUNT(whatever you are counting).
Auxiliary table of Dates (AKA calendar table)
My we use our numbers table to create a calendar table.
;--== 3. Auxilliary Month/Year Calendar Table
DECLARE #Start DATE = '20191001',
#End DATE = '20200301';
WITH E1(N) AS (SELECT 1 FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS e(n)),
iTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY(SELECT NULL)) FROM E1, E1 a)
SELECT TOP(DATEDIFF(MONTH,#Start,#End)+1)
TheDate = f.Dt,
TheYear = YEAR(f.Dt),
TheMonth = MONTH(f.Dt),
TheWeekday = DATEPART(WEEKDAY,f.Dt),
DayOfTheYear = DATEPART(DAYOFYEAR,f.Dt),
LastDayOfMonth = EOMONTH(f.Dt)
FROM iTally AS i
CROSS APPLY (VALUES(DATEADD(MONTH, i.N-1, #Start))) AS f(Dt)
This returns:
TheDate TheYear TheMonth TheWeekday DayOfTheYear LastDayOfMonth
---------- ----------- ----------- ----------- ------------ --------------
2019-10-01 2019 10 3 274 2019-10-31
2019-11-01 2019 11 6 305 2019-11-30
2019-12-01 2019 12 1 335 2019-12-31
2020-01-01 2020 1 4 1 2020-01-31
2020-02-01 2020 2 7 32 2020-02-29
2020-03-01 2020 3 1 61 2020-03-31
You will only need the YEAR and MONTH.
The Auxiliary Customer table
Because you are performing aggregations (SUM,COUNT,etc.) against multiple customers we will also need an Auxiliary table of customers, more commonly known as a lookup or dimension.
SAMPLE DATA:
;--== Sample Data
DECLARE #sale TABLE
(
Customer VARCHAR(10),
SaleYear INT,
SaleMonth TINYINT,
SaleAmt DECIMAL(19,2),
INDEX idx_cust(Customer)
);
INSERT #sale
VALUES('ABC',2019,12,410),('ABC',2020,1,668),('ABC',2020,1,50), ('ABC',2020,3,250),
('CDF',2019,10,200),('CDF',2019,11,198),('CDF',2020,1,333),('CDF',2020,2,5000),
('CDF',2020,2,325),('CDF',2020,3,1105),('FRED',2018,11,1105);
Distinct list of customers for an "Auxilliary Table of Customers"
SELECT DISTINCT s.Customer FROM #sale AS s;
For my sample data we get:
Customer
----------
ABC
CDF
FRED
Putting it all together
Here I'm going to:
Create a numbers table
Use my numbers table to create a calendar table
Create an auxiliary Customer table from #sale
CROSS JOIN (combine) both tables for a "junk dimension"
LEFT JOIN our sales data to our calendar/customer auxiliary tables/junk dimension
Group by the auxiliary table values
SOLUTION:
;--==== SAMPLE DATA
DECLARE #sale TABLE
(
Customer VARCHAR(10),
SaleYear INT,
SaleMonth TINYINT,
SaleAmt DECIMAL(19,2),
INDEX idx_cust(Customer)
);
INSERT #sale
VALUES('ABC',2019,12,410),('ABC',2020,1,668),('ABC',2020,1,50), ('ABC',2020,3,250),
('CDF',2019,10,200),('CDF',2019,11,198),('CDF',2020,1,333),('CDF',2020,2,5000),
('CDF',2020,2,325),('CDF',2020,3,1105),('FRED',2018,11,1105);
;--==== START/END DATEs
DECLARE #Start DATE = '20191001',
#End DATE = '20200301';
;--==== FINAL SOLUTION
WITH -- 6.1. Auxilliary Table of numbers:
E1(N) AS (SELECT 1 FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS e(n)),
iTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY(SELECT NULL)) FROM E1, E1 a),
-- 6.2. Use numbers table to create an "Auxilliary Date Table" (Calendar Table):
MonthYear(SaleYear,SaleMonth) AS
(
SELECT TOP(DATEDIFF(MONTH,#Start,#End)+1) YEAR(f.Dt), MONTH(f.Dt)
FROM iTally AS i
CROSS APPLY (VALUES(DATEADD(MONTH, i.N-1, #Start))) AS f(Dt)
)
SELECT
Customer = cust.Customer,
MonthYear = CONCAT(cal.SaleYear,'-',cal.SaleMonth),
Sales = ISNULL(SUM(s.SaleAmt),0)
-- Auxilliary Table of Customers
FROM (SELECT DISTINCT s.Customer FROM #sale AS s) AS cust -- 6.3. Aux Customer Table
CROSS JOIN MonthYear AS cal -- 6.4. Cross join to create Calendar/Customer Junk Dimension
LEFT JOIN #sale AS s -- 6.5. Join #sale to Junk Dimension on Year,Month and Customer
ON s.SaleYear = cal.SaleYear
AND s.SaleMonth = cal.SaleMonth
AND s.Customer = cust.Customer
GROUP BY cust.Customer, cal.SaleYear, cal.SaleMonth -- 6.6. Group by Junk Dim values
ORDER BY cust.Customer, cal.SaleYear, cal.SaleMonth; -- Order by not required
RESULTS:
Customer MonthYear Sales
---------- ------------ ------------
ABC 2019-10 0.00
ABC 2019-11 0.00
ABC 2019-12 410.00
ABC 2020-1 718.00
ABC 2020-2 0.00
ABC 2020-3 250.00
CDF 2019-10 200.00
CDF 2019-11 198.00
CDF 2019-12 0.00
CDF 2020-1 333.00
CDF 2020-2 5325.00
CDF 2020-3 1105.00
FRED 2019-10 0.00
FRED 2019-11 0.00
FRED 2019-12 0.00
FRED 2020-1 0.00
FRED 2020-2 0.00
FRED 2020-3 0.00

How to find MAX(date) from BETWEEN(dates) in column 2 with DUPLICATES in column 1?

I have a Database that has product names in column 1 and product release dates in column 2. I want to find 'old' products by their release date. However, I'm only interested in finding 'old' products that released a minimum of 1 year ago. I cannot make any edits to the original database infrastructure.
The table looks like this:
Product| Release_Day
A | 2018-08-23
A | 2017-08-23
A | 2019-08-21
B | 2018-08-22
B | 2016-08-22
B | 2017-08-22
C | 2018-10-25
C | 2016-10-25
C | 2019-08-19
I have already tried multiple versions of DISTINCT, MAX, BETWEEN, >, <, etc.
SELECT DISTINCT product,MAX(release_day) as most_recent_release
FROM Product_Release
WHERE
release_day between '2015-08-22' and '2018-08-22'
and release_day not between '2018-08-23' and '2019-08-22'
GROUP BY 1
ORDER BY MAX(release_day) DESC
The expected results should not contain any products found by this query:
SELECT DISTINCT product,MAX(release_day) as most_recent_release
FROM Product_Release
WHERE
release_day between '2018-08-23' and '2019-08-22'
AND product = A
GROUP BY 1
However, every check I complete returns a product from this date range.
This is the output of the initial query:
Product|Most_Recent_Release
A | 2018-08-23
B | 2018-08-22
C | 2015-10-25
And, for example, if I run the check query on Product A, I get this:
Product|Most_Recent_Release
A | 2019-08-21

Use HAVING to filter on most_recent_release
SELECT product, MAX(release_day) as most_recent_release
FROM Product_Release
GROUP BY product
HAVING most_recent_release < '2018-08-23'
ORDER BY most_recent_release DESC
There's no need to use DISTINCT when you use GROUP BY -- you can't get duplicates if there's only one row per product.

Fill in missing rows when aggregating over multiple fields in Postgres

I am aggregating sales for a set of products per day using Postgres and need to know not just when sales do happen, but also when they do not for further processing.
SELECT
sd.date,
COUNT(sd.sale_id) AS sales,
sd.product
FROM sales_data sd
-- sales per product, per day
GROUP BY sd.product, sd.date
ORDER BY sd.product, sd.date
This produces the following:
date | sales | product
------------+-------+-------------------
2017-08-17 | 10 | soap
2017-08-19 | 2 | soap
2017-08-20 | 5 | soap
2017-08-17 | 2 | shower gel
2017-08-21 | 1 | shower gel
As you can see - the date ranges per product are not continuous as sales_data just didn't contain any info for these products on some days.
What I'm aiming to do is to add a sales = 0 row for each product that is not sold on any day in a range - for example here, between 2017-08-17 and 2017-08-21 to give something like the the following:
date | sales | product
------------+-------+-------------------
2017-08-17 | 10 | soap
2017-08-18 | 0 | soap
2017-08-19 | 2 | soap
2017-08-20 | 5 | soap
2017-08-21 | 0 | soap
2017-08-17 | 2 | shower gel
2017-08-18 | 0 | shower gel
2017-08-19 | 0 | shower gel
2017-08-20 | 0 | shower gel
2017-08-21 | 1 | shower gel
In a simpler case where there was only a single product, it seems like the solution would be to use generate_series() i.e.:
create a full range of dates using generate_series
LEFT JOIN the already aggregated sales data onto the date series
COALESCE any NULL counts to 0 in the missing rows
The problem I have is that this approach does not seem to work dates repeat in the aggregated data as I'm grouping over not just multiple dates, but multiple products also.
It feels like I should be able to do something cunning with window functions here to solve this e.g. joining onto the full date range over partitions defined by the product name - but I can't see a way of actually getting this to work.

You could use:
WITH cte AS (
SELECT date, s.product
FROM ... -- some way to generate date series
CROSS JOIN (SELECT DISTINCT product FROM sales_data) s
)
SELECT
c.date,
c.product,
COUNT(sd.sale_id) AS sales
FROM cte c
LEFT JOIN sales_data sd
ON c.date = sd.date AND c.product= sd.product
GROUP BY c.date, c.product
ORDER BY c.date, c.product;
First create Cartesian product of dates and products, then LEFT JOIN to actual data and do calculations.
Oracle has great feature for this scenarios called Partitioned Outer Joins:
SELECT times.time_id, product, quantity
FROM inventory PARTITION BY (product)
RIGHT OUTER JOIN times ON (times.time_id = inventory.time_id)
WHERE times.time_id BETWEEN TO_DATE('01/04/01', 'DD/MM/YY')
AND TO_DATE('06/04/01', 'DD/MM/YY')
ORDER BY 2,1;

select
date,
count(sale_id) as sales,
product
from
sales_data
right join (
(
select d::date as date
from generate_series (
(select min(date) from sales_data),
(select max(date) from sales_data),
'1 day'
) gs (d)
) gs
cross join
(select distinct product from sales_data) p
) cj using (product, date)
group by product, date
order by product, date

How can 'brand new, never before seen' IDs be counted per month in redshift?

A fair amount of material is available detailing methods utilising dense_rank() and the like to count distinct somethings per month, however, I've been unable to find anything that allows a count of distinct per month which also removes/discounts any id's that have been seen in prior month groups.
The data can be imagined like so:
id (int8 type) | observed time (timestamp utc)
------------------
1 | 2017-01-01
2 | 2017-01-02
1 | 2017-01-02
1 | 2017-02-02
2 | 2017-02-03
3 | 2017-02-04
1 | 2017-03-01
3 | 2017-03-01
4 | 2017-03-01
5 | 2017-03-02
The process of the count can be seen as:
1: in 2017-01 we saw devices 1 and 2 so the count is 2
2: in 2017-02 we saw devices 1, 2 and 3. We know already about devices 1 and 2, but not 3, so the count is 1
3: in 2017-03 we saw devices 1, 3, 4 and 5. We already know about 1 and 3, but not 4 or 5, so the count is 2.
with the desired output being something like:
observed time | count of new id
--------------------------
2017-01 | 2
2017-02 | 1
2017-03 | 2
Explicitly, I am looking to have a new table, with an aggregated month per row, with a count of how many new ids occur within that month that have not been seen at all before.
The IRL case allows devices to be seen more than once in a month, but this shouldn't impact the count. It also uses integer for storage (both positive and negative) of the id, and time periods will be to the second in true timestamps. The size of the data set is also significant.
My initial attempt is along the lines of:
WITH records_months AS (
SELECT *,
date_trunc('month', observed_time) AS month_group
FROM my_table
WHERE observed_time > '2017-01-01')
id_months AS (
SELECT DISTINCT
month_group,
id
FROM records_months
GROUP BY month_group, id)
SELECT *
FROM id-months
However, I'm stuck on the next part i.e counting the number of new ID that were not seen in prior months. I believe the solution might be a window function, but I'm having trouble working out which or how.

First thing I thought of. The idea is to
(innermost query) calculate the earliest month that each id was seen,
(next level up) join that back to the main my_table dataset, and then
(outer query) count distinct ids by month after nulling out the already-seen ids.
I tested it out and got the desired result set. Joining the earliest month back to the original table seemed like the most natural thing to do (vs. a window function). Hopefully this is performant enough for your Redshift!
select observed_month,
-- Null out the id if the observed_month that we're grouping by
-- is NOT the earliest month that the id was seen.
-- Then count distinct id
count(distinct(case when observed_month != earliest_month then null else id end)) as num_new_ids
from (
select t.id,
date_trunc('month', t.observed_time) as observed_month,
earliest.earliest_month
from my_table t
join (
-- What's the earliest month an id was seen?
select id,
date_trunc('month', min(observed_time)) as earliest_month
from my_table
group by 1
) earliest
on t.id = earliest.id
)
group by 1
order by 1;

Postgres aggregate sum conditional on row comparison

So, I have data that looks something like this
User_Object | filesize | created_date | deleted_date
row 1 | 40 | May 10 | Aug 20
row 2 | 10 | June 3 | Null
row 3 | 20 | Nov 8 | Null
I'm building statistics to record user data usage to graph based on time based datapoints. However, I'm having difficulty developing a query to take the sum for each row of all queries before it, but only for the rows that existed at the time of that row's creation. Before taking this step to incorporate deleted values, I had a simple naive query like this:
SELECT User_Object.id, User_Object.created, SUM(filesize) OVER (ORDER BY User_Object.created) AS sum_data_used
FROM User_Object
JOIN user ON User_Object.user_id = user.id
WHERE user.id = $1
However, I want to alter this somehow so that there's a conditional for the the window function to only get the sum of any row created before this User Object when that row doesn't have a deleted date also before this User Object.
This incorrect syntax illustrates what I want to do:
SELECT User_Object.id, User_Object.created,
SUM(CASE WHEN NOT window_function_row.deleted
OR window_function_row.deleted > User_Object.created
THEN filesize ELSE 0)
OVER (ORDER BY User_Object.created) AS sum_data_used
FROM User_Object
JOIN user ON User_Object.user_id = user.id
WHERE user.id = $1
When this function runs on the data that I have, it should output something like
id | created | sum_data_used|
1 | May 10 | 40
2 | June 3 | 50
3 | Nov 8 | 30

Something along these lines may work for you:
SELECT a.user_id
,MIN(a.created_date) AS created_date
,SUM(b.filesize) AS sum_data_used
FROM user_object a
JOIN user_object b ON (b.user_id <= a.user_id
AND COALESCE(b.deleted_date, a.created_date) >= a.created_date)
GROUP BY a.user_id
ORDER BY a.user_id
For each row, self-join, match id lower or equal, and with date overlap. It will be expensive because each row needs to look through the entire table to calculate the files size result. There is no cumulative operation taking place here. But I'm not sure there is a way that.
Example table definition:
create table user_object(user_id int, filesize int, created_date date, deleted_date date);
Data:
1;40;2016-05-10;2016-08-29
2;10;2016-06-03;<NULL>
3;20;2016-11-08;<NULL>
Result:
1;2016-05-10;40
2;2016-06-03;50
3;2016-11-08;30