Optimized querying in PostgreSQL - postgresql

Assume you have a table named tracker with following records.
issue_id | ingest_date | verb,status
10 2015-01-24 00:00:00 1,1
10 2015-01-25 00:00:00 2,2
10 2015-01-26 00:00:00 2,3
10 2015-01-27 00:00:00 3,4
11 2015-01-10 00:00:00 1,3
11 2015-01-11 00:00:00 2,4
I need the following results
10 2015-01-26 00:00:00 2,3
11 2015-01-11 00:00:00 2,4
I am trying out this query
select *
from etl_change_fact
where ingest_date = (select max(ingest_date)
from etl_change_fact);
However, this gives me only
10 2015-01-26 00:00:00 2,3
this record.
But, I want all unique records(change_id) with
(a) max(ingest_date) AND
(b) verb columns priority being (2 - First preferred ,1 - Second preferred ,3 - last preferred)
Hence, I need the following results
10 2015-01-26 00:00:00 2,3
11 2015-01-11 00:00:00 2,4
Please help me to efficiently query it.
P.S :
I am not to index ingest_date because I am going to set it as "distribution key" in Distributed Computing setup.
I am newbie to Data Warehouse and querying.
Hence, please help me with optimized way to hit my TB sized DB.

This is a typical "greatest-n-per-group" problem. If you search for this tag here, you'll get plenty of solutions - including MySQL.
For Postgres the quickest way to do it is using distinct on (which is a Postgres proprietary extension to the SQL language)
select distinct on (issue_id) issue_id, ingest_date, verb, status
from etl_change_fact
order by issue_id,
case verb
when 2 then 1
when 1 then 2
else 3
end, ingest_date desc;
You can enhance your original query to use a co-related sub-query to achieve the same thing:
select f1.*
from etl_change_fact f1
where f1.ingest_date = (select max(f2.ingest_date)
from etl_change_fact f2
where f1.issue_id = f2.issue_id);
Edit
For an outdated and unsupported Postgres version, you can probably get away using something like this:
select f1.*
from etl_change_fact f1
where f1.ingest_date = (select f2.ingest_date
from etl_change_fact f2
where f1.issue_id = f2.issue_id
order by case verb
when 2 then 1
when 1 then 2
else 3
end, ingest_date desc
limit 1);
SQLFiddle example: http://sqlfiddle.com/#!15/3bb05/1

Related

How to apply partition by in lag function using postrgresql

I have a table like as shown below
subject_id, date_inside, value
1 2110-02-12 19:41:00 1.3
1 2110-02-15 01:40:00 1.4
1 2110-02-15 02:40:00 1.5
2 2110-04-15 04:07:00 1.6
2 2110-04-15 08:00:00 1.7
2 2110-04-15 18:30:00 1.8
I would like to compute the date difference between consecutive rows for each subject
I tried the below
select a.subject_id,a.date_inside, a.value,
a. date_inside- lag(a. date_inside) over (order by a. date_inside) as difference
from table1 a
While the above works, I am not able to apply partition by for each subject. So, it ends up calculating the difference for all the rows (without considering the subject_id). Basically, the last row of each subject has to be null because that's his or her last row (and should not be subtracted from consecutive record of the next subject)
I expect my output to be like as shown below
subject_id, date_inside, difference
1 2110-02-12 19:41:00 66 hours
1 2110-02-15 01:40:00 1 hour
1 2110-02-15 02:40:00 NULL
2 2110-04-15 04:07:00 3 hours, 53 minutes
2 2110-04-15 08:00:00 10 hours, 30 minutes
2 2110-04-15 18:30:00 NULL
Just add a PARTITION BY clause, and also your expected output seems to want LEAD, not LAG:
SELECT subject_id, date_inside, value,
LEAD(date_inside) OVER (PARTITION BY subject_id ORDER BY date_inside)
- date_inside AS difference
FROM table1
ORDER BY
subject_id,
date_inside;
Think of "partition by" to be simiar to how you could use "group by". In this case the logical boundaries are determined by subject_id so just include as part of the over clause:
select a.subject_id,a.date_inside, a.value,
a.date_inside - lag(a.date_inside) over (partition by a.subject_id order by a.date_inside) as difference
from table1

Getting ranking based on a number from CTE

I have a complex situation in PostgreSQL 11 where i need to generate a numbering based on a single figure which i get it from a CTE.
Below is the CTE
WITH pending_orders_to_be_processed_details
AS
(
SELECT ROW_NUMBER() OVER(ORDER BY so.create_date ) as queue_no
, name,so.create_date ::TIMESTAMP
FROM picking sp
LEFT JOIN order so ON so.name=sp.origin
WHERE sp.state IN('assigned','confirmed')
)
,orders_which_can_be_processed_today AS
(
-- This CTE will give me a count of orders
and its hourly average, Lets say count is 400 and hourly avg is 3
)
Now i need to number the details according to the hourly average, Means the first 3 orders need to be ranked as 1, next 3 to be ranked as 2 and so on, so that i can able to identify that these can be processed based on this ranking.
Input will be
name queu_number. create_date
so1 1 2021-03-11 12:00:00
so2 2 2021-03-11 13:00:00
so3 3 2021-03-11 14:00:00
so4 4 2021-03-11 15:00:00
so5 5 2021-03-11 16:00:00
so6 6 2021-03-11 17:00:00
so7 7 2021-03-11 18:00:00
so8 8 2021-03-11 19:00:00
so9 9 2021-03-11 20:00:00
The expected output will be
name rank
so1 1
so2 1
so3 1
so4 2
so5 2
so6 2
so7 3
so8 3
so9 3
Any help/suggestions.
Edit: I recently learned about a function, which fits well here:
demo:db<>fiddle
You can use the ntile() window function for that:
SELECT
*,
ntile(3) OVER (ORDER BY create_date)
FROM mytable
demo:db<>fiddle
Since you already created a cumulative row count, you can use this to create your expected rank:
SELECT
*,
floor((queue_no - 1) / 3) + 1 as rank
FROM my_cte
queue_no - 1 (so, 1 to 3 will be shifted to 0 to 2)
Diff by 3: so, 0 to 2 will be 0.x and 3 to 5 will be 1.x, ...
Now round these result to 0, 1, 2, ...
If you want to start with 1 instead of 0, add 1

Running Count Total with PostgresQL

I'm fairly close to this solution, but I just need a little help getting over the end.
I'm trying to get a running count of the occurrences of client_ids regardless of the date, however I need the dates and ids to still appear in my results to verify everything.
I found part of the solution here but have not been able to modify it enough for my needs.
Here is what the answer should be, counting if the occurrences of the client_ids sequentially :
id client_id deliver_on running_total
1 138 2017-10-01 1
2 29 2017-10-01 1
3 138 2017-10-01 2
4 29 2013-10-02 2
5 29 2013-10-02 3
6 29 2013-10-03 4
7 138 2013-10-03 3
However, here is what I'm getting:
id client_id deliver_on running_total
1 138 2017-10-01 1
2 29 2017-10-01 1
3 138 2017-10-01 1
4 29 2013-10-02 3
5 29 2013-10-02 3
6 29 2013-10-03 1
7 138 2013-10-03 2
Rather than counting the times the client_id appears sequentially, the code counts the time the id appears in the previous date range.
Here is my code and any help would be greatly appreciated.
Thank you,
SELECT n.id, n.client_id, n.deliver_on, COUNT(n.client_id) AS "running_total"
FROM orders n
LEFT JOIN orders o
ON (o.client_id = n.client_id
AND n.deliver_on > o.deliver_on)
GROUP BY n.id, n.deliver_on, n.client_id
ORDER BY n.deliver_on ASC
* EDIT WITH ANSWER *
I ending up solving my own question. Here is the solution with comments:
-- Set "1" for counting to be used later
WITH DATA AS (
SELECT
orders.id,
orders.client_id,
orders.deliver_on,
COUNT(1) -- Creates a column of "1" for counting the occurrences
FROM orders
GROUP BY 1
ORDER BY deliver_on, client_id
)
SELECT
id,
client_id,
deliver_on,
SUM(COUNT) OVER (PARTITION BY client_id
ORDER BY client_id, deliver_on
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) -- Counts the sequential client_ids based on the number of times they appear
FROM DATA
Just the answer posted to close the question:
-- Set "1" for counting to be used later
WITH DATA AS (
SELECT
orders.id,
orders.client_id,
orders.deliver_on,
COUNT(1) -- Creates a column of "1" for counting the occurrences
FROM orders
GROUP BY 1
ORDER BY deliver_on, client_id
)
SELECT
id,
client_id,
deliver_on,
SUM(COUNT) OVER (PARTITION BY client_id
ORDER BY client_id, deliver_on
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) -- Counts the sequential client_ids based on the number of times they appear
FROM DATA

SELECT record based upon dates

Assuming data such as the following:
ID EffDate Rate
1 12/12/2011 100
1 01/01/2012 110
1 02/01/2012 120
2 01/01/2012 40
2 02/01/2012 50
3 01/01/2012 25
3 03/01/2012 30
3 05/01/2012 35
How would I find the rate for ID 2 as of 1/15/2012?
Or, the rate for ID 1 for 1/15/2012?
In other words, how do I do a query that finds the correct rate when the date falls between the EffDate for two records? (Rate should be for the date prior to the selected date).
Thanks,
John
How about this:
SELECT Rate
FROM Table1
WHERE ID = 1 AND EffDate = (
SELECT MAX(EffDate)
FROM Table1
WHERE ID = 1 AND EffDate <= '2012-15-01');
Here's an SQL Fiddle to play with. I assume here that 'ID/EffDate' pair is unique for all table (at least the opposite doesn't make sense).
SELECT TOP 1 Rate FROM the_table
WHERE ID=whatever AND EffDate <='whatever'
ORDER BY EffDate DESC
if I read you right.
(edited to suit my idea of ms-sql which I have no idea about).

tsql PIVOT function

Need help with the following query:
Current Data format:
StudentID EnrolledStartTime EnrolledEndTime
1 7/18/2011 1.00 AM 7/18/2011 1.05 AM
2 7/18/2011 1.00 AM 7/18/2011 1.09 AM
3 7/18/2011 1.20 AM 7/18/2011 1.40 AM
4 7/18/2011 1.50 AM 7/18/2011 1.59 AM
5 7/19/2011 1.00 AM 7/19/2011 1.05 AM
6 7/19/2011 1.00 AM 7/19/2011 1.09 AM
7 7/19/2011 1.20 AM 7/19/2011 1.40 AM
8 7/19/2011 1.10 AM 7/18/2011 1.59 AM
I would like to calculate the time difference between EnrolledEndTime and EnrolledStartTime and group it with 15 minutes difference and the count of students that enrolled in the time.
Expected Result :
Count(StudentID) Date 0-15Mins 16-30Mins 31-45Mins 46-60Mins
4 7/18/2011 3 1 0 0
4 7/19/2011 2 1 0 1
Can I use a combination of the PIVOT function to acheive the required result. Any pointers would be helpful.
Create a table variable/temp table that includes all the columns from the original table, plus one column that marks the row as 0, 16, 31 or 46. Then
SELECT * FROM temp table name PIVOT (Count(StudentID) FOR new column name in (0, 16, 31, 46).
That should put you pretty close.
It's possible (just see the basic pivot instructions here: http://msdn.microsoft.com/en-us/library/ms177410.aspx), but one problem you'll have using pivot is that you need to know ahead of time which columns you want to pivot into.
E.g., you mention 0-15, 16-30, etc. but actually, you have no idea how long some students might take -- some might take 24-hours, or your full session timeout, or what have you.
So to alleviate this problem, I'd suggesting having a final column as a catch-all, labeled something like '>60'.
Other than that, just do a select on this table, selecting the student ID, the date, and a CASE statement, and you'll have everything you need to work the pivot on.
CASE WHEN date2 - date1 < 15 THEN '0-15' WHEN date2-date1 < 30 THEN '16-30'...ELSE '>60' END.
I have an old version of ms sql server that doesn't support pivot. I wrote the sql for getting the data. I cant test the pivot, so I tried my best, couldn't test the pivot part. The rest of the sql will give you the exact data for the pivot table. If you accept null instead of 0, it can be written alot more simple, you can skip the "a subselect" part defined in "with a...".
declare #t table (EnrolledStartTime datetime,EnrolledEndTime datetime)
insert #t values('2011/7/18 01:00', '2011/7/18 01:05')
insert #t values('2011/7/18 01:00', '2011/7/18 01:09')
insert #t values('2011/7/18 01:20', '2011/7/18 01:40')
insert #t values('2011/7/18 01:50', '2011/7/18 01:59')
insert #t values('2011/7/19 01:00', '2011/7/19 01:05')
insert #t values('2011/7/19 01:00', '2011/7/19 01:09')
insert #t values('2011/7/19 01:20', '2011/7/19 01:40')
insert #t values('2011/7/19 01:10', '2011/7/19 01:59')
;with a
as
(select * from
(select distinct dateadd(day, cast(EnrolledStartTime as int), 0) date from #t) dates
cross join
(select '0-15Mins' t, 0 group1 union select '16-30Mins', 1 union select '31-45Mins', 2 union select '46-60Mins', 3) i)
, b as
(select (datediff(minute, EnrolledStartTime, EnrolledEndTime )-1)/15 group1, dateadd(day, cast(EnrolledStartTime as int), 0) date
from #t)
select count(b.date) count, a.date, a.t, a.group1 from a
left join b
on a.group1 = b.group1
and a.date = b.date
group by a.date, a.t, a.group1
-- PIVOT(max(date)
-- FOR group1
-- in(['0-15Mins'], ['16-30Mins'], ['31-45Mins'], ['46-60Mins'])AS p