Generate time slots based on duration and quantity in SQL - tsql

I have a table that for some reason stores rotas like this:
Rota | Date_Start | Position | Duration | Quantity | Rota_Slot_Type
----------+-------------------------+----------+----------+----------+---------------
372387412 | 2020-04-12 08:00:00.000 | 1 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2 | 15 | 1 | Support Slot
372387412 | 2020-04-12 08:00:00.000 | 3 | 30 | 1 | Lunch Break
372387412 | 2020-04-12 08:00:00.000 | 4 | 15 | 13 | Not available
372387412 | 2020-04-12 08:00:00.000 | 5 | 15 | 1 | Support Slot
372387412 | 2020-04-12 08:00:00.000 | 6 | 30 | 1 | Lunch Break
372387412 | 2020-04-12 08:00:00.000 | 7 | 15 | 12 | Not available
372387412 | 2020-04-12 08:00:00.000 | 8 | 15 | 1 | Support Slot
372387412 | 2020-04-12 08:00:00.000 | 9 | 15 | 1 | Not available
Changing the table is not an option.
I have generated this:
ID_Rota | Date_Start | RowNumber | Position | Number | Duration | Quantity | Rota_Slot_Type
----------+-------------------------+-----------+----------+--------+----------+----------+---------------
372387412 | 2020-04-12 08:00:00.000 | 1 | 1 | 1 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2 | 1 | 2 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 3 | 1 | 3 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 4 | 1 | 4 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 5 | 1 | 5 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 6 | 1 | 6 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 7 | 1 | 7 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 8 | 1 | 8 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 9 | 1 | 9 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 10 | 1 | 10 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 11 | 1 | 11 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 12 | 1 | 12 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 13 | 1 | 13 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 14 | 1 | 14 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 15 | 1 | 15 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 16 | 2 | 1 | 15 | 1 | Support Slot
372387412 | 2020-04-12 08:00:00.000 | 17 | 3 | 1 | 30 | 1 | Lunch Break
372387412 | 2020-04-12 08:00:00.000 | 18 | 4 | 1 | 15 | 13 | Not available
372387412 | 2020-04-12 08:00:00.000 | 19 | 4 | 2 | 15 | 13 | Not available
372387412 | 2020-04-12 08:00:00.000 | 20 | 4 | 3 | 15 | 13 | Not available
(Top 20 rows, there are 46 in total)
This was generated using the following SQL:
select
rs.ID_Rota,
rs.Date_Start,
row_number() over (partition by rs.Date_Start, rs.ID_Rota order by rs.Position, n.Number) as [RowNumber],
rs.Position,
n.Number,
rs.Duration,
rs.Quantity,
rs.Rota_Slot_Type
from dbo.RotaSlots as [rs]
cross apply
(
select top (rs.Quantity)
n.Number + 1 as [Number]
from dbo.Numbers as [n]
) as [n]
The part I'm struggling with is generating a Slot_Start column for each row. My anticipated output is:
ID_Rota | Date_Start | Slot_Start | RowNumber | Position | Number | Duration | Quantity | Rota_Slot_Type
----------+-------------------------+-------------------------+-----------+----------+--------+----------+----------+---------------
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 08:00:00.000 | 1 | 1 | 1 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 08:15:00.000 | 2 | 1 | 2 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 08:30:00.000 | 3 | 1 | 3 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 08:45:00.000 | 4 | 1 | 4 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 09:00:00.000 | 5 | 1 | 5 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 09:15:00.000 | 6 | 1 | 6 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 09:30:00.000 | 7 | 1 | 7 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 09:45:00.000 | 8 | 1 | 8 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 10:00:00.000 | 9 | 1 | 9 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 10:15:00.000 | 10 | 1 | 10 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 10:30:00.000 | 11 | 1 | 11 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 10:45:00.000 | 12 | 1 | 12 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 11:00:00.000 | 13 | 1 | 13 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 11:15:00.000 | 14 | 1 | 14 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 11:30:00.000 | 15 | 1 | 15 | 15 | 15 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 11:45:00.000 | 16 | 2 | 1 | 15 | 1 | Support Slot
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 12:15:00.000 | 17 | 3 | 1 | 30 | 1 | Lunch Break
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 12:30:00.000 | 18 | 4 | 1 | 15 | 13 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 12:45:00.000 | 19 | 4 | 2 | 15 | 13 | Not available
372387412 | 2020-04-12 08:00:00.000 | 2020-04-12 13:00:00.000 | 20 | 4 | 3 | 15 | 13 | Not available
The first block is relatively straight forward - (RowNumber - 1) * Duration gives me the Slot_Start for all rows under Position 1. It falls down when you switch block to Position 2, Position 3, and so on.
Any help is gratefully received.

The simplest way to get the Slot_Start time is to
Order the rows (as you have there with RowNumber)
Do a cumulative sum (running total) of the minutes duration of all preceding rows
Add this cumulative sum to the Start_Datetime
In other words - instead of calculating (rownumber - 1) * Duration, you get the sum of Durations of all the relevant preceding rows.
You can use an expression like the following to calculate Slot_Start
DATEADD(minute, SUM(Duration) OVER (PARTITION BY ID_Rota, Date_Start ORDER BY RowNumber) - Duration, Date_Start) AS Slot_Start
Note you may need to put your code into a sub-query or CTE, so that RowNumber is calculated, or you could incorporate the ROW_NUMBER expression etc into your same set of calculations.
You could also use the ROWS explicitly within the window function, except then you need to account for the first row e.g.,
ISNULL(DATEADD(minute, SUM(Duration) OVER (PARTITION BY ID_Rota, Date_Start ORDER BY RowNumber ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), Date_Start), Date_Start) AS Slot_Start2
Here is a db<>fiddle with data similar to what you have above (I started down a path before realising you were doing something different; however, the cumulative sum of Durations should still work).

Related

postgrest retreive ranked results

I made a game, with level and scores saved into an sql table like this :
create table if not exists api.scores (
id serial primary key,
pseudo varchar(50),
level int,
score int,
created_at timestamptz default CURRENT_TIMESTAMP
);
I want to display the scores in the ui with the rank of each score, based on the score column, ordered by desc.
Here is a sample data :
id | pseudo | level | score | created_at
----+----------+-------+-------+-------------------------------
1 | test | 1 | 1 | 2020-05-01 11:25:20.446402+02
2 | test | 1 | 1 | 2020-05-01 11:28:11.04001+02
3 | szef | 1 | 115 | 2020-05-01 15:45:06.201135+02
4 | erg | 1 | 115 | 2020-05-01 15:55:19.621372+02
5 | zef | 1 | 115 | 2020-05-01 16:14:09.718861+02
6 | aa | 1 | 115 | 2020-05-01 16:16:49.369718+02
7 | zesf | 1 | 115 | 2020-05-01 16:17:42.504354+02
8 | zesf | 2 | 236 | 2020-05-01 16:18:07.070728+02
9 | zef | 1 | 115 | 2020-05-01 16:22:23.406013+02
10 | zefzef | 1 | 115 | 2020-05-01 16:23:49.720094+02
Here is what I want :
id | pseudo | level | score | created_at | rank
----+----------+-------+-------+-------------------------------+------
31 | zef | 7 | 730 | 2020-05-01 18:40:42.586224+02 | 1
50 | Cyprien | 5 | 588 | 2020-05-02 14:08:39.034112+02 | 2
49 | cyprien | 4 | 438 | 2020-05-01 23:35:13.440595+02 | 3
51 | Cyprien | 3 | 374 | 2020-05-02 14:13:41.071752+02 | 4
47 | cyprien | 3 | 337 | 2020-05-01 23:27:53.025475+02 | 5
45 | balek | 3 | 337 | 2020-05-01 19:57:39.888233+02 | 5
46 | cyprien | 3 | 337 | 2020-05-01 23:25:56.047495+02 | 5
48 | cyprien | 3 | 337 | 2020-05-01 23:28:54.190989+02 | 5
54 | Cyzekfj | 2 | 245 | 2020-05-02 14:14:34.830314+02 | 9
8 | zesf | 2 | 236 | 2020-05-01 16:18:07.070728+02 | 10
13 | zef | 1 | 197 | 2020-05-01 16:28:59.95383+02 | 11
14 | azd | 1 | 155 | 2020-05-01 17:53:30.372793+02 | 12
38 | balek | 1 | 155 | 2020-05-01 19:08:57.622195+02 | 12
I want to retreive the rank based on the full table whatever the result set.
I'm using the postgrest webserver.
How do I do that ?
You are describing window function rank():
select t.*, rank() over(order by score desc) rnk
from mytable t
order by score desc

If a users record on column x is not null how do I count how many records that user has after the first time it is not null?

I would like to create a count per user of number of records after the first time that x is not null for that user.
I have a table that is similar to the following:
id | user_id | completed_at | x
----+---------+--------------+---
1 | 1001 | 2017-06-01 | 1
20 | 1001 | 2017-06-01 | 2
21 | 1001 | 2017-06-02 | 4
22 | 1001 | 2017-06-03 |
24 | 1001 | 2017-06-03 |
25 | 1001 | 2017-06-04 |
23 | 1001 | 2017-06-04 |
12 | 1001 | 2017-06-06 |
13 | 1001 | 2017-06-07 |
14 | 1001 | 2017-06-08 |
2 | 1002 | 2017-06-02 | 3
27 | 1002 | 2017-06-02 | 7
15 | 1002 | 2017-06-09 |
3 | 1003 | 2017-06-03 |
4 | 1004 | 2017-06-04 |
5 | 1005 | 2017-06-05 |
33 | 1005 | 2017-06-20 | 8
34 | 1006 | 2017-07-10 | 9
6 | 1006 | 2017-10-06 |
7 | 1007 | 2017-10-07 |
8 | 1008 | 2017-10-08 |
9 | 1009 | 2017-10-09 |
10 | 1010 | 2017-10-10 |
16 | 1011 | 2017-06-01 |
11 | 1011 | 2017-07-01 | 5
17 | 1012 | 2017-06-02 |
26 | 1012 | 2017-07-02 | 6
18 | 1013 | 2017-06-03 |
19 | 1014 | 2017-06-04 |
31 | 1014 | 2017-06-24 |
32 | 1014 | 2017-06-24 |
30 | 1014 | 2017-06-24 |
29 | 1014 | 2017-06-24 |
28 | 1014 | 2017-06-24 |
The expected output would look like this:
+------+------------+---------------+
| user | first_x | records_after |
+------+------------+---------------+
| 1001 | 2017-06-01 | 9 |
| 1002 | 2017-06-02 | 2 |
| 1005 | 2017-06-20 | 0 |
| 1011 | 2017-07-01 | 0 |
| 1012 | 2017-07-02 | 0 |
+------+------------+---------------+
Using running count, and then conditional count for running count > 0
Sample
WITH flags AS (
SELECT
user_id,
completed_at,
sum(CASE WHEN x IS NULL THEN 0 ELSE 1 END) OVER (PARTITION BY user_id ORDER BY completed_at ROWS BETWEEN UNBOUNDED PRECEDING AND 0 FOLLOWING) AS flag
FROM users
),
completed AS (
SELECT DISTINCT ON (user_id)
user_id,
completed_at AS first_x
FROM flags
WHERE flag > 0
ORDER BY user_id, completed_at
)
SELECT DISTINCT
user_id AS user,
first_x,
count(flag) FILTER (WHERE flag>0) - 1 AS records_after
FROM flags
NATURAL JOIN completed
GROUP BY 1, 2
ORDER BY 1

Tibco Spotfire - Calculate average only if there are minimum 3 values in a column - see desc

I want to calculate average in Spotfire only when there are minimum 3 values. if there are no values or just 2 values the average should be blank
Raw data:
Product Age Average
1
2
3 10
4 12
5 13 11
6
7 18
8 19
9 20 19
10 21 20
The only way I could really do this is with 3 calculated columns. Insert these calculated columns in this order:
If(Min(If([Age] IS NULL,0,[Age])) over (LastPeriods(3,[Product]))<>0,1) as [BitFlag]
Avg([Age]) over (LastPeriods(3,[Product])) as [TempAvg]
If([BitFlag]=1,[TempAvg]) as [Average]
This will give you the following results. You can ignore / hide the two columns you don't care about.
RESULTS
+---------+-----+---------+------------------+------------------+
| Product | Age | BitFlag | TempAvg | Average |
+---------+-----+---------+------------------+------------------+
| 1 | | | | |
| 2 | | | | |
| 3 | 10 | | 10 | |
| 4 | 12 | | 11 | |
| 5 | 13 | 1 | 11.6666666666667 | 11.6666666666667 |
| 6 | | | 12.5 | |
| 7 | 18 | | 15.5 | |
| 8 | 19 | | 18.5 | |
| 9 | 20 | 1 | 19 | 19 |
| 10 | 21 | 1 | 20 | 20 |
| 11 | | | 20.5 | |
| 12 | 22 | | 21.5 | |
| 13 | 36 | | 29 | |
| 14 | | | 29 | |
| 15 | 11 | | 23.5 | |
| 16 | 23 | | 17 | |
| 17 | 14 | 1 | 16 | 16 |
+---------+-----+---------+------------------+------------------+

Divison with more than one result from postgresql query

I am using postgresql and I have a table called accidents (state, total accidents) and another table called population. I want to get the top 3 state names with high total accidents and then get the population of those 3 states divided by total accidents in postgresql? How to write the query in the following way?
Explanation:
Population Table
rank| state | population
---+-----------------------------+------------
1 | Uttar Pradesh | 199581477
2 | Maharashtra | 112372972
3 | Bihar | 103804630
4 | West Bengal | 91347736
5 | Madhya Pradesh | 72597565
6 | Tamil Nadu | 72138958
7 | Rajasthan | 68621012
8 | Karnataka | 61130704
9 | Gujarat | 60383628
10 | Andhra Pradesh | 49665533
11 | Odisha | 41947358
12 | Telangana | 35193978
13 | Kerala | 33387677
14 | Jharkhand | 32966238
15 | Assam | 31169272
16 | Punjab | 27704236
17 | Haryana | 25753081
18 | Chhattisgarh | 25540196
19 | Jammu and Kashmir | 12548926
20 | Uttarakhand | 10116752
21 | Himachal Pradesh | 6856509
22 | Tripura | 3671032
23 | Meghalaya | 2964007
24 | Manipur*β* | 2721756
25 | Nagaland | 1980602
26 | Goa | 1457723
27 | Arunachal Pradesh | 1382611
28 | Mizoram | 1091014
29 | Sikkim | 607688
30 | Delhi | 16753235
31 | Puducherry | 1244464
32 | Chandigarh | 1054686
33 | Andaman and Nicobar Islands | 379944
34 | Dadra and Nagar Haveli | 342853
35 | Daman and Diu | 242911
36 | Lakshadweep | 64429
accident table:
state | eqto8 | eqto10 | mrthn10 | ntknwn | total
-----------------------------+-------+--------+---------+--------+--------
Andhra Pradesh | 6425 | 8657 | 8144 | 19298 | 42524
Arunachal Pradesh | 88 | 76 | 87 | 0 | 251
Assam | 0 | 0 | 0 | 6535 | 6535
Bihar | 2660 | 3938 | 3722 | 0 | 10320
Chhattisgarh | 2888 | 7052 | 3571 | 0 | 13511
Goa | 616 | 1512 | 2184 | 0 | 4312
Gujarat | 4864 | 7864 | 7132 | 8089 | 27949
Haryana | 3365 | 2588 | 4112 | 0 | 10065
Himachal Pradesh | 276 | 626 | 977 | 1020 | 2899
Jammu and Kashmir | 1557 | 618 | 434 | 4100 | 6709
Jharkhand | 1128 | 701 | 1037 | 2845 | 5711
Karnataka | 11167 | 14715 | 18566 | 0 | 44448
Kerala | 5580 | 13271 | 17323 | 0 | 36174
Madhya Pradesh | 15630 | 16226 | 19354 | 0 | 51210
Maharashtra | 4117 | 5350 | 10538 | 46311 | 66316
Manipur | 147 | 453 | 171 | 0 | 771
Meghalaya | 210 | 154 | 119 | 0 | 483
Mizoram | 27 | 58 | 25 | 0 | 110
Nagaland | 11 | 13 | 18 | 0 | 42
Odisha | 1881 | 3120 | 4284 | 0 | 9285
Punjab | 1378 | 2231 | 1825 | 907 | 6341
Rajasthan | 5534 | 5895 | 5475 | 6065 | 22969
Sikkim | 6 | 144 | 8 | 0 | 158
Tamil Nadu | 8424 | 18826 | 29871 | 10636 | 67757
Tripura | 290 | 376 | 222 | 0 | 888
Uttarakhand | 318 | 305 | 456 | 393 | 1472
Uttar Pradesh | 8520 | 10457 | 10995 | 0 | 29972
West Bengal | 1494 | 1311 | 974 | 8511 | 12290
Andaman and Nicobar Islands | 18 | 104 | 114 | 0 | 236
Chandigarh | 112 | 39 | 210 | 58 | 419
Dadra and Nagar Haveli | 40 | 20 | 17 | 8 | 85
Daman and Diu | 11 | 6 | 8 | 25 | 50
Delhi | 0 | 0 | 0 | 6937 | 6937
Lakshadweep | 0 | 0 | 0 | 3 | 3
Puducherry | 154 | 668 | 359 | 0 | 1181
All India | 88936 | 127374 | 152332 | 121741 | 490383
So that result should be
21.57
81.03
107.44
explanation:
Highest accidents states Tamilnadu, Maharashtra, Madhyapradesh.
Tamilnadu population/accidents = 21213/983 = 21.57 (Assumed values)
Maharasthra population/accidents = 10000/123 = 81.03
Madhyapradesh population/accidents = 34812/324 = 107.44
My query is:
SELECT POPULATION/
(SELECT TOTAL
FROM accidents
WHERE STATE NOT LIKE 'All %'
ORDER BY TOTAL DESC
LIMIT 3)
aVG FROM population
WHERE STATE IN
(SELECT STATE
FROM accidents
WHERE STATE NOT LIKE 'All %'
ORDER BY TOTAL DESC
LIMIT 3);
throwing ERROR: more than one row returned by a subquery used as an expression.
How to modify the query to get the required result or any other way to get the result in postgresql?
This ought to do it.
SELECT a.state, population.population/a.total FROM
(SELECT total, state FROM accidents WHERE state <> 'All India' ORDER BY total DESC LIMIT 3 ) AS a
INNER JOIN population on a.state = population.state

How can I disaggregate rows of a data frame in Spark?

I have a Spark dataframe containing data similar to the following:
+----+---------------------+-------+----------+-------------+
| ID | Timestamp | Value | Interval | Consumption |
+----+---------------------+-------+----------+-------------+
| 1 | 2012-05-02 12:30:00 | 550 | 1 | 5 |
| 1 | 2012-05-02 12:45:00 | 551 | 1 | 1 |
| 1 | 2012-05-02 13:00:00 | 554 | 1 | 3 |
| 1 | 2012-05-02 14:00:00 | 578 | 4 | 24 |
| 1 | 2012-05-02 14:15:00 | 578 | 1 | 0 |
| 1 | 2012-05-02 14:30:00 | 584 | 1 | 6 |
+----+---------------------+-------+----------+-------------+
I'm looking to turn this into something like the following:
+----+---------------------+-------+----------+-------------+------------+
| ID | Timestamp | Value | Interval | Consumption | Estimation |
+----+---------------------+-------+----------+-------------+------------+
| 1 | 2012-05-02 12:30:00 | 550 | 1 | 5 | ? |
| 1 | 2012-05-02 12:45:00 | 551 | 1 | 1 | ? |
| 1 | 2012-05-02 13:00:00 | 554 | 1 | 3 | ? |
| 1 | 2012-05-02 13:15:00 | 560 | 1 | 6 | 4 |
| 1 | 2012-05-02 13:30:00 | 566 | 1 | 6 | 4 |
| 1 | 2012-05-02 13:45:00 | 572 | 1 | 6 | 4 |
| 1 | 2012-05-02 14:00:00 | 578 | 1 | 6 | 4 |
| 1 | 2012-05-02 14:15:00 | 578 | 1 | 0 | ? |
| 1 | 2012-05-02 14:30:00 | 584 | 1 | 6 | ? |
+----+---------------------+-------+----------+-------------+------------+
More specifically I want to turn this:
+----+---------------------+-------+----------+-------------+
| ID | Timestamp | Value | Interval | Consumption |
+----+---------------------+-------+----------+-------------+
| 1 | 2012-05-02 14:00:00 | 578 | 4 | 24 |
+----+---------------------+-------+----------+-------------+
Into this:
+----+---------------------+-------+----------+-------------+------------+
| ID | Timestamp | Value | Interval | Consumption | Estimation |
+----+---------------------+-------+----------+-------------+------------+
| 1 | 2012-05-02 13:15:00 | 560 | 1 | 6 | 4 |
| 1 | 2012-05-02 13:30:00 | 566 | 1 | 6 | 4 |
| 1 | 2012-05-02 13:45:00 | 572 | 1 | 6 | 4 |
| 1 | 2012-05-02 14:00:00 | 578 | 1 | 6 | 4 |
+----+---------------------+-------+----------+-------------+------------+
I want to take the rows with more than 1 interval out of the original table, interpolate Values for missing intervals and reinsert the newly created rows into the initial table place of the original rows. I have ideas of how to achieve this (in PostgreSQL for example I would simply use the generate_series() function to create the required Timestamps and calculate new Values), but implementing these in Spark/Scala is proving troublesome.
Assuming I've created a new dataframe containing only rows with Interval > 1, how could I replicate those rows 'n' times with 'n' being the value of Interval? I believe that would give me enough to get going using a Counter function partitioned by some row reference I can create.
If there's a way to replicate the behavior of generate_series() that I've missed, even better.