Islands and Gaps Issue - tsql

Backstory: I have a database that has data points of drivers in trucks which also contain the. While in a truck, the driver can have a 'driverstatus'. What I'd like to do is group these statuses by driver, truck.
As of now, I've tried using LAG/LEAD to help. The reason for this is so I can tell when a driverstatus change occurs, and then I can mark that row as having the last datetime of that status.
That in itself is insufficient, because I need to group the statuses by their status and date. For this, I've got something such as DENSE_RANK, but I can't manage to get that right concerning the ORDER BY clause.
Here is my test data, and here is one attempt of many of me floundering with ranking.
/****** Script for SelectTopNRows command from SSMS ******/
DECLARE #SomeTable TABLE
(
loginId VARCHAR(255),
tractorId VARCHAR(255),
messageTime DATETIME,
driverStatus VARCHAR(2)
);
INSERT INTO #SomeTable (loginId, tractorId, messageTime, driverStatus)
VALUES('driver35','23533','2018-08-10 8:33 AM','2'),
('driver35','23533','2018-08-10 8:37 AM','2'),
('driver35','23533','2018-08-10 8:56 AM','2'),
('driver35','23533','2018-08-10 8:57 AM','1'),
('driver35','23533','2018-08-10 8:57 AM','1'),
('driver35','23533','2018-08-10 8:57 AM','1'),
('driver35','23533','2018-08-10 9:07 AM','1'),
('driver35','23533','2018-08-10 9:04 AM','1'),
('driver35','23533','2018-08-12 8:07 AM','3'),
('driver35','23533','2018-08-12 8:37 AM','3'),
('driver35','23533','2018-08-12 9:07 AM','3'),
('driver35','23533','2018-06-12 8:07 AM','2'),
('driver35','23533','2018-06-12 8:37 AM','2'),
('driver35','23533','2018-06-12 9:07 AM','2')
;
SELECT *, DENSE_RANK() OVER(PARTITION BY
loginId, tractorId, driverStatus
ORDER BY messageTime ) FROM #SomeTable
;
My end result would ideally look something like this:
loginId tractorId startTime endTime driverStatus
driver35 23533 2018-08-10 8:33 AM 2018-08-10 8:56 AM 2
driver35 23533 2018-08-10 8:57 AM 2018-08-10 9:07 AM 1
driver35 23533 2018-08-12 8:07 AM 2018-08-12 9:07 AM 3
Any help on this is greatly appreciated.

WITH drivers_data AS
(
SELECT *,
row_num = ROW_NUMBER()
OVER (PARTITION BY loginId,
tractorId,
CAST(messageTime AS date),
driverStatus
ORDER BY messageTime),
row_num_all = ROW_NUMBER()
OVER (PARTITION BY loginId,
tractorId
ORDER BY messageTime),
first_date = FIRST_VALUE (messageTime)
OVER (PARTITION BY loginId,
tractorId,
CAST(messageTime AS date),
driverStatus
ORDER BY messageTime),
last_date = LAST_VALUE (messageTime)
OVER (PARTITION BY loginId,
tractorId,
CAST(messageTime AS date),
driverStatus
ORDER BY messageTime
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM #t
)
SELECT loginId, tractorId, first_date, last_date, driverStatus
FROM drivers_data
WHERE row_num = 1
ORDER BY row_num_all;
OUTPUT:
+==========+===========+=====================+=====================+==============+
| loginId | tractorId | first_date | last_date | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:56:00 | 2 |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533 | 2018-10-08 08:57:00 | 2018-10-08 09:07:00 | 1 |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533 | 2018-12-06 08:07:00 | 2018-12-06 09:07:00 | 2 |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533 | 2018-12-08 08:07:00 | 2018-12-08 09:07:00 | 3 |
+----------+-----------+---------------------+---------------------+--------------+
I will try to explain what's going on here:
row_num This is for numbering rows which are restricted by date and status of the driver. We need casting since we need date part without time.
row_num_all This is the key attribute since it allows us in the end to sort rows by occurrence. This window is not restricted by status since we need numbering across whole driver's data.
first_date The FIRST_VALUE is handy function for our purpose. It just retrieves the first datetime occurrence.
last_date It's correct to assume that for the last date we need LAST_VALUE window function. But using it is tricky and requires more explanation. As you can see I explicitly use special framing ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING. But why? Let me explain. Let's take a part of output for date 10/8/2018 and status 2 with default framing. We get the following results:
+==========+===========+=====================+=====================+==============+
| loginId | tractorId | first_date | last_date | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2 |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:37:00 | 2 |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:56:00 | 2 |
+----------+-----------+---------------------+---------------------+--------------+
As you can see, the last date is incorrect! This happens because LAST_VALUE uses default frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW- it means that last row is always current row in window. Here's what happens under the hood. Three windows get created. Each row gets its own window. Then it retrieves the last row from window:
Window for 1-st row
+==========+===========+=====================+=====================+==============+
| loginId | tractorId | first_date | last_date | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2 |
+----------+-----------+---------------------+---------------------+--------------+
Window for 2-nd row
+==========+===========+=====================+=====================+==============+
| loginId | tractorId | first_date | last_date | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2 |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:37:00 | 2 |
+----------+-----------+---------------------+---------------------+--------------+
Window for 3-rd row
+==========+===========+=====================+=====================+==============+
| loginId | tractorId | first_date | last_date | driverStatus |
|==========|===========|=====================|=====================|==============|
| driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:33:00 | 2 |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:37:00 | 2 |
|----------|-----------|---------------------|---------------------|--------------|
| driver35 | 23533 | 2018-10-08 08:33:00 | 2018-10-08 08:56:00 | 2 |
+----------+-----------+---------------------+---------------------+--------------+
So, the solution for this is to change framing: we need to move not from beginning to current row, but from current row to the end. So, UNBOUNDED FOLLOWING just means this - last row in current window.
Next is WHERE row_num = 1. This is all simple: since all rows has same information about first date and last date, we just need first row.
The final part is ORDER BY row_num_all. This is where you get your correct ordering.
P.S.
Your desired output in question is incorrect.
For date 8/10/18 8:57 AM and status 1 the last date must be 10/8/2018 9:07 AM - not 10/8/2018 9:04 AM, as you mentioned.
Also there's missing output for date 12/6/2018 and status 2.
UPDATE:
Here are illustrations of how FIRST_VALUE and LAST_VALUE work.
All three figures have following parts:
Query data This is result of query.
Original query Original source data.
Windows These are intermediate steps for calculations.
Frame Mentions which frame is used.
Green cell Window specification.
Here's what's happening under the hood:
First, SQL Server creates partitions for all mentioned fields. On figure it is partition column.
Each partition can have a frame: either default or custom. The default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This means that the row gets window between the start of partition and current row. If you don't mention a frame, the default frame comes into play.
Each frame creates window for each row. On figures these windows are in columns row 1 to row 2 and marked with color. The row number corresponds to row_num_all field.
A row operates only in bounds of its window.
1. FIRST_VALUE
To get first date, we can use handy FIRST_VALUE window function.
As you can see, we use default frame here. This means that for each row the window will be between the start of window and current row. For getting first date this is just what we need. Each row will fetch the value from first row. The first date is in "first_date" field.
2. LAST_VALUE - incorrect frame
Now we need to calculate last date. The last date is in the last row of partition, so we can use LAST_VALUE window function.
As I mentioned earlier, if we don't mention frame, the default frame is used. As you can see on figure, the frame always ends in the current row - this is incorrect, because we need the date from last window row. The last_date field shows us incorrect results - it reflects the date from current row.
3. LAST_VALUE - correct frame
To fix the situation with fetching last date, we need to change the frame upon which LAST_VALUE will operate on: ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING. As you can see, the window for each row now is between current row and the end of partition. In this case LAST_VALUE will correctly fetch the date from last row of window. Now the result in last_date field is correct.

The below solution identifies each time an island starts (when a driverStatus changes) within each loginID / tractorID combination, and then assigns an "id" number to that island.
After that, its a simple min/max to find when that island starts and ends.
Answer:
select b.loginId
, b.tractorId
, min(b.messageTime) as startTime
, max(b.messageTime) as endTime
, b.driverStatus
from (
select a.loginId
, a.tractorId
, a.messageTime
, a.driverStatus
, a.is_island_start_flg
, sum(a.is_island_start_flg) over (partition by a.loginID, a.tractorID order by a.messageTime asc) as island_nbr --assigning the "id" number to the island
from (
select st.loginId
, st.tractorId
, st.messageTime
, st.driverStatus
, iif(lag(st.driverStatus, 1, st.driverStatus) over (partition by st.loginID, st.tractorId order by st.messageTime asc) = st.driverStatus, 0, 1) as is_island_start_flg --identifying start of island
from #SomeTable as st
) as a
) as b
group by b.loginId
, b.tractorId
, b.driverStatus
, b.island_nbr --purposefully in the group by, to make sure each occurrence of a status is in final results
order by b.loginId asc
, b.tractorId asc
, min(b.messageTime) asc
When you leave off the last three records of the sample data (as this is not in the expected output of the question, just like JohnyL said), this query produces the exact output from the question.

SELECT
t.loginId,
t.tractorId,
startTime = MIN(messageTime),
endTime = MAX(messageTime),
driverStatus
FROM #someTable t
GROUP BY loginId, tractorId, driverStatus
ORDER BY MIN(messageTime);
Results:
loginId tractorId startTime endTime driverStatus
-------------- ---------- ----------------------- ----------------------- ------------
driver35 23533 2018-10-08 08:33:00.000 2018-10-08 08:56:00.000 2
driver35 23533 2018-10-08 08:57:00.000 2018-10-08 09:07:00.000 1
driver35 23533 2018-12-08 08:07:00.000 2018-12-08 09:07:00.000 3

Related

Using PostgreSQL, how can I count the amount of individuals that opened a message in the previous 30 days from the Monday of each week?

Scenario:
I have a table, events_table, that consists of records that are inserted by a webhook based on messages I send to my users:
"column_name" (type)
- "time_stamp" (timestamp with time zone)
- "username" (varchar)
- "delivered" (int)
- "action" (int)
Sample Data:
| time_stamp | username | delivered | action |
|:----------------|:---------|:----------|:-------|
|1349733421.460000| user1 | 1 | null |
|1549345346.460000| user3 | 1 | 1 |
|1524544421.460000| user1 | 1 | 1 |
|1345444421.570000| user7 | 1 | null |
|1756756761.980000| user9 | 1 | null |
|1234343421.460000| user171 | 1 | 1 |
|1843455621.460000| user5 | 1 | 1 |
| ... | ... | ... | ... |
The "delivered" column is null by default and 1 when delivered. The "action" column is null by default and is 1 when opened.
Problem:
Using PostgreSQL, how can I count the amount of individuals that opened an email in the previous 30 days from the Monday of each week?
Ideal query results:
| date | count |
|:----------------|:----------|
| 02/24/2020 | 1,234,123 |
| 02/17/2020 | 234,123 |
| 02/10/2020 | 1,234,123 |
| 02/03/2020 |12,341,213 |
| ... | ... |
My attempt:
This is the extent of what I've tried which gives me count of the previous week:
SELECT
date_trunc('week', to_timestamp("time_stamp")) as date,
count("username") as count,
lag(count(1), 1) over (order by "date") as "count_previous_week"
FROM events_table
WHERE "delivered" = 1
and "action" = 1
GROUP BY 1 order by 1 desc
This is my attempt at writing this query.
First I get the lowest and highest dates from the data set. I add 7 days on to the highest date to make sure I include data up to today.
I then run generate_series against these 2 values set with an interval of 7 days to give me every single monday between the 2 points (we can't rely on just mondays within your data set in case we have an empty week)
Then, I simply subquery and aggregate the data based on our generate_series output.
select
__weeks.week_begins,
(
select
count(distinct "username")
from
events_table
where
to_timestamp("time_stamp")::date between week_begins - '30 days'::interval and week_begins
and "delivered" = 1
and "action" = 1
)
from
(
select
generate_series(_.min_date, _.max_date, '7 days'::interval)::date as week_begins
from
(
select
min(date_trunc('week', to_timestamp("time_stamp"))::date) as min_date
max(date_trunc('week', to_timestamp("time_stamp"))::date) as max_date
from
events_table
where
"delivered" = 1
and "action" = 1
) as _
) as __weeks
order by
__weeks.week_begins
I'm not particularly keen on this query because the query planner visits the same table twice, but I can't think of another way to structure it.

Update column with correct daterange using generate_series

I have a column with incorrect dateranges (a day is missing). The code
to generate these dateranges was written by a previous employee and
cannot be found.
The dateranges look like this, notice the missing day:
+-------+--------+-------------------------+
| id | client | date_range |
+-------+--------+-------------------------+
| 12885 | 30 | [2016-01-07,2016-01-13) |
| 12886 | 30 | [2016-01-14,2016-01-20) |
| 12887 | 30 | [2016-01-21,2016-01-27) |
| 12888 | 30 | [2016-01-28,2016-02-03) |
| 12889 | 30 | [2016-02-04,2016-02-10) |
| 12890 | 30 | [2016-02-11,2016-02-17) |
| 12891 | 30 | [2016-02-18,2016-02-24) |
+-------+--------+-------------------------+
And should look like this:
+-------------------------+
| range |
+-------------------------+
| [2016-01-07,2016-01-14) |
| [2016-01-14,2016-01-21) |
| [2016-01-21,2016-01-28) |
| [2016-01-28,2016-02-04) |
| [2016-02-04,2016-02-11) |
| [2016-02-11,2016-02-18) |
| [2016-02-18,2016-02-25) |
| [2016-02-25,2016-03-03) |
+-------------------------+
The code I've written to generate correct dateranges looks like this:
create or replace function generate_date_series(startsOn date, endsOn date, frequency interval)
returns setof date as $$
select (startsOn + (frequency * count))::date
from (
select (row_number() over ()) - 1 as count
from generate_series(startsOn, endsOn, frequency)
) series
$$ language sql immutable;
select DATERANGE(
generate_date_series(
'2016-01-07'::date, '2024-11-07'::date, interval '7days'
)::date,
generate_date_series(
'2016-01-14'::date, '2024-11-13'::date, interval '7days'
)::date
) as range;
However, I'm having trouble trying to update the column with the
correct dateranges. I initially executed this UPDATE query on a test
database I created:
update factored_daterange set date_range = dt.range from (
select daterange(
generate_date_series(
'2016-01-07'::date, '2024-11-07'::date, interval '7days'
)::date,
generate_date_series(
'2016-01-14'::date, '2024-11-14'::date, interval '7days'
)::date ) as range ) dt where client_id=30;
But that is not correct, it simply assigns the first generated
daterange to each row. I want to essentially update the dateranges
row-by-row since there is no other join or condition I can match the
dates up to. Any assistance in this matter is greatly appreciated.
Your working too hard. Just update the upper range value.
update your_table_name
set date_range = daterange(lower(date_range),(upper(date_range) + interval '1 day')::date) ;

How to check if there's a record in every hour in specified time-frame and then count it?

I'm using PostgreSQL and this is my table measurement_archive:
+-----------+------------------------+------+-------+
| sensor_id | time | type | value |
+-----------+------------------------+------+-------+
| 123 | 2017-11-26 01:53:11+00 | PM25 | 34.32 |
+-----------+------------------------+------+-------+
| 123 | 2017-11-26 02:15:11+00 | PM25 | 32.1 |
+-----------+------------------------+------+-------+
| 123 | 2017-11-26 04:32:11+00 | PM25 | 75.3 |
+-----------+------------------------+------+-------+
I need a query that will take records from specified timeframe (eg. from 2017-01-01 00:00:00 to 2017-12-01 23:59:59) and then check if in every hour there is at least 1 record - if there is, then add 1 to result.
So, if I make that query from 2017-11-26 01:00:00 to 2017-11-26 04:59:59+00 for sensor_id == 123 on above table then the result should be 3.
select count(*)
from (
select date_trunc('hour', time) as time
from measurement_archive
where
time >= '2017-11-26 01:00:00' and time < '2017-11-26 05:00:00'
and
sensor_id = 123
group by 1
) s
alternative solution would be using distinct,
select count(*) from (select distinct a, extract(hour from time) from t where time >'2017-11-26 01:00:11' and time <'2017-11-26 05:00:00' and sensor_id=123)t;

How to query just the last record of every second within a period of time in postgres

I have a table with hundreds of millions of records in 'prices' table with only four columns: uid, price, unit, dt. dt is a datetime in standard format like '2017-05-01 00:00:00.585'.
I can quite easily to select a period using
SELECT uid, price, unit from prices
WHERE dt > '2017-05-01 00:00:00.000'
AND dt < '2017-05-01 02:59:59.999'
What I can't understand how to select price for every last record in each second. (I also need a very first one of each second too, but I guess it will be a similar separate query). There are some similar example (here), but they did not work for me when I try to adapt them to my needs generating errors.
Could some please help me to crack this nut?
Let say that there is a table which has been generated with a help of this command:
CREATE TABLE test AS
SELECT timestamp '2017-09-16 20:00:00' + x * interval '0.1' second As my_timestamp
from generate_series(0,100) x
This table contains an increasing series of timestamps, each timestamp differs by 100 milliseconds (0.1 second) from neighbors, so that there are 10 records within each second.
| my_timestamp |
|------------------------|
| 2017-09-16T20:00:00Z |
| 2017-09-16T20:00:00.1Z |
| 2017-09-16T20:00:00.2Z |
| 2017-09-16T20:00:00.3Z |
| 2017-09-16T20:00:00.4Z |
| 2017-09-16T20:00:00.5Z |
| 2017-09-16T20:00:00.6Z |
| 2017-09-16T20:00:00.7Z |
| 2017-09-16T20:00:00.8Z |
| 2017-09-16T20:00:00.9Z |
| 2017-09-16T20:00:01Z |
| 2017-09-16T20:00:01.1Z |
| 2017-09-16T20:00:01.2Z |
| 2017-09-16T20:00:01.3Z |
.......
The below query determines and prints the first and the last timestamp within each second:
SELECT my_timestamp,
CASE
WHEN rn1 = 1 THEN 'First'
WHEN rn2 = 1 THEN 'Last'
ELSE 'Somwhere in the middle'
END as Which_row_within_a_second
FROM (
select *,
row_number() over( partition by date_trunc('second', my_timestamp)
order by my_timestamp
) rn1,
row_number() over( partition by date_trunc('second', my_timestamp)
order by my_timestamp DESC
) rn2
from test
) xx
WHERE 1 IN (rn1, rn2 )
ORDER BY my_timestamp
;
| my_timestamp | which_row_within_a_second |
|------------------------|---------------------------|
| 2017-09-16T20:00:00Z | First |
| 2017-09-16T20:00:00.9Z | Last |
| 2017-09-16T20:00:01Z | First |
| 2017-09-16T20:00:01.9Z | Last |
| 2017-09-16T20:00:02Z | First |
| 2017-09-16T20:00:02.9Z | Last |
| 2017-09-16T20:00:03Z | First |
| 2017-09-16T20:00:03.9Z | Last |
| 2017-09-16T20:00:04Z | First |
| 2017-09-16T20:00:04.9Z | Last |
| 2017-09-16T20:00:05Z | First |
| 2017-09-16T20:00:05.9Z | Last |
A working demo you can find here

Order by created_date if less than 1 month old, else sort by updated_date

SQL Fiddle: http://sqlfiddle.com/#!15/1da00/5
I have a table that looks something like this:
products
+-----------+-------+--------------+--------------+
| name | price | created_date | updated_date |
+-----------+-------+--------------+--------------+
| chair | 50 | 10/12/2016 | 1/4/2017 |
| desk | 100 | 11/4/2016 | 12/27/2016 |
| TV | 500 | 12/1/2016 | 1/2/2017 |
| computer | 1000 | 12/28/2016 | 1/1/2017 |
| microwave | 100 | 1/3/2017 | 1/4/2017 |
| toaster | 20 | 1/9/2017 | 1/9/2017 |
+-----------+-------+--------------+--------------+
I want to order this table in a way where if the product was created less than 30 days those results should show first (and be ordered by the updated date). If the product was created 30 or more days ago I want it to show after (and have it ordered by updated date within that group)
This is what the result should look like:
products - desired results
+-----------+-------+--------------+--------------+
| name | price | created_date | updated_date |
+-----------+-------+--------------+--------------+
| toaster | 20 | 1/9/2017 | 1/9/2017 |
| microwave | 100 | 1/3/2017 | 1/4/2017 |
| computer | 1000 | 12/28/2016 | 1/1/2017 |
| chair | 50 | 10/12/2016 | 1/4/2017 |
| TV | 500 | 12/1/2016 | 1/2/2017 |
| desk | 100 | 11/4/2016 | 12/27/2016 |
+-----------+-------+--------------+--------------+
I've started writing this query:
SELECT *,
CASE
WHEN created_date > NOW() - INTERVAL '30 days' THEN 0
ELSE 1
END AS order_index
FROM products
ORDER BY order_index, created_date DESC
but that only bring the rows with created_date less thatn 30 days to the top, and then ordered by created_date. I want to also sort the rows where order_index = 1 by updated_date
Unfortunately in version 9.3 only positional column numbers or expressions involving table columns can be used in order by so order_index is not available to case at all and its position is not well defined because it comes after * in the column list.
This will work.
order by
created_date <= ( current_date - 30 ) , case
when created_date > ( current_date - 30 ) then created_date
else updated_date end desc
Alternatively a common table expression can be used to wrap the result and then that can be ordered by any column.
WITH q AS(
SELECT *,
CASE
WHEN created_date > NOW() - INTERVAL '30 days' THEN 0
ELSE 1
END AS order_index
FROM products
)
SELECT * FROM q
ORDER BY
order_index ,
CASE order_index
WHEN 0 THEN created_date
WHEN 1 THEN updated_date
END DESC;
A third approach is to exploit nulls.
order by
case
when created_date > ( current_date - 30 ) then created_date
end desc nulls last,
updated_date desc;
This approach can be useful when the ordering columns are of different types.