Concatenate mutliple contiguous rows to single row - tsql

i have a huge table with iot-datas from a lot of iot-devices. Every device is sending data one time per minute but only if counter-input got some singals. If not, no data will be sended. So in my database the datas looks like
Today I'm loading all this data in my application and aggregate them by iterating and checking row by row to 3 rows based on contiguous rows. Contiguous rows are all rows where next row is one minute later. It is working but it feels not smart and nice.
Does it make sense to generate this aggregation on sql server - espacialy increase performance?
How would you start?

This is a classic Islands and Gaps problem. I'm still mastering Islands and Gaps so I'd love any feedback on my solution from others in the know (please be gentle). There are at least a couple different ways to solve Islands and Gaps but this is the one that is easiest on my brain. Here's how I got it to work:
DDL to set up data:
IF OBJECT_ID('tempdb..#tmp') IS NOT NULL
DROP TABLE #tmp;
CREATE TABLE #tmp
(IoT_Device INT,
Count INT,
TimeStamp DATETIME);
INSERT INTO #tmp
VALUES
(1, 5, '2021-10-27 14:03'),
(1, 4, '2021-10-27 14:04'),
(1, 7, '2021-10-27 14:05'),
(1, 8, '2021-10-27 14:06'),
(1, 5, '2021-10-27 14:07'),
(1, 4, '2021-10-27 14:08'),
(1, 7, '2021-10-27 14:12'),
(1, 8, '2021-10-27 14:13'),
(1, 5, '2021-10-27 14:14'),
(1, 4, '2021-10-27 14:15'),
(1, 5, '2021-10-27 14:21'),
(1, 4, '2021-10-27 14:22'),
(1, 7, '2021-10-27 14:23');
Islands and Gaps Solution:
;WITH CTE_TIMESTAMP_DATA AS (
SELECT
IoT_Device,
Count,
TimeStamp,
LAG(TimeStamp) OVER
(PARTITION BY IoT_Device ORDER BY TimeStamp) AS previous_timestamp,
LEAD(TimeStamp) OVER
(PARTITION BY IoT_Device ORDER BY TimeStamp) AS next_timestamp,
ROW_NUMBER() OVER
(PARTITION BY IoT_Device ORDER BY TimeStamp) AS island_location
FROM #tmp
)
,CTE_ISLAND_START AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY IoT_Device ORDER BY TimeStamp) AS island_number,
IoT_Device,
TimeStamp AS island_start_timestamp,
island_location AS island_start_location
FROM CTE_TIMESTAMP_DATA
WHERE DATEDIFF(MINUTE, previous_timestamp, TimeStamp) > 1
OR previous_timestamp IS NULL
)
,CTE_ISLAND_END AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY IoT_Device ORDER BY TimeStamp) AS island_number,
IoT_Device,
TimeStamp AS island_end_timestamp,
island_location AS island_end_location
FROM CTE_TIMESTAMP_DATA
WHERE DATEDIFF(MINUTE, TimeStamp, next_timestamp) > 1
OR next_timestamp IS NULL
)
SELECT
S.IoT_Device,
(SELECT SUM(Count)
FROM CTE_TIMESTAMP_DATA
WHERE IoT_Device = S.IoT_Device
AND TimeStamp BETWEEN S.island_start_timestamp AND E.island_end_timestamp) AS Count,
S.island_start_timestamp,
E.island_end_timestamp
FROM CTE_ISLAND_START AS S
INNER JOIN CTE_ISLAND_END AS E
ON E.IoT_Device = S.IoT_Device
AND E.island_number = S.island_number;
The CTE_TIMESTAMP_DATA query pulls the IoT_Device, Count, and TimeStamp along with the TimeStamp before and after each record using LAG and LEAD, and assigns a row number to each record ordered by TimeStamp.
The CTE_ISLAND_START query gets the start of each island.
The CTE_ISLAND_END query gets the end of each island.
The main SELECT at the bottom then uses this data to sum the Count within each island.
This will work with multiple IoT_Devices.
You can read more about Islands and Gaps here or numerous other places online.

Related

Best way to avoid duplicates in table?

I've been given a task that requires writing a script to mass change items in a table(ProductArea):
ProductID int
SalesareaID int
One ProductID can only exist once in each SalesareaID so there can't be any duplicates in this table. But one ProductID can be sold in multiple SalesareaID.
So an example would look something like:
ProductID SalesareaID
1 1
1 2
1 3
2 2
3 1
Now, some areas have merged. So, if I try to run a straight-forward UPDATE to fix this like:
UPDATE ProductArea SET SalesareaID = 4 where SalesareaID IN (2, 3)
it will find (1, 2) and change that to (1, 4). Then it will find (1, 3) and try to change that to (1, 4). But that already exist so it will crash with a "Cannot insert duplicate key..."-error.
Is there a best/recommended way to tell my UPDATE to only update if the resulting (ProductID, SalesareaID) doesn't already exist?
This should work
It uses a window function
declare #T table (prodID int, salesID int, primary key (prodID, salesID));
insert into #T values
(1, 1)
, (1, 2)
, (1, 3)
, (2, 2)
, (3, 1);
with cte as
( select t.*
, row_number() over (partition by t.prodID order by t.salesID) as rn
from #T t
where t.salesID in (2, 3)
)
delete cte where rn > 1;
update #T set salesID = 4 where salesID in (2, 3);
select * from #T;
If you are creating a new merged region from existing regions then I think the easiest thing to do would be to treat the merge as two separate operations.
First you insert entries for the new area based on the existing areas.
INSERT INTO ProductArea (ProductID, SalesareaID)
SELECT DISTINCT ProductID, 4 FROM ProductArea
WHERE SalesareaID IN (2, 3)
Then you remove the entries for the existing areas.
DELETE FROM ProductArea WHERE SalesareaID IN (2, 3)
The SalesareaID of 4 would need to be replaced by the id of the new Salesarea. The 2 and 3 would also need to be replaced by the ids of the areas you are merging to create the new Salesarea.

How to filter a query based on a jsonb data?

Not even sure if it's possible to do this kind of query in postgres. At least i'm stuck.
I have two tables: a product recommendation list, containing multiple products to be recommended to a particular customer; and a transaction table indicating the product bought by customer and transaction details.
I'm trying to track the performance of my recommendation by plotting all the transaction that match the recommendations (both customer and product).
Below is my test case.
Kindly help
create table if not exists productRec( --Product Recommendation list
task_id int,
customer_id int,
detail jsonb);
truncate productRec;
insert into productRec values (1, 2, '{"1":{"score":5, "name":"KitKat"},
"4":{"score":2, "name":"Yuppi"}
}'),
(1, 3, '{"1":{"score":3, "name":"Yuppi"},
"4":{"score":2, "name":"GoldenSnack"}
}'),
(1, 4, '{"1":{"score":3, "name":"Chickies"},
"4":{"score":2, "name":"Kitkat"}
}');
drop table txn;
create table if not exists txn( --Transaction table
customer_id int,
item_id text,
txn_value numeric,
txn_date date);
truncate txn;
insert into txn values (1, 'Yuppi', 500, DATE '2001-01-01'), (2, 'Kitkat', 2000, DATE '2001-01-01'),
(3, 'Kitkat', 2000, DATE '2001-02-01'), (4, 'Chickies', 200, DATE '2001-09-01');
--> Query must plot:
--Transaction value vs date where the item_id is inside the recommendation for that customer
--ex: (2000, 2001-01-01), (200, 2001-09-01)
We can get each recommendation as its own row with jsonb_each. I don't know what to do with the keys so I just take the value (still jsonb) and then the name inside it (the ->> outputs text).
select
customer_id,
(jsonb_each(detail)).value->>'name' as name
from productrec
So now we have a list of customer_ids and item_ids they were recommended. Now we can just join this with the transactions.
select
txn.txn_value,
txn.txn_date
from txn
join (
select
customer_id,
(jsonb_each(detail)).value->>'name' as name
from productrec
) p ON (
txn.customer_id = p.customer_id AND
lower(txn.item_id) = lower(p.name)
);
In your example data you spelled Kitkat differently in the recommendation table for customer 2. I added lowercasing in the join condition to counter that but it might not be the right solution.
txn_value | txn_date
-----------+------------
2000 | 2001-01-01
200 | 2001-09-01
(2 rows)

How to rewrite SQL joins into window functions?

Database is HP Vertica 7 or PostgreSQL 9.
create table test (
id int,
card_id int,
tran_dt date,
amount int
);
insert into test values (1, 1, '2017-07-06', 10);
insert into test values (2, 1, '2017-06-01', 20);
insert into test values (3, 1, '2017-05-01', 30);
insert into test values (4, 1, '2017-04-01', 40);
insert into test values (5, 2, '2017-07-04', 10);
Of the payment cards used in the last 1 day, what is the maximum amount charged on that card in the last 90 days.
select t.card_id, max(t2.amount) max
from test t
join test t2 on t2.card_id=t.card_id and t2.tran_dt>='2017-04-06'
where t.tran_dt>='2017-07-06'
group by t.card_id
order by t.card_id;
Results are correct
card_id max
------- ---
1 30
I want to rewrite the query into sql window functions.
select card_id, max(amount) over(partition by card_id order by tran_dt range between '60 days' preceding and current row) max
from test
where card_id in (select card_id from test where tran_dt>='2017-07-06')
order by card_id;
But result set does not match, how can this be done?
Test data here:
http://sqlfiddle.com/#!17/db317/1
I can't try PostgreSQL, but in Vertica, you can apply the ANSI standard OLAP window function.
But you'll need to nest two queries: The window function only returns sensible results if it has all rows that need to be evaluated in the result set.
But you only want the row from '2017-07-06' to be displayed.
So you'll have to filter for that date in an outer query:
WITH olap_output AS (
SELECT
card_id
, tran_dt
, MAX(amount) OVER (
PARTITION BY card_id
ORDER BY tran_dt
RANGE BETWEEN '90 DAYS' PRECEDING AND CURRENT ROW
) AS the_max
FROM test
)
SELECT
card_id
, the_max
FROM olap_output
WHERE tran_dt='2017-07-06'
;
card_id|the_max
1| 30
As far as I know, PostgreSQL Window function doesn't support bounded range preceding thus range between '90 days' preceding won't work. It does support bounded rows preceding such as rows between 90 preceding, but then you would need to assemble a time-series query similar to the following for the Window function to operate on the time-based rows:
SELECT c.card_id, t.amount, g.d as d_series
FROM generate_series(
'2017-04-06'::timestamp, '2017-07-06'::timestamp, '1 day'::interval
) g(d)
CROSS JOIN ( SELECT distinct card_id from test ) c
LEFT JOIN test t ON t.card_id = c.card_id and t.tran_dt = g.d
ORDER BY c.card_id, d_series
For what you need (based on your question description), I would stick to using group by.

Postgres how can I merge 2 separate select queries into 1

I am using postgres 9.4 and I would like to merge 2 separate queries into one statement. I been looking at this How to merge these queries into 1 using subquery post but still can't figure out how to work it. These 2 queries do work independently. Here they are
# 1: select * from votes v where v.user_id=32 and v.stream_id=130;
#2: select city,state,post,created_on,votes,id as Voted from streams
where latitudes >=28.0363 AND 28.9059>= latitudes order by votes desc limit 5 ;
I would like query #2 to be limited by 5, however I don't want query #1 to be included in that limit so that up to 6 rows could be returned in total. This works like a suggestion engine where query #1 has a main thread and query #2 gives up to 5 different suggestions however they are obviously located in a different table.
Having no model and data I simulated this problem with dummies of both in this SQL Fiddle.
CREATE TABLE votes
(
id smallint
, user_id smallint
);
CREATE TABLE streams
(
id smallint
, foo boolean
);
INSERT INTO votes
VALUES (1, 42), (2, 32), (3, 17), (4, 37), (5, 73), (6, 69), (7, 21), (8, 18), (9, 11), (10, 15), (11, 28);
INSERT INTO streams
VALUES (1, true), (2, true), (3, true), (4, true), (5, true), (6, true), (7, false), (8, false), (9, false), (10, false), (11, false);
SELECT
id
FROM
(SELECT id, 1 AS sort FROM votes WHERE user_id = 32) AS query_1
FULL JOIN (SELECT id FROM streams WHERE NOT foo) AS query_2 USING (id)
ORDER BY
sort
LIMIT 6;
Also I have to point out, that this isn't my work entirely, but an adaptation of this answer I came across the other day. Maybe this is an approach here too.
So, what's going on? Column id stands for any column your tables and sub-queries will have in common. votes.user_id I made to have sth. to select in the one sub-query and streams.foo in the other.
As you demanded to have 6 rows at the most I used the limit clause twice. First in the sub-query just in case there is a huge amount of rows in your table you don't want to select and again in the outer query to finally restrict the number of rows. Fiddle about a little on the two limits and toggle WHERE foo and WHERE NOT foo and you see why.
In the first sub-query I added a sort column like it is done in that answer. That's because I guess you want the result of the first sub-query always on top too.

PostgreSQL Get holes in index column

I suppose it is not easy to query a table for data which don't exists but maybe here is some trick to achieve holes in one integer column (rowindex).
Here is small table for illustrating concrete situation:
DROP TABLE IF EXISTS examtable1;
CREATE TABLE examtable1
(rowindex integer primary key, mydate timestamp, num1 integer);
INSERT INTO examtable1 (rowindex, mydate, num1)
VALUES (1, '2015-03-09 07:12:45', 1),
(3, '2015-03-09 07:17:12', 4),
(5, '2015-03-09 07:22:43', 1),
(6, '2015-03-09 07:25:15', 3),
(7, '2015-03-09 07:41:46', 2),
(10, '2015-03-09 07:42:05', 1),
(11, '2015-03-09 07:45:16', 4),
(14, '2015-03-09 07:48:38', 5),
(15, '2015-03-09 08:15:44', 2);
SELECT rowindex FROM examtable1;
With showed query I get all used indexes listed.
But I would like to get (say) first five indexes which is missed so I can use them for insert new data at desired rowindex.
In concrete example result will be: 2, 4, 8, 9, 12 what represent indexes which are not used.
Is here any trick to build a query which will give n number of missing indexes?
In real, such table may contain many rows and "holes" can be anywhere.
You can do this by generating a list of all numbers using generate_series() and then check which numbers don't exist in your table.
This can either be done using an outer join:
select nr.i as missing_index
from (
select i
from generate_series(1, (select max(rowindex) from examtable1)) i
) nr
left join examtable1 t1 on nr.i = t1.rowindex
where t1.rowindex is null;
or an not exists query:
select i
from generate_series(1, (select max(rowindex) from examtable1)) i
where not exists (select 1
from examtable1 t1
where t1.rowindex = i.i);
I have used a hardcoded lower bound for generate_series() so that you would also detect a missing rowindex that is smaller than the lowest number.