Finding groups of clustered data points - tsql

I have a table which records events:
create table #events
( intRowId int identity(1,1),
intItemId int,
intUserId int,
datEvent datetime)
It's a big table with many millions of rows, recording events against several thousand items and tens of thousands of users.
There's a select group of ten itemIDs I want to look for, but only when they occur in a certain pattern: I'm trying to find rows where all ten of these items have events registered against them for the same userID and close together in time, say 5 minutes.
I have absolutely NO IDEA how to go about this. One would assume partitioning is involved somewhere, but help, even just somewhere to get started, would be much appreciated.
Cheers,
Matt

Lets say you want statistics about items id: 1, 2, ... , 10.
First create a table EventByItems:
CREATE TABLE EventByItems
(
intRowId int identity(1,1),
intUserId int,
datEvent datetime,
intItem1 int,
intItem2 int,
intItem3 int,
...
intItem10 int
)
Then use query to populate this table:
SELECT intUserId, datEvent,
SUM(pvt.[1]), SUM(pvt.[2]), SUM(pvt.[3]), ... , SUM(pvt.[10])
FROM #events
PIVOT
(
COUNT(intItemId)
FOR intItemId IN ([1], [2], [3], ... , [10])
) AS pvt
GROUP BY intUserId, datEvent
Now we can do some work with that table. For example we can update it to fill gaps according your logic. Or we can do queries like that:
SELECT
intRowId,
intUserId,
datEvent
FROM
EventByItems AS E
WHERE
((intItem1 > 0) OR EXISTS(SELECT *
FROM EventByItems
WHERE intUserId = E.intUserId
AND intItem1 > 0
AND DATEDIFF(MINUTE, datEvent, E.datEvent) <= 5
AND intRowId != E.intRowId ))
AND
...
AND
((intItem10 > 0) OR EXISTS(SELECT *
FROM EventByItems
WHERE intUserId = E.intUserId
AND intItem10 > 0
AND DATEDIFF(MINUTE, datEvent, E.datEvent) <= 5
AND intRowId != E.intRowId ))

Ok, so below you'll find a working example which does what you want. I'm assuming that events don't have to appear in tens.
But the solution is very crude, and will run slow, especially if you increase number of items/users.
Temp table with pre-selected events would help with performance in my solution, but what you really need are windowed functions like in Oracle..
DROP TABLE #events
GO
create table #events
( intRowId int identity(1,1),
intItemId int,
intUserId int,
datEvent datetime)
GO
insert into #events (intUserId,intItemId, datEvent)
select '1','1','2013-05-01 10:25' union all --group1
select '1','2','2013-05-01 10:25' union all --group1
select '1','3','2013-05-01 10:26' union all --group1
select '1','7','2013-05-01 10:25' union all
select '1','8','2013-05-01 10:25' union all
select '1','9','2013-05-01 10:26' union all
select '1','1','2013-05-01 10:50' union all --group2
select '1','2','2013-05-01 10:52' union all --group2
select '1','3','2013-05-01 10:59' union all
select '1','1','2013-05-01 11:10' union all --group3
select '1','1','2013-05-01 11:12' union all --group3
select '1','3','2013-05-01 11:17' union all --group3
select '1','2','2013-05-01 11:25' union all
select '1','1','2013-05-01 11:31' union all
select '1','7','2013-05-01 11:32' union all
select '1','2','2013-05-01 11:50' union all --group4
select '1','2','2013-05-01 11:50' union all --group4
select '1','3','2013-05-01 11:50' union all --group4
select '1','1','2013-05-01 11:56'
GO
DROP TABLE #temp
GO
select
e1.intRowId as intRowId_1, e1.intItemId as intItemId_1, e1.intUserId as intUserId_1, e1.datEvent as datEvent_1
,e2.intRowId as intRowId_2, e2.intItemId as intItemId_2, e2.intUserId as intUserId_2, e2.datEvent as datEvent_2
into #temp
from #events e1
join #events e2
on e1.intUserId=e2.intUserId
and e1.datEvent<=e2.datEvent
and e1.intRowId<>e2.intRowId
where 1=1
and e1.intUserId=1
and e2.intUserId=1
and e1.intItemId in (1,2,3)
and e2.intItemId in (1,2,3)
and datediff(minute,e1.datevent,e2.datevent)<6
order by
e1.intRowId, e2.intRowId
GO
select distinct
*
from (
select
intRowId_1 as intRowId, intItemId_1 as intItemId, intUserId_1 as intUserId, datEvent_1 as datEvent
from #temp
UNION ALL
select
intRowId_2 as intRowId, intItemId_2 as intItemId, intUserId_2 as intUserId, datEvent_2 as datEvent
from #temp
) x
order by
datEvent, intRowId

Related

Select Date and Count, Group By Date -- How to show Dates with NULL Counts?

SELECT
CAST(c.DT AS DATE) AS 'Date'
, COUNT(p.PatternID) AS 'Count'
FROM CalendarMain c
LEFT OUTER JOIN Pattern p
ON c.DT = p.PatternDate
INNER JOIN Result r
ON p.PatternID = r.PatternID
INNER JOIN Detail d
ON p.PatternID = d.PatternID
WHERE r.Type = 7
AND d.Panel = 501
AND CAST(c.DT AS DATE)
BETWEEN '20190101' AND '20190201'
GROUP BY CAST(c.DT AS DATE)
ORDER BY CAST(c.DT AS DATE)
The query above isn't working for me. It still skips days where the COUNT is NULL for it's c.DT.
c.DT and p.PatternDate are both time DateTime, although c.DT can't be NULL. It is actually the PK for the table. It is populated as DateTimes for every single day from 2015 to 2049, so the records for those days exist.
Another weird thing I noticed is that nothing returns at all when I join C.DT = p.PatternDate without a CAST or CONVERT to a Date style. Not sure why when they are both DateTimes.
There are a few things to talk about here. At this stage it's not clear what you're actually trying to count. If it's the number of "patterns" per day for the month of Jan 2019, then:
Your BETWEEN will also count any activity occurring on 1 Feb,
It looks like one pattern could have multiple results, potentially causing a miscount
It looks like one pattern could have multiple details, potentially causing a miscount
If one pattern has say 3 eligible results, and also 4 details, you'll get the cross product of them. Your count will be 12.
I'm going to assume:
you only want the distinct number of patterns, regardless of the number of details and results.
You only want January's activity
--Set up some dummy data
DROP TABLE IF EXISTS #CalendarMain
SELECT cast('20190101' as datetime) as DT
INTO #CalendarMain
UNION ALL SELECT '20190102' as DT
UNION ALL SELECT '20190103' as DT
UNION ALL SELECT '20190104' as DT
UNION ALL SELECT '20190105' as DT --etc etc
;
DROP TABLE IF EXISTS #Pattern
SELECT cast('1'as int) as PatternID
,cast('20190101 13:00' as datetime) as PatternDate
INTO #Pattern
UNION ALL SELECT 2,'20190101 14:00'
UNION ALL SELECT 3,'20190101 15:00'
UNION ALL SELECT 4,'20190104 11:00'
UNION ALL SELECT 5,'20190104 14:00'
;
DROP TABLE IF EXISTS #Result
SELECT cast(100 as int) as ResultID
,cast(1 as int) as PatternID
,cast(7 as int) as [Type]
INTO #Result
UNION ALL SELECT 101,1,7
UNION ALL SELECT 102,1,8
UNION ALL SELECT 103,1,9
UNION ALL SELECT 104,2,8
UNION ALL SELECT 105,2,7
UNION ALL SELECT 106,3,7
UNION ALL SELECT 107,3,8
UNION ALL SELECT 108,4,7
UNION ALL SELECT 109,5,7
UNION ALL SELECT 110,5,8
;
DROP TABLE IF EXISTS #Detail
SELECT cast(201 as int) as DetailID
,cast(1 as int) as PatternID
,cast(501 as int) as Panel
INTO #Detail
UNION ALL SELECT 202,1,502
UNION ALL SELECT 203,1,503
UNION ALL SELECT 204,1,502
UNION ALL SELECT 205,1,502
UNION ALL SELECT 206,1,502
UNION ALL SELECT 207,2,502
UNION ALL SELECT 208,2,503
UNION ALL SELECT 209,2,502
UNION ALL SELECT 210,4,502
UNION ALL SELECT 211,4,501
;
-- create some variables
DECLARE #start_date as date = '20190101';
DECLARE #end_date as date = '20190201'; --I assume this is an exclusive end date
SELECT cal.DT
,isnull(patterns.[count],0) as [Count]
FROM #CalendarMain cal
LEFT JOIN ( SELECT cast(p.PatternDate as date) as PatternDate
,COUNT(DISTINCT p.PatternID) as [Count]
FROM #Pattern p
JOIN #Result r ON p.PatternID = r.PatternID
JOIN #Detail d ON p.PatternID = d.PatternID
WHERE r.[Type] = 7
and d.Panel = 501
GROUP BY cast(p.PatternDate as date)
) patterns ON cal.DT = patterns.patternDate
WHERE cal.DT >= #start_date
and cal.DT < #end_date --Your code would have included 1 Feb, which I assume was unintentional.
ORDER BY cal.DT
;

Pivot Table By One Column And Multiple Aggregates

I prepared the following report. It displays three columns (AccruedInterest, Tip & CardRevenue) by month for a given year. Now I want to "Rotate" the result so that StartDate value turn into 12 columns.
I have this:
I need this:
I have tried pivoting the table but I need multiple columns to be aggregated as you see.
You have to unpivot your data and the pivot.
Note: I put in my own values in your table as to make each unique so you know the data is correct.
--Create YourTable
SELECT * INTO YourTable
FROM
(
SELECT CAST('2015-01-01' AS DATE) StartDate,
607.834 AS AccruedInterest,
1 AS Tip,
3 AS CardRevenue
UNION ALL
SELECT CAST('2015-02-01' AS DATE) StartDate,
643.298 AS AccruedInterest,
16.8325 AS Tip,
5 AS CardRevenue
) A;
GO
--This pivots your data
SELECT *
FROM
(
--This unpivots your data using cross apply
SELECT col,val,StartDate
FROM YourTable
CROSS APPLY
(
SELECT 'AccruedInterest', CAST(AccruedInterest AS VARCHAR(100))
UNION ALL
SELECT 'Tip', CAST(Tip AS VARCHAR(100))
UNION ALL
SELECT 'CardRevenue', CAST(CardRevenue AS VARCHAR(100))
) A(col,val)
) B
PIVOT
(
MAX(val) FOR startdate IN([2015-01-01],[2015-02-01])
) pvt
Results:
col 2015-01-01 2015-02-01
AccruedInterest 607.834 643.298
CardRevenue 3 5
Tip 1.0000 16.8325

T-SQL how to count the number of duplicate rows then print the outcome?

I have a table ProductNumberDuplicates_backups, which has two columns named ProductID and ProductNumber. There are some duplicate ProductNumbers. How can I count the distinct number of products, then print out the outcome like "() products was backup." ? Because this is inside a stored procedure, I have to use a variable #numrecord as the distinct number of rows. I put my codes like this:
set #numrecord= select distinct ProductNumber
from ProductNumberDuplicates_backups where COUNT(*) > 1
group by ProductID
having Count(ProductNumber)>1
Print cast(#numrecord as varchar)+' product(s) were backed up.'
obviously the error was after the = sign as the select can not follow it. I've search for similar cases but they are just select statements. Please help. Many thanks!
Try
select #numrecord= count(distinct ProductNumber)
from ProductNumberDuplicates_backups
Print cast(#numrecord as varchar)+' product(s) were backed up.'
begin tran
create table ProductNumberDuplicates_backups (
ProductNumber int
)
insert ProductNumberDuplicates_backups(ProductNumber)
select 1
union all
select 2
union all
select 1
union all
select 3
union all
select 2
select * from ProductNumberDuplicates_backups
declare #numRecord int
select #numRecord = count(ProductNumber) from
(select ProductNumber, ROW_NUMBER()
over (partition by ProductNumber order by ProductNumber) RowNumber
from ProductNumberDuplicates_backups) p
where p.RowNumber > 1
print cast(#numRecord as varchar) + ' product(s) were backed up.'
rollback

SQL Running Subtraction

Just a brief of business scenario is table has been created for a good receipt. So here we have good expected line with PurchaseOrder(PO) in first few line. And then we receive each expected line physically and that time these quantity may be different, due to business case like quantity may damage and short quantity like that. So we maintain a status for that eg: OK, Damage, also we have to calculate short quantity based on total of expected quantity of each item and total of received line.
if object_id('DEV..Temp','U') is not null
drop table Temp
CREATE TABLE Temp
(
ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
Item VARCHAR(32),
PO VARCHAR(32) NULL,
ExpectedQty INT NULL,
ReceivedQty INT NULL,
[STATUS] VARCHAR(32) NULL,
BoxName VARCHAR(32) NULL
)
Please see first few line with PO data will be the expected lines,
and then rest line will be received line
INSERT INTO TEMP (Item,PO,ExpectedQty,ReceivedQty,[STATUS],BoxName)
SELECT 'ITEM01','PO-01','30',NULL,NULL,NULL UNION ALL
SELECT 'ITEM01','PO-02','20',NULL,NULL,NULL UNION ALL
SELECT 'ITEM02','PO-01','40',NULL,NULL,NULL UNION ALL
SELECT 'ITEM03','PO-01','50',NULL,NULL,NULL UNION ALL
SELECT 'ITEM03','PO-02','30',NULL,NULL,NULL UNION ALL
SELECT 'ITEM03','PO-03','20',NULL,NULL,NULL UNION ALL
SELECT 'ITEM04','PO-01','30',NULL,NULL,NULL UNION ALL
SELECT 'ITEM01',NULL,NULL,'20','OK','box01' UNION ALL
SELECT 'ITEM01',NULL,NULL,'25','OK','box02' UNION ALL
SELECT 'ITEM01',NULL,NULL,'5','DAMAGE','box03' UNION ALL
SELECT 'ITEM02',NULL,NULL,'38','OK','box04' UNION ALL
SELECT 'ITEM02',NULL,NULL,'2','DAMAGE','box05' UNION ALL
SELECT 'ITEM03',NULL,NULL,'30','OK','box06' UNION ALL
SELECT 'ITEM03',NULL,NULL,'30','OK','box07' UNION ALL
SELECT 'ITEM03',NULL,NULL,'30','OK','box08' UNION ALL
SELECT 'ITEM03',NULL,NULL,'10','DAMAGE','box09' UNION ALL
SELECT 'ITEM04',NULL,NULL,'25','OK','box10'
Below Table is my expected result based on above data.
I need to show those data following way.
So I appreciate if you can give me an appropriate query for it.
Note: first row is blank and it is actually my table header. :)
SELECT '' as 'ITEM', '' as 'PO#', '' as 'ExpectedQty',
'' as 'ReceivedQty','' as 'DamageQty' ,'' as 'ShortQty' UNION ALL
SELECT 'ITEM01','PO-01','30','30','0' ,'0' UNION ALL
SELECT 'ITEM01','PO-02','20','15','5' ,'0' UNION ALL
SELECT 'ITEM02','PO-01','40','38','2' ,'0' UNION ALL
SELECT 'ITEM03','PO-01','50','50','0' ,'0' UNION ALL
SELECT 'ITEM03','PO-02','30','30','0' ,'0' UNION ALL
SELECT 'ITEM03','PO-03','20','10','10','0' UNION ALL
SELECT 'ITEM04','PO-01','30','25','0' ,'5'
Note : we don't received more than expected.
solution should be based on SQL 2000
You should reconsider how you store this data. Separate Expected and Received+Damaged in different tables (you have many unused (null) cells). This way any query should become more readable.
I think what you try to do can be achieved more easily with a stored procedure.
Anyway, try this query:
SELECT Item, PO, ExpectedQty,
CASE WHEN [rec-consumed] > 0 THEN ExpectedQty
ELSE CASE WHEN [rec-consumed] + ExpectedQty > 0
THEN [rec-consumed] + ExpectedQty
ELSE 0
END
END ReceivedQty,
CASE WHEN [rec-consumed] < 0
THEN CASE WHEN DamageQty >= -1*[rec-consumed]
THEN -1*[rec-consumed]
ELSE DamageQty
END
ELSE 0
END DamageQty,
CASE WHEN [rec_damage-consumed] < 0
THEN DamageQty - [rec-consumed]
ELSE 0
END ShortQty
FROM (
select t1.Item,
t1.PO,
t1.ExpectedQty,
st.sum_ReceivedQty_OK
- (sum(COALESCE(t2.ExpectedQty,0))
+t1.ExpectedQty)
[rec-consumed],
st.sum_ReceivedQty_OK + st.sum_ReceivedQty_DAMAGE
- (sum(COALESCE(t2.ExpectedQty,0))
+t1.ExpectedQty)
[rec_damage-consumed],
st.sum_ReceivedQty_DAMAGE DamageQty
from #tt t1
left join #tt t2 on t1.Item = t2.Item
and t1.PO > t2.PO
and t2.PO is not null
join (select Item
, sum(CASE WHEN status = 'OK' THEN ReceivedQty ELSE 0 END)
sum_ReceivedQty_OK
, sum(CASE WHEN status != 'OK' THEN ReceivedQty ELSE 0 END)
sum_ReceivedQty_DAMAGE
from #tt where PO is null
group by Item) st on t1.Item = st.Item
where t1.PO is not null
group by t1.Item, t1.PO, t1.ExpectedQty,
st.sum_ReceivedQty_OK,
st.sum_ReceivedQty_DAMAGE
) a
order by Item, PO

Query to get row from one table, else random row from another

tblUserProfile - I have a table which holds all the Profile Info (too many fields)
tblMonthlyProfiles - Another table which has just the ProfileID in it (the idea is that this table holds 2 profileids which sometimes become monthly profiles (on selection))
Now when I need to show monthly profiles, I simply do a select from this tblMonthlyProfiles and Join with tblUserProfile to get all valid info.
If there are no rows in tblMonthlyProfile, then monthly profile section is not displayed.
Now the requirement is to ALWAYS show Monthly Profiles. If there are no rows in monthlyProfiles, it should pick up 2 random profiles from tblUserProfile. If there is only one row in monthlyProfiles, it should pick up only one random row from tblUserProfile.
What is the best way to do all this in one single query ?
I thought something like this
select top 2 * from tblUserProfile P
LEFT OUTER JOIN tblMonthlyProfiles M
on M.profileid = P.profileid
ORder by NEWID()
But this always gives me 2 random rows from tblProfile. How can I solve this ?
Try something like this:
SELECT TOP 2 Field1, Field2, Field3, FinalOrder FROM
(
select top 2 Field1, Field2, Field3, FinalOrder, '1' As FinalOrder from tblUserProfile P JOIN tblMonthlyProfiles M on M.profileid = P.profileid
UNION
select top 2 Field1, Field2, Field3, FinalOrder, '2' AS FinalOrder from tblUserProfile P LEFT OUTER JOIN tblMonthlyProfiles M on M.profileid = P.profileid ORDER BY NEWID()
)
ORDER BY FinalOrder
The idea being to pick two monthly profiles (if that many exist) and then 2 random profiles (as you correctly did) and then UNION them. You'll have between 2 and 4 records at that point. Grab the top two. FinalOrder column is an easy way to make sure that you try and get the monthly's first.
If you have control of the table structure, you might save yourself some trouble by simply adding a boolean field IsMonthlyProfile to the UserProfile table. Then it's a single table query, order by IsBoolean, NewID()
In SQL 2000+ compliant syntax you could do something like:
Select ...
From (
Select TOP 2 ...
From tblUserProfile As UP
Where Not Exists( Select 1 From tblMonthlyProfile As MP1 )
Order By NewId()
) As RandomProfile
Union All
Select MP....
From tblUserProfile As UP
Join tblMonthlyProfile As MP
On MP.ProfileId = UP.ProfileId
Where ( Select Count(*) From tblMonthlyProfile As MP1 ) >= 1
Union All
Select ...
From (
Select TOP 1 ...
From tblUserProfile As UP
Where ( Select Count(*) From tblMonthlyProfile As MP1 ) = 1
Order By NewId()
) As RandomProfile
Using SQL 2005+ CTE you can do:
With
TwoRandomProfiles As
(
Select TOP 2 ..., ROW_NUMBER() OVER ( ORDER BY UP.ProfileID ) As Num
From tblUserProfile As UP
Order By NewId()
)
Select MP.Col1, ...
From tblUserProfile As UP
Join tblMonthlyProfile As MP
On MP.ProfileId = UP.ProfileId
Where ( Select Count(*) From tblMonthlyProfile As MP1 ) >= 1
Union All
Select ...
From TwoRandomProfiles
Where Not Exists( Select 1 From tblMonthlyProfile As MP1 )
Union All
Select ...
From TwoRandomProfiles
Where ( Select Count(*) From tblMonthlyProfile As MP1 ) = 1
And Num = 1
The CTE has the advantage of only querying for the random profiles once and the use of the ROW_NUMBER() column.
Obviously, in all the UNION statements the number and type of the columns must match.