Daily counts with TSQL? - tsql

I have a site where I record client metrics in a SQL Server 2008 db on every link clicked. I have already written the query to get the daily total clicks, however I want to find out how many times the user clicked within a given timespan (ie. within 5 seconds).
The idea here is to lock out incoming IP addresses that are trying to scrape content. It would be assumed that if more than 5 "clicks" is detected within 5 seconds or the number of daily clicks from a given IP address exceeds some value, that this is a scraping attempt.
I have tried a few variations of the following:
-- when a user clicked more than 5 times in 5 seconds
SELECT DATEADD(SECOND, DATEDIFF(SECOND, 0, ClickTimeStamp), 0) as ClickTimeStamp, COUNT(UserClickID) as [Count]
FROM UserClicks
WHERE DATEDIFF(SECOND, 0, ClickTimeStamp) = 5
GROUP BY IPAddress, ClickTimeStamp
This one in particular returns the following error:
Msg 535, Level 16, State 0, Line 3 The datediff function resulted in
an overflow. The number of dateparts separating two date/time
instances is too large. Try to use datediff with a less precise
datepart.
So once again, I want to use the seconds datepart, which I believe I'm on the right track, but not quite getting it.
Help appreciated. Thanks.
-- UPDATE --
Great suggestions and helped me think that the approach is wrong. The check is going to be made on every click. What I should do is for a given timestamp, check to see if in the last 5 seconds 5 clicks have been recorded from the same IP address. So it would be something like, count the number of clicks for > GetDate() - 5 seconds
Trying the following still isn't giving me an accurate figure.
SELECT COUNT(*)
FROM UserClicks
WHERE ClickTimeStamp >= GetDate() - DATEADD(SECOND, -5, GetDate())

Hoping my syntax is good, I only have oracle to test this on. I'm going to assume you have an ID column called user_id that is unique to that user (is it user_click_id? helpful to include table create statements in these questions when you can)
You'll have to preform a self join on this one. Logic will be take the userclick and join onto userclick on userId = userId and difference on clicktimestamp is between 0-5 seconds. Then it's counting from the subselect.
select u1.user_id, u1.clicktimestamp, u2.clicktimestamp
from userclicks uc1
left join user_clicks uc2
on u2.userk_id = u1.user_id
and datediff(second,u1.ClickTimeStamp,u2.ClickTimeStamp) <= 5
and datediff(second,u1.ClickTimeStamp,u2.ClickTimeStamp) > 0
This select statement should give you the user_id/clicktimestampe and 1 row for every record that is between 0 and 5 seconds apart from that clicktimestamp from the same user. Now it's just a matter of counting all user_id,u1.clicktimestamp combinations and highlighting the ones with 5 or more. Take the above query and turn it into a subselect and pull counts from it:
select u1.user_id, u1.clicktimestamp, count(1)
from
(select u1.user_id, u1.clicktimestamp
from userclicks uc1
left join user_clicks uc2
on u2.userk_id = u1.user_id
and datediff(second,u1.ClickTimeStamp,u2.ClickTimeStamp) <= 5
and datediff(second,u1.ClickTimeStamp,u2.ClickTimeStamp) > 0) a
group by u1.user_id, u1.clicktimestamp
having count(1) >= 5
Wish I could verify my syntax on a MS machine....there might be some typo's in there, but the logic should be good.

An answer for your UPDATE: the problem is in the third line of
SELECT COUNT(*)
FROM UserClicks
WHERE ClickTimeStamp >= GetDate() - DATEADD(SECOND, -5, GetDate())
GetDate() - DATEADD(SECOND, -5, GetDate()) is saying "take the current date time and subtract (the current date time minus five seconds)". I'm not entirely sure what kind of value this produces, but it won't be the one you want.
You still want some kind of time-period, perahps like so:
SELECT count(*)
from UserClicks
where IPAddress = #IPAddress
and ClickTimeStamp between getdate() and dateadd(second, -5, getdate())
I'm a bit uncomfortable using getdate() there--if you have a specific datetime value (accurate to the second), you should probably use it.

Assuming log entries are only entered for current activity -- that is, whenever a new row is inserted, the logged time is for that point in time and never for any prior point in time -- then you should only need to review data for a set period of time, and not have to review "all data" as you are doing now.
Next question is: how frequently do you make this check? If you are concerned with clicks per second, then something between "once per hour" and "once every 24 hours" seems reasonable.
Next up: define your interval. "All clicks per IPAddress within 5 seconds" could go two ways: set window (00-04, 05-09, 10-14, etc), or sliding window(00-04, 01-05, 02-06, etc.) Probably irrelevant with a 5 second window, but perhaps more relevant for longer periods (clicks per "day").
With that, the general approach I'd take is:
Start with earliest point in time you care about (1 hour ago, 24 hours ago)
Set up "buckets", means by which time windows can be identified (00:00:00 - 00:00:04, 00:00:05 - 00:00:09, etc.). This could be done as a temp table.
For all events, calculate number of elapsed seconds since your earliest point
For each bucket, count number of events that hit that bucket, grouped by IPAddress (inner join on the temp table on seconds between lowValue and highValue)
Identify those that exceed your threshold (having count(*) > X), and defenestrate them.

Related

I need to add up lots of values between date ranges as quickly as possible using PostgreSQL, what's the best method?

Here's a simple example of what I'm trying to do:
CREATE TABLE daily_factors (
factor_date date,
factor_value numeric(3,1));
CREATE TABLE customer_date_ranges (
customer_id int,
date_from date,
date_to date);
INSERT INTO
daily_factors
SELECT
t.factor_date,
(random() * 10 + 30)::numeric(3,1)
FROM
generate_series(timestamp '20170101', timestamp '20210211', interval '1 day') AS t(factor_date);
WITH customer_id AS (
SELECT generate_series(1, 100000) AS customer_id),
date_from AS (
SELECT
customer_id,
(timestamp '20170101' + random() * (timestamp '20201231' - timestamp '20170101'))::date AS date_from
FROM
customer_id)
INSERT INTO
customer_date_ranges
SELECT
d.customer_id,
d.date_from,
(d.date_from::timestamp + random() * (timestamp '20210211' - d.date_from::timestamp))::date AS date_to
FROM
date_from d;
So I'm basically making two tables:
a list of daily factors, one for every day from 1st Jan 2017 until today's date;
a list of 100,000 "customers" all who have a date range between 1st Jan 2017 and today, some long, some short, basically random.
Then I want to add up the factors for each customer in their date range, and take the average value.
SELECT
cd.customer_id,
AVG(df.factor_value) AS average_value
FROM
customer_date_ranges cd
INNER JOIN daily_factors df ON df.factor_date BETWEEN cd.date_from AND cd.date_to
GROUP BY
cd.customer_id;
Having a non-equi join on a date range is never going to be pretty, but is there any way to speed this up?
The only index I could think of was this one:
CREATE INDEX performance_idx ON daily_factors (factor_date);
It makes a tiny difference to the execution time. When I run this locally I'm seeing around 32 seconds with no index, and around 28s with the index.
I can see that this is a massive bottleneck in the system I'm building, but I can't think of any way to make things faster. The ideas I did have were:
instead of using daily factors I could largely get away with monthly ones, but now I have the added complexity of "whole months and partial months" to work with. It doesn't seem like it's going to be worth it for the added complexity, e.g. "take 7 whole months for Feb to Aug 2020, then 10/31 of Jan 2020 and 15/30 of September 2020";
I could pre-calculate every average I will ever need, but with 1,503 factors (and that will increase with each new day), that's already 1,128,753 numbers to store (assuming we ignore zero date ranges and that my maths is right). Also my real world system has an extra level of complexity, a second identifier with 20 possible values, so this would mean having c.20 million numbers to pre-calculate. Also, every day the number of values to store grows exponentially;
I could take this work out of the database, and do it in code (in memory), as it seems like a relational database might not be the best solution here?
Any other suggestions?
The classic way to deal with this is to store running sums of factor_value, not (or in addition to) individual values. Then you just look up the running sum at the two end points (actually at the end, and one before the start), and take the difference. And of course divide by the count, to turn it into an average. I've never done this inside a database, but there is no reason it can't be done there.

A T-SQL process to identify the total duration or number of days of all "cases" within a specified time period. This is a challenge

I could really do with some help and intend to be active in this community and help others in return. I am a SQL developer using MS SQL Server for the last two years but I've hit a roadblock on this one. Imagine the scenario you have a number of "Accommodation Providers". Each has a certain "Service Capacity". We have a dataset with a number of concurrent "Placements" which can be any duration from a day to several years. We would like to know the "Occupancy Rate" by calculating it as
Occupancy = Placement Days (all days in all placements within period)
/
(Capacity x Days in Period) X 100
I have changed names of fields/tables and am showing some made-up sample data here.
We have one dataset in a table (tPL) for "Placements". There are many thousands of records, going back 7 years
e.g
tbl_Placements tPL:
[Provder Name] [Name of Client] [Vacancy Filled Date] [Vacancy End Date]
Accommodation1 John Smith 2018-08-04 2018-08-12
Accommodation1 Jane Smith 2019-01-28 2019-04-09
and:
[Placement_Length_in_Days]
8
294
tbl_Month_Year tMY:
Month_Year
2018-03-01
2018-04-01
2018-05-01
2018-06-01
2018-07-01
2018-08-01
2018-09-01
2018-10-01
2018-11-01
2018-12-01
2019-01-01
2019-02-01
2019-03-01
2019-04-01
2019-05-01
and lastly
tbl_Service_Capacity tSC:
[Provider Name] [Service Capacity]
Accommodation 1 12
Accommodation 2 4
Dividing by the service capacity is the easy part. Where I'm struggling is calculating the total number of "Placement Days" in a given period such as a month or quarter.
If you consider that Accommodation1, 2 and 3 can have multiple concurrent and overlapping placements of different lengths which can start and finish at any time, how can I calculate the total number of days in all placements, that fall within a given time period e.g. quarter or a month, to then calculate the occupancy percentage? The code below is an attempt. I'm presuming all months to be 30 days here, which I know is wrong. I know the logic is wrong here about calculating the number of days. To be honest, I'm almost totally fried and I just can't seem to get this done, hence I'm asking for help.
Am I going about this the wrong way by joining on a date table? Has anyone come against this before. Also if you would like me to give you more information or clarify, I'm happy to do so.
Any help you can give will be hugely appreciated!
Please see the code below. I've tried it a few different ways, but sadly did not save the older versions to show. They didn't work, though. I've done something similar in the past to see how many "open cases" there were at any given point in time. That inspired the code here and went like this:
SELECT TOP (1000) tMY.Month_Year, COUNT(*) AS ActiveCases
FROM tbl_Casework AS tblCW LEFT OUTER JOIN
tbl_Month_Year AS tMY ON tMY.Month_Year >= tblCW.Start_Date AND tMY.Month_Year <= DATEADD(day, 31 - DATEPART(day,
ISNULL(tblCW.End_Date, GETDATE())), ISNULL(tblCW.End_Date, GETDATE()))
GROUP BY tMY.Month_Year
This definitely worked, but was just a count of "how many cases were open at some point during each month?"
SELECT tMY.Month_Year
,tPL.[Accommodation Provider]
,tSC.[Service_capacity_Total]
-- if started before month began and closed at or after end of month / or still open
,(sum(case when (datediff(day, tPL.[Vacancy Filled Date], [tMY].[MonthYear])<0 AND
(datediff(day, [tMY].[Month_Year], tPL.[Vacancy End Date])>=30) OR tPL.[Vacancy End Date] is null) then 30
-- if started after month began and closed during month
,sum(case when (datediff(day, tPL.[Vacancy Filled Date], [tMY].[MonthYear])>=0 AND
datediff(day, [tMY].[Month_Year], tPL.[Vacancy End Date])<=30) then tPL.[Placement_Length_in_Days]
-- if started before and closed after month - take filled date to end of month
,sum(case when datediff(day, [tMY].[Month_Year], tPL.[Vacancy End Date])>=30 AND datediff(day, tPL.[Vacancy Filled Date], [tMY].[Month_Year])<0 then
datediff(day, tPL.[Vacancy Filled Date], DATEADD(DAY, 30, tMY.Month_Year)) END) / (tSC.[Service_capacity]*30)*100 As [Occupancy Rate]
FROM [tbl_Placements] tPL
inner join tbl_Service_Capacity tSC on tSC.[Service Name] = tPL.[Accommodation Provider]
left outer join tbl_Month_Year tMY ON tMY.MonthYear >= [Vacancy Filled Date] and tMY.MonthYear <= DATEADD(day, 30, tPL.[Vacancy Filled Date])
WHERE tPL.[Vacancy Filled Date] >= '20160501' and tMY.MonthYear < (getdate()-30) AND tSC.[Service Capacity] IS NOT NULL
group by tMY.MonthYear, tPL.[Service Name], tSC.[Service Capacity]--, tPL.[Client Name]
order by tMY.MonthYear Asc
The code runs but I get crazy occupancy rates at 300% or 3% so the figures must be incorrect. The only part I'm sure of is taking the [Placement_Length_in_Days] when it starts and finishes within the time period. The calculations here are wrong, I'm sure of that.
To give you a quick shot, you might try this:
DECLARE #tbl_Placements TABLE
(
[Provider Name] VARCHAR(100),
[Name of Client] VARCHAR(100),
[Vacancy Filled Date] DATE,
[Vacancy End Date] DATE
);
INSERT INTO #tbl_Placements
VALUES ('Accommodation1', 'John Smith', '2018-08-04', '2018-08-12'),
('Accommodation1', 'Jane Smith ', '2019-01-28', '2019-04-09');
SELECT
p.[Provider Name], p.[Name of Client],
DATEADD(DAY, A.Nmbr - 1, p.[Vacancy Filled Date]) AS OccupiedAt
FROM
#tbl_Placements p
CROSS APPLY
(SELECT TOP (DATEDIFF(DAY, p.[Vacancy Filled Date], p.[Vacancy End Date]) + 1)
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM
master..spt_values) A(Nmbr);
The idea in short:
We use CROSS APPLY to create a joined set per row.
We use a computed TOP clause to get the right count of rows back
We create a numbers-table on the fly, simply by querying any table with enough rows (here I took master..spt_values. We do not need the actual table's content, just a counter we get from ROW_NUMBER().
We return the set together with a running day starting with the first day of occupation and ending with the last day of occupation.
Hint: This was much easier, if you have an existing physical numbers/date table in your database. You would simply inner join this table with a BETWEEN in the ON-clause.
You might read this.

Count Until A Specific Value?

Say you've got a table ordered by the date that captures the speed of vehicles with a device in them. And imagine you get 30 updates per day for the speed. It's not always 30 per vehicle. The data will have the vehicle, the timestamp, and the speed.
What I want to do is be able to count how many days have passed since the vehicle last went over 10 mph in order to find inactive vehicles. Is something like that possible in postgresql?
*Or is there a way to get back the row number of the table if it's sorted where the speed goes past 10, and then select the date in that row number to subtract the current date from the date listed?
SELECT DISTINCT ON (vessel) vessel, now() - date
FROM your_table
WHERE speed > 10
ORDER BY vessel, date DESC
This will tell you, for every vehicle, how long ago its speed field was last over 10.
SELECT vessel, now() - max(date)
WHERE speed > 10
FROM your_table
GROUP BY vessel;

kdb/q building NBBO from TAQ data

I have a table with bid/ask for for every stock/venue. Something like:
taq:`time xasc ([] time:10:00:00+(100?1000);bid:30+(100?20)%30;ask:30.8+(100?20)%30;stock:100?`STOCK1`STOCK2;exhcnage:100?`NYSE`NASDAQ)
How can I get the best/bid offer from all exchanges as of a time (in one minute buckets) for every stock?
My initial thought is to build a table that has a row for every minute/exchange/stock and do a asof join on the taq data. However, it sounds to me this is a brute force solution - since this is a solved problem, i figured i'd ask if there is a better way.
select max bid, min ask by stock,1+minute from 0!select by 1 xbar time.minute,stock,exchange from taq
This will give you the max bid, min ask across exchanges at the 1-minute interval in column minute.
The only tricky thing is the select by 1 xbar time.minute. When you do select by with no aggregation, it will just return the last row. So really what this means is select last time, last bid, last ask .... by 1 xbar time.minute etc.
So after we get the last values by minute and exchange, we just get the min/max across exchanges for that minute.

Web analytics schema with postgres

I am building a web analytics tool and use Postgresql as a database. I will not insert postgres each user visit but only aggregated data each 5 seconds:
time country browser num_visits
========================================
0 USA Chrome 12
0 USA IE 7
5 France IE 5
As you can see each 5 seconds I insert multiple rows (one per each dimensions combination).
In order to reduce the number of rows need to be scanned in queries, I am thinking to have multiple tables with the above schema based on their resolution: 5SecondResolution, 30SecondResolution, 5MinResolution, ..., 1HourResolution. Now when the user asks about the last day I will go to the hour resolution table which is smaller than the 5 sec resolution table (although I could have used that one too - it's just more rows to scan).
Now what if the hour resolution table has data on hours 0,1,2,3,... but users asks to see hourly trend from 1:59 to 8:59. In order to get data for the 1:59-2:59 period I could do multiple queries to the different resolutions tables so I get 1:59:2:00 from 1MinResolution, 2:00-2:30 from 30MinResolution and etc. AFAIU I have traded one query to a huge table (that has many relevant rows to scan) with multiple queries to medium tables + combine results on client side.
Does this sound like a good optimization?
Any other considerations on this?
Now what if the hour resolution table has data on hours 0,1,2,3,... but users asks to see hourly trend from 1:59 to 8:59. In order to get data for the 1:59-2:59 period I could do multiple queries to the different resolutions tables so I get 1:59:2:00 from 1MinResolution, 2:00-2:30 from 30MinResolution and etc.
You can't do that if you want your results to be accurate. Imagine if they're asking for one hour resolution from 01:30 to 04:30. You're imagining that you'd get the first and last half hour from the 5 second (or 1 minute) res table, then the rest from the one hour table.
The problem is that the one-hour table is offset by half an hour, so the answers won't actually be correct; each hour will be from 2:00 to 3:00, etc, when the user wants 2:30 to 3:30. It's an even more serious problem as you move to coarser resolutions.
So: This is a perfectly reasonable optimisation technique, but only if you limit your users' search start precision to the resolution of the aggregated table. If they want one hour resolution, force them to pick 1:00, 2:00, etc and disallow setting minutes. If they want 5 min resolution, make them pick 1:00, 1:05, 1:10, ... and so on. You don't have to limit the end precision the same way, since an incomplete ending interval won't affect data prior to the end and can easily be marked as incomplete when displayed. "Current day to date", "Hour so far", etc.
If you limit the start precision you not only give them correct results but greatly simplify the query. If you limit the end precision too then your query is purely against the aggregated table, but if you want "to date" data it's easy enough to write something like:
SELECT blah, mytimestamp
FROM mydata_1hour
WHERE mytimestamp BETWEEN current_date + INTERVAL '1' HOUR AND current_date + INTERVAL '4' HOUR
UNION ALL
SELECT sum(blah), current_date + INTERVAL '5' HOUR
FROM mydata_5second
WHERE mytimestamp BETWEEN current_date + INTERVAL '4' HOUR AND current_date + INTERVAL '5' HOUR;
... or even use several levels of union to satisfy requests for coarser resolutions.
You could use inheritance/partition. One resolution master table and many hourly resolution children tables ( and, perhaps, many minutes and seconds resolution children tables).
Thus you only have to select from the master table only, let the constraint of each children tables decide which is which.
Of course you have to add a trigger function to separate insert into appropriate children tables.
Complexities in insert versus complexities in display.
PostgreSQL - View or Partitioning?