SQL Server and index and null parameters - tsql

I have the below stored procedure that run against 2,304,697 records :
#startdate DATETIME NULL,
#enddate DATETIME NULL,
#drilldown VARCHAR(20) NULL
AS
BEGIN
SELECT
DATENAME(YEAR, ReceivingTime) as Year,
MAX(DATENAME(MONTH, ReceivingTime)) AS Month,
ProductionLocation,
CAST(COUNT(*) * 100.0 / SUM(COUNT(*) * 100) OVER (PARTITION BY DATENAME(YEAR, ReceivingTime), MONTH(ReceivingTime)) AS DECIMAL(10,2)) AS TotalsByMonth,
CAST(COUNT(*) * 100.0 / SUM(COUNT(*) * 100) OVER (PARTITION BY DATENAME(YEAR, ReceivingTime)) AS DECIMAL(10, 2)) AS TotalsByYear
FROM
Jobs_analytics
WHERE
ProductionLocation IS NOT NULL
AND ((ReceivingTime BETWEEN dbo.cleanStartDate(#startdate) AND dbo.cleanEndDate(#enddate))
AND #startdate IS NULL)
OR ((YEAR(ReceivingTime) = #drilldown) AND #drilldown IS NULL)
GROUP BY
DATENAME(YEAR, ReceivingTime),
DATEPART(MONTH, ReceivingTime), ProductionLocation
ORDER BY
DATENAME(YEAR, ReceivingTime),
DATEPART(MONTH, ReceivingTime)
The query works well in that it returns a data set in about 8 seconds. But I like to get the speed better So I added the below index:
CREATE INDEX RecDateTime
ON Jobs_analytics(RecDateTime, ProductionLocation)
go
however that really didn't improve anything. So I ran the execution plan and I notice that the my index is being used and the cost was 35% and my sort was at 6%.
So I reworked my where clause from this:
WHERE ProductionLocation IS NOT NULL AND
((ReceivingTime BETWEEN dbo.cleanStartDate(#startdate) and dbo.cleanEndDate(#enddate) ) AND #drilldown IS NULL)
OR ((YEAR(ReceivingTime) = #drilldown) AND #startdate IS NULL)
to this:
WHERE ProductionLocation IS NOT NULL AND
ReceivingTime BETWEEN dbo.cleanStartDate('2018-07-01') and dbo.cleanEndDate('2019-08-25')
and I got the query to run in a second. As you can see there is no more filter and the cost on the cluster is at 3%..( something I did not realize)
The NULL parameter checks are for a report that sometimes will have null values set. so I don't have to maintain two stored procedures. I can write a second stored procedure and just remove the where clause items but I rather not. is there any index or changes to my query that anyone could suggest that might help
Thanks
Mike

Okay if anyone comes across this this is what I found out:
I was testing the wrong parameter for null values within the OR
clause.
I have function that adds the hh:mm:ss to a date that was also
causing me problem.
I fixed both those items and the query runs in about a second.

Related

How to get timestamp associated with percentile(x) value using timescale db time_bucket

I need find percentile(50) value and its timestamp using timescale db time-bucket. Finding P50 is easy but I don't know how to get the time stamp.
Select time_bucket('120 sec',timestamp_utc) as interval_size,
first(timestamp_utc,int_val) as minTime,
min(int_val) as minVal,
last(timestamp_utc,int_val) as maxTime,
max(int_val) as maxVal,
-- timestamp of percentile value below.
percentile_disc(0.5) within group (order by int_val) as medianVal
from timeseries.raw
where timestamp_utc > NOW() - INTERVAL '10 min'
AND tag_id = 59560544877390423
group by interval_size
order by interval_size desc
I think what you're looking for we can do by selecting where the int_val is equal to the median value in a lateral (percentile_disc does ensure that there is a value exactly equal to that value, there may be more than one depending on what you want there you could deal with the more than one case in different ways), building on a previous answer and making it work a bit better I think would look something like this:
WITH p50 AS (
Select time_bucket('120 sec',timestamp_utc) as interval_size,
first(timestamp_utc,int_val) as minTime,
min(int_val) as minVal,
last(timestamp_utc,int_val) as maxTime,
max(int_val) as maxVal,
-- timestamp of percentile value below.
percentile_disc(0.5) within group (order by int_val) as medianVal
from timeseries.raw
where timestamp_utc > NOW() - INTERVAL '10 min'
AND tag_id = 59560544877390423
group by interval_size
order by interval_size desc
) SELECT p50.*, rmed.*
FROM p50, LATERAL (SELECT * FROM timeseries.raw r
-- copy over the same where clause from above so we're dealing with the same subset of data
WHERE timestamp_utc > NOW() - INTERVAL '10 min'
AND tag_id = 59560544877390423
-- add a where clause on the median value
AND r.int_val = p50.medianVal
-- now add a where clause to account for the time bucket
AND r.timestamp_utc >= p50.interval_size
AND r.timestamp_utc < p50.interval_size + '120 sec'::interval
-- Can add an order by something desc limit 1 if you want to avoid ties
) rmed;
Note that this will do a second scan of the table, it should be reasonably efficient, especially if you have an index on that column, but it will cause another scan, there isn't a great way that I know of of doing it without a second scan.

Using min/max values from a CTE in a later query, instead of using a subquery in Postgres

I've got a remedial question about pulling results out of a CTE in a later part of the query. For the example code, below are the relevant, stripped down tables:
CREATE TABLE print_job (
created_dts timestamp not null default now(),
status text not null
);
CREATE TABLE calendar_day (
date_actual date not null
);
In the current setup, there are gaps in the dates in the print_job data, and we would like to have a gapless result. For example, there are 87 days from the first to last date in the table, and only 77 days in there have data. We've already got a calendar_day dimension table to join with to get the 87 rows for the 87-day range. It's easy enough to figure out the min and max dates in the data with a subquery or in a CTE, but I don't know how to use those values from a CTE. I've got a full query below, but here are the relevant fragments with comments:
-- Get the date range from the data.
date_range AS (
select min(created_dts::date) AS start_date,
max(created_dts::date) AS end_date
from print_job),
-- This CTE does not work because it doesn't know what date_range is.
complete_date_series_using_cte AS (
select actual_date
from calendar_day
where actual_date >= date_range.start_date
and actual_date <= date_range.end_date
),
-- Subqueries are fine, because the FROM is specified in the subquery condition directly.
complete_date_series_using_subquery AS (
select date_actual
from calendar_day
where date_actual >= (select min(created_dts::date) from print_job)
and date_actual <= (select max(created_dts::date) from print_job)
)
I run into this regularly, and finally figured I'd ask. I've hunted around already for an answer, but I'm not clear how to summarize it well. And while there's nothing wrong with the subqueries in this case, I've got other situations where a CTE is nicer/more readable.
If it helps, I've listed the complete query below.
-- Get some counts and give them names.
WITH
daily_status AS (
select created_dts::date as created_date,
count(*) AS daily_total,
count(*) FILTER (where status = 'Error') AS status_error,
count(*) FILTER (where status = 'Processing') AS status_processing,
count(*) FILTER (where status = 'Aborted') AS status_aborted,
count(*) FILTER (where status = 'Done') AS status_done
from print_job
group by created_dts::date
),
-- Get the date range from the data.
date_range AS (
select min(created_dts::date) AS start_date,
max(created_dts::date) AS end_date
from print_job),
-- There are gaps in the data, and we want a row for dates with no results.
-- Could use generate_series on a timestamp & convert that to dates. But,
-- in our case, we've already got dimension tables for days. All that's needed
-- here is the actual date.
-- This CTE does not work because it doesn't know what date_range is.
-- complete_date_series_using_cte AS (
-- select actual_date
--
-- from calendar_day
--
-- where actual_date >= date_range.start_date
-- and actual_date <= date_range.end_date
-- ),
complete_date_series_using_subquery AS (
select date_actual
from calendar_day
where date_actual >= (select min(created_dts::date) from print_job)
and date_actual <= (select max(created_dts::date) from print_job)
)
-- The final query joins the complete date series with whatever data is in the print_job table daily summaries.
select date_actual,
coalesce(daily_total,0) AS total,
coalesce(status_error,0) AS errors,
coalesce(status_processing,0) AS processing,
coalesce(status_aborted,0) AS aborted,
coalesce(status_done,0) AS done
from complete_date_series_using_subquery
left join daily_status
on daily_status.created_date =
complete_date_series_using_subquery.date_actual
order by date_actual
I said it was a remedial question....I remembered where I'd seen this done before:
https://tapoueh.org/manual-post/2014/02/postgresql-histogram/
In my example, I need to list the CTE in the table list. That's obvious in retrospect, and I realize that I automatically don't think to do that as I'm habitually avoiding CROSS JOIN. The fragment below shows the slight change needed:
WITH
date_range AS (
select min(created_dts)::date as start_date,
max(created_dts)::date as end_date
from print_job
),
complete_date_series AS (
select date_actual
from calendar_day, date_range
where date_actual >= date_range.start_date
and date_actual <= date_range.end_date
),

postgresql find preceding and following timestamp to arbitrary timestamp

Given an arbitrary timestamp such as 2014-06-01 12:04:55-04 I can find in sometable the timestamps just before and just after. I then calculate the elapsed number of seconds between those two with the following query:
SELECT EXTRACT (EPOCH FROM (
(SELECT time AS t0
FROM sometable
WHERE time < '2014-06-01 12:04:55-04'
ORDER BY time DESC LIMIT 1) -
(SELECT time AS t1
FROM sometable
WHERE time > '2014-06-01 12:04:55-04'
ORDER BY time ASC LIMIT 1)
)) as elapsedNegative;
`
It works, but I was was wondering if there was another more elegant or astute way to achieve the same result? I am using 9.3. Here is a toy database.
CREATE TABLE sometable (
id serial,
time timestamp
);
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 11:59:37-04');
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 12:02:22-04');
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 12:04:49-04');
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 12:07:35-04');
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 12:09:53-04');
Thanks for any tips...
update Thanks to both #Joe Love and #Clément Prévost for interesting alternatives. Learned a lot on the way!
Your original query can't be more effective given that the sometable.time column is indexed, your execution plan should show only 2 index scans, which is very efficient (index only scans if you have pg 9.2 and above).
Here is a more readable way to write it
WITH previous_timestamp AS (
SELECT time AS time
FROM sometable
WHERE time < '2014-06-01 12:04:55-04'
ORDER BY time DESC LIMIT 1
),
next_timestamp AS (
SELECT time AS time
FROM sometable
WHERE time > '2014-06-01 12:04:55-04'
ORDER BY time ASC LIMIT 1
)
SELECT EXTRACT (EPOCH FROM (
(SELECT * FROM next_timestamp)
- (SELECT * FROM previous_timestamp)
))as elapsedNegative;
Using CTE allow you to give meaning to a subquery by naming it. Explicit naming is a well known and recognised coding best practice (use explicit names, don't abbreviate and don't use over generic names like "data" or "value").
Be warned that CTE are optimisation "fences" and sometimes get in the way of planner optimisation
Here is the SQLFiddle.
Edit: Moved the extract from the CTE to the final query so that PostgreSQL can use a index only scan.
This solution will likely perform better if the timestamp column does not have an index. When 9.4 comes out we can do it a little shorter by using aggregate filters.
This should be a bit bit faster as it's running 1 full table scan instead of 2, however it may perform worse, if your timestamp column is indexed and you have a large dataset.
Here's the example without the epoch conversion to make it more easy to read.
select
min(
case when start_timestamp > current_timestamp
then
start_timestamp
else 'infinity'::timestamp
end
),
max(
case when t1.start_timestamp < current_timestamp
then
start_timestamp
else '-infinity'::timestamp
end
)
from my_table as t1
And here's the example including the math and epoch extraction:
select
extract (EPOCH FROM (
min(
case when start_timestamp > current_timestamp
then
start_timestamp
else 'infinity'::timestamp
end
)-
max(
case when start_timestamp < current_timestamp
then
start_timestamp
else '-infinity'::timestamp
end
)))
from snap.offering_event
Please let me know if you need further details-- I'd recommend trying my code vs yours and seeing how it performs.

TSQL - Control a number sequence

Im a new in TSQL.
I have a table with a field called ODOMETER of a vehicle. I have to get the quantity of km in a period of time from 1st of the month to the end.
SELECT MAX(Odometer) - MIN(Odometer) as TotalKm FROM Table
This will work in ideal test scenary, but the Odomometer can be reset to 0 in anytime.
Someone can help to solve my problem, thank you.
I'm working with MS SQL 2012
EXAMPLE of records:
Date Odometer value
datetime var, 37210
datetime var, 37340
datetime var, 0
datetime var, 220
Try something like this using the LAG. There are other ways, but this should be easy.
EDIT: Changing the sample data to include records outside of the desired month range. Also simplifying that Reading for easy hand calc. Will shows a second option as siggested by OP.
DECLARE #tbl TABLE (stamp DATETIME, Reading INT)
INSERT INTO #tbl VALUES
('02/28/2014',0)
,('03/01/2014',10)
,('03/10/2014',20)
,('03/22/2014',0)
,('03/30/2014',10)
,('03/31/2014',20)
,('04/01/2014',30)
--Original solution with WHERE on the "outer" SELECT.
--This give a result of 40 as it include the change of 10 between 2/28 and 3/31.
;WITH cte AS (
SELECT Reading
,LAG(Reading,1,Reading) OVER (ORDER BY stamp ASC) LastReading
,Reading - LAG(Reading,1,Reading) OVER (ORDER BY stamp ASC) ChangeSinceLastReading
,CONVERT(date, stamp) stamp
FROM #tbl
)
SELECT SUM(CASE WHEN Reading = 0 THEN 0 ELSE ChangeSinceLastReading END)
FROM cte
WHERE stamp BETWEEN '03/01/2014' AND '03/31/2014'
--Second option with WHERE on the "inner" SELECT (within the CTE)
--This give a result of 30 as it include the change of 10 between 2/28 and 3/31 is by the filtered lag.
;WITH cte AS (
SELECT Reading
,LAG(Reading,1,Reading) OVER (ORDER BY stamp ASC) LastReading
,Reading - LAG(Reading,1,Reading) OVER (ORDER BY stamp ASC) ChangeSinceLastReading
,CONVERT(date, stamp) stamp
FROM #tbl
WHERE stamp BETWEEN '03/01/2014' AND '03/31/2014'
)
SELECT SUM(CASE WHEN Reading = 0 THEN 0 ELSE ChangeSinceLastReading END)
FROM cte
I think Karl solution using LAG is better than mine, but anyway:
;WITH [Rows] AS
(
SELECT o1.[Date], o1.[Value] as CurrentValue,
(SELECT TOP 1 o2.[Value]
FROM #tbl o2 WHERE o1.[Date] < o2.[Date]) as NextValue
FROM #tbl o1
)
SELECT SUM (CASE WHEN [NextValue] IS NULL OR [NextValue] < [CurrentValue] THEN 0 ELSE [NextValue] - [CurrentValue] END )
FROM [Rows]

Stored proc date filtering causing no records to be returned

I have the following stored procedure:
ALTER PROCEDURE [dbo].[GetErrorsByDate]
(
#p_skip INT
,#p_take INT
,#p_orderBy VARCHAR(50) = 'TimestampDesc'
,#p_startDate SMALLDATETIME = NULL
,#p_endDate SMALLDATETIME = NULL
)
AS
BEGIN
WITH pathAuditErrorLogCT AS
(
SELECT
CASE
WHEN #p_orderBy = 'TimestampAsc' THEN ROW_NUMBER() OVER (ORDER BY E.[Timestamp])
WHEN #p_orderBy = 'TimestampDesc' THEN ROW_NUMBER() OVER (ORDER BY E.[Timestamp] DESC)
WHEN #p_orderBy = 'LogIdAsc' THEN ROW_NUMBER() OVER (ORDER BY E.LogId)
WHEN #p_orderBy = 'LogIdDesc' THEN ROW_NUMBER() OVER (ORDER BY E.LogId DESC)
WHEN #p_orderBy = 'ReferrerUrlAsc' THEN ROW_NUMBER() OVER (ORDER BY E.ReferrerUrl)
WHEN #p_orderBy = 'ReferrerUrlDesc' THEN ROW_NUMBER() OVER (ORDER BY E.ReferrerUrl DESC)
END AS RowNum
,E.Id
FROM pathAuditErrorLog AS E
WHERE
(E.[Timestamp] >= #p_startDate OR #p_startDate IS NULL)
AND
(E.[Timestamp] <= #p_endDate OR #p_endDate IS NULL)
)
SELECT
E.Id
,E.Node
,E.HttpCode
,E.[Timestamp]
,E.[Version]
,E.LogID
,E.IsFrontEnd
,E.ReferrerUrl
,E.[Login]
,E.BrowserName
,E.BrowserVersion
,E.ErrorDetails
,E.ServerVariables
,E.StackTrace
FROM pathAuditErrorLog AS E
INNER JOIN pathAuditErrorLogCT AS pct ON pct.Id = E.Id
WHERE pct.RowNum BETWEEN #p_skip + 1 AND (#p_skip + #p_take)
ORDER BY RowNum
END
The idea is that the procedure returns data from an error table, but allows dynamic column ordering, paging and date filtering. My problem is the date filtering that's part of the common table expression. I'm having trouble getting this to work.
If I remove the date filtering logic then the proc works fine. With it included, then I often get no rows returned even though there were rows expected. For example if I try:
exec GetErrorsByDate 0, 10, 'TimestampDesc', '2013-02-05'
Giving only the start date, then I get a bunch of records backs. However, if I do this:
exec GetErrorsByDate 0, 10, 'TimestampDesc', '2013-02-05', '2013-02-05'
Passing both a start and end date, I end up with no records returned. This is odd since I expected some of the records from the first query to appear in the second.
Can someone spot what I'm doing wrong please?
EDIT:
I've applied what AdaTheDev has suggested and it looks closer to what I need. However I have found one case where the suggested method does not return what I expected. If I run the following:
exec GetErrorsByDate 0, 10, 'TimestampDesc', '2013-01-01', '2013-01-03'
I get one row whose timestamp is 2013-01-02 13:29:00. If I run this:
exec GetErrorsByDate 0, 10, 'TimestampDesc', '2013-01-01', '2013-01-02'
I get no rows returned. I was hoping to see the one row from the previous query to be returned, since it's timestamp does fall on the 2nd of Jan 2013. Have I misunderstood something here?
That will only return records where the Timestamp is exactly 2013-02-05 (i.e. midnight), so records during that day will not be included (I'm assuming they all have times associated).
If you want to include them I would change the clause to:
WHERE
(e.[Timestamp] >= #p_startDate OR #p_startDate IS NULL)
AND
(e.[Timestamp] < #p_endDate OR #p_endDate IS NULL)
(change is the #p_enddate clause is now just <)
And then do:
exec GetErrorsByDate 0, 10, 'TimestampDesc', '2013-02-05', '2013-02-06'