Finding average and moving average at any level - tsql

I have a situation
I have a table which has 3 columns:
Product_ID
Scheduled_date_time
Arrival_date_time
Write a query to get the total number of delayed products, the average delay in minutes and the moving average all aggregated at week level, using the Scheduled_date_time as reference for the aggregation.
Delay is when a product arrives after its Scheduled_date_time.
I used this query to get total and average at week level:
select week_num,
sum(Isdelayed),
avg(Timediff_Inmins)
from
(
select product_id,
Scheduled_date_time,
Arrival_date_time,
case when Scheduled_date_time<Arrival_date_time then 1 else 0 end Isdelayed,
datediff(MINUTE,Scheduled_date_time,Arrival_date_time) as Timediff_Inmins,
datepart(week,Scheduled_date_time) as week_num
from products
) x
group by x.week_num
--Query below gives me moving average at week level of aggregation
select x.week_num,AVG(Timediff_Inmins) OVER (ORDER BY week_num ASC ROWS 6 PRECEDING) as Moving_Average_PerWeek
from
(
select product_id,
Scheduled_date_time,
Arrival_date_time,
case when Scheduled_date_time<Arrival_date_time then 1 else 0 end Isdelayed,
datediff(MINUTE,Scheduled_date_time,Arrival_date_time) as Timediff_Inmins,
datepart(week,Scheduled_date_time) as week_num
from products
) x
group by x.week_num,x.Timediff_Inmins
Not sure if Im going in the right direction as I do not have the sample data.
I came across this question from somewhere.
Getting Total and average at week level isn't an issue for me but getting total,average and moving average all aggregated at week level, using the Scheduled_date_time as reference for the aggregation is the real challenge for me here. Reason being total and average can be put in one query grouped by week_num but cant put moving average in the same query so I end up with 2 queries and again not sure if they are exactly rite.

Related

How to get the top 99% values in postgresql?

Seemingly similar to How to get the top 10 values in postgresql?, yet somehow very different.
We'll set it up similar to that question:
I have a postgresql table: Scores(score integer).
How would I get the highest 99% of the scores? We cannot say that we know beforehand how many rows there are, so we can't use the same limit to an integer trick. SQL Server has an easy SELECT TOP syntax -- is there anything similarly simple in the postgresql world?
This should be doable with percent_rank()
select score
from (
select score, percent_rank() over (order by score desc) as pct_rank
from scores
) t
where pct_rank <= 0.99
you can use the ntile function to partition the rows into percentiles and then select the rows where tile > 99
example:
-- following query generates 1000 rows with random
-- scores and selects the 99th percentile using the ntile function.
-- because the chance of the same random value appearing twice is extremely
-- small, the result should in most cases yield 10 rows.
with scores as (
select
id
, random() score
from generate_series(1, 1000) id
)
, percentiles AS (
select
*
, ntile(100) over (order by score) tile
from scores
)
select
id
, score
from percentiles
where tile > 99

How to aggregate table calculation in tableau

this is my workbook
on that workbook, i calculate timediff between each transaction for each users. what i build first is the filters PUL with this calculate
{Fixed [User Id]: sum(
if [Created At]<=[END_DATE] then 1 else 0 end)}>=2
AND
{FIXED [User Id]: sum(
IF [Created At]<=[END_DATE] AND
[Created At] >= [START_DATE] THEN 1 ELSE 0 END)}>=1
this formula is to find out the users who match with conditions (do transaction at least 2 before end_date parameter, and atleast doing 1 transaction in between start_date parameter and end_date parameter) after that i add to context this filters to find out the users first.
and i made filters date_range with this calculate
lookup(min(([Created At])),0) >= [START_DATE] and
lookup(min(([Created At])),0) <= [END_DATE]
so it will visualize only transaction on range (start_date as first range, and end_date as last range) and also visualize last date transaction before first range (if any).
after that i make calculate called datediff
DATEDIFF('day',LOOKUP(MIN([Created At]),-1), MIN([Created At])) and put that on label so it will calculate the day different. and also i put the date in the detail and put the date also in rows and make it ATTR.
my question is, how to find out max, min, median, and average value from this calculate
i tried with calculated max
MAX({FIXED [User Id]:DATEDIFF('day',INT(LOOKUP(MIN([Created At]),-1)), INT(MIN([Created At])))})
but it return error datediff being called with string,integer,integer
For Max and Min you can proceed like I presented you a solution on your previous question. (For max create a rank calculation and sort is descending, For Min you create a second rank calculation ordered ascending).
However, as far as my knowledge of table calculations in tableau goes, Tableau doesn't allow to hard-code these table calculated fields and therefore you cannot-
further aggregate these results
perform LOD calculations on these.
For calculation of these like average and median, It is advised that you may please create a hard-coded column/field which give you the time-difference on any order with that customer's previous order. This you can do it in any programming language of your choice like R or python (or other).
Moreover, Tableau integration with R/python is through script-real type functions which are again of table calculation category and above restrictions will apply.
Good Luck.
EDIT as Alex Blakemore has suggested on a different question/answer, you can use window functions with a slight tweak. Let's assume your calculated field name for datediff is [CF], then create four calculated fields with the following calculations.
window_max([CF])
window_min([CF])
window_avg([CF])
window_median([CF])
and name them [CF max], [CF Min], [CF avg], [CF Median] respectively.
Now edit table calculation with nesting in each four these, as follows-
click nested calculations down arrow. CF will be listed there. change its calculation to specific dimensions, at level deepest and restarting at evrey user id. the screenshot is
thereafter click nested calculations down arrow again. select CF_max/min/med/avg (as the case may be) and create table calculations with table down.
You'll get a view like this as desired.

Execute IF-THEN_ELSE before Execute Calculation - Tableau

I have a graph below.
I would like to calculate lapsed rate which is sum of lapsed value divided by sum of inforce value. I use the formula below in calculated field.
abs(sum(if[Status]='lapsed'then[TotalAmount]end))/abs(sum(if[Status]='inforce'then[TotalAmount]end))
However that formula will also pick the value from Q2 (quarter 2) 2016. What I want to do is to tell tableau to check first if any quarter does not contain both inforce value and lapsed value then skip that quarter. In this case I need to calculate lapsed rate which does not include Q2 2016. How do i do this?
I'm using Tableau v.10.
Thanks.
This is just a quick approach and may not be the most efficient but it seems to work.
You need to use a combination of a row level calculation and a level of detail calculation.
The level of detail calculation can be used to flag the quarters which have both a lapsed and inforced status. Once these quarters are flagged you can calculate the lapsed rate at a row level which can then be rolled up using a sum.
Create a calculated field as follows:
Let me know if you have any questions or need any tweaks.
if
avg(
// Calculate the number of Inforce/Lapsed occurences per Quarter
IF
[Status] = 'Inforce'
or
[Status] = 'Lapsed'
then
{ FIXED
DATEPART('quarter', [Date]):
countd([Status])
}
else
0
end)
//
= 2
then
// Calculate the Lapsed Rate as both statuses exist in the quarter
sum((if
[Status] = 'Lapsed'
then [Total Amount]
END))
/
sum([Total Amount])
END

Count number of points within certain distance ranges from another set of points

I have the following, which gives me the number of customers within 10,000 meters of any store location:
SELECT COUNT(*) as customer_count FROM customer_table c
WHERE EXISTS(
SELECT 1 FROM locations_table s
WHERE ST_Distance_Sphere(s.the_geom, c.the_geom) < 10000
)
What I need is for this query to return not only the number of customers within 10,000 meters, but also the following. The number of customers within...
10,000 meters
more than 10,000, but less than 50,000
more than 50,000, but less than 10,0000
more than 100,000
...of any location.
I'm open to this working a couple of ways. For a given customer, only count them one time (the shortest distance to any store), which would count everyone exactly once. I realize this is probably pretty complex. I'm also open to having people be counted multiple times, which is really the accurate values anyway and think should be much simpler.
Thanks for any direction.
You can do both types of queries relatively easily. But an issue here is that you do not know which customers are associated with which store locations, which seems like an interesting thing to know. If you want that, use the PK and store_name of the locations_table in the query. See both options with location id and store_name below. To emphasize the difference between the two options:
The first option indicates how many customers are in every distance class for every store location, for all customers for every store location.
The second option indicates how many customers are in every distance class for every store location, for the nearest store location for each customer only.
This is a query of O(n x m) running order (implemented with the CROSS JOIN between customer_table and locations_table) and likely to become rather slow with increasing numbers of rows in either table.
Count customers in all distance classes
You should make a CROSS JOIN between the distances of customers from store locations and then group them by the store location id, name and classes of maximum distance that you define. You can create a "table" from your distance classes with the VALUES command which you can then simply use in any query:
SELECT loc_dist.id, loc_dist.store_name, grps.grp, count(*)
FROM (
SELECT s.id, s.store_name, ST_Distance_Sphere(s.the_geom, c.the_geom) AS dist
FROM customer_table c, locations_table s) AS loc_dist
JOIN (
VALUES(1, 10000.), (2, 50000.), (3, 100000.), (4, 1000000.)
) AS grps(grp, dist) ON loc_dist.dist < grps.dist
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3;
Count customers in the nearest distance class
If you want customers listed in the nearest distance class only, then you should make the same CROSS JOIN on customer_table and locations_table as in the previous case, but then simply select the lowest group (i.e. the closest store) using a CASE clause in the query and GROUP BY store location id, name and distance class as before:
SELECT
id, store_name,
CASE
WHEN dist < 10000. THEN 1
WHEN dist < 50000. THEN 2
WHEN dist < 100000. THEN 3
ELSE 4
END AS grp,
count(*)
FROM (
SELECT s.id, s.store_name, ST_Distance_Sphere(s.the_geom, c.the_geom) AS dist
FROM customer_table c, locations_table s) AS loc_dist
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3;

how to calculate percentile in postgres

I Have table called timings where we are storing 1 million response timings for load testing , now we need to divide this data into 100 groups i.e. - first 500 records as one group and so on , and calculate percentile of each group , rather than average.
so far i tried this query
Select quartile
, avg(data)
, max(data)
FROM (
SELECT data
, ntile(500) over (order by data) as quartile
FROM data
) x
GROUP BY quartile
ORDER BY quartile
but how do i have find the percentile
Usually, if you want to know the percentile, you are safer using cume_dist than ntile. That is because ntile behaves strangely when given few inputs. Consider:
=# select v,
ntile(100) OVER (ORDER BY v),
cume_dist() OVER (ORDER BY v)
FROM (VALUES (1), (2), (4), (4)) x(v);
v | ntile | cume_dist
---+-------+-----------
1 | 1 | 0.25
2 | 2 | 0.5
4 | 3 | 1
4 | 4 | 1
You can see that ntile only uses the first 4 out of 100 buckets, where cume_dist always gives you a number from 0 to 1. So if you want to find out the 99th percentile, you can just throw away everything with a cume_dist under 0.99 and take the smallest v from what's left.
If you are on Postgres 9.4+, then percentile_cont and percentile_disc make it even easier, because you don't have to construct the buckets yourself. The former even gives you interpolation between values, which again may be useful if you have a small data set.
Edit:
Please note that since I originally answered this question, Postgres has gotten additional aggregate functions to help with this. See percentile_disc and percentile_cont here. These were introduced in 9.4.
Original Answer:
ntile is how one calculates percentiles (among other n-tiles, such as quartile, decile, etc.).
ntile groups the table into the specified number of buckets as equally as possible. If you specified 4 buckets, that would be a quartile. 10 would be a decile.
For percentile, you would set the number of buckets to be 100.
I'm not sure where the 500 comes in here... if you want to determine which percentile your data is in (i.e. divide the million timings as equally as possible into 100 buckets), you would use ntile with an argument of 100, and the groups would have more than 500 entries.
If you don't care about avg nor max, you can drop a bunch from your query. So it would look something like this:
SELECT data, ntile(100) over (order by data) AS percentile
FROM data
ORDER BY data