how to get average that ignores outliers? - postgresql

say I have a postgresql table with the following values:
id | value
----------
1 | 4
2 | 8
3 | 100
4 | 5
5 | 7
If I use postgresql to calculate the average, it gives me an average of 24.8 because the high value of 100 has great impact on the calculation. While in fact I would like to find an average somewhere around 6 and eliminate the extreme(s).
I am looking for a way to eliminate extremes and want to do this "statistically correct". The extreme's cannot be fixed. I cannot say; If a value is over X, it has to be eliminated.
I have been bending my head on the postgresql aggregate functions but cannot put my finger on what is right for me to use. Any suggestions?

Postgresql can also calculate the standard deviation.
You could take only the data points which are in the average() +/- 2*stddev() which would roughly correspond to the 90% datapoints closest to the average.
Of course 2 can also be 3 (95%) or 6 (99.995%) but do not get hung up on the numbers because in the presence of a collection outliers you are no longer dealing with a normal distribution.
Be very careful and validate that it works as expected.

I cannot say; If a value is over X, it has to be eliminated.
Well, you could use having and a subselect to eliminate outliers, something like:
HAVING value < (
SELECT 2 * avg(value)
FROM mytable
GROUP BY ...
)
(Or, for that matter, use a more complex version to eliminate anything above 2 or 3 standard deviations if you want something that will be better at eliminating only outliers.)
The other option is to look at generating a median value, which is a fairly statistically sound way of accounting for outliers; happily there are three reasonable examples of just that: one from the Postgresql Wiki, one built as an Oracle compatability layer, and another from the PostgreSQL Journal. Note the caveats around how precisely/accurately they implement medians.

Here's an aggregate function which will calculate the trimmed mean for a set of values, excluding values outside N standard deviations from the mean.
Example:
DROP TABLE IF EXISTS foo;
CREATE TEMPORARY TABLE foo (x FLOAT);
INSERT INTO foo VALUES (1);
INSERT INTO foo VALUES (2);
INSERT INTO foo VALUES (3);
INSERT INTO foo VALUES (4);
INSERT INTO foo VALUES (100);
SELECT avg(x), tmean(x, 2), tmean(x, 1.5) FROM foo;
-- avg | tmean | tmean
-- -----+-------+-------
-- 22 | 22 | 2.5
Code:
DROP TYPE IF EXISTS tmean_stype CASCADE;
CREATE TYPE tmean_stype AS (
deviations FLOAT,
count INT,
acc FLOAT,
acc2 FLOAT,
vals FLOAT[]
);
CREATE OR REPLACE FUNCTION tmean_sfunc(tmean_stype, float, float)
RETURNS tmean_stype AS $$
SELECT $3, $1.count + 1, $1.acc + $2, $1.acc2 + ($2 * $2), array_append($1.vals, $2);
$$ LANGUAGE SQL;
CREATE OR REPLACE FUNCTION tmean_finalfunc(tmean_stype)
RETURNS float AS $$
DECLARE
fcount INT;
facc FLOAT;
mean FLOAT;
stddev FLOAT;
lbound FLOAT;
ubound FLOAT;
val FLOAT;
BEGIN
mean := $1.acc / $1.count;
stddev := sqrt(($1.acc2 / $1.count) - (mean * mean));
lbound := mean - stddev * $1.deviations;
ubound := mean + stddev * $1.deviations;
-- RAISE NOTICE 'mean: % stddev: % lbound: % ubound: %', mean, stddev, lbound, ubound;
fcount := 0;
facc := 0;
FOR i IN array_lower($1.vals, 1) .. array_upper($1.vals, 1) LOOP
val := $1.vals[i];
IF val >= lbound AND val <= ubound THEN
fcount := fcount + 1;
facc := facc + val;
END IF;
END LOOP;
IF fcount = 0 THEN
return NULL;
END IF;
RETURN facc / fcount;
END;
$$ LANGUAGE plpgsql;
CREATE AGGREGATE tmean(float, float)
(
SFUNC = tmean_sfunc,
STYPE = tmean_stype,
FINALFUNC = tmean_finalfunc,
INITCOND = '(-1, 0, 0, 0, {})'
);
Gist (which should be identical): https://gist.github.com/4458294

Mind using the ntile window function. It allows you to easily isolate extreme values from the result set.
Let's say you want to cut 10% from both sides of the result set. Then passing the value of 10 to ntile and looking for values between 2 and 9 would give you the desired result. Keep also in mind that if you have less than 10 records, you might accidentally cut more than 20%, so be sure to check the total amount of records as well.
WITH yyy AS (
SELECT
id,
value,
NTILE(10) OVER (ORDER BY value) AS ntiled,
COUNT(*) OVER () AS counted
FROM
xxx)
SELECT
*
FROM
yyy
WHERE
counted < 10 OR ntiled BETWEEN 2 AND 9;

You can use IQR to filter outliers. PL/pgSQL code:
select percentile_cont(0.25) WITHIN GROUP (ORDER BY value)
into q1
from table;
select percentile_cont(0.75) WITHIN GROUP (ORDER BY value)
into q3
from table;
iqr := q3 - q1;
min := q1 - 1.5 * iqr;
max := q3 + 1.5 * iqr;
select value
into result
from table
where value >= min and value <= max;
return result;

Related

write a query to calculate cumulative performance based on daily percent change in postgresql?

I have daily change in a table like below.
Table: performance
date
percent_change
2022/12/01
2
2022/12/02
-1
2022/12/03
3
I want to assume initial value as 100. and shows cumulative value till date, like below.
expected output:
date
percent_change
cumulative value
2022/12/01
2
102
2022/12/02
-1
100.98
2022/12/03
3
104.0094
A product of values, like the one you want to make, is nothing more than EXP(SUM(LN(...))). It results in a slightly verbose query but does not require new functions to be coded and can be ported as is to other DBMS.
In your case, as long as none of your percentages is below -100%:
SELECT date,
percent_change,
100 * EXP(SUM(LN(1+percent_change/100)) OVER (ORDER BY Date)) AS cumulative_value
FROM T
The SUM(...) OVER (ORDER BY ...) is what makes it a cumulative sum.
If you need to account for percentages lower than -100%, you need a bit more complexity.
SELECT date,
percent_change,
100 * -1 ^ SUM(CASE WHEN percent_change < -100 THEN 1 ELSE 0 END) OVER (ORDER BY Date)
* EXP(SUM(LN(ABS(1+percent_change/100))) OVER (ORDER BY Date))
AS cumulative_value
FROM T
WHERE NOT EXISTS (SELECT FROM T T2 WHERE T2.percent_change = -100 AND T2.date <= T.date)
UNION ALL
SELECT Date, percent_change, 0
FROM T
WHERE EXISTS (SELECT FROM T T2 WHERE T2.percent_change = -100 AND T2.date <= T.date)
Explanation:
An ABS(...) has been added to account for the values not supported in the previous query. It effectively strips the sign of 1 + percentage_value / 100
Before the EXP(SUM(LN(ABS(...)))), the -1 ^ SUM(...) is where the sign is put back to the calculation. Read it as: -1 to the power of how many times we encountered a negative value.
The part WHERE EXISTS(...) / WHERE NOT EXISTS(...) handles the special case of percentage_value = -100%. When we encounter -100, we cannot calculate the logarithm even with a call to ABS(...).However, this does not matter much as the products you want to calculate are going to be 0 from this point onward.
Side note:
You can save yourself some of the complexity of the above queries by changing how you store the changes.
Storing 0.02 to represent 2% removes the multiplications/divisions by 100.
Storing 0.0198026272961797 (LN(1 + 0.02)) removes the need to call for a logarithm in your query.
I assume that date in 3rd row is 2022/12/03. Otherwise you need to add an id or some other column to have order on percent changes that occurred in the same day.
Solution
To calculate value after percent_change, you need to multiply your current value by (100 + percent_change) / 100
For day n cumulative value is 100 multiplied by product of coefficients (100 + percent_change) / 100 up to day n.
In PostgreSQL "up to day n" can be implemented with window functions.
Since there is no aggregate function for multiplication, lets create it.
CREATE AGGREGATE PRODUCT(DOUBLE PRECISION) (
SFUNC = float8mul,
STYPE = FLOAT8
);
Final query will look like this:
SELECT
date,
percent_change,
100 * product((100 + percent_change)::float / 100) OVER (ORDER BY date) cumulative_value
FROM performance;

DB2 - use field as a labeled duration for time calculation

Given a table that looks like this:
id task scheduled_date reminder
-- ----------------------- ---------------- --------
1 mail january newsletter 2022-01-01 15 days
I had planned on executing a query to mimic date addition as in
SELECT TASK, SCHEDULED_DATE + 15 DAYS FROM ...
==> 2022-01-16
Unfortunately, using the REMINDER field gives an error:
SELECT TASK, (SCHEDULED_DATE + REMINDER) FROM ...
==>[Code: -182, SQL State: 42816] [SQL0182] A date, time, or timestamp expression not valid.
Is there any way to accomplish using the reminder field as a labeled duration? (I'm using IBMi DB2)
You'll need to convert the string "15 days" into an actual duration.
A date durration is a decimal(8,0) number representing YYYYMMDD
So 15 days would be 00000015
1 year, 00010000
1 year 1 month, one day '00010101`
create table testdur (
datedur decimal(8,0)
);
insert into testdur
values (15), (10000), (10101), (90), (300);
select current_date as curDate
, dateDur
,current_date + dateDur
from testdur;
Results
There is an attempt to implement the interval function available in Db2 for LUW. It supports string expression as a parameter, not just string constant as the built-in one.
The result of this function can participate in whatever allowed date arithmetic.
This works on Db2 for LUW v11.1+ and Db2 for IBM i v7.5+ at least.
create or replace function interval_d (p_interval varchar (100))
returns dec (8)
contains sql
deterministic
no external action
begin atomic
declare v_sign dec (1) default 0;
declare v_pattern varchar (100) default '([+-]? *[0-9]+) *(\w+)';
declare v_y int default 0;
declare v_m int default 0;
declare v_d int default 0;
declare v_occ int default 1;
declare v_num int;
declare v_kind varchar (10);
l1: while 1=1 DO
set v_kind =
lower
(
regexp_substr
(
p_interval
, v_pattern
, 1, v_occ, '', 2
)
);
if v_kind is null then leave l1; end if;
set v_num =
int
(
replace
(
regexp_substr
(
p_interval
, v_pattern
, 1, v_occ, '', 1
)
, ' ', ''
)
);
if sign (v_num) * v_sign < 0 then
signal sqlstate '75001' set message_text = 'Sign of all operands must be the same';
end if;
if v_sign = 0 then set v_sign = sign (v_num); end if;
if v_kind in ('d', 'day', 'days')
then set v_d = v_d + v_num;
elseif v_kind in ('mon', 'mons', 'month', 'months')
then set v_m = v_m + v_num;
elseif v_kind in ('y', 'year', 'years')
then set v_y = v_y + v_num;
else
signal sqlstate '75000' set message_text = 'wrong duration';
end if;
set v_occ = v_occ + 1;
end while l1;
if abs (v_d) > 99 then
set v_m = v_m + v_d / 30, v_d = mod (v_d, 30);
end if;
if abs (v_m) > 99 then
set v_y = v_y + v_m / 12, v_m = mod (v_m, 12);
end if;
return v_y * 10000 + v_m * 100 + v_d;
end
select interval_d (i) as d
from
(
values
('4 years 2 months 3 days')
, ('3 day 4 year 2 month')
, ('-4y -2mon -3d')
) t (i)
D
40203
40203
-40203
fiddle

Postgres: difference between two timestamps (hours:minutes:seconds)

i'm creating a select that calculate the difference between two timestamps
here the code: (isn't necessary you understand tables below. Just follow the thread)
(select value from demo.data where id=q.id and key='timestampend')::timestamp
- (select value from demo.data where id=q.id and key='timestampstart')::timestamp) as durata
Look at this example, if you want easier:
select timestamp_end::timestamp - timestamp_start as duration
here the result:
// "durata" is duration
The problem is that the first timestamp is 2017-06-21 and the second is 2017-06-22 so we have 1 day and some hours of difference.
How can i do for show the result not like "1 day 02:06:41.993657" but "26:06:41.993657" without milliseconds (26:06:41)?
Update
I'm testing this query:
select id as ticketid,
(select value from demo.data where id=q.id and key = 'timestampstart')::timestamp as TEnd,
(select value from demo.data where id=q.id and key = 'timestampend')::timestamp as TStart,
(select
make_interval
(
0,0,0,0, -- years, months, weeks, days
extract(days from duration1)::int * 24 + extract(hours from duration1)::int, -- calculated hours (days * 24 + hours)
extract(mins from duration1)::int, -- minutes
floor(extract(secs from duration1))::int -- seconds, without miliseconds, thus FLOOR()
) as duration1
from
(
(select value from demo.data where id=q.id and key='timestampstart')::timestamp - (select value from demo.data where id=q.id and key='timestampend')::timestamp
) t(duration) as dur
from (select distinct id from demo.data) q
error is the same: [Err] ERROR: syntax error at or near "::"
there is an error on id = q.id
data table is like this:
You could use EXTRACT function and wrap it up with MAKE_INTERVAL and some math. It's pretty straight forward, since you pass each part of timestamp to it:
select
make_interval(
0,0,0,0, -- years, months, weeks, days
extract(days from durdata)::int * 24 + extract(hours from durdata)::int, -- calculated hours (days * 24 + hours)
extract(mins from durdata)::int, -- minutes
floor(extract(secs from durdata))::int -- seconds, without miliseconds, thus FLOOR()
) as durdata
from (
select '2017-06-22 02:06:41.993657'::timestamp - '2017-06-21'::timestamp
) t(durdata);
Output:
durdata
----------
26:06:41
You could wrap it up within a function to make it easy to work with.
There is no worry about timestamp - timestamp returning an output with precision to more than days, and thus losing you some information, because even calculation for different years would still return days and additional time part.
Example:
postgres=# select ('2019-06-22 01:03:05.993657'::timestamp - '2017-06-21'::timestamp) as durdata;
durdata
------------------------
731 days 01:03:05.993657
In Postgres, although interval data type allows having hours value greater than 23 (see https://www.postgresql.org/docs/9.6/static/functions-formatting.html), to_char() function will cut out days and will take only "hours within a day" if you put delta value to it and try to get 'HH24' value.
So, I ended up with such trick, combining to_char(...) with extract('epoch' from...) and then putting the concatinated value to another to_char():
with timestamps(ts1, ts2) as (
select
'2017-06-21'::timestamptz,
'2017-06-22 01:03:05.1212'::timestamptz
), res as (
select
round(extract('epoch' from ts2 - ts1) / 3600) as hours,
to_char(ts2 - ts1, 'MI:SS') as min_sec
from timestamps
)
select hours, min_sec, to_char(format('%s:%s', hours, min_sec)::interval, 'HH24:MI:SS')
from res;
The result is:
hours | min_sec | to_char
-------+---------+----------
25 | 03:05 | 25:03:05
(1 row)
You can define an SQL function to make using it easier:
create or replace function extract_hhmmss(timestamptz, timestamptz) returns interval as $$
with delta(i) as (
select
case when $2 > $1 then $2 - $1
else $1 - $2
end
), res as (
select
round(extract('epoch' from i) / 3600) as hours,
to_char(i, 'MI:SS') as min_sec
from delta
)
select
(
case when $2 < $1 then '-' else '' end
|| to_char(format('%s:%s', hours, min_sec)::interval, 'HH24:MI:SS')
)::interval
from res;
$$ language sql stable;
Example of usage:
[local]:5432 nikolay#test=# select extract_hhmmss('2017-06-21'::timestamptz, '2017-06-22 01:03:05.1212'::timestamptz);
extract_hhmmss
----------------
25:03:05
(1 row)
Time: 0.882 ms
[local]:5432 nikolay#test=# select extract_hhmmss('2017-06-22 01:03:05.1212'::timestamptz, '2017-06-21'::timestamptz);
extract_hhmmss
----------------
-25:03:05
(1 row)
Notice, that it will give an error if timestamps are provided in reverse order, but it's not really hard to fix. // Update: already fixed.

Converting Daily Snapshots to Ranges in PostgreSQL

I have a very large table with years' worth of daily snapshots, showing what the data looks like each day. For the sake of illustration the table looks something like this:
Part Qty Snapshot
---- ---- --------
A 5 1/1/2015
B 10 1/1/2015
A 5 1/2/2015
B 10 1/2/2015
A 6 1/3/2015
B 10 1/3/2015
A 5 1/4/2015
B 10 1/4/2015
I would like to implement a slowly changing data methodology and collapse this data into a form that would look like this (assume current date is 1/4/15)
Part Qty From Thru Active
---- ---- -------- -------- ------
A 5 1/1/2015 1/2/2015 I
B 10 1/1/2015 1/4/2015 A
A 6 1/3/2015 1/3/2015 I
A 5 1/4/2015 1/4/2015 A
I have a function that runs daily so when I capture the latest snapshot, I convert it to this methodology. This function runs once the data is actually loaded into the table with an active flag of 'C' (current), from the giant table (which is actually in DB2).
This works for me moving forward (once I have all past dates loaded), but I'd like to have a way to do this in one fell swoop, for all existing dates and convert the individual snapshot dates into ranges.
For what it's worth, my current method is to run this function for every possible date value. While it's working, it's quite slow, and I have several years worth of history to process as I loop one day at a time.
Tables:
create table main.history (
part varchar(25) not null,
qty integer not null,
from_date date not null,
thru_date date not null,
active_flag char(1)
);
create table stage.history as select * from main.history where false;
create table partitioned.history_active (
constraint history_active_ck1 check (active_flag in ('A', 'C'))
) inherits (main.history);
create table partitioned.history_inactive (
constraint history_active_ck1 check (active_flag = 'I')
) inherits (main.history);
Function to process a day's worth of new data:
CREATE OR REPLACE FUNCTION main.capture_history(new_date date)
RETURNS null AS
$BODY$
DECLARE
rowcount integer := 0;
BEGIN
-- partitioned.history_active already has a current snapshot for new_date
truncate table stage.history;
insert into stage.history
select
part, qty,
min (from_date), max (thru_date),
case when max (thru_date) = new_date then 'A' else 'I' end
FROM
partitioned.history_active
group by
part_qty;
truncate table partitioned.history_active;
insert into partitioned.history_active
select * from stage.history
where active_flag = 'A';
insert into partitioned.history_inactive
select * from stage.history
where active_flag = 'I';
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;

Weighted moving average in Amazon Redshift

Is there a way to calculate a weighted moving average with a fixed window size in Amazon Redshift? In more detail, given a table with a date column and a value column, for each date compute the weighted average value over a window of a specified size, with weights specified in an auxiliary table.
My search attempts so far yielded plenty of examples for doing this with window functions for simple average (without weights), for example here. There are also some related suggestions for postgres, e.g., this SO question, however Redshift's feature set is quite sparse compared with postgres and it doesn't support many of the advanced features that are suggested.
Assuming we have the following tables:
create temporary table _data (ref_date date, value int);
insert into _data values
('2016-01-01', 34)
, ('2016-01-02', 12)
, ('2016-01-03', 25)
, ('2016-01-04', 17)
, ('2016-01-05', 22)
;
create temporary table _weight (days_in_past int, weight int);
insert into _weight values
(0, 4)
, (1, 2)
, (2, 1)
;
Then, if we want to calculate a moving average over a window of three days (including the current date) where values closer to the current date are assigned a higher weight than those further in the past, we'd expect for the weighted average for 2016-01-05 (based on values from 2016-01-05, 2016-01-04 and 2016-01-03):
(22*4 + 17*2 + 25*1) / (4+2+1) = 147 / 7 = 21
And the query could look as follows:
with _prepare_window as (
select
t1.ref_date
, datediff(day, t2.ref_date, t1.ref_date) as days_in_past
, t2.value * weight as weighted_value
, weight
, count(t2.ref_date) over(partition by t1.ref_date rows between unbounded preceding and unbounded following) as num_values_in_window
from
_data t1
left join
_data t2 on datediff(day, t2.ref_date, t1.ref_date) between 0 and 2
left join
_weight on datediff(day, t2.ref_date, t1.ref_date) = days_in_past
order by
t1.ref_date
, datediff(day, t2.ref_date, t1.ref_date)
)
select
ref_date
, round(sum(weighted_value)::float/sum(weight), 0) as weighted_average
from
_prepare_window
where
num_values_in_window = 3
group by
ref_date
order by
ref_date
;
Giving the result:
ref_date | weighted_average
------------+------------------
2016-01-03 | 23
2016-01-04 | 19
2016-01-05 | 21
(3 rows)