Esper: Take the last value from each id and take the mean of all but the most extreme - complex-event-processing

I have 5 temperature sensors. I want to calculate the mean temperature of 4 - excluding the most extreme value (high or low).
Firstly: will std:unique(id) create a window of the last temperature readings for each id 1-5?
select
avg(tempEvent.temp) as meantemp
from
Event(id in (1, 2, 3, 4, 5)).std:unique(id) as tempEvent
Secondly: how could I change the select statement (possibly using an expression if necessary) to only calculate the mean of four values excluding the most extreme?
The background is, I want to know the deviations of each temperature from the average, but I don't want the average to include an anomalous id. Otherwise all temperatures will look like they are deviating from the average but really only one is.

Finding the average of the middle four values is simple enough, though not as elegant as your solution. The code below will work for any number of temps.
SELECT
AVG(temp) AS meantemp
FROM (
SELECT
temp,
COUNT(temp) AS c,
RANK () OVER (PARTITION BY temp ORDER BY temp) AS r
FROM
[table]
)
WHERE
r > 1
AND r < (c-1)
;
As for your second question, I'm not sure I understand. Do you want the value from among the four middle values that has the greatest absolute deviation from the mean of those four values?

Related

Get percent rank for a given value in a given table column

I have a Postgres table with about 500k rows. One of the columns called score has values ranging from 0-1. The data is not normally distributed.
Say I have an observation of 0.25. I'd like to find out where this would fall in the distribution of the score column. This is sometimes referred to as the percent rank.
E.G. a value of 0.25 is in the 40th percentile. This would mean that a value of 0.25 is larger than 40% of the observations in the table.
I know I can calculate a frequency distribution with something like below but this feel like overkill when all I want is a percentile value.
select k, percentile_disc(k) within group (order by mytable.score)
from mytable, generate_series(0.01, 1, 0.01) as k
group by k
Sounds like you want the hypothetical-set aggregate function percent_rank():
SELECT percent_rank(0.25) WITHIN GROUP (ORDER BY score)
FROM mytable;
The manual:
Computes the relative rank of the hypothetical row, that is (rank - 1) / (total rows - 1). The value thus ranges from 0 to 1 inclusive.

Dividing AVG of column1 by AVG of column2

I am trying to divide the average value of column1 by the average value of column 2, which will give me an average price from my data. I believe there is a problem with my syntax / structure of my code, or I am making a rookie mistake.
I have searched stack and cannot find many examples of dividing two averaged columns, and checked the postgres documentation.
The individual average query is working fine (as shown here)
SELECT (AVG(CAST("Column1" AS numeric(4,2))),2) FROM table1
But when I combine two of them in an attempt to divide, It simply does not work.
SELECT (AVG(CAST("Column1" AS numeric(4,2))),2) / (AVG(CAST("Column2" AS numeric(4,2))),2) FROM table1
I am receiving the following error; "ERROR: row comparison operator must yield type boolean, not type numeric". I have tried a few other variations which have mostly given me syntax errors.
I don't know what you are trying to do with your current approach. However, if you want to take the ratio of two averages, you could also just take the ratio of the sums:
SELECT SUM(CAST(Column1 AS numeric(4,2))) / SUM(CAST(Column2 AS numeric(4,2)))
FROM table1;
Note that SUM() just takes a single input, not two inputs. The reason why we can use the sums is that average would normalize both the numerator and denominator by the same amount, which is the number of rows in table1. Hence, this factor just cancels out.

MATLAB-How can I randomly select smaller values with higher probabilities?

I have a column vector "distances", and I want to select a value randomly from this vector such that smaller values have a higher probability of being selected. So far I am using the following, where "possible_cells" is the randomly selected value:
w=(fliplr(1:numel(distances)))/100
possible_cells=randsample((sort(distances)),1,true,w)
Basically, I flipped the distance vector to create probabilities of selection "w" (if I am understanding randsample correctly), so that the smallest value has the probability of being selected equal to the highest value. To check how well this works, I randomly drew 50 values and by using a histogram, I see that the values are higher than I would expect. Does anyone have any idea on how else to do what I described above?
0 Comments
How about something like this?
let's start with 10 sample distances with lengths no greater than 20 just to demonstrate:
d = randi(20,10,1);
Next, since we want smaller values to be more likely, let's take the reciprocal of those distances:
d_rec = 1./d;
Now, let's normalize so we can create a distribution from which to select our distance:
d_rec_norm = d_rec ./ sum(d_rec);
This new variable reflects the probability with which to select each given distance. Now comes a little trick... we choose the distance like this:
d_i = find(rand < cumsum(d_rec_norm),1);
This will give us the index of our chosen distance. The logic behind this is that when cumulatively summing the normalized values associated with each distance (d_rec_norm) we create "bins" whose widths are proportional to the likelihood of selecting each distance. All that is left is to pick a random number between 0 and 1 (rand) and see which "bin" it falls in.
I'm a new poster here, so let me know if this is unclear and I can try to improve my explanation.

Calculating Running Balances

I am struggling with how to calculate a running balance that is dynamic. Here's what I have.
A range of periods with an intersection between total duration and each month in the duration.
A rate that is applied at each intersection.
An initial amount
A need to calculate a running balance based on the initial amount less the rate applied in that period.
For example, I have a project worth $2,500,000 that is 8 months long. The rates for each interval are as follows: 1. 8.10% 2. 14.04% 3. 26.8% 4. 29.1% 5. 33.4% 6. 30.4% 7. 47.4% 8. 100%
For period 1 I have $202,500 (8.10% × $2.5 million), For period 2, I have $322,500 (14.04% × $2,297,500 ($2.5 - $202,500)), for period 3, I have $530,000 (26.8% × $1,974,999 ($2.5 - sum of first two periods ($525,000)). At the end of period 8, my balance is $0 and my earned amount = $2.5 million.
Can I use something like RunningTotal = Sum(MonthlyAmts) OVER (ORDER BY XX ROWS UNBOUNDED PRECEDING), ORDER BY Period? Or is this a candidate for a cursor?
Thanks in advance!
It depends on your table structure. But it sounds like you should use function. If you would like get all years use table valued function, if you are interested only on result use scalar function. Or better way with recursive CTE.

Select rows randomly distributed around a given mean

I have a table that has a value field. The records have values somewhat evenly distributed between 0 and 100.
I want to query this table for n records, given a target mean, x, so that I'll receive a weighted random result set where avg(value) will be approximately x.
I could easily do something like
SELECT TOP n * FROM table ORDER BY abs(x - value)
... but that would give me the same result every time I run the query.
What I want to do is to add weighting of some sort, so that any record may be selected, but with diminishing probability as the distance from x increases, so that I'll end up with something like a normal distribution around my given mean.
I would appreciate any suggestions as to how I can achieve this.
why not use the RAND() function?
SELECT TOP n * FROM table ORDER BY abs(x - value) + RAND()
EDIT
Using Rand won't work because calls to RAND in a select have a tendency to produce the same number for most of the rows. Heximal was right to use NewID but it needs to be used directly in the order by
SELECT Top N value
FROM table
ORDER BY
abs(X - value) + (cast(cast(Newid() as varbinary) as integer))/10000000000
The large divisor 10000000000 is used to keep the avg(value) closer to X while keeping the AVG(x-value) low.
With that all said maybe asking the question (without the SQL bits) on https://stats.stackexchange.com/ will get you better results.
try
SELECT TOP n * FROM table ORDER BY abs(x - value), newid()
or
select * from (
SELECT TOP n * FROM table ORDER BY abs(x - value)
) a order by newid()