Postgres: Get percentile of number not necessarily in table column - postgresql

Imagine I have a column my_variable of floats in my a my_table. I know how to convert each of the rows in this my_variable column into percentiles, but my question is: I have a number x that is not necessarily in the table. Let's call it 7.67. How do I efficiently compute where 7.67 falls in that distribution of my_variable? I would like to be able to say "7.67 is in the 16.7th percentile" or "7.67 is larger than 16.7% of rows in my_variable." Note that 7.67 is not something taken from the column, but I'm inputting it in the SQL query itself.
I was thinking about ordering my_variable in ascending order and counting the number of rows that fall below the number I specify and dividing by the total number of rows, but is there a more computationally efficient way of doing this, perhaps?

If your data does not change too often, you can use a materialized view or a different table, call it percentiles, in which you store 100 or 1.000 (depending on the precision you need). This table should have a descending index on the value column.
Each row contains the minimum value to reach a certain percentile and the percentile itself.
Then you just need to get the first row that have value greater than the given data and read the percentile value.
In you example the table will contain 1.000 rows, and you could have someting like:
Percentile value
16.9 7.71
16.8 7.69
16.7 7.66
16.6 7.65
16.5 7.62
And your query could be something like:
SELECT TOP 1 Percentile FROM percentiles where 7.67 < value ORDER BY value desc
This is a valid solution if the number of SELECTs you make is much bigger than the numbers of updates in the my_table table.

I ended up doing:
select (avg(dummy_var::float))
from (
select case when var_name < 3.14 then 1 else 0 end as dummy_var
from table_name where var_name is not null
)
Where var_name was the variable of interest, table_name was the table of interest, and 3.14 was the number of interest.

Related

Get percent rank for a given value in a given table column

I have a Postgres table with about 500k rows. One of the columns called score has values ranging from 0-1. The data is not normally distributed.
Say I have an observation of 0.25. I'd like to find out where this would fall in the distribution of the score column. This is sometimes referred to as the percent rank.
E.G. a value of 0.25 is in the 40th percentile. This would mean that a value of 0.25 is larger than 40% of the observations in the table.
I know I can calculate a frequency distribution with something like below but this feel like overkill when all I want is a percentile value.
select k, percentile_disc(k) within group (order by mytable.score)
from mytable, generate_series(0.01, 1, 0.01) as k
group by k
Sounds like you want the hypothetical-set aggregate function percent_rank():
SELECT percent_rank(0.25) WITHIN GROUP (ORDER BY score)
FROM mytable;
The manual:
Computes the relative rank of the hypothetical row, that is (rank - 1) / (total rows - 1). The value thus ranges from 0 to 1 inclusive.

How do I rank the column on percentile and remove the bottom 0.1% percentile data from a column in PostgreSQL?

At work I need to remove the bottom 0.1% data aka I only need data with percentile rank =< .99.
How can I rank the column I want to calculate percentile on where it gives me percentile rank from .01 to .99 so that I can eliminate the data I don't want?
The column I want to get the percentile for also needs to be partitioned by another column. There are X values in one column, which has Y values each which I want the percentile for.
I used the percent_rank function, but it doesn't give out accurate results. The examples on the internet show it'll rank the data from 0 to 1 but while mine does start at 0 it ends at .57 for one column and .93 for another column and so on, but never goes till .99 or 1.
I wrote percent_rank() over (partition by ColX order by ColY). Am I missing something here? If this works properly, it's exactly what I am looking for.
I also tried using the functions shown here, but I didn't quite understand what's happening with the ntile function and the generate_series returned an error basically saying that the numbers in the brackets (0.01, 1, 0.01) are out of range. The host cloud tool my company uses doesn't work with Postgres, like Postgres now accepts -1 indexing, but the tool we use doesn't. It still says indexing needs to be a positive number. So I don't exactly know why the error is occuring.
I feel like I am missing something obvious here, there is a very simple function which will do the work but I can't just find it.
It looks like values in ColY column are not unique inside ColX group.
In this case last records in each group share the same percent rank
In example below ident is unique inside group
Question without unique "order by"
select ident, ColX, ColY, percent_rank() over (partition by ColX order by ColY) from table
Result
ident
ColX
ColY
percent_rank
1
1
1
0
2
1
2
0.5
3
1
2
0.5
Question with unique "order by"
select ident, ColX, ColY, percent_rank() over (partition by ColX order by ColY, ident) from table
Result
ident
ColX
ColY
percent_rank
1
1
1
0
2
1
2
0.5
3
1
2
1

how to calculate percentile in postgres

I Have table called timings where we are storing 1 million response timings for load testing , now we need to divide this data into 100 groups i.e. - first 500 records as one group and so on , and calculate percentile of each group , rather than average.
so far i tried this query
Select quartile
, avg(data)
, max(data)
FROM (
SELECT data
, ntile(500) over (order by data) as quartile
FROM data
) x
GROUP BY quartile
ORDER BY quartile
but how do i have find the percentile
Usually, if you want to know the percentile, you are safer using cume_dist than ntile. That is because ntile behaves strangely when given few inputs. Consider:
=# select v,
ntile(100) OVER (ORDER BY v),
cume_dist() OVER (ORDER BY v)
FROM (VALUES (1), (2), (4), (4)) x(v);
v | ntile | cume_dist
---+-------+-----------
1 | 1 | 0.25
2 | 2 | 0.5
4 | 3 | 1
4 | 4 | 1
You can see that ntile only uses the first 4 out of 100 buckets, where cume_dist always gives you a number from 0 to 1. So if you want to find out the 99th percentile, you can just throw away everything with a cume_dist under 0.99 and take the smallest v from what's left.
If you are on Postgres 9.4+, then percentile_cont and percentile_disc make it even easier, because you don't have to construct the buckets yourself. The former even gives you interpolation between values, which again may be useful if you have a small data set.
Edit:
Please note that since I originally answered this question, Postgres has gotten additional aggregate functions to help with this. See percentile_disc and percentile_cont here. These were introduced in 9.4.
Original Answer:
ntile is how one calculates percentiles (among other n-tiles, such as quartile, decile, etc.).
ntile groups the table into the specified number of buckets as equally as possible. If you specified 4 buckets, that would be a quartile. 10 would be a decile.
For percentile, you would set the number of buckets to be 100.
I'm not sure where the 500 comes in here... if you want to determine which percentile your data is in (i.e. divide the million timings as equally as possible into 100 buckets), you would use ntile with an argument of 100, and the groups would have more than 500 entries.
If you don't care about avg nor max, you can drop a bunch from your query. So it would look something like this:
SELECT data, ntile(100) over (order by data) AS percentile
FROM data
ORDER BY data

Cumulative count cut with postgresql

Is there a possibility to realize a cumulative count cut of 2.0 - 98.0% in PostgreSQL?
That means only to select the data range set from 2% until 98%.
I suggest the window function ntile() in a subquery:
SELECT *
FROM (
SELECT *, ntile(50) OVER (ORDER BY ts) AS part
FROM tbl
WHERE ts >= $start
AND ts < $end
) sub
WHERE part NOT IN (1, 50);
ts being your date column.
ntile() assigns integer numbers to the rows dividing them into the number of partitions specified. Per documentation:
integer ranging from 1 to the argument value, dividing the partition as equally as possible
By using 50 partitions the first partition matches the first 2 % as closely as possible and the last the last 2 % (> 98%).
Note, if there are less than 50 rows the assigned numbers never go up to 50. In this case the first row would be trimmed but not the last. Since the first and last 2 % are not well defined with such a low number of rows this may be considered correct or incorrect. Check for the total number of rows and adapt to your needs. For instance by trimming the last row, too. Or none at all.

SQL - 5% random sample by group

I have a table with about 10 million rows and 4 columns, no primary key. Data in Column 2 3 4 (x2 x3 and x4) are grouped by 50 groups identified in column1 X1.
To get a random sample of 5% from table, I have always used
SELECT TOP 5 PERCENT *
FROM thistable
ORDER BY NEWID()
The result returns about 500,000 rows. But, some groups get an unequal representation in the sample (relative to their original size) if sampled this way.
This time, to get a better sample, I wanted to get 5% sample from each of the 50 groups identified in column X1. So, at the end, I can get a random sample of 5% of rows in each of the 50 groups in X1 (instead of 5% of entire table).
How can I approach this problem? Thank you.
You need to be able to count each group and then coerce the data out in a random order. Fortuantly, we can do this with a CTE-style query. Although CTE isn't strictly needed it will help break down the solution into little bits, rather than a lots of sub-selects and the like.
I assume you've already got a column that groups the data, and that the value in this column is the same for all items in the group. If so, something like this might work (columns and table names to be changed to suit your situation):
WITH randomID AS (
-- First assign a random ID to all rows. This will give us a random order.
SELECT *, NEWID() as random FROM sourceTable
),
countGroups AS (
-- Now we add row numbers for each group. So each group will start at 1. We order
-- by the random column we generated in the previous expression, so you should get
-- different results in each execution
SELECT *, ROW_NUMBER() OVER (PARTITION BY groupcolumn ORDER BY random) AS rowcnt FROM randomID
)
-- Now we get the data
SELECT *
FROM countGroups c1
WHERE rowcnt <= (
SELECT MAX(rowcnt) / 20 FROM countGroups c2 WHERE c1.groupcolumn = c2.groupcolumn
)
The two CTE expressions allow you to randomly order and then count each group. The final select should then be fairly straightforward: for each group, find out how many rows there are in it, and only return 5% of them (total_row_count_in_group / 20).