Sampling in PostgreSQL

Sampling in PostgreSQL - postgresql

I am looking for possible ways of random sampling in PostgreSQL. I found a couple of methods to do that with different advantages and disadvantages. The naive way to do that is:
select * from Table_Name
order by random()
limit 10;
Another faster method is:
select * from Table_Name
WHERE random() <= 0.01
order by random()
limit 10;
(Although that 0.01 depends on the table size and the sample size; this is just an example.)
In both of these queries a random number is generated for each row and sorted based on those random generated numbers. Then in the sorted numbers the first 10 are selected as the final result, so I think these should be sampling without replacement.
Now what I want to do is to somehow turn this sampling methods into sampling with replacement. How is that possible? Or is there any other random sampling method with replacement in PostgreSQL?
I have to say that I do have an idea how this might be possible but I don't know how to implement it in postgresql, Here is my idea:
If instead of generating one random value we generate S random values where S is the sample size,then order all of the random generated values,it will be sampling with replacement.(I don't know if I am right)
At this point I don't mind about the performance of the query.

This could be achieved by mapping the random values to row numbers. The same row could be sampled N times if it happens that the same corresponding random number comes out N times. Here's a CTE implementation:
with
rows as (select *,row_number() over() as rn from tablename order by random()),
w(num) as (select (random()*(select count(*) from rows))::int+1
from generate_series(1,10))
select rows.* from rows join w on rows.rn = w.num;

Related

Get percent rank for a given value in a given table column

I have a Postgres table with about 500k rows. One of the columns called score has values ranging from 0-1. The data is not normally distributed.
Say I have an observation of 0.25. I'd like to find out where this would fall in the distribution of the score column. This is sometimes referred to as the percent rank.
E.G. a value of 0.25 is in the 40th percentile. This would mean that a value of 0.25 is larger than 40% of the observations in the table.
I know I can calculate a frequency distribution with something like below but this feel like overkill when all I want is a percentile value.
select k, percentile_disc(k) within group (order by mytable.score)
from mytable, generate_series(0.01, 1, 0.01) as k
group by k

Sounds like you want the hypothetical-set aggregate function percent_rank():
SELECT percent_rank(0.25) WITHIN GROUP (ORDER BY score)
FROM mytable;
The manual:
Computes the relative rank of the hypothetical row, that is (rank - 1) / (total rows - 1). The value thus ranges from 0 to 1 inclusive.

Dividing AVG of column1 by AVG of column2

I am trying to divide the average value of column1 by the average value of column 2, which will give me an average price from my data. I believe there is a problem with my syntax / structure of my code, or I am making a rookie mistake.
I have searched stack and cannot find many examples of dividing two averaged columns, and checked the postgres documentation.
The individual average query is working fine (as shown here)
SELECT (AVG(CAST("Column1" AS numeric(4,2))),2) FROM table1
But when I combine two of them in an attempt to divide, It simply does not work.
SELECT (AVG(CAST("Column1" AS numeric(4,2))),2) / (AVG(CAST("Column2" AS numeric(4,2))),2) FROM table1
I am receiving the following error; "ERROR: row comparison operator must yield type boolean, not type numeric". I have tried a few other variations which have mostly given me syntax errors.

I don't know what you are trying to do with your current approach. However, if you want to take the ratio of two averages, you could also just take the ratio of the sums:
SELECT SUM(CAST(Column1 AS numeric(4,2))) / SUM(CAST(Column2 AS numeric(4,2)))
FROM table1;
Note that SUM() just takes a single input, not two inputs. The reason why we can use the sums is that average would normalize both the numerator and denominator by the same amount, which is the number of rows in table1. Hence, this factor just cancels out.

Order by random with seed values in where condition not returning same result

I have a query which checks the data inside the set or not, then orders by random with a seed value.
Example
set seed to 0.1;
select * from table where id in (1,2,3) order by random() limit 10;
since the seed value purpose is to give me the exact same result every time. But I couldn't get the same result with the above query.
But if I run the below query it's giving me exactly the same result every time.
set seed to 0.1;
select * from table order by random() limit 10;
Is there any wrong using the IN condition while using random seed value?
Thanks in advance.

SQL - 5% random sample by group

I have a table with about 10 million rows and 4 columns, no primary key. Data in Column 2 3 4 (x2 x3 and x4) are grouped by 50 groups identified in column1 X1.
To get a random sample of 5% from table, I have always used
SELECT TOP 5 PERCENT *
FROM thistable
ORDER BY NEWID()
The result returns about 500,000 rows. But, some groups get an unequal representation in the sample (relative to their original size) if sampled this way.
This time, to get a better sample, I wanted to get 5% sample from each of the 50 groups identified in column X1. So, at the end, I can get a random sample of 5% of rows in each of the 50 groups in X1 (instead of 5% of entire table).
How can I approach this problem? Thank you.

You need to be able to count each group and then coerce the data out in a random order. Fortuantly, we can do this with a CTE-style query. Although CTE isn't strictly needed it will help break down the solution into little bits, rather than a lots of sub-selects and the like.
I assume you've already got a column that groups the data, and that the value in this column is the same for all items in the group. If so, something like this might work (columns and table names to be changed to suit your situation):
WITH randomID AS (
-- First assign a random ID to all rows. This will give us a random order.
SELECT *, NEWID() as random FROM sourceTable
),
countGroups AS (
-- Now we add row numbers for each group. So each group will start at 1. We order
-- by the random column we generated in the previous expression, so you should get
-- different results in each execution
SELECT *, ROW_NUMBER() OVER (PARTITION BY groupcolumn ORDER BY random) AS rowcnt FROM randomID
)
-- Now we get the data
SELECT *
FROM countGroups c1
WHERE rowcnt <= (
SELECT MAX(rowcnt) / 20 FROM countGroups c2 WHERE c1.groupcolumn = c2.groupcolumn
)
The two CTE expressions allow you to randomly order and then count each group. The final select should then be fairly straightforward: for each group, find out how many rows there are in it, and only return 5% of them (total_row_count_in_group / 20).

Select rows randomly distributed around a given mean

I have a table that has a value field. The records have values somewhat evenly distributed between 0 and 100.
I want to query this table for n records, given a target mean, x, so that I'll receive a weighted random result set where avg(value) will be approximately x.
I could easily do something like
SELECT TOP n * FROM table ORDER BY abs(x - value)
... but that would give me the same result every time I run the query.
What I want to do is to add weighting of some sort, so that any record may be selected, but with diminishing probability as the distance from x increases, so that I'll end up with something like a normal distribution around my given mean.
I would appreciate any suggestions as to how I can achieve this.

why not use the RAND() function?
SELECT TOP n * FROM table ORDER BY abs(x - value) + RAND()
EDIT
Using Rand won't work because calls to RAND in a select have a tendency to produce the same number for most of the rows. Heximal was right to use NewID but it needs to be used directly in the order by
SELECT Top N value
FROM table
ORDER BY
abs(X - value) + (cast(cast(Newid() as varbinary) as integer))/10000000000
The large divisor 10000000000 is used to keep the avg(value) closer to X while keeping the AVG(x-value) low.
With that all said maybe asking the question (without the SQL bits) on https://stats.stackexchange.com/ will get you better results.

try
SELECT TOP n * FROM table ORDER BY abs(x - value), newid()
or
select * from (
SELECT TOP n * FROM table ORDER BY abs(x - value)
) a order by newid()

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Sampling in PostgreSQL - postgresql

Related

Get percent rank for a given value in a given table column

Dividing AVG of column1 by AVG of column2

Order by random with seed values in where condition not returning same result

SQL - 5% random sample by group

Select rows randomly distributed around a given mean

Categories

Resources