How to get the top 99% values in postgresql? - postgresql

Seemingly similar to How to get the top 10 values in postgresql?, yet somehow very different.
We'll set it up similar to that question:
I have a postgresql table: Scores(score integer).
How would I get the highest 99% of the scores? We cannot say that we know beforehand how many rows there are, so we can't use the same limit to an integer trick. SQL Server has an easy SELECT TOP syntax -- is there anything similarly simple in the postgresql world?

This should be doable with percent_rank()
select score
from (
select score, percent_rank() over (order by score desc) as pct_rank
from scores
) t
where pct_rank <= 0.99

you can use the ntile function to partition the rows into percentiles and then select the rows where tile > 99
example:
-- following query generates 1000 rows with random
-- scores and selects the 99th percentile using the ntile function.
-- because the chance of the same random value appearing twice is extremely
-- small, the result should in most cases yield 10 rows.
with scores as (
select
id
, random() score
from generate_series(1, 1000) id
)
, percentiles AS (
select
*
, ntile(100) over (order by score) tile
from scores
)
select
id
, score
from percentiles
where tile > 99

Related

Finding average and moving average at any level

I have a situation
I have a table which has 3 columns:
Product_ID
Scheduled_date_time
Arrival_date_time
Write a query to get the total number of delayed products, the average delay in minutes and the moving average all aggregated at week level, using the Scheduled_date_time as reference for the aggregation.
Delay is when a product arrives after its Scheduled_date_time.
I used this query to get total and average at week level:
select week_num,
sum(Isdelayed),
avg(Timediff_Inmins)
from
(
select product_id,
Scheduled_date_time,
Arrival_date_time,
case when Scheduled_date_time<Arrival_date_time then 1 else 0 end Isdelayed,
datediff(MINUTE,Scheduled_date_time,Arrival_date_time) as Timediff_Inmins,
datepart(week,Scheduled_date_time) as week_num
from products
) x
group by x.week_num
--Query below gives me moving average at week level of aggregation
select x.week_num,AVG(Timediff_Inmins) OVER (ORDER BY week_num ASC ROWS 6 PRECEDING) as Moving_Average_PerWeek
from
(
select product_id,
Scheduled_date_time,
Arrival_date_time,
case when Scheduled_date_time<Arrival_date_time then 1 else 0 end Isdelayed,
datediff(MINUTE,Scheduled_date_time,Arrival_date_time) as Timediff_Inmins,
datepart(week,Scheduled_date_time) as week_num
from products
) x
group by x.week_num,x.Timediff_Inmins
Not sure if Im going in the right direction as I do not have the sample data.
I came across this question from somewhere.
Getting Total and average at week level isn't an issue for me but getting total,average and moving average all aggregated at week level, using the Scheduled_date_time as reference for the aggregation is the real challenge for me here. Reason being total and average can be put in one query grouped by week_num but cant put moving average in the same query so I end up with 2 queries and again not sure if they are exactly rite.

Optimizing two queries which aim at getting only the min row on a grouping key

I'm not sure I can get a clear question name...
What I want is to calculate the distance between points and polygons (this is step 1 and then, for each point, get only the closest polygon (nb : one polygon can have many points attached, but one point must be attached to only one polygon).
I'm currently doing is the following :
CREATE TABLE temp_table AS
SELECT
areas.*
points.* -- includes a points_id column
ST_DistanceSphere(areas.geometry, points.geometry) AS distance_sphere
FROM points
INNER JOIN areas
ON st_DWithin(areas.geometry, points.geometry, 25)
SELECT *
FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY temp_table.points_id ORDER BY distance_sphere ASC) as rownumber, *
FROM temp_table
) X
WHERE rownumber = 1
I have a feeling it's quite inefficient (the first request has been processing all night, on a 4 000 000 rows database... It took 29mn with a limit 10 at the end) as it's calculating many useless rows.
Would putting the first request in the second one be faster ?
SELECT *
FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY temp_table.points_id ORDER BY distance_sphere ASC) as rownumber, *
FROM (
SELECT
areas.*
points.* -- includes a points_id column
ST_DistanceSphere(areas.geometry, points.geometry) AS distance_sphere
FROM areas
INNER JOIN points
ON st_DWithin(areas.geometry, points.geometry, 25)
)
) X
WHERE rownumber = 1
If not, how could I optimize what I'm doing ?
What EPSG/SRID do you use (degree, meters) for example:
- 4326 is in degrees
- 3857 is in meters
If you use meteric then you should use st_distance not st_distancesphere. If you use degree EPSG then be carefull with st_dwithin as this using units of EPSG so 25 means 25 degree and that is HUGE distance (around 3000 km).
So if you use 4326 (degree) then for your st_dwithin use much smaller value then 25.
Create gist indexes on both geometry columns.
Create index on point using gist(geometry);
Create index on areas using gist(geometry);
And just use your question with proposed changes.(change st_distancesphare to st_distance or use st_dwithin with much smaller value).

Count number of points within certain distance ranges from another set of points

I have the following, which gives me the number of customers within 10,000 meters of any store location:
SELECT COUNT(*) as customer_count FROM customer_table c
WHERE EXISTS(
SELECT 1 FROM locations_table s
WHERE ST_Distance_Sphere(s.the_geom, c.the_geom) < 10000
)
What I need is for this query to return not only the number of customers within 10,000 meters, but also the following. The number of customers within...
10,000 meters
more than 10,000, but less than 50,000
more than 50,000, but less than 10,0000
more than 100,000
...of any location.
I'm open to this working a couple of ways. For a given customer, only count them one time (the shortest distance to any store), which would count everyone exactly once. I realize this is probably pretty complex. I'm also open to having people be counted multiple times, which is really the accurate values anyway and think should be much simpler.
Thanks for any direction.
You can do both types of queries relatively easily. But an issue here is that you do not know which customers are associated with which store locations, which seems like an interesting thing to know. If you want that, use the PK and store_name of the locations_table in the query. See both options with location id and store_name below. To emphasize the difference between the two options:
The first option indicates how many customers are in every distance class for every store location, for all customers for every store location.
The second option indicates how many customers are in every distance class for every store location, for the nearest store location for each customer only.
This is a query of O(n x m) running order (implemented with the CROSS JOIN between customer_table and locations_table) and likely to become rather slow with increasing numbers of rows in either table.
Count customers in all distance classes
You should make a CROSS JOIN between the distances of customers from store locations and then group them by the store location id, name and classes of maximum distance that you define. You can create a "table" from your distance classes with the VALUES command which you can then simply use in any query:
SELECT loc_dist.id, loc_dist.store_name, grps.grp, count(*)
FROM (
SELECT s.id, s.store_name, ST_Distance_Sphere(s.the_geom, c.the_geom) AS dist
FROM customer_table c, locations_table s) AS loc_dist
JOIN (
VALUES(1, 10000.), (2, 50000.), (3, 100000.), (4, 1000000.)
) AS grps(grp, dist) ON loc_dist.dist < grps.dist
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3;
Count customers in the nearest distance class
If you want customers listed in the nearest distance class only, then you should make the same CROSS JOIN on customer_table and locations_table as in the previous case, but then simply select the lowest group (i.e. the closest store) using a CASE clause in the query and GROUP BY store location id, name and distance class as before:
SELECT
id, store_name,
CASE
WHEN dist < 10000. THEN 1
WHEN dist < 50000. THEN 2
WHEN dist < 100000. THEN 3
ELSE 4
END AS grp,
count(*)
FROM (
SELECT s.id, s.store_name, ST_Distance_Sphere(s.the_geom, c.the_geom) AS dist
FROM customer_table c, locations_table s) AS loc_dist
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3;

SQL - 5% random sample by group

I have a table with about 10 million rows and 4 columns, no primary key. Data in Column 2 3 4 (x2 x3 and x4) are grouped by 50 groups identified in column1 X1.
To get a random sample of 5% from table, I have always used
SELECT TOP 5 PERCENT *
FROM thistable
ORDER BY NEWID()
The result returns about 500,000 rows. But, some groups get an unequal representation in the sample (relative to their original size) if sampled this way.
This time, to get a better sample, I wanted to get 5% sample from each of the 50 groups identified in column X1. So, at the end, I can get a random sample of 5% of rows in each of the 50 groups in X1 (instead of 5% of entire table).
How can I approach this problem? Thank you.
You need to be able to count each group and then coerce the data out in a random order. Fortuantly, we can do this with a CTE-style query. Although CTE isn't strictly needed it will help break down the solution into little bits, rather than a lots of sub-selects and the like.
I assume you've already got a column that groups the data, and that the value in this column is the same for all items in the group. If so, something like this might work (columns and table names to be changed to suit your situation):
WITH randomID AS (
-- First assign a random ID to all rows. This will give us a random order.
SELECT *, NEWID() as random FROM sourceTable
),
countGroups AS (
-- Now we add row numbers for each group. So each group will start at 1. We order
-- by the random column we generated in the previous expression, so you should get
-- different results in each execution
SELECT *, ROW_NUMBER() OVER (PARTITION BY groupcolumn ORDER BY random) AS rowcnt FROM randomID
)
-- Now we get the data
SELECT *
FROM countGroups c1
WHERE rowcnt <= (
SELECT MAX(rowcnt) / 20 FROM countGroups c2 WHERE c1.groupcolumn = c2.groupcolumn
)
The two CTE expressions allow you to randomly order and then count each group. The final select should then be fairly straightforward: for each group, find out how many rows there are in it, and only return 5% of them (total_row_count_in_group / 20).

Select rows randomly distributed around a given mean

I have a table that has a value field. The records have values somewhat evenly distributed between 0 and 100.
I want to query this table for n records, given a target mean, x, so that I'll receive a weighted random result set where avg(value) will be approximately x.
I could easily do something like
SELECT TOP n * FROM table ORDER BY abs(x - value)
... but that would give me the same result every time I run the query.
What I want to do is to add weighting of some sort, so that any record may be selected, but with diminishing probability as the distance from x increases, so that I'll end up with something like a normal distribution around my given mean.
I would appreciate any suggestions as to how I can achieve this.
why not use the RAND() function?
SELECT TOP n * FROM table ORDER BY abs(x - value) + RAND()
EDIT
Using Rand won't work because calls to RAND in a select have a tendency to produce the same number for most of the rows. Heximal was right to use NewID but it needs to be used directly in the order by
SELECT Top N value
FROM table
ORDER BY
abs(X - value) + (cast(cast(Newid() as varbinary) as integer))/10000000000
The large divisor 10000000000 is used to keep the avg(value) closer to X while keeping the AVG(x-value) low.
With that all said maybe asking the question (without the SQL bits) on https://stats.stackexchange.com/ will get you better results.
try
SELECT TOP n * FROM table ORDER BY abs(x - value), newid()
or
select * from (
SELECT TOP n * FROM table ORDER BY abs(x - value)
) a order by newid()