Select random rows from Postgres table weighted towards value in column

Select random rows from Postgres table weighted towards value in column - postgresql

Users are presented with 2 random items from an assets table to vote on.
There is a votes_count column in the assets table that counts each time users vote.
When choosing the 2 random items, I'd like to weight that more towards a lower value in votes_count. So items with a lower vote count have a higher probability of being selected randomly.
How can I do this with Postgres?
I've used various methods for selecting random records (RAND(), TABLESAMPLE BERNOULLI(), TABLESAMPLE SYSTEM()), but those don't have the weighting that I'm after.
I'm running PostgreSQL 13, FWIW.

I suggest you the following query :
SELECT *
FROM assets
ORDER BY random()/power(votes_count,1) DESC
LIMIT 2
It orders in a random way the rows of your table while dividing by the votes_count put the lower votes in a better position than the higher votes in the DESC order.
Replacing the 1 power by any real number n will increase the probability of selecting the lower votes when n > 1, and decrease the probability of selecting the lower votes when 0 < n < 1.

Related

Randomly sampling n rows in impala using random() or tablesample system()

I would like to randomly sample n rows from a table using Impala. I can think of two ways to do this, namely:
SELECT * FROM TABLE ORDER BY RANDOM() LIMIT <n>
or
SELECT * FROM TABLE TABLESAMPLE SYSTEM(1) limit <n>
In my case I set n to 10000 and sample from a table of over 20 million rows. If I understand correctly, the first option essentially creates a random number between 0 and 1 for each row and orders by this random number.
The second option creates many different 'buckets' and then randomly samples at least 1% of the data (in practice this always seems to be much greater than the percentage provided). In both cases I then select only the 10000 first rows.
Is the first option reliable to randomly sample the 10K rows in my case?
Edit: some aditional context. The structure of the data is why the random sampling or shuffling of the entire table seems quite important to me. Additional rows are added to the table daily. For example, one of the columns is country and usually the incoming rows are then first all from country A, then from country B, etc. For this reason I am worried that the second option would maybe sample too many rows from a single country, rather than randomly. Is that a justified concern?
Related thread that reveals the second option: What is the best query to sample from Impala for a huge database?

I beg to differ OP. I prefer second optoin.
First option, you are assigning values 0 to 1 to all of your data and then picking up first 10000 records. so basically, impala has to process all rows in the table and thus the operation will be slow if you have a 20million row table.
Second option, impala randomly picks up rows from files based on percentage you provide. Since this works on the files, so return count of rows may different than the percentage you mentioned. Also, this method is used to compute statistics in Impala. So, performance wise this is much better and correctness of random can be a problem.
Final thought -
If you are worried about randomness and correctness of your random data, go for option 1. But if you are not much worried about randomness and want sample data and quick performance, then pick second option. Since Impala uses this for COMPUTE STATS, i pick this one :)
EDIT : After looking at your requirement, i have a method to sample over a particular field or fields.
We will use window function to set rownumber randomly to each country group. Then pick up 1% or whatever % you want to pick up from that data set.
This will make sure you have data evenly distributed between countries and each country have same % of rows in result data set.
select * from
(
select
row_number() over (partition by country order by country , random()) rn,
count() over (partition by country order by country) cntpartition,
tab.*
from dat.mytable tab
)rs
where rs.rn between 1 and cntpartition* 1/100 -- This is for 1% data
screenshot from my data -
HTH

Tableau: how to make a count if in a for loop?

I'm just starting off Tableau and would like to do a count if in a for loop.
I have the following variables:
City
User
Round: takes values of either A or B
Amount
I would like to have a countif function that shows how many users received any positive amount in both round A and round B in a given city.
In my dashboard, each row represents a city, and I would like to have a column that shows the total number of users in each city that received amounts in both rounds.
Thanks!

You can go for a simple solution that works.
Create a calculated field called "Positive Rounds per User" using the below formula:
// counts the number of unique rounds that had positive amounts per user in a city
{ FIXED [User], [City]: COUNTD(IIF([Amount]>0, [Round], NULL))}
You can use the above to create another calculated field called "unique users":
// unique number of users that have 2 in "Positive Rounds per User" field
COUNTD(IIF([Positive Rounds per User]=2, [User], NULL))
You can combine the calculation of 1 and 2 into one but it gets complex to read so better to split them up

to group my results by varying distances from an object

I created a successful query that counts the number of objects + average the prices of the counted objects. the query looks like this:
select count (product_id), avg (price)
from current_table
Now, I want to group the results by DISTANCE column that is already exist in the table BUT IN PARTICULAR DISTANCE GROUPING like this:
groups of 0-10 km, 10-20 km, 20-30 km etc. lets say about 10 different distance groups.
How can I do it?
thanks!!!!

Using RANK function in Tableau on a Combined column. However, display individual columns as well

I am using RANK function in Tableau and I am displaying the Rank of calculated measure (Eg: 1 to 50)
The calculated Measure I have is Total Amount for Combined Periods.
When there is no Period displayed on the dashboard, the Total Amount is the sum of both periods and this is exactly what I want. I am good in this case.
However, When I want to display the Period in the Rows, the Total Amount changes to "Total Amount for Period 1 and Total Amount for Period 2".
How can I add a different axis to show Individual Periods as well as Rank of Total Amount for Combined Period?
I guess this might come down to Dual axis in Tableau and I believe this is not available yet and users are voting for this in Ideas.

Different Total Types in Tableau

I am trying to use Tableau's row total function but am running into a challenge. In the same widget I have Rows 1 - 4 with Numbers. Row 5 is a percentage.
What I would like to do is have Rows 1 - 4 use a Sum Total and Row 5 use an Average total.
Any suggestions on how I can do this?
Thanks,

I don't believe you can use different total metrics on the same worksheet.
What you can do is to create 2 different worsheets, and bring them side by side on a dashboard. Then use the proper Total metric in each.
But beware on calculation average of percentages, because they might be twisted. Usually weighted average is required to accurately express the "average" of a percentage.
What you can do is to actually calculate the percentage (use a calculated field) via the division of two metrics. That way, when you do Totals you will actually a valid value for the "average" of the percentage.
As an exercise, suppose you have sales (in $) in first row, and # of clients in row 2. Now I create a calculated field called ticket, that is
SUM(sales) / sum([# of clients])
That way I can add that to a third row, and for each column I'll have the right number of ticket, and if I add a Row Grand Total, I'll get the actual average ticket value (that is total sales / total # clients), because Tableau will sum all sales, sum all # clients and them perform the calculation (the division)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Select random rows from Postgres table weighted towards value in column - postgresql

Related

Randomly sampling n rows in impala using random() or tablesample system()

Tableau: how to make a count if in a for loop?

to group my results by varying distances from an object

Using RANK function in Tableau on a Combined column. However, display individual columns as well

Different Total Types in Tableau

Categories

Resources