Selecting random value from a table with multiple entries to insert into another table in hive - select

I need to select random values from above table where when there are multiple values (exampl:- of 3333,4444,6666) . Currently I am using below code which is biased in the final result.
insert into com_n3
select distinct number,min(district)
from com_n2
result will give more numbers with value "A" as the district. I need a unbiased random way to select from multiple entries.

you can get some random records using following query.
select number, district
from
(
select *, row_number() over (partition by number order rand()) as rank
from
temp.com_n2
) a
where a.rank=1

Related

Rank() window function on Redshift over multiple columns , not 1 or 2

I want to use rank() window function on a Redshift database to rank over specific, multiple columns. The code shall check those multiple columns per each row and assign same rank to rows that have identical values in ALL those columns.
Example image found in link below:
https://ibb.co/GJv1xQL
There are 18 distinct rows, however the rank shows 3 distinct rows, according to the ranking I wish to apply.
I tried :
select tbl.*
, dense_rank() over (partition by secondary_id order by created_on, type1, type2, money, amount nulls last ) as rank
from table tbl
where secondary_id='92d30f87-b2da-45c0-bdf7-c5ca96fe5ea6'
But the ranks assigned were wrong, and then I tried:
select tbl.*
, dense_rank() over (partition by secondary_id,created_on, type1, type2, money, amount ) as rank
from table tbl
where secondary_id='92d30f87-b2da-45c0-bdf7-c5ca96fe5ea6'
But this assigned rank=1 everywhere, in every row.
I found how to solve this.
The reason that the order by all the columns of interest was failing, is because the timestamp column contained different values in miliseconds, which was not obvious by viewing the data . So I only took into account the timestamp up until seconds and it worked! So I converted created_on column to date_trunc('s',cd.created_on) .
select tbl.* , dense_rank() over (partition by secondary_id order by date_trunc('s',created_on), type1, type2, money, amount nulls last ) as rank from table tbl where secondary_id='92d30f87-b2da-45c0-bdf7-c5ca96fe5ea6'

postgres random using setseed

I would like to add a column with a random number using setseed to a table.
The original table structure (test_input) col_a,col_b,col_c
Desired output (test_output) col_a, col_b, col_c, random_id
The following returns the same random_id on all rows instead of a different value in each row.
select col_a,col_b,col_c,setseed(0.5),(
select random() from generate_series(1,100) limit 1
) as random_id
from test_input
Could you help me modify the query that uses setseed and returns a different random_id in each row?
You have to use setseed differently. Also generate_series() is misued in your example. You need to use something like:
select setseed(0.5);
select col_a,col_b,col_c, random() as random_id from test_input;
If you want to get the same random number assigned to the same row, you will have to sort rows first, quoting documentation:
If the ORDER BY clause is specified, the returned rows are sorted in
the specified order. If ORDER BY is not given, the rows are returned
in whatever order the system finds fastest to produce.
You can use:
select setseed(0.5);
select *, random() as random_id from (
select col_a,col_b,col_c from test_input order by col_a, col_b, col_c) a;
Here I assume that combination of col_a, col_b, col_c is unique. If it's not the case, you will have to first add another column with unique ID to the table and sort by this column in the query above.

Postgres: Distinct but only for one column

I have a table on pgsql with names (having more than 1 mio. rows), but I have also many duplicates. I select 3 fields: id, name, metadata.
I want to select them randomly with ORDER BY RANDOM() and LIMIT 1000, so I do this is many steps to save some memory in my PHP script.
But how can I do that so it only gives me a list having no duplicates in names.
For example [1,"Michael Fox","2003-03-03,34,M,4545"] will be returned but not [2,"Michael Fox","1989-02-23,M,5633"]. The name field is the most important and must be unique in the list everytime I do the select and it must be random.
I tried with GROUP BY name, bu then it expects me to have id and metadata in the GROUP BY as well or in a aggragate function, but I dont want to have them somehow filtered.
Anyone knows how to fetch many columns but do only a distinct on one column?
To do a distinct on only one (or n) column(s):
select distinct on (name)
name, col1, col2
from names
This will return any of the rows containing the name. If you want to control which of the rows will be returned you need to order:
select distinct on (name)
name, col1, col2
from names
order by name, col1
Will return the first row when ordered by col1.
distinct on:
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first.
The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.
Anyone knows how to fetch many columns but do only a distinct on one column?
You want the DISTINCT ON clause.
You didn't provide sample data or a complete query so I don't have anything to show you. You want to write something like:
SELECT DISTINCT ON (name) fields, id, name, metadata FROM the_table;
This will return an unpredictable (but not "random") set of rows. If you want to make it predictable add an ORDER BY per Clodaldo's answer. If you want to make it truly random, you'll want to ORDER BY random().
To do a distinct on n columns:
select distinct on (col1, col2) col1, col2, col3, col4 from names
SELECT NAME,MAX(ID) as ID,MAX(METADATA) as METADATA
from SOMETABLE
GROUP BY NAME

ROWID equivalent in postgres 9.2

Is there any way to get rowid of a record in postgres??
In oracle i can use like
SELECT MAX(BILLS.ROWID) FROM BILLS
Yes, there is ctid column which is equivalent for rowid. But is useless for you. Rowid and ctid are physical row/tuple identifiers => can change after rebuild/vacuum.
See: Chapter 5. Data Definition > 5.4. System Columns
The PostgreSQL row_number() window function can be used for most purposes where you would use rowid. Whereas in Oracle the rowid is an intrinsic numbering of the result data rows, in Postgres row_number() computes a numbering within a logical ordering of the returned data. Normally if you want to number the rows, it means you expect them in a particular order, so you would specify which column(s) to order the rows when numbering them:
select client_name, row_number() over (order by date) from bills;
If you just want the rows numbered arbitrarily you can leave the over clause empty:
select client_name, row_number() over () from bills;
If you want to calculate an aggregate over the row number you'll have to use a subquery:
select max(rownum) from (
select row_number() over () as rownum from bills
) r;
If all you need is the last item from a table, and you have a column to sort sequentially, there's a simpler approach than using row_number(). Just reverse the sort order and select the first item:
select * from bills
order by date desc limit 1;
Use a Sequence. You can choose 4 or 8 byte values.
http://www.neilconway.org/docs/sequences/
Add any unique column to your table(name maybe rowid).
And prevent changing it by creating BEFORE UPDATE trigger, which will raise exception if someone will try to update.
You may populate this column with sequence as #JohnMudd mentioned.

Calculate Mode - "Highest frequency row" DB2

What would be the most efficient way to calculating the mode across tables with joins in DB2..
I am trying to get the value with the most frequency(count) for a given column(ID - candidate key for joined table) on a given date.
The idea is to get the most common (value) from the table which has different (value)s for some accounts (for the same ID and date). We need to make it unique for use in another table.
You can use common table expressions [CTE's], indicated by WITH, to break the logic down into logical steps. First we'll build the summary rows, then we'll assign a ranking to the rows within each group, then pick out the ones that with the highest count of records.
Let's say we want to know which flavor of each item sells the most frequently on each date (perhaps assuming a record is quantity one).
WITH s as
(
SELECT itemID, saleDate, flavor, count(*) as tally
FROM sales
GROUP BY itemID, saleDate, flavor
), r as
(
SELECT itemID, saleDate, flavor, tally,
RANK() OVER (PARTITION BY itemID, saleDate ORDER BY tally desc) as pri
FROM s
)
SELECT itemID, saleDate, flavor, tally
FROM r
WHERE pri = 1
Here the names "s" and "r" refer to the result set from their respective CTE's. These names can then be used as to represent a table in another part of the statement.
The pri column will have the RANK() of tally value on the summary row from the first section "s" within the window of itemID and saleDate. Tally is descending, because we want the largest value first, which will get a RANK() of 1. Then in the main SELECT we simply pick those summary records which were first in their partition.
By using RANK() or DENSE_RANK() we could get back multiple flavors for an itemID, saleDate, if they are tied for first place. This could be eliminated by replacing RANK() with ROW_NUMBER(), but it would arbitrarily pick one of the tied flavors as a winner, and this may not be correct answer for the problem at hand.
If we had a sales quantity column in the table, we could replace COUNT(*) with SUM(salesqty) and find what had sold the most units.