Syntax error when trying to populate column with count of unique values in another column - postgresql

I'm trying to count the number of unique pool operators for every permit # in a table but am having trouble putting this value in a new column dedicated to that count.
So I have 2 tables: doh_analysis; doh_pools.
Both of these tables have a "permit" column (TEXT), but doh_analysis has about 1000 rows with duplicates in the permit column but occasional unique values in the operator column (TEXT).
I'm trying to fill a column "operator_count" in the table "doh_pools" with a count of unique values in "pooloperator" for each permit #.
So I tried the following code but am getting a syntax error at or near "(":
update doh_pools
set operator_count = select count(distinct doh_analysis.pooloperator)
from doh_analysis
where doh_analysis.permit ilike doh_pools.permit;
When I remove the "select" from before the "count" I get "SQL Error [42803]: ERROR: aggregate functions are not allowed in UPDATE".
I can successfully query a list of distinct permit-pooloperator pairs using:
select distinct permit, pooloperator
from doh_analysis;
And I can query the # of unique pooloperators per permit 1 at a time using:
select count(distinct pooloperator)
from doh_analysis
where permit ilike '52-60-03054';
But I'm struggling to insert a count of unique pairs for each permit # in the operatorcount column.
Is there a way to do this?

There is certainly a better way of doing this but I accomplished my goal by creating 2 intermediary tables and the updating the target table with values from the 2nd intermediate table like so:
select distinct permit, pooloperator
into doh_pairs
from doh_analysis;
select permit, count(distinct pooloperator)
into doh_temp
from doh_pairs
group by permit;
select count(distinct permit)
from doh_temp;
update doh_pools
set operator_count = doh_temp.count
from doh_temp
where doh_pools.permit ilike doh_temp.permit
and doh_pools.permit is not NULL
returning count;

Related

Selecting an entry from PostgreSQL table based on time and id using psycopg2

I have the following table in PostgreSQL DB:
DB exempt
I need a PostgreSQL command to get a specific value from tbl column, based on time_launched and id columns. More precisely, I need to get a value from tbl column which corresponds to a specific id and latest (time-wise) value from time_launched column. Consequently, the request should return "x" as an output.
I've tried those requests (using psycopg2 module) but they did not work:
db_object.execute("SELECT * FROM check_ids WHERE id = %s AND MIN(time_launched)", (id_variable,))
db_object.execute(SELECT DISTINCT on(id, check_id) id, check_id, time_launched, tbl, tbl_1 FROM check_ids order by id, check_id time_launched desc)
Looks like a simple ORDER BY with a LIMIT 1 should do the trick:
SELECT tbl
FROM check_ids
WHERE id = %s
ORDER BY time_launched DESC
LIMIT 1
The WHERE clause filters results by the provided id, the ORDER BY clause ensures results are sorted in reverse chronological order, and LIMIT 1 only returns the first (most recent) row

How to get distinct results with max of a field

I have a query :
select distinct(donorig_cdn),cerhue_num_rfa,max(cerhue_dt) from t_certif_hue
group by donorig_cdn,cerhue_num_rfa
order by donorig_cdn
it returns me some repeated ids with different cerhue_num_rfa
how do i return only one line for the repeated ids with cerhue_num_rfa that matches the max of date (cerhue_dt) .. and have at the end only 10 results instead of 15 ?
Postgres has SELECT DISTINCT ON to the rescue. It only returns the first row found for each value of the given column. So, all you need is an order that ensures the latest entry comes first. No need for grouping.
SELECT DISTINCT ON (donorig_cdn) donorig_cdn,cerhue_num_rfa,cerhue_dt
FROM t_certif_hue
ORDER BY donorig_cdn, cerhue_dt DESC;

selecting a distinct column with alias table not working in postgres

SELECT pl_id,
distinct ON (store.store_ID),
in_user_id
FROM plan1.plan_copy_levl copy1
INNER JOIN plan1._PLAN_STORE store
ON copy1.PLAN_ID = store .PLAN_ID;
while running this query in postgres server i am getting the below error..How to use the distinct clause..in above code plan 1 is the schema name.
ERROR: syntax error at or near "distinct" LINE 2: distinct ON
(store.store_ID),
You are missing an order by where the first set of rows should be the ones specified in the distinct on clause. Also, the distinct on clause should be at start of the selection list.
Try this:
SELECT distinct ON (store_ID) store.store_ID, pl_id,
in_user_id
FROM plan1.plan_copy_levl copy1
INNER JOIN plan1._PLAN_STORE store
ON copy1.PLAN_ID = store .PLAN_ID
order by store_ID, pl_id;

Postgres: Distinct but only for one column

I have a table on pgsql with names (having more than 1 mio. rows), but I have also many duplicates. I select 3 fields: id, name, metadata.
I want to select them randomly with ORDER BY RANDOM() and LIMIT 1000, so I do this is many steps to save some memory in my PHP script.
But how can I do that so it only gives me a list having no duplicates in names.
For example [1,"Michael Fox","2003-03-03,34,M,4545"] will be returned but not [2,"Michael Fox","1989-02-23,M,5633"]. The name field is the most important and must be unique in the list everytime I do the select and it must be random.
I tried with GROUP BY name, bu then it expects me to have id and metadata in the GROUP BY as well or in a aggragate function, but I dont want to have them somehow filtered.
Anyone knows how to fetch many columns but do only a distinct on one column?
To do a distinct on only one (or n) column(s):
select distinct on (name)
name, col1, col2
from names
This will return any of the rows containing the name. If you want to control which of the rows will be returned you need to order:
select distinct on (name)
name, col1, col2
from names
order by name, col1
Will return the first row when ordered by col1.
distinct on:
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first.
The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.
Anyone knows how to fetch many columns but do only a distinct on one column?
You want the DISTINCT ON clause.
You didn't provide sample data or a complete query so I don't have anything to show you. You want to write something like:
SELECT DISTINCT ON (name) fields, id, name, metadata FROM the_table;
This will return an unpredictable (but not "random") set of rows. If you want to make it predictable add an ORDER BY per Clodaldo's answer. If you want to make it truly random, you'll want to ORDER BY random().
To do a distinct on n columns:
select distinct on (col1, col2) col1, col2, col3, col4 from names
SELECT NAME,MAX(ID) as ID,MAX(METADATA) as METADATA
from SOMETABLE
GROUP BY NAME

Hive: How to do a SELECT query to output a unique primary key using HiveQL?

I have the following schema dataset which i want to transform into a table that can be exported to SQL. I am using HIVE. Input as follows
call_id,stat1,stat2,stat3
1,a,b,c,
2,x,y,z,
3,d,e,f,
1,j,k,l,
The output table needs to have call_id as its primary key so it needs to be unique. The output schema should be
call_id,stat2,stat3,
1,b,c, or (1,k,l)
2,y,z,
3,e,f,
The problem is that when i use the keyword DISTINCT in the HIVE query, the DISTINCT applies to the all the colums combined. I want to apply the DISTINCT operation only to the call_id. Something on the lines of
SELECT DISTINCT(call_id), stat2,stat3 from intable;
However this is not valid in HIVE(I am not well-versed in SQL either).
The only legal query seems to be
SELECT DISTINCT call_id, stat2,stat3 from intable;
But this returns multiple rows with same call_id as the other columns are different and the row on the whole is distinct.
NOTE: There is no arithmetic relation between a,b,c,x,y,z, etc. So any trick of averaging or summing is not viable.
Any ideas how i can do this?
One quick idea,not the best one, but will do the work-
hive>create table temp1(a int,b string);
hive>insert overwrite table temp1
select call_id,max(concat(stat1,'|',stat2,'|',stat3)) from intable group by call_id;
hive>insert overwrite table intable
select a,split(b,'|')[0],split(b,'|')[1],split(b,'|')[2] from temp1;
,,I want to apply the DISTINCT operation only to the call_id"
But how will then Hive know which row to eliminate?
Without knowing the amount of data / size of the stat fields you have, the following query can the job:
select distinct i1.call_id, i1.stat2, i1.stat3 from (
select call_id, MIN(concat(stat1, stat2, stat3)) as smin
from intable group by call_id
) i2 join intable i1 on i1.call_id = i2.call_id
AND concat(i1.stat1, i1.stat2, i1.stat3) = i2.smin;