My table is somethingg like
CREATE TABLE table1
(
_id text,
name text,
data_type int,
data_value int,
data_date timestamp -- insertion time
);
Now due to a system bug, many duplicate entries are created and I need to remove those duplicated and keep only unique entries excluding data_date because it is a system generated date.
My query to do that is something like:
DELETE FROM table1 A
USING ( SELECT _id, name, data_type, data_value, MIN(data_date) min_date
FROM table1
GROUP BY _id, name, data_type, data_value
HAVING count(data_date) > 1) B
WHERE A._id = B._id
AND A.name = B.name
AND A.data_type = B.data_type
AND A.data_value = B.data_value
AND A.data_date != B.min_date;
However this query works, having millions of records in the table, I want a faster way for it. My idea is to create a new column with value as partition by [_id, name, data_type, data_value] or columns which are in group by. However, I could not find the way to create such column.
I would appretiate if any one may suggest a way to create such column.
Edit 1:
There is another thing to add, I don't want to use CTE or subquery for updating this new column because it will be same as my existing query.
The best way is simply creating a new table without duplicated records:
CREATE...
SELECT _id, name, data_type, data_value, MIN(data_date) min_date
FROM table1
GROUP BY _id, name, data_type, data_value;
Alternatively, you can create a rank and then filter, but a subquery is needed.
RANK() OVER (PARTITION BY your_variables ORDER BY data_date ASC) r
And then filter r=1.
Related
I need to write a sql code that probably is very simple but I am very new to it.
I need to find all the records from one table that have matching id (but no more than one) from the other table. eg. one table contains records of the employees and the second one with employees' telephone numbers. i need to find all employees with only one telephone no
Sample data would be nice. In absence of:
SELECT
employees.employee_id
FROM
employees
LEFT JOIN
(SELECT distinct on(employee_id) employee_id FROM emp_phone) AS phone
ON
employees.employee_id = phone.employee_id
WHERE
phone.employee_id IS NOT NULL;
You need a join of the 2 tables, group by employee and the condition in the having clause:
SELECT e.employee_id, e.name
FROM employees e INNER JOIN numbers n
ON e.employee_id = n.employee_id
GROUP BY e.employee_id, e.name
HAVING COUNT(*) = 1;
If there can be more than a few numbers per employee in the table with the employees' telephone numbers (calling it tel), then it's cheaper to avoid GROUP BY and HAVING which has to process all rows. Find employees with "unique" numbers using a self-anti-join with NOT EXISTS.
While you don't need more than the employee_id and their unique phone number, you don't even have to involve the employee table at all:
SELECT *
FROM tel t
WHERE NOT EXISTS (
SELECT FROM tel
WHERE employee_id = t.employee_id
AND tel_number <> t.tel_number -- or use PK column
);
If you need additional columns from the employee table:
SELECT * -- or any columns you need
FROM (
SELECT employee_id AS id, tel_number -- or any columns you need
FROM tel t
WHERE NOT EXISTS (
SELECT FROM tel
WHERE employee_id = t.employee_id
AND tel_number <> t.tel_number -- or use PK column
)
) t
JOIN employee e USING (id);
The column alias in the subquery (employee_id AS id) is just for convenience. Then the outer join condition can be USING (id), and the ID column is only included once in the result, even with SELECT * ...
Simpler with a smart naming convention that uses employee_id for the employee ID everywhere. But it's a widespread anti-pattern to use employee.id instead.
Related:
JOIN table if condition is satisfied, else perform no join
I have a query like this:
SELECT
table1.*,
sum(table2.amount) as totalamount
FROM table1
join table2 on table1.key = table2.key
GROUP BY table1.*;
I got the error: column "table1.key" must appear in the GROUP BY clause or be used in an aggregate function.
Are there any way to group "all" field?
There is no shortcut syntax for grouping by all columns, but it's probably not necessary in the described case. If the key column is a primary key, it's enough when you use it:
GROUP BY table1.key;
You have to specify all the column names in group by that are selected and are not part of aggregate function ( SUM/COUNT etc)
select c1,c2,c4,sum(c3) FROM totalamount
group by c1,c2,c4;
A shortcut to avoid writing the columns again in group by would be to specify them as numbers.
select c1,c2,c4,sum(c3) FROM t
group by 1,2,3;
I found another way to solve, not perfect but maybe it's useful:
SELECT string_agg(column_name::character varying, ',') as columns
FROM information_schema.columns
WHERE table_schema = 'your_schema'
AND table_name = 'your_table
Then apply this select result to main query like this:
$columns = $result[0]["columns"];
SELECT
table1.*,
sum(table2.amount) as totalamount
FROM table1
join table2 on table1.key = table2.key
GROUP BY $columns;
I have two table like this:
create table product (
id serial primary key,
name text
);
create table selectedattribute (
id serial primary key,
product integer references product,
attribute text,
val text
);
and I'm creating a materialized view with this select query
select product.name,
jsonb_build_object(
'color', COALESCE(jsonb_agg(val) FILTER (WHERE attribute='color'), '[]'),
'diameter', COALESCE(jsonb_agg(val) FILTER (WHERE attribute='diameter'), '[]')
)
from product
left join selectedattribute on product.id = selectedattribute.product
group by product.id;
the problem with this select query is when I add a new attribute, I have to add it to select query in order to create an up-to-date materialized view.
Is there a way to write an aggregate expression that dynamically gets attributes without all these hard-coded attribute names?
You can try my code in SQL Fiddle: http://sqlfiddle.com/#!17/c4150/4
You need to nest the aggregation. First collect all values for an attribute then aggregate that into a JSON:
select id, name, jsonb_object_agg(attribute, vals)
from (
select p.id, p.name, a.attribute, jsonb_agg(a.val) vals
from product p
left join selectedattribute a on p.id = a.product
group by p.id, a.attribute
) t
group by id, name;
Updated SQLFiddle: http://sqlfiddle.com/#!17/c4150/5
I'm dealing with a pretty unique record matching problem within postgres right now. Essentially I have a table (A) with a lot of records in it, including a key value that I need to match on and the date of the record. Then I have this other table (B) that I want to match the first table on that key value. However, there can be multiple of the same 'key values' in both tables. To get around this I need to match the earliest key value from table A to the earliest key value to table B, the second earliest to the second earliest, and so on... However, if table B runs out of key value matches in table B then I want to default to the latest key value match in A, even though something else already matched on it.
My initial thought is to use a something like this on both tables:
ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY date) AS rank
And then join on the rank and key_value field. However, I'm not exactly sure how to get that default scenario to work with this method. And if records are added to one table and not the other and I try the join again, I feel like it might get out of sync.
My other thought was to use a cursor, but I'm really struggling to see how I'd implement that.
Any help would be greatly appreciated!
first you need number all your rows, the find the one with matching ranks.
After that match the one without matching to the latest_date
with cteA as (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY date) AS rank
FROM tableA
), cteB as (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY date) AS rank
FROM tableB
), ranked_match as (
SELECT ctA.*, cteB.*
FROM cteA
LEFT JOIN cteB
ON cteA.key_value = cteB.key_value
AND cteA.rank = cteB.rank
), latest_row as (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY date DESC) AS rank
FROM tableB
)
SELECT *
FROM ranked_match
WHERE cteB.key_value IS NOT NULL
UNION ALL
SELECT *
FROM ranked_match
JOIN latest_row
ON ranked_match.key_value = latest_row .key_value
WHERE cteB.key_value IS NULL
AND latest_row .rank = 1
I am trying to removing duplicate values which, for some reason, was imported in a specific Table.
There is no Primary Key in this table.
There is 27797 unique records.
Select distinct txdate, plunumber from itemaudit
Give me the correct records, but only displays the txdate, plunumber of course.
If it was possible to select all the fields but only select the distinct of txdate,plunumber I could export the values, delete the duplicated ones and re-import.
Or if its possible to delete the distinct values from the entire table.
If you select the distinct of all fields the value is incorrect.
To get all information on the duplicates, you simply need to query all information for the duplicate rows using a JOIN:
SELECT b.*
FROM (SELECT COUNT(*) as cnt, txdate, plunumber
FROM itemaudit
GROUP BY txdate, plunumber
HAVING COUNT(*) > 1) a
INNER JOIN itemaudit b ON a.txdate = b.txdate AND a.plunumber = b.plunumber
DELETE FROM itemaudit t1
WHERE EXISTS (
SELECT 1 FROM itemaudit t2
WHERE t1.txdate = t2.txdate and t1.plunumber = t2.plunumber
AND t1.RDB$DB_KEY < t2.RDB$DB_KEY
);