Copy data with new IDs - postgresql

Is there any way to COPY some rows into same table with new IDs?
My table is like this:
ID | data
1 | SOMETHING
2 | SOMETHING
3 | SOMETHING
I have old IDs: '{1,513,3,4,5}', and new ones: '{1338,7,512,9,10}' and I need to add row 1338 with data from row 1, 7 <= 513 etc. Like old[0] = new[0].
Currently I am using a loop:
SELECT old_ids INTO oIds FROM vars_table WHERE sid = id;
FOR i IN 0..array_length(new_ids, 1) LOOP
INSERT INTO ids(ID, data)
SELECT new_ids[i], data
FROM ids
WHERE id = oIds[i]
AND NOT EXISTS(SELECT 1 FROM ids WHERE id = new_ids[i]);
END LOOP;
Is there better way to do this? Maybe in 1 query?

There is no need for a loop:
insert into the_table (id, data)
select id + 5, data
from the_table;
However the above requires you to know how many rows there are in the table. To take the current number of rows into account you can do:
insert into the_table (id, data)
select id + (select max(id) from the_table), data
from the_table;
Attention: the above is NOT safe in a multi-user environment. It should only be used if you are the only one doing this.
The best way to deal with this kind of data duplication is to define the ID column as serial and let Postgres deal with creating new values:
create table the_table (id serial not null, data text);
The initial data would then be inserted like this:
insert into the_table (data)
values ('foo'), ('bar'), ('foobar');
Duplicating the data is then as easy as:
insert into the_table (data)
select data
from the_data;

Related

Postgresql group into predefined groups where group names come from a database table

I have a database table with data similar to this.
create table DataTable (
name text,
value number
)
insert into DataTable values
('A', 1),('A', 2),('B', 3),('Other', 5),('C', 1);
And i have another table
create table "group" (
name text,
default boolean
)
insert into "group" values
('A', false),('B', false),('Other', true);
I want to group the data in the first table based on the defined groups in the second table.
Expected output
Name | sum
A | 3
B | 3
Other | 6
Right now I'm using this query:
select coalesce(g.name, (select name from group where default = true)) name
sum(dt.value)
from DataTable dt
left join group g on dt.name = g.name
group by 1
This works but can cause performance tips in some situations. Any better way to do this?

Multiple UPDATE ... FROM same row is not working

I'm trying to do multiple update, but it works only for the first row.
I have table "users" with 2 records:
create table users
(
uid serial not null
constraint users_pkey
primary key,
balance numeric default 0 not null
);
INSERT INTO public.users (uid, balance) VALUES (2, 100);
INSERT INTO public.users (uid, balance) VALUES (1, 100);
I try to UPDATE user "1" twice with the query, but it update only one time:
balance for user "1" become "105", not "115"
update users as u
set balance = balance + c.bal
from (values (1, 5),
(1, 10)
) as c(uid, bal)
where c.uid = u.uid;
Why it not updated for all rows from subquery?
The postgresql documentation gives no reason for this behaviour but does specify it.
Relevant quote
When a FROM clause is present, what essentially happens is that the
target table is joined to the tables mentioned in the from_list, and
each output row of the join represents an update operation for the
target table. When using FROM you should ensure that the join produces
at most one output row for each row to be modified. In other words, a
target row shouldn't join to more than one row from the other
table(s). If it does, then only one of the join rows will be used to
update the target row, but which one will be used is not readily
predictable.
Use a SELECT with a GROUP BY to combine the rows before performing the update.
You need to aggregate in the inner query before joining:
update users as u
set balance = balance + d.bal
from (
select uid, sum(bal) bal
from ( values (1, 5), (1, 10) ) as c(uid, bal)
group by uid
) d
where d.uid = u.uid;
Demo on DB Fiddle:
| uid | balance |
| --- | ------- |
| 2 | 100 |
| 1 | 115 |

Optimal approach to bulk insert of pandas dataframe into PostgreSQL table

I need to upload multiple excel files to a postgresql table but they can olverlap each other in several registers, therefore I need to be aware of IntegrityErrors. I'm following two approaches:
cursor.copy_from: The fastest approach but I don't know how to catch and control all Integrityerrors due to duplicate registers
streamCSV = StringIO()
streamCSV.write(invoicing_info.to_csv(index=None, header=None, sep=';'))
streamCSV.seek(0)
with conn.cursor() as c:
c.copy_from(streamCSV, "staging.table_name", columns=dataframe.columns, sep=';')
conn.commit()
cursor.execute: I can count and handle each exception but it is very
slow.
data = invoicing_info.to_dict(orient='records')
with cursor as c:
for entry in data:
try:
c.execute(DLL_INSERT, entry)
successful_inserts += 1
connection.commit()
print('Successful insert. Operation number {}'.format(successful_inserts))
except psycopg2.IntegrityError as duplicate:
duplicate_registers += 1
connection.rollback()
print('Duplicate entry. Operation number {}'.format(duplicate_registers))
At the end of the routine, I need to determine the following info:
print("Initial shape: {}".format(invoicing_info.shape))
print("Successful inserts: {}".format(successful_inserts))
print("Duplicate entries: {}".format(duplicate_registers))
How can I modify the first approach to control all exceptions? How can I optimize the second approach?
while you have duplicate IDs in different excel sheets you have to answer for yourself how you make a decision to data from which excel sheet to trust?
while you are using multiple tables, and will use approach to have at least one row from conflicting pair you can always do following:
create temporary tables for each excel sheet
upload data to each table for excel sheet (like you do now in a bulk)
make an insert from select picking distinct on(id), in a manner:
INSERT INTO staging.table_name(id, col1, col2 ...)
SELECT DISTINCT ON(id)
id, col1, col2
FROM
(
SELECT id, col1, col2 ...
FROM staging.temp_table_for_excel_sheet1
UNION
SELECT id, col1, col2 ...
FROM staging.temp_table_for_excel_sheet2
UNION
SELECT id, col1, col2 ...
FROM staging.temp_table_for_excel_sheet3
) as data
with such insert postgreSQL will take the random row out of non-unique id sets.
In case you would like to trust the first record you can add some order:
INSERT INTO staging.table_name(id, col1, col2 ...)
SELECT DISTINCT ON(id)
id, ordering_column col1, col2
FROM
(
SELECT id, 1 as ordering_column, col1, col2 ...
FROM staging.temp_table_for_excel_sheet1
UNION
SELECT id, 2 as ordering_column, col1, col2 ...
FROM staging.temp_table_for_excel_sheet2
UNION
SELECT id, 3 as ordering_column, col1, col2 ...
FROM staging.temp_table_for_excel_sheet3
) as data
ORDER BY ordering_column
for initial count of objects:
SELECT sum(count)
FROM
(
SELECT count(*) as count FROM temp_table_for_excel_sheet1
UNION
SELECT count(*) as count FROM temp_table_for_excel_sheet2
UNION
SELECT count(*) as count FROM temp_table_for_excel_sheet3
) as data
after finishing this bulk inserts you can run select count(*) FROM staging.table_name to get a result for total number of inserted records
for duplicate count you can run:
SELECT sum(count)
FROM
(
SELECT count(*) as count
FROM temp_table_for_excel_sheet2 WHERE id in (select id FROM temp_table_for_excel_sheet1 )
UNION
SELECT count(*) as count
FROM temp_table_for_excel_sheet3 WHERE id in (select id FROM temp_table_for_excel_sheet1 )
)
UNION
SELECT count(*) as count
FROM temp_table_for_excel_sheet3 WHERE id in (select id FROM temp_table_for_excel_sheet2 )
) as data
If the excel sheets contain duplicate records, Pandas seems a likely choice for identifying and eliminated dupes: https://33sticks.com/python-for-business-identifying-duplicate-data/. Or is the issue that different records in different sheets have the same id/index? If so, a similar approach could work where you use Pandas to isolate the ids used multiple times and then correct them with unique identifiers before attempting to upload to the SQL db.
For a bulk upload, I'd use an ORM. SQLAlchemy has some great info on bulk uploads: http://docs.sqlalchemy.org/en/rel_1_0/orm/persistence_techniques.html#bulk-operations, and there's a related discussion here: Bulk insert with SQLAlchemy ORM

How to remove duplicates in postgres (no unique id) [duplicate]

This question already has answers here:
How to delete duplicate rows without unique identifier
(10 answers)
Closed 5 years ago.
I have some difficulty in removing duplicates rows. I thought user_id and time_id together acting as an identifier but there were even duplicates for those.
user_id (text), time_id(bigint), value1 (numeric)
user_id; time_id; value1|
aaa;1;3|
aaa;1;3|
aaa;2;4|
baa;3;1|
In this case how do I remove duplicates?
Since I have 16 distinct values in time_id and 15,000 distinct ones in user_id, I tried something like this but I do not have an unique id..
DELETE FROM tablename a
USING tablename b
WHERE a.unique_id < b.unique_id
AND a.user_id = b.user_id
time_id = 1 (repeat till time_id 16)
Each table in Postgres has a few hidden system columns. One of them (ctid) is unique by definition and can be used in cases when a primary key is missing.
DELETE FROM tablename a
USING tablename b
WHERE a.ctid < b.ctid
AND a.user_id = b.user_id
AND a.time_id = b.time_id;
The problem is due to lack of primary key. Using hidden columns should not be a systematic method (see comments below). Once you delete duplicates you should create a primary key on (user_id, time_id) or create a new unique column for this purpose.
Please use any advice on deletions with care, make sure you have a way to "undo it" if needed. I think you need to add an auto-numbered column to assist in this endeavor
alter table tablename add column is_uniq serial
Then I'd suggest using row_number() to help identify the rows you do want to retain (where rn=1) and those to be deleted (where rn>1). Use the following as a guide:
select *
, ROW_NUMBER()over(partition by user_id, time_id, value1 order by is_uniq) as rn from tablename
I'm not sure if there is any other columns(s) to use for order by, but if there are then you can include that into over clause as well.
Once you have the "is_uniq" column and the rn>1 rows you should be able to safely delete the unwanted rows.
If you don't want to rely on ctid (personally,I do) ,you can add a unique column (such as aserial) and use that for identity-purposes,
CREATE TABLE lutser
( user_id text not null
, time_i integer not null
, value integer not null
);
INSERT INTO lutser(user_id,time_i,value) VALUES
('aaa', 1, 3)
,('aaa', 1, 3)
,('aaa', 2, 4)
,('baa', 3, 1)
;
SELECT*FROM lutser;
ALTER TABLE lutser
ADD COLUMN seq serial NOT NULL UNIQUE
;
SELECT*FROM lutser;
DELETE FROM lutser del
WHERE EXISTS(
SELECT*FROM lutser x
WHERE x.user_id=del.user_id
AND x.time_i=del.time_i
AND x.seq < del.seq
);
ALTER TABLE lutser
ADD PRIMARY KEY (user_id,time_i)
;
SELECT*FROM lutser;

PostgreSQL Removing duplicates

I am working on postgres query to remove duplicates from a table. The following table is dynamically generated and I want to write a select query which will remove the record if the first row has duplicate values.
The table looks something like this
Ist col 2nd col
4 62
6 34
5 26
5 12
I want to write a select query which remove either row 3 or 4.
There is no need for an intermediate table:
delete from df1
where ctid not in (select min(ctid)
from df1
group by first_column);
If you are deleting many rows from a large table, the approach with an intermediate table is probably faster.
If you just want to get unique values for one column, you can use:
select distinct on (first_column) *
from the_table
order by first_column;
Or simply
select first_column, min(second_column)
from the_table
group by first_column;
select count(first) as cnt, first, second
from df1
group by first
having(count(first) = 1)
if you want to keep one of the rows (sorry, I initially missed it if you wanted that):
select first, min(second)
from df1
group by first
Where the table's name is df1 and the columns are named first and second.
You can actually leave off the count(first) as cnt if you want.
At the risk of stating the obvious, once you know how to select the data you want (or don't want) the delete the records any of a dozen ways is simple.
If you want to replace the table or make a new table you can just use create table as for the deletion:
create table tmp as
select count(first) as cnt, first, second
from df1
group by first
having(count(first) = 1);
drop table df1;
create table df1 as select * from tmp;
or using DELETE FROM:
DELETE FROM df1 WHERE first NOT IN (SELECT first FROM tmp);
You could also use select into, etc, etc.
if you want to SELECT unique rows:
SELECT * FROM ztable u
WHERE NOT EXISTS ( -- There is no other record
SELECT * FROM ztable x
WHERE x.id = u.id -- with the same id
AND x.ctid < u.ctid -- , but with a different(lower) "internal" rowid
); -- so u.* must be unique
if you want to SELECT the other rows, which were suppressed in the previous query:
SELECT * FROM ztable nu
WHERE EXISTS ( -- another record exists
SELECT * FROM ztable x
WHERE x.id = nu.id -- with the same id
AND x.ctid < nu.ctid -- , but with a different(lower) "internal" rowid
);
if you want to DELETE records, making the table unique (but keeping one record per id):
DELETE FROM ztable d
WHERE EXISTS ( -- another record exists
SELECT * FROM ztable x
WHERE x.id = d.id -- with the same id
AND x.ctid < d.ctid -- , but with a different(lower) "internal" rowid
);
So basically I did this
create temp t1 as
select first, min (second) as second
from df1
group by first
select * from df1
inner join t1 on t1.first = df1.first and t1.second = df1.second
Its a satisfactory answer. Thanks for your help #Hack-R