insert from select misses some rows - Postgres - postgresql

We have a table with around 7M records with type ferrari and want to do a schema migration. We used this script
insert into new_car id, name, type, colorType
select id, name, type, 'red'
from old_car
where type = 'ferrari'
order by id asc
The script took around 50 minutes to execute and after it got complete we realised that the new_car table have 2M less records than old_car table.
While the script was executing the old_car table still got inserts, updates and etc concurrently.
Does this concurrency may cause some sort of problem? What are the possible cause of the lack of ~2M rows? (the old_car table didn't got 2M deletes while the query was running, maybe something like 100 or 200 deletes)

Related

PostgresSql 9.6 deletes suddenly became slow

I have a database table where debug log entries are recorded. There are no foreign keys - it is a single standalone table.
I wrote a utility to delete a number of entries starting with the oldest.
There are 65 million entries so I deleted them 100,000 at a time to give some progress feedback to the user.
There is a primary key column called id
All was going fine until it got to about 5,000,000 million records remaining. Then it started taking over 1 minute to execute.
What is more, if I used PgAdmin and type the query in myself, but use an Id that I know is less than the minimum id, it still takes over one minute to execute!
I.e: delete from public.inettklog where id <= 56301001
And I know the min(id) is 56301002
Here is the result of an explain analyze
Your stats are way out of date. It thinks it will find 30 million rows, but instead finds zero. ANALYZE the table.

I found the inconsistency data on postgres

I have a table of datas on postgres. This table, called it 'table1', have a unique constraint on field 'id'. this table also have 3 other fields, 'write_date', 'state', 'state_detail'.
all this time, i got no problem when accesing and joining these table with another table with field 'id' as the relational field. but, this time, i got a strange result when i querying this table1.
when i run this query (called it Query1):
SELECT id, write_date, state, state_detail
FROM table1
WHERE write_date = '2019-07-30 19:42:49.314' or write_date = '2019-07-30 14:29:06.945'
it gives me 2 rows of datas, with the same id, but different value for the other fields:
id || write_date || state || state_detail
168972 2019-07-30 14:29:06.945 1 80
168972 2019-07-30 19:42:49.314 2 120
BUT, when i run this query (called it Query2):
SELECT id, write_date, state, state_detail
FROM table1
WHERE id = 168972
it gives me just 1 row:
id || write_date || state || state_detail
168972 2019-07-30 19:42:49.314 2 120
How come it gives the different result. i mean, i checked 'table1', it has the unique constraint 'id' as primary key. But, how come this happened?
i have restart the postgres service, and i run those 2 queries again. And it still gives me the same result as above.
This looks like a case of index corruption, specifically on the unique index on the id column. Could you run the following query:
SELECT ctid, id, write_date, state, state_detail FROM table1
WHERE write_date = '2019-07-30 19:42:49.314' or write_date = '2019-07-30 14:29:06.945'
You will likely receive 2 rows back for the id, with two different ctids. The ctid represents the physical location on disk for each row. Presuming you get two rows back, you will need to pick a row and delete the other one in order to "fix" the data. In addition, you'll want to recreate the unique index on the table, in order to prevent this from happening again.
Oh, and don't be surprised if you find other rows in the table like this. Depending on the source of the corruption (bad disks, bad memory, recent crash, upgrade to glibc?), this could be the sign of future trouble to come. I'd probably double-check all my tables for any issues, recreate any unique indexes, and look at the OS level for any signs of disk corruption or I/O errors. And if you aren't on the latest version of Postgres, you should upgrade.

How to select unique records from table with big number of records

I use postgresql and I have a database table with more than 5 million records. The structure of the table is as follows:
A lot of records is inserted every day. There are many records with the same reference.
I want to select all records but I do not want duplicates, the records with the same reference.
I tried with query as follows:
SELECT DISTINCT ON (reference) reference_url, reference FROM daily_run_vehicle WHERE handled = False and retries < 5 ORDER BY reference DESC;
It executed, it gives me correct result, but it takes to long to execute.
Is there any better way to do this?
Create Sort keys on columns which yo used in where condition
after large data movement into the table, we need to do "vacuum" command it will refresh all the keys and after that Analyze the table with "Analyze" command. it will help to rebuild the stats of the table.

Best way to delete large no of random rows in PostgreSQL

I have a table which contains about 900K rows.I want to delete about 90% of the rows. Tried using TABLESAMPLE to select them randomly but didn't get much performance improvement. Here are the queries which i have tried and there times
sql> DELETE FROM users WHERE id IN (
SELECT id FROM users ORDER BY random() LIMIT 5000
)
[2017-11-22 11:35:39] 5000 rows affected in 1m 11s 55ms
sql> DELETE FROM users WHERE id IN (
SELECT id FROM users TABLESAMPLE BERNOULLI (5)
)
[2017-11-22 11:55:07] 5845 rows affected in 1m 13s 666ms
sql> DELETE FROM users WHERE id IN (
SELECT id FROM users TABLESAMPLE SYSTEM (5)
)
[2017-11-22 11:57:59] 5486 rows affected in 1m 4s 574ms
Only deleting 5% data takes about an min. So this is going to take very long for large data. Pls suggest if I am doing things right or if there is any better way to do this.
Deleting a large number of rows is always going to be slow. The way how you identify them won't make much difference.
Instead of deleting a large number it's usually a lot faster, to create a new table that contains those rows that you want to keep, e.g.:
create table users_to_keep
as
select *
from users
tablesample system (10);
then truncate the original table and insert the rows that you stored away:
truncate table users;
insert into users
select *
from users_to_keep;
If you want, you can do that in a single transaction.
As a_horse_with_no_name pointed out, the random selection itself is a relatively minor factor. And much of the cost associated with a deletion (e.g. foreign key checks) is not something you can avoid.
The only thing which stands out as an unnecessary overhead is the id-based lookup in the DELETE statement; you just visited the row during the random selection step, and now you're looking it up again, presumably via an index on id.
Instead, you can perform the lookup using the row's physical location, represented by the hidden ctid column:
DELETE FROM users WHERE ctid = ANY(ARRAY(
SELECT ctid FROM users TABLESAMPLE SYSTEM (5)
))
This gave me a ~6x speedup in an artificial test, though it will likely be dwarfed by other costs in most real-world scenarios.

SQLite - a smart way to remove and add new objects

I have a table in my database and I want for each row in my table to have an unique id and to have the rows named sequently.
For example: I have 10 rows, each has an id - starting from 0, ending at 9. When I remove a row from a table, lets say - row number 5, there occurs a "hole". And afterwards I add more data, but the "hole" is still there.
It is important for me to know exact number of rows and to have at every row data in order to access my table arbitrarily.
There is a way in sqlite to do it? Or do I have to manually manage removing and adding of data?
Thank you in advance,
Ilya.
It may be worth considering whether you really want to do this. Primary keys usually should not change through the lifetime of the row, and you can always find the total number of rows by running:
SELECT COUNT(*) FROM table_name;
That said, the following trigger should "roll down" every ID number whenever a delete creates a hole:
CREATE TRIGGER sequentialize_ids AFTER DELETE ON table_name FOR EACH ROW
BEGIN
UPDATE table_name SET id=id-1 WHERE id > OLD.id;
END;
I tested this on a sample database and it appears to work as advertised. If you have the following table:
id name
1 First
2 Second
3 Third
4 Fourth
And delete where id=2, afterwards the table will be:
id name
1 First
2 Third
3 Fourth
This trigger can take a long time and has very poor scaling properties (it takes longer for each row you delete and each remaining row in the table). On my computer, deleting 15 rows at the beginning of a 1000 row table took 0.26 seconds, but this will certainly be longer on an iPhone.
I strongly suggest that you re-think your design. In my opinion your asking yourself for troubles in the future (e.g. if you create another table and want to have some relations between the tables).
If you want to know the number of rows just use:
SELECT count(*) FROM table_name;
If you want to access rows in the order of id, just define this field using PRIMARY KEY constraint:
CREATE TABLE test (
id INTEGER PRIMARY KEY,
...
);
and get rows using ORDER BY clause with ASC or DESC:
SELECT * FROM table_name ORDER BY id ASC;
Sqlite creates an index for the primary key field, so this query is fast.
I think that you would be interested in reading about LIMIT and OFFSET clauses.
The best source of information is the SQLite documentation.
If you don't want to take Stephen Jennings's very clever but performance-killing approach, just query a little differently. Instead of:
SELECT * FROM mytable WHERE id = ?
Do:
SELECT * FROM mytable ORDER BY id LIMIT 1 OFFSET ?
Note that OFFSET is zero-based, so you may need to subtract 1 from the variable you're indexing in with.
If you want to reclaim deleted row ids the VACUUM command or pragma may be what you seek,
http://www.sqlite.org/faq.html#q12
http://www.sqlite.org/lang_vacuum.html
http://www.sqlite.org/pragma.html#pragma_auto_vacuum