How to select unique records from table with big number of records

How to select unique records from table with big number of records - postgresql

I use postgresql and I have a database table with more than 5 million records. The structure of the table is as follows:
A lot of records is inserted every day. There are many records with the same reference.
I want to select all records but I do not want duplicates, the records with the same reference.
I tried with query as follows:
SELECT DISTINCT ON (reference) reference_url, reference FROM daily_run_vehicle WHERE handled = False and retries < 5 ORDER BY reference DESC;
It executed, it gives me correct result, but it takes to long to execute.
Is there any better way to do this?

Create Sort keys on columns which yo used in where condition
after large data movement into the table, we need to do "vacuum" command it will refresh all the keys and after that Analyze the table with "Analyze" command. it will help to rebuild the stats of the table.

Related

Get latest rows in PostgresSQL table ordered by Date: Index or Sort table?

I had a hard time titling this question but I hope its appropriate.
I have a table of transactions, and each transaction has a Date column (of type Date).
I want to run a query that gets the latest 100 transactions by date (simple enough with an ORDERBY query).
My question is, in order to make this an extremely cheap operation, would it make sense to sort my entire table so that I just need to select the top 100 rows every time, or do i simply create an index on the date column? Not sure if first option is even possible and or/good sql db practice.

You would add an index on the column with the date and query:
SELECT * FROM tab
ORDER BY datecol DESC
LIMIT 100;
The problem with your other idea is that there is no well-defined order in a table. Every UPDATE changes this "order", and even if you don't modify anything, a sequential scan need not start at the beginning of the table.

Why is counting by i not working but selecting from hdb works just fine?

I have a partitioned hdb and the following query works fine :
select from tableName where date within(.z.d-8;.z.d)
but the following query breaks :
select count i by date from tableName where date within(.z.d-8;.z.d)
with the following error :
"./2017.10.14/:./2017.10.15/tableName. OS reports: No such file or directory"
Any idea why this might happen ?

As the error indicates, there's no table called tableName is in a partition for 2017.10.15. For partitioned databases kdb caches table counts; it happens when it runs the first query with the following properties:
the "select" part of the query is either count i or the partition field itself (in your example that would be date)
the where clause is either empty or constrains the partition field only.
(.Q.ps -- partitioned select -- is where all this magic happens, see the definition of it if you need all the details.)
You have several options to avoid the error you're getting.
Amend the query to avoid having either count i on its own or the empty where.
Any of the following will work; the first is the simplest while the others are useful if you're writing a query for the general case and don't know field names in advance.
select count sym by date where date within (.z.d-8;.z.d) / any field except i
select count i by date where date within (.z.d-8;.z.d),i>=0
select `dummy, count i by date where date within (.z.d-8;.z.d)
select {count x}i by date where date within (.z.d-8;.z.d)
Use .Q.view to define a sub-view to exclude partitions with missing tables; kdb won't cache or otherwise access them.
The previous solutions will not work if the date range in your select includes partitions with missing tables. In this case you can either
Run .Q.chk to create empty tables where they are mising; or
Run .Q.bv to construct the dictionary of table schemas for tables with missing partitions.

You probably need to create the missing tables. I believe when doing a 'count i' on a partitioned table as you have done, it counts every single partition (not just the ones in your query) and caches these counts in .Q.pn
If you run .Q.chk[HDB root location], it should create the missing tables and your query should work
https://code.kx.com/q/ref/dotq/#qchk-fill-hdb

'count i' will scan each partition regardless of what is specified in the where clause. So it's likely those two partitions are incomplete.
Better to pick an actual column for things like that or else something like
select count i>0 by date from tableName where date within(.z.d-8;.z.d)
will prevent the scanning of all partitions.
Jason

INSERT INTO .. SELECT causing possible race condition?

INSERT INTO A
SELECT * FROM B WHERE timestamp > (SELECT max(timestamp) FROM A);
or, written differently:
WITH selection AS
(SELECT * FROM B WHERE timestamp > (SELECT max(timestamp) FROM A))
INSERT INTO A SELECT * FROM selection;
If these queries run multiple times simultaneously, is it possible that I will end up with duplicated rows in A?
How does Postgres process these queries? Is it one or multiple?
If it is multiple queries (find max(timestamp)[1], select[2] then insert[3]) I can imagine this will cause duplicated rows.
If that is correct, would wrapping it in BEGIN/END (a transaction) help?

Yes, that might result in duplicate values.
A single statement sees a consistent view of the data in all tables as of the point in time when the statement started.
Wrapping that single statement into a transaction won't change that (a single statement is always executed as an atomic statement regardless of the number of sub-query involved).
The statement will never see uncommitted data from other transactions (which is the root cause why you can wind up with duplicate values).
The only safe way to avoid duplicate values, is to create a unique constraint (or index) on that column. In that case the INSERT would result in an error if such a value already exists.
If you want to avoid the error, use insert ... on conflict

This depends on the isolation level set in your database.
This is from the postgres documentation
By default, this is set to Repeatable read, which means that each query will get the output based on when the transaction first attempted to read the data. If 2 queries read before any one writes, then you will get duplicate data in these tables.
If you want to avoid having duplicate entries, you have a few options.
Try using the isolation level Serializable
Apply a unique index on a field of A in table B. Timestamp is not a great contender as you might legitimately have 2 rows with the same timestamp. Probably id of the table A is a good option.
Take a lock at the application level before performing such a query.

Oracle order by query very slow

I am facing a problem when doing order by on a table.
My select query is working fine, but when i do order by (even on the primary key) it just goes on and on with no results. Finally i need to kill the session. The table has20K records.
Any suggestion for the this?
Query is as:
SELECT * FROM Users ORDER BY ID;
I do not about know about the query plan as i am new to oracle

For the unordered query, Is SQL Developer retrieving and displaying 20K rows, or just the fisrt 50? Your comparison might not be fair.
What is the size of those 20K rows: select bytes/1024/1024 MB from user_segments where segment_name = 'USERS'; I've seen many cases where a few megabytes of data use many gigabytes of storage. Maybe the data was very large before and somebody just deleted it (this doesn't remove the space). Or maybe somebody inserted those rows 1 at a time with an APPEND hint, and each row is taking an entire block.
Your query might be waiting for more temp tablespace for sorting, look at DBA_RESUMABLE.

SQLite - a smart way to remove and add new objects

I have a table in my database and I want for each row in my table to have an unique id and to have the rows named sequently.
For example: I have 10 rows, each has an id - starting from 0, ending at 9. When I remove a row from a table, lets say - row number 5, there occurs a "hole". And afterwards I add more data, but the "hole" is still there.
It is important for me to know exact number of rows and to have at every row data in order to access my table arbitrarily.
There is a way in sqlite to do it? Or do I have to manually manage removing and adding of data?
Thank you in advance,
Ilya.

It may be worth considering whether you really want to do this. Primary keys usually should not change through the lifetime of the row, and you can always find the total number of rows by running:
SELECT COUNT(*) FROM table_name;
That said, the following trigger should "roll down" every ID number whenever a delete creates a hole:
CREATE TRIGGER sequentialize_ids AFTER DELETE ON table_name FOR EACH ROW
BEGIN
UPDATE table_name SET id=id-1 WHERE id > OLD.id;
END;
I tested this on a sample database and it appears to work as advertised. If you have the following table:
id name
1 First
2 Second
3 Third
4 Fourth
And delete where id=2, afterwards the table will be:
id name
1 First
2 Third
3 Fourth
This trigger can take a long time and has very poor scaling properties (it takes longer for each row you delete and each remaining row in the table). On my computer, deleting 15 rows at the beginning of a 1000 row table took 0.26 seconds, but this will certainly be longer on an iPhone.

I strongly suggest that you re-think your design. In my opinion your asking yourself for troubles in the future (e.g. if you create another table and want to have some relations between the tables).
If you want to know the number of rows just use:
SELECT count(*) FROM table_name;
If you want to access rows in the order of id, just define this field using PRIMARY KEY constraint:
CREATE TABLE test (
id INTEGER PRIMARY KEY,
...
);
and get rows using ORDER BY clause with ASC or DESC:
SELECT * FROM table_name ORDER BY id ASC;
Sqlite creates an index for the primary key field, so this query is fast.
I think that you would be interested in reading about LIMIT and OFFSET clauses.
The best source of information is the SQLite documentation.

If you don't want to take Stephen Jennings's very clever but performance-killing approach, just query a little differently. Instead of:
SELECT * FROM mytable WHERE id = ?
Do:
SELECT * FROM mytable ORDER BY id LIMIT 1 OFFSET ?
Note that OFFSET is zero-based, so you may need to subtract 1 from the variable you're indexing in with.

If you want to reclaim deleted row ids the VACUUM command or pragma may be what you seek,
http://www.sqlite.org/faq.html#q12
http://www.sqlite.org/lang_vacuum.html
http://www.sqlite.org/pragma.html#pragma_auto_vacuum

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to select unique records from table with big number of records - postgresql

Create Sort keys on columns which yo used in where condition after large data movement into the table, we need to do "vacuum" command it will refresh all the keys and after that Analyze the table with "Analyze" command. it will help to rebuild the stats of the table.

Related

Get latest rows in PostgresSQL table ordered by Date: Index or Sort table?

Why is counting by i not working but selecting from hdb works just fine?

INSERT INTO .. SELECT causing possible race condition?

Oracle order by query very slow

SQLite - a smart way to remove and add new objects

Categories

Resources