how to efficiiently select first or last rows from a KDB table stored on disk - kdb

For an in-memory table, I can use sublist or take syntax to retrieve first x, last x elements.
How to do this efficiently for an on-disk table which may be very large? The constraint is that I don't want to cache all the data from table to memory to run the query.

.Q.ind - it takes a table and (long!) indices into the table - and returns the appropriate rows
http://code.kx.com/q/ref/dotq/#qind-partitioned-index

I suppose you can use the i column which is the row number (per partition!) on a historical.
So the first row would be select from trade where date=first date, i = 0
The last row would I guess be select from trade where date=last date, i=max i
This assumes normal partitioned by date stuff. If you have just a non-partitioned tables, probably select from trade where i=0 would be fine

Related

Get latest rows in PostgresSQL table ordered by Date: Index or Sort table?

I had a hard time titling this question but I hope its appropriate.
I have a table of transactions, and each transaction has a Date column (of type Date).
I want to run a query that gets the latest 100 transactions by date (simple enough with an ORDERBY query).
My question is, in order to make this an extremely cheap operation, would it make sense to sort my entire table so that I just need to select the top 100 rows every time, or do i simply create an index on the date column? Not sure if first option is even possible and or/good sql db practice.
You would add an index on the column with the date and query:
SELECT * FROM tab
ORDER BY datecol DESC
LIMIT 100;
The problem with your other idea is that there is no well-defined order in a table. Every UPDATE changes this "order", and even if you don't modify anything, a sequential scan need not start at the beginning of the table.

How to select unique records from table with big number of records

I use postgresql and I have a database table with more than 5 million records. The structure of the table is as follows:
A lot of records is inserted every day. There are many records with the same reference.
I want to select all records but I do not want duplicates, the records with the same reference.
I tried with query as follows:
SELECT DISTINCT ON (reference) reference_url, reference FROM daily_run_vehicle WHERE handled = False and retries < 5 ORDER BY reference DESC;
It executed, it gives me correct result, but it takes to long to execute.
Is there any better way to do this?
Create Sort keys on columns which yo used in where condition
after large data movement into the table, we need to do "vacuum" command it will refresh all the keys and after that Analyze the table with "Analyze" command. it will help to rebuild the stats of the table.

PostgreSQL different index creation time for same datatype

I have a table with three columns A, B, C, all of type bytea.
There are around 180,000,000 rows in the table. A, B and C all have exactly 20 bytes of data, C sometimes contains NULLs
When creating indexes for all columns with
CREATE INDEX index_A ON transactions USING hash (A);
CREATE INDEX index_B ON transactions USING hash (B);
CREATE INDEX index_C ON transactions USING hash (C);
index_A is created in around 10 minutes, while B and C are taking over 10 hours after which I aborted them. I ran every CREATE INDEX on their own, so no indices were created in parallel. There are also no other queries running in the database.
When running
SELECT * FROM pg_stat_activity;
wait_event_type and wait_event are both NULL, state is active.
Why are the second index creations taking so long, and can I do anything to speed them up?
Ensure the statistics on your table are up-to-date.
Then execute the following query:
SELECT attname, n_distinct, correlation
from pg_stats
where tablename = '<Your table name here>'
Basically, the database will have more work to create indexes when:
The number of distinct values gets higher.
The correlation (= are values in the field physically stored in order) is close to 0.
I suspect you will see field A is different in terms of distinct values and/or a higher correlation than the other 2 fields.
Edit: Basically, creating an index = FULL SCAN of the table and create entries in the index as you progress. With the stats you have shared below that means:
Column A: it was detected as unique
A single scan is enough as the DB knows 1 record = 1 index entry.
Columns B & C : it was detected as having very few distinct values + abs(correlation) is very low.
Each index entry takes an entire FULL SCAN of the table.
Note: the description is simplified to highlight the difference.
Solution 1:
Do not create indexes for B and C.
It might sound stupid but in fact and as explained here, a small correlation means the indexes will probably not be used (an index is useful only when entries are not scattered in all the table blocks).
Solution 2:
Order records on the disk.
The initialization would be something like this:
CREATE TABLE Transactions_order as SELECT * FROM Transactions;
TRUNCATE TABLE Transactions;
INSERT INTO Transactions SELECT * FROM Transactions_order ORDER BY B,C,A;
DROP TABLE Transactions_order;
The tricky part comes next: with insert/update/delete records, you need to keep track of the correlation and ensure it does not drop too much.
If you can't guarantee that, stick to solution 1.
Solution3:
Create partitions and enjoy partition pruning.
There are quite a lot of efforts being made for partitioning recently in postgresql. It could be worth having a look into it.

Count from KDB table with where clause

I know count table tells you how many rows are in table but how do you count from a table with a where clause as a filter? I tried count table where PERIOD=x but I am getting the error: 'PERIOD even though PERIOD is a field in the table
Use qsql to filter and then count the result:
count select from table where PERIOD=x
If you only need the count, do
exec sum PERIOD=x from table
If the table has many columns, this can be much faster than
count select from table where PERIOD=x
Please note that this computes a sum of booleans as a 32bit int, so if your table has more than a billion rows, you may want to add a cast:
exec sum "j"$PERIOD=x from table
The following will be the most efficient.
select count i from table where PERIOD=x
#jomahony solution will require all columns to be read from disk (if the table is on disk) before doing the count so can be inefficient

SQLite - a smart way to remove and add new objects

I have a table in my database and I want for each row in my table to have an unique id and to have the rows named sequently.
For example: I have 10 rows, each has an id - starting from 0, ending at 9. When I remove a row from a table, lets say - row number 5, there occurs a "hole". And afterwards I add more data, but the "hole" is still there.
It is important for me to know exact number of rows and to have at every row data in order to access my table arbitrarily.
There is a way in sqlite to do it? Or do I have to manually manage removing and adding of data?
Thank you in advance,
Ilya.
It may be worth considering whether you really want to do this. Primary keys usually should not change through the lifetime of the row, and you can always find the total number of rows by running:
SELECT COUNT(*) FROM table_name;
That said, the following trigger should "roll down" every ID number whenever a delete creates a hole:
CREATE TRIGGER sequentialize_ids AFTER DELETE ON table_name FOR EACH ROW
BEGIN
UPDATE table_name SET id=id-1 WHERE id > OLD.id;
END;
I tested this on a sample database and it appears to work as advertised. If you have the following table:
id name
1 First
2 Second
3 Third
4 Fourth
And delete where id=2, afterwards the table will be:
id name
1 First
2 Third
3 Fourth
This trigger can take a long time and has very poor scaling properties (it takes longer for each row you delete and each remaining row in the table). On my computer, deleting 15 rows at the beginning of a 1000 row table took 0.26 seconds, but this will certainly be longer on an iPhone.
I strongly suggest that you re-think your design. In my opinion your asking yourself for troubles in the future (e.g. if you create another table and want to have some relations between the tables).
If you want to know the number of rows just use:
SELECT count(*) FROM table_name;
If you want to access rows in the order of id, just define this field using PRIMARY KEY constraint:
CREATE TABLE test (
id INTEGER PRIMARY KEY,
...
);
and get rows using ORDER BY clause with ASC or DESC:
SELECT * FROM table_name ORDER BY id ASC;
Sqlite creates an index for the primary key field, so this query is fast.
I think that you would be interested in reading about LIMIT and OFFSET clauses.
The best source of information is the SQLite documentation.
If you don't want to take Stephen Jennings's very clever but performance-killing approach, just query a little differently. Instead of:
SELECT * FROM mytable WHERE id = ?
Do:
SELECT * FROM mytable ORDER BY id LIMIT 1 OFFSET ?
Note that OFFSET is zero-based, so you may need to subtract 1 from the variable you're indexing in with.
If you want to reclaim deleted row ids the VACUUM command or pragma may be what you seek,
http://www.sqlite.org/faq.html#q12
http://www.sqlite.org/lang_vacuum.html
http://www.sqlite.org/pragma.html#pragma_auto_vacuum