historical result of SELECT statement - postgresql

I want to query a large number of rows and displayed to the user, however the user will see only for example 10 rows and I will use LIMIT and OFFSET, he will press 'next' button in the user interface and the next 10 rows will be fetched.
The database will be updated all the time, is there any way to guarantee that the user will see the data of the next 10 rows as they were in the first select, so any changes in the data will not be reflected if he choose to see the next 10 rows of result.
This is like using SELECT statement as a snapshot of the past, any subsequent updates after the first SELECT will not be visible for the subsequent SELECT LIMIT OFFSET.

You can use cursors, example:
drop table if exists test;
create table test(id int primary key);
insert into test
select i from generate_series(1, 50) i;
declare cur cursor with hold for
select * from test order by id;
move absolute 0 cur; -- get 1st page
fetch 5 cur;
id
----
1
2
3
4
5
(5 rows)
truncate test;
move absolute 5 cur; -- get 2nd page
fetch 5 cur; -- works even though the table is already empty
id
----
6
7
8
9
10
(5 rows)
close cur;
Note that it is rather expensive solution. A declared cursor creates a temporary table with a complete result set (a snapshot). This can cause significant server load. Personally I would rather search for alternative approaches (with dynamic results).
Read the documentation: DECLARE, FETCH.

Related

How to ensure sum of amounts in one table is less than the amount of another?

Say I have a table of marbles
id
color
total
1
blue
5
2
red
10
3
swirly
3
and I need to put them into bags with a unique constraint on (bag_id, marble_id):
bag_id
marble_id
quantity
1
1 (blue)
2
1
2 (red)
3
2
1 (blue)
2
I have a query for bagging at most the number of remaining marbles
WITH unbagged AS (
SELECT
marble.total - COALESCE( SUM( bag.quantity ), 0 ) AS quantity
FROM marble
LEFT JOIN bag ON marble.id = bag.marble_id
WHERE marble.id = :marble_id
GROUP BY marble.id )
INSERT INTO bag (bag_id, marble_id, quantity)
SELECT
:bag_id,
:marble_id,
LEAST( :quantity, unbagged.quantity )
FROM unbagged
ON CONFLICT (bag_id, marble_id) DO UPDATE SET
quantity = bag.quantity
+ LEAST(
EXCLUDED.quantity,
(SELECT quantity FROM unbagged) )
which works great until one day, it gets called twice at exactly the same time with the same item and I end up with 6 swirly marbles in a bag (or maybe 3 each in 2 bags), even though there are only 3 total.
I think I understand why, but I don't know how to prevent this from happening?
Your algorithm isn't exactly clear to me, but the core issue is concurrency.
Manual locking
Your query processes a single given row in table marble at a time. The cheapest solution is to take an exclusive lock on that row (assuming that's the only query writing to marble and bag). Then the next transaction trying to mess with the same kind of marble has to wait until the current one has committed (or rolled back).
BEGIN;
SELECT FROM marble WHERE id = :marble_id FOR UPDATE; -- row level lock
WITH unbagged AS ( ...
COMMIT;
SERIALIZABLE
Or use serializable transaction isolation, that's the more expensive "catch-all" solution - and be prepared to repeat the transaction in case of a serialization error. Like:
BEGIN ISOLATION LEVEL SERIALIZABLE;
WITH unbagged AS ( ...
COMMIT;
Related:
How to atomically replace a subset of table data
Atomic UPDATE .. SELECT in Postgres

Would it be possible to select random rows with a little preference for a specific column?

I would like to get a random selection of records from my table but I wonder if it would be possible to give a better chance for items that are newly created. I also have pagination so this is why I'm using setseed
Currently I'm only retrieving items randomly and it works quite well, but I need to give a certain "preference" to newly created items.
Here is what I'm doing for now:
SELECT SETSEED(0.16111981), RANDOM();
I don't know what to do and I can't figure what can be a good solution without being an absolute performance disaster.
Firstly I want to explain how we can select random records on a table. On PostgreSQL, we can use random() function in the order by statement. Example:
select * from test_table
order by random()
limit 1;
I am using limit 1 for selecting only one record. But, using this method our query performance will be very bad for large size tables (over 100 million data)
The second way, you can manually be selecting records using random() if the tables are had id fields. This way is very high performance.
Let's firstly write our own randomize function for using it's easily on our queries.
CREATE OR REPLACE FUNCTION random_between(low integer, high integer)
RETURNS integer
LANGUAGE plpgsql
STRICT
AS $function$
BEGIN
RETURN floor(random()* (high-low + 1) + low);
END;
$function$;
This function returns a random integer value in the range of our input argument values. Then we can write a query using our random function. Example:
select * from test_table
where id = (select random_between(min(id), max(id)) from test_table);
This query I tested on the table has 150 million data and gets the best performance, Duration 12 ms. In this query, if you need many rows but not one, then you can write where id > instead of where id=.
Now, for your little preference, I don't know your detailed business logic and condition statements which you want to set to randomizing. I can write for you some sample queries for understanding the mechanism. PostgreSQL has not a function for doing this process, so randomize data using preferences. We must write this logic manually. I created a sample table for testing our queries.
CREATE TABLE test_table (
id serial4 NOT NULL,
is_created bool NULL,
action_date date NULL,
CONSTRAINT test_table_pkey PRIMARY KEY (id)
);
CREATE INDEX test_table_id_idx ON test_table USING btree (id);
For example, I want to set more preference only to data which are action dates has a closest to today. Sample query:
select
id,
is_created,
action_date,
(extract(day from (now()-action_date))) as dif_days
from
test.test_table
where
id > (select random_between(min(id), max(id)) from test.test_table)
and
(extract(day from (now()-action_date))) = random_between(0, 6)
limit 1;
In this query this (extract(day from (now()-action_date))) as dif_days query will returned difference between action_date and today. On the where clause firstly I select data that are id field values greater than the resulting randomize value. Then using this query (extract(day from (now()-action_date))) = random_between(0, 6) I select from this resulting data only which data are action_date equals maximum 6 days ago (maybe 4 days ago or 2 days ago, mak 6 days ago).
Сan wrote many logic queries (for example set more preferences using boolean fields: closed are opened and etc.)

Convert query with cursor to more optimized methodology

I have a table with many columns but two of them that are of interest in this case. One of the columns represents a subversion commit numnber, and the other one prepresents a timestamp of when an automated process ran using the data from the aforementioned commit number. Theres many rows with the same commit number, and any number greater than or equal to 1 of the timestamp. I need to get a list of distinct commit numbers, and the earliest timestamp in the table for each one.
I can do this with a cursor that iterates between the distinct commit numbers, and finds the top 1 timestamp for each commit, but this is very slow because there are 56 million rows in the table. I feel certain there must be a more efficient way.
Below you can see my TSql.
DECLARE #CommitDates TABLE (CommitNumber int, LastUpdate date)
declare #commit int
DECLARE db_cursor CURSOR FOR
SELECT DISTINCT [CommitNumber] FROM ProcessHistory ORDER BY [CommitNumber] DESC
OPEN db_cursor
fetch next from db_cursor into #commit
while ##FETCH_STATUS=0
BEGIN
INSERT INTO #CommitDates ([CommitNumber], [LastUpdate])
select top 1 [CommitNumber],LastUpdate from ProcessHistory WHERE [CommitNumber]=#commit ORDER BY LastUpdate ASC
fetch next from db_cursor into #commit
END
CLOSE db_cursor
deallocate db_cursor
SELECT * from #CommitDates
Expected results: be able to know quickly what the first date a given commit number appears in the table without having to pull up the subversion log viewer. In this case, i would define "quickly" as executing in no more than 60 seconds.
Actual results: it takes more than 7 minutes 30 seconds to execute this code which returns only 176 rows as of today.
Well i feel silly, just figured it out:
SELECT [CommitNumber],MIN([LastUpdate]) FROM ProcessHistory GROUP BY [CommitNumber]
Executes in literally 00:00:02

lock the rows until next select postgres

Is there a way in postgres to lock the rows until the next select query execution from the same system.And one more thing is there will be no update process on locked rows.
scenario is something like this
If the table1 contains data like
id | txt
-------------------
1 | World
2 | Text
3 | Crawler
4 | Solution
5 | Nation
6 | Under
7 | Padding
8 | Settle
9 | Begin
10 | Large
11 | Someone
12 | Dance
If sys1 executes
select * from table1 order by id limit 5;
then it should lock row from id 1 to 5 for other system which are executing select statement concurrently.
Later if sys1 again execute another select query like
select * from table1 where id>10 order by id limit 5;
then pereviously locked rows should be released.
I don't think this is possible. You cannot block a read only access to a table (unless that select is done FOR UPDATE)
As far as I can tell, the only chance you have is to use the pg_advisory_lock() function.
http://www.postgresql.org/docs/current/static/functions-admin.html#FUNCTIONS-ADVISORY-LOCKS
But this requires a "manual" release of the locks obtained through it. You won't get an automatic unlocking with that.
To lock the rows you would need something like this:
select pg_advisory_lock(id), *
from
(
select * table1 order by id limit 5
) t
(Note the use of the derived table for the LIMIT part. See the manual link I posted for an explanation)
Then you need to store the retrieved IDs and later call pg_advisory_unlock() for each ID.
If each process is always releasing all IDs at once, you could simply use pg_advisory_unlock_all() instead. Then you will not need to store the retrieved IDs.
Note that this will not prevent others from reading the rows using "normal" selects. It will only work if every process that accesses that table uses the same pattern of obtaining the locks.
It looks like you really have a transaction which transcends the borders of your database, and all the change happens in an another system.
My idea is select ... for update no wait to lock the relevant rows, then offload the data into another system, then rollback to unlock the rows. No two select ... for update queries will select the same row, and the second select will fail immediately rather than wait and proceed.
But you don't seem to mark offloaded records in any way; I don't see why two non-consecutive selects won't happily select overlapping range. So I'd still update the records with a flag and/or a target user name and would only select records with the flag unset.
I tried both select...for update and pg_try_advisory_lock and managed to get near my requirement.
/*rows are locking but limit is the problem*/
select * from table1 where pg_try_advisory_lock( id) limit 5;
.
.
$_SESSION['rows'] = $rowcount; // no of row to process
.
.
/*afer each process of word*/
$_SESSION['rows'] -=1;
.
.
/*and finally unlock locked rows*/
if ($_SESSION['rows']===0)
select pg_advisory_unlock_all() from table1
But there are two problem in this
1. As Limit will apply before lock, every time the same rows are trying to lock in different instance.
2. Not sure whether pg_advisory_unlock_all will unlock the rows locked by current instance or all the instance.

SQLite - a smart way to remove and add new objects

I have a table in my database and I want for each row in my table to have an unique id and to have the rows named sequently.
For example: I have 10 rows, each has an id - starting from 0, ending at 9. When I remove a row from a table, lets say - row number 5, there occurs a "hole". And afterwards I add more data, but the "hole" is still there.
It is important for me to know exact number of rows and to have at every row data in order to access my table arbitrarily.
There is a way in sqlite to do it? Or do I have to manually manage removing and adding of data?
Thank you in advance,
Ilya.
It may be worth considering whether you really want to do this. Primary keys usually should not change through the lifetime of the row, and you can always find the total number of rows by running:
SELECT COUNT(*) FROM table_name;
That said, the following trigger should "roll down" every ID number whenever a delete creates a hole:
CREATE TRIGGER sequentialize_ids AFTER DELETE ON table_name FOR EACH ROW
BEGIN
UPDATE table_name SET id=id-1 WHERE id > OLD.id;
END;
I tested this on a sample database and it appears to work as advertised. If you have the following table:
id name
1 First
2 Second
3 Third
4 Fourth
And delete where id=2, afterwards the table will be:
id name
1 First
2 Third
3 Fourth
This trigger can take a long time and has very poor scaling properties (it takes longer for each row you delete and each remaining row in the table). On my computer, deleting 15 rows at the beginning of a 1000 row table took 0.26 seconds, but this will certainly be longer on an iPhone.
I strongly suggest that you re-think your design. In my opinion your asking yourself for troubles in the future (e.g. if you create another table and want to have some relations between the tables).
If you want to know the number of rows just use:
SELECT count(*) FROM table_name;
If you want to access rows in the order of id, just define this field using PRIMARY KEY constraint:
CREATE TABLE test (
id INTEGER PRIMARY KEY,
...
);
and get rows using ORDER BY clause with ASC or DESC:
SELECT * FROM table_name ORDER BY id ASC;
Sqlite creates an index for the primary key field, so this query is fast.
I think that you would be interested in reading about LIMIT and OFFSET clauses.
The best source of information is the SQLite documentation.
If you don't want to take Stephen Jennings's very clever but performance-killing approach, just query a little differently. Instead of:
SELECT * FROM mytable WHERE id = ?
Do:
SELECT * FROM mytable ORDER BY id LIMIT 1 OFFSET ?
Note that OFFSET is zero-based, so you may need to subtract 1 from the variable you're indexing in with.
If you want to reclaim deleted row ids the VACUUM command or pragma may be what you seek,
http://www.sqlite.org/faq.html#q12
http://www.sqlite.org/lang_vacuum.html
http://www.sqlite.org/pragma.html#pragma_auto_vacuum