Are there any conditions under which records created in a table using a typical auto-increment field would be available for read out of sequence?
For instance, could a record with value 10 ever appear in the result of a select query when the record with value 9 is not yet visible to a select query?
The purpose for my question is… I want to know if it is reliable to use the maximum value retrieved from one query as the lower bound to identify previously unretrieved values in a later query, or could that potentially miss a row?
If that kind of race condition is possible under some circumstances, then are any of the isolation levels that can be used for the select queries that are immune to that problem?
Yes, and good on you for thinking about it.
You can trivially demonstrate this with three concurrent psql sessions, given some table
CREATE TABLE x (
seq serial primary key,
n integer not null
);
then
SESSION 1 SESSION 2 SESSION 3
BEGIN;
BEGIN;
INSERT INTO x(n) VALUES(1)
INSERT INTO x(n) VALUES (2);
COMMIT;
SELECT * FROM x;
COMMIT;
SELECT * FROM x;
It is not safe to assume that for any generated value n, all generated values n-1 have been used by already-committed or already-aborted xacts. They might be in progress and commit after you see n.
I don't think isolation levels really help you here. There's no mutual dependency for SERIALIZABLE to detect.
This is partly why logical decoding was added, so you can get a consistent stream in commit order.
Related
INSERT INTO A
SELECT * FROM B WHERE timestamp > (SELECT max(timestamp) FROM A);
or, written differently:
WITH selection AS
(SELECT * FROM B WHERE timestamp > (SELECT max(timestamp) FROM A))
INSERT INTO A SELECT * FROM selection;
If these queries run multiple times simultaneously, is it possible that I will end up with duplicated rows in A?
How does Postgres process these queries? Is it one or multiple?
If it is multiple queries (find max(timestamp)[1], select[2] then insert[3]) I can imagine this will cause duplicated rows.
If that is correct, would wrapping it in BEGIN/END (a transaction) help?
Yes, that might result in duplicate values.
A single statement sees a consistent view of the data in all tables as of the point in time when the statement started.
Wrapping that single statement into a transaction won't change that (a single statement is always executed as an atomic statement regardless of the number of sub-query involved).
The statement will never see uncommitted data from other transactions (which is the root cause why you can wind up with duplicate values).
The only safe way to avoid duplicate values, is to create a unique constraint (or index) on that column. In that case the INSERT would result in an error if such a value already exists.
If you want to avoid the error, use insert ... on conflict
This depends on the isolation level set in your database.
This is from the postgres documentation
By default, this is set to Repeatable read, which means that each query will get the output based on when the transaction first attempted to read the data. If 2 queries read before any one writes, then you will get duplicate data in these tables.
If you want to avoid having duplicate entries, you have a few options.
Try using the isolation level Serializable
Apply a unique index on a field of A in table B. Timestamp is not a great contender as you might legitimately have 2 rows with the same timestamp. Probably id of the table A is a good option.
Take a lock at the application level before performing such a query.
I have a table with three columns A, B, C, all of type bytea.
There are around 180,000,000 rows in the table. A, B and C all have exactly 20 bytes of data, C sometimes contains NULLs
When creating indexes for all columns with
CREATE INDEX index_A ON transactions USING hash (A);
CREATE INDEX index_B ON transactions USING hash (B);
CREATE INDEX index_C ON transactions USING hash (C);
index_A is created in around 10 minutes, while B and C are taking over 10 hours after which I aborted them. I ran every CREATE INDEX on their own, so no indices were created in parallel. There are also no other queries running in the database.
When running
SELECT * FROM pg_stat_activity;
wait_event_type and wait_event are both NULL, state is active.
Why are the second index creations taking so long, and can I do anything to speed them up?
Ensure the statistics on your table are up-to-date.
Then execute the following query:
SELECT attname, n_distinct, correlation
from pg_stats
where tablename = '<Your table name here>'
Basically, the database will have more work to create indexes when:
The number of distinct values gets higher.
The correlation (= are values in the field physically stored in order) is close to 0.
I suspect you will see field A is different in terms of distinct values and/or a higher correlation than the other 2 fields.
Edit: Basically, creating an index = FULL SCAN of the table and create entries in the index as you progress. With the stats you have shared below that means:
Column A: it was detected as unique
A single scan is enough as the DB knows 1 record = 1 index entry.
Columns B & C : it was detected as having very few distinct values + abs(correlation) is very low.
Each index entry takes an entire FULL SCAN of the table.
Note: the description is simplified to highlight the difference.
Solution 1:
Do not create indexes for B and C.
It might sound stupid but in fact and as explained here, a small correlation means the indexes will probably not be used (an index is useful only when entries are not scattered in all the table blocks).
Solution 2:
Order records on the disk.
The initialization would be something like this:
CREATE TABLE Transactions_order as SELECT * FROM Transactions;
TRUNCATE TABLE Transactions;
INSERT INTO Transactions SELECT * FROM Transactions_order ORDER BY B,C,A;
DROP TABLE Transactions_order;
The tricky part comes next: with insert/update/delete records, you need to keep track of the correlation and ensure it does not drop too much.
If you can't guarantee that, stick to solution 1.
Solution3:
Create partitions and enjoy partition pruning.
There are quite a lot of efforts being made for partitioning recently in postgresql. It could be worth having a look into it.
I am solving an performance issue on PostgreSQL 9.6 dbo based system. Intro:
12yo system, similar to banking system, with most queried primary table called transactions.
CREATE TABLE jrn.transactions (
ID BIGSERIAL,
type_id VARCHAR(200),
account_id INT NOT NULL,
date_issued DATE,
date_accounted DATE,
amount NUMERIC,
..
)
In the table transactions we store all transactions within a bank account. Field type_id determines the type of a transaction. Servers also as C# EntityFramework Discriminator column. Values are like:
card_payment, cash_withdrawl, cash_in, ...
14 types of transaction are known.
In generally, there are 4 types of queries (no. 3 and .4 are by far most frequent):
select single transaction like: SELECT * FROM jrn.transactions WHERE id = 3748734
select single transaction with JOIN to other transaction like: SELECT * FROM jrn.transactions AS m INNER JOIN jrn.transactions AS r ON m.refund_id = r.id WHERE m.id = 3748734
select 0-100, 100-200, .. transactions of given type like: SELECT * FROM jrn.transactions WHERE account_id = 43784 AND type_id = 'card_payment' LIMIT 100
several aggregate queries, like: SELECT SUM(amount), MIN(date_issued), MAX(date_issued) FROM jrn.transactions WHERE account_id = 3748734 AND date_issued >= '2017-01-01'
In last few month we had unexpected row count growth, now 120M.
We are thinking of table partitioning, following to PostgreSQL doc: https://www.postgresql.org/docs/10/static/ddl-partitioning.html
Options:
partition table by type_id into 14 partitions
add column year and partition table by year (or year_month) into 12 (or 144) partitions.
I am now restoring data into out test environment, I am going to test both options.
What do you consider the most appropriate partitioning rule for such situation? Any other options?
Thanks for any feedback / advice etc.
Partitioning won't be very helpful with these queries, since they won't perform a sequential scan, unless you forgot an index.
The only good reason I see for partitioning would be if you want to delete old rows efficiently; then partitioning by date would be best.
Based on your queries, you should have these indexes (apart from the primary key index):
CREATE INDEX ON jrn.transactions (account_id, date_issued);
CREATE INDEX ON jrn.transactions (refund_id);
The following index might be a good idea if you can sacrifice some insert performance to make the third query as fast as possible (you might want to test):
CREATE INDEX ON jrn.transactions (account_id, type_id);
What you have here is almost a perfect case for column-based storage as you may get it using a SAP HANA Database. However, as you explicitly have asked for a Postgres answer and I doubt that a HANA database will be within the budget limit, we will have to stick with Postgres.
Your two queries no. 3 and 4 go quite into different directions, so there won't be "the single answer" to your problem - you will always have to balance somehow between these two use cases. Yet, I would try to use two different techniques to approach each of them individually.
From my perspective, the biggest problem is the query no. 4, which creates quite a high load on your postgres server just because it is summing up values. Moreover, you are just summing up values over and over again, which most likely won't change often (or even at all), as you have said that UPDATEs nearly do not happen at all. I furthermore assume two more things:
transactions is INSERT-only, i.e. DELETE statements almost never happen (besides perhaps in cases of some exceptional administrative intervention).
The values of column date_issued when INSERTing typically are somewhere "close to today" - so you usually won't INSERT stuff way in the past.
Out of this, to prevent aggregating values over and over again unnecessarily, I would introduce yet another table: let's call it transactions_aggr, which is built up like this:
create table transactions_aggr (
account_id INT NOT NULL,
date_issued DATE,
sumamount NUMERIC,
primary key (account_id, date_issued)
)
which will give you a table of per-day preaggregated values.
To determine which values are already preaggregated, I would add another boolean-typed column to transactions, which indicates to me, which of the rows are contained in transactions_aggr and which are not (yet). The query no. 4 then would have to be changed in such a way that it reads only non-preaggregated rows from transactions, whilst the rest could come from transactions_aggr. To facilitate that you could define a view like this:
select account_id, date_issued, sum(amount) as sumamount from
(
select account_id, date_issued, sumamount as amount from transactions_aggr as aggr
union all
select account_id, date_issued, sum(amount) as amount from transactions as t where t.aggregated = false
)
group by account_id, date_issued
Needless to say that putting an index on transactions.aggregated (perhaps in conjunction with the account_id) could greatly help to improve the performance here.
Updating transactions_aggr can be done using multiple approaches:
You could use this as a one-time activity and only pre-aggregate the current set of ~120m rows once. This would at least reduce the load on your machine doing aggregations significantly. However, over time you will run into the same problem again. Then you may just re-execute the entire procedure, simply dropping transactions_aggr as a whole and re-create it from scratch (all the original data still is there in transactions).
You have a nice period somewhere during the week/month/in the night, where you have little or no queries are coming in. Then you can open a transaction, read all transactions WHERE aggregated = false and add them with UPDATEs to transactions_aggr. Keep in mind to then toggle aggregated to true (should be done in the same transaction). The tricky part of this, however, is that you must pay attention to what reading queries will "see" of this transaction: Depending on your requirements of accuracy during that timeframe of this "update job", you may have to consider switching the transaction isolation level to "READ_COMMITED" to prevent ghost reads.
On the matter of your query no. 3 you then could try to really go for the approach of partitioning based on type_id. However, I perceive your query as a little strange, as you are performing a LIMIT/OFFSET without ordering (e.g. there is no ORDER BY statement in place) having specified (NB: You are not saying that you would be using database cursors). This may lead to the effect that the implicit order, which is currently used, is changed, if you enable partitioning on the table. So be careful on side-effects which this may cause on your program.
And one more thing: Before really doing the partition split, I would first check on the data distribution concerning type_id by issuing
select type_id, count(*) from transactions group by type_id
Not that it turns out that, for example, 90% of your data is with card_payment - so that you will have a heavily uneven distribution amongst your partitions and the biggest performance hogging queries are those which would still go into this single "large partition".
Hope this helps a little - and good luck!
I'm currently working on a benchmark (which is part of my bachelor thesis) that compares SQL and NoSQL Databases based on an abstract data model an abstract queries to achieve fair implementation on all systems.
I'm currently working on the implementation of a query that is specified as follows:
I have a table in Cassandra that is specified as follows:
CREATE TABLE allocated(
partition_key int,
financial_institution varchar,
primary_uuid uuid,
report_name varchar,
view_name varchar,
row_name varchar,
col_name varchar,
amount float,
PRIMARY KEY (partition_key, report_name, primary_uuid));
This table contains about 100,000,000 records (~300GB).
We now need to calculate the sum for the field "amount" for every possible combination of report_name, view_name, col_name and row_name.
In SQL this would be quite easy, just select sum (amount) and group it by the fields you want.
However, since Cassandra does not support these operations (which is perfectly fine) I need to achieve this on another way.
Currently I achieve this by doing a full-table walk, processing each record and storing the sum in a HashMap in Java for each combination.
The prepared statement I use is as follows:
SELECT
partition_key,
financial_institution,
report_name,
view_name,
col_name,
row_name,
amount
FROM allocated;
That works partially on machines with lots on RAM for both, cassandra and the Java app, but crashes on smaller machines.
Now I'm wondering whether it's possible to achieve this on a faster way?
I could imagine using the partition_key, which serves also as the cassandra partition key and do this for every partition (I have 5 of them).
Also I though of doing this multithreaded by assigning every partition and report to a seperate thread and running it parallel. But I guess this would cause a lot of overhead on the application side.
Now to the actual question: Would you recommend another execution strategy to achieve this?
Maybe I still think too much in a SQL-like way.
Thank you for you support.
Here are two ideas that may help you.
1) You can efficiently scan rows in any table using the following approach. Consider a table with PRIMARY KEY (pk, sk, tk). Let's use a fetch size of 1000, but you can try other values.
First query (Q1):
select whatever_columns from allocated limit 1000;
Process these and then record the value of the three columns that form the primary key. Let's say these values are pk_val, sk_val, and tk_val. Here is your next query (Q2):
select whatever_columns from allocated where token(pk) = token(pk_val) and sk = sk_val and tk > tk_val limit 1000;
The above query will look for records for the same pk and sk, but for the next values of tk. Keep repeating as long as you keep getting 1000 records. When get anything less, you ignore the tk, and do greater on sk. Here is the query (Q3):
select whatever_columns from allocated where token(pk) = token(pk_val) and sk > sk_val limit 1000;
Again, keep doing this as long as you get 1000 rows. Once you are done, you run the following query (Q4):
select whatever_columns from allocated where token(pk) > token(pk_val) limit 1000;
Now, you again use the pk_val, sk_val, tk_val from the last record, and run Q2 with these values, then Q3, then Q4.....
You are done when Q4 returns less than 1000.
2) I am assuming that 'report_name, view_name, col_name and row_name' are not unique and that's why you maintain a hashmap to keep track of the total amount whenever you see the same combination again. Here is something that may work better. Create a table in cassandra where key is a combination of these four values (maybe delimited). If there were three, you could have simply used a composite key for those three. Now, you also need a column called amounts which is a list. As you are scanning the allocate table (using the approach above), for each row, you do the following:
update amounts_table set amounts = amounts + whatever_amount where my_primary_key = four_col_values_delimited;
Once you are done, you can scan this table and compute the sum of the list for each row you see and dump it wherever you want. Note that since there is only one key, you can scan using only token(primary_key) > token(last_value_of_primary_key).
Sorry if my description is confusing. Please let me know if this helps.
Is there a way to generate some kind of in-order identifier for a table records?
Suppose that we have two threads doing queries:
Thread 1:
begin;
insert into table1(id, value) values (nextval('table1_seq'), 'hello');
commit;
Thread 2:
begin;
insert into table1(id, value) values (nextval('table1_seq'), 'world');
commit;
It's entirely possible (depending on timing) that an external observer would see the (2, 'world') record appear before the (1, 'hello').
That's fine, but I want a way to get all the records in the 'table1' that appeared since the last time the external observer checked it.
So, is there any way to get the records in the order they were inserted? Maybe OIDs can help?
No. Since there is no natural order of rows in a database table, all you have to work with is the values in your table.
Well, there are the Postgres specific system columns cmin and ctid you could abuse to some degree.
The tuple ID (ctid) contains the file block number and position in the block for the row. So this represents the current physical ordering on disk. Later additions will have a bigger ctid, normally. Your SELECT statement could look like this
SELECT *, ctid -- save ctid from last row in last_ctid
FROM tbl
WHERE ctid > last_ctid
ORDER BY ctid
ctid has the data type tid. Example: '(0,9)'::tid
However it is not stable as long-term identifier, since VACUUM or any concurrent UPDATE or some other operations can change the physical location of a tuple at any time. For the duration of a transaction it is stable, though. And if you are just inserting and nothing else, it should work locally for your purpose.
I would add a timestamp column with default now() in addition to the serial column ...
I would also let a column default populate your id column (a serial or IDENTITY column). That retrieves the number from the sequence at a later stage than explicitly fetching and then inserting it, thereby minimizing (but not eliminating) the window for a race condition - the chance that a lower id would be inserted at a later time. Detailed instructions:
Auto increment table column
What you want is to force transactions to commit (making their inserts visible) in the same order that they did the inserts. As far as other clients are concerned the inserts haven't happened until they're committed, since they might roll back and vanish.
This is true even if you don't wrap the inserts in an explicit begin / commit. Transaction commit, even if done implicitly, still doesn't necessarily run in the same order that the row its self was inserted. It's subject to operating system CPU scheduler ordering decisions, etc.
Even if PostgreSQL supported dirty reads this would still be true. Just because you start three inserts in a given order doesn't mean they'll finish in that order.
There is no easy or reliable way to do what you seem to want that will preserve concurrency. You'll need to do your inserts in order on a single worker - or use table locking as Tometzky suggests, which has basically the same effect since only one of your insert threads can be doing anything at any given time.
You can use advisory locking, but the effect is the same.
Using a timestamp won't help, since you don't know if for any two timestamps there's a row with a timestamp between the two that hasn't yet been committed.
You can't rely on an identity column where you read rows only up to the first "gap" because gaps are normal in system-generated columns due to rollbacks.
I think you should step back and look at why you have this requirement and, given this requirement, why you're using individual concurrent inserts.
Maybe you'll be better off doing small-block batched inserts from a single session?
If you mean that every query if it sees world row it has to also see hello row then you'd need to do:
begin;
lock table table1 in share update exclusive mode;
insert into table1(id, value) values (nextval('table1_seq'), 'hello');
commit;
This share update exclusive mode is the weakest lock mode which is self-exclusive — only one session can hold it at a time.
Be aware that this will not make this sequence gap-less — this is a different issue.
We found another solution with recent PostgreSQL servers, similar to #erwin's answer but with txid.
When inserting rows, instead of using a sequence, insert txid_current() as row id. This ID is monotonically increasing on each new transaction.
Then, when selecting rows from the table, add to the WHERE clause id < txid_snapshot_xmin(txid_current_snapshot()).
txid_snapshot_xmin(txid_current_snapshot()) corresponds to the transaction index of the oldest still-open transaction. Thus, if row 20 is committed before row 19, it will be filtered out because transaction 19 will still be open. When the transaction 19 is committed, both rows 19 and 20 will become visible.
When no transaction is opened, the snapshot xmin will be the transaction id of the currently running SELECT statement.
The returned transaction IDs are 64-bits, the higher 32 bits are an epoch and the lower 32 bits are the actual ID.
Here is the documentation of these functions: https://www.postgresql.org/docs/9.6/static/functions-info.html#FUNCTIONS-TXID-SNAPSHOT
Credits to tux3 for the idea.