In Postgres, I am taking out a REPEATABLE READ transaction in order to obtain a consistent view of the database at the moment the transaction is begun. I'd like to know the LSN from the POV of this transaction so that I can also setup a replication slot at this LSN at the same tie so that once I finish with the transaction I can setup logical replication at the LSN and receive all updates to the database that happened after the transaction begun.
My expectation was that the LSN wouldn't change once inside the transaction (as other connections were making updates, etc) however multiple calls pg_current_wal_lsn in the transaction resulted in a different LSN each time.
Is there a way to determine the last LSN as seen from the POV of the transaction?
For a bit more context, I'd like to setup logical replication on a database but must first operate on the data that exists in the database prior to setting up the replication slot. I must assume that previous WAL segments have been purged so I don't expect to see all data in the database via logical replication and as a result need a way to first operate on existing data then stream everything past that. Hopefully that makes sense.
Thanks in advance.
What is fixed for a REPEATABLE READ transaction is not the WAL position, since such a transaction can perform data modifications. It is the snapshot, which determines which row versions it can see.
A snapshot consists of a minimal transaction ID (the transaction can see any rows created by older transactions), a maximal transaction ID (the transaction can see nothing newer than that) and the list of the IDs of all concurrent transactions.
Now if you have a REPEATABLE READ READ ONLY transaction, it makes sense to ask for the position where WAL is inserted at the time the snapshot is taken, so you could query
SELECT pg_current_wal_insert_lsn();
as the first statement in your transaction.
However, there is a race condition: first PostgreSQL takes the snapshot, then it executes the query. Between those times a concurrent transaction could perform data modifications that are not visible to the snapshot, but before the LSN you get from the function.
The solution is to use logical decoding. As the documentation says:
When a new replication slot is created using the streaming replication interface (see CREATE_REPLICATION_SLOT), a snapshot is exported (see Section 9.27.5), which will show exactly the state of the database after which all changes will be included in the change stream. This can be used to create a new replica by using SET TRANSACTION SNAPSHOT to read the state of the database at the moment the slot was created. This transaction can then be used to dump the database's state at that point in time, which afterwards can be updated using the slot's contents without losing any changes.
So you do it the other way around: first you create a logical replication slot, then you start a REPEATABLE READ transaction and set its snapshot, so that it sees the correct data.
Related
I don't understand this comment:
My understanding is that read-write transactions carry some overhead, but that you don't incur this overhead until you actually write something. In other words, in terms of performance, a READ ONLY transaction should be the same as a READ WRITE transaction which only contains reads. This stems from the way Postgres handles XID assignment (some info on this here).
The link in the comment states:
"Transactions and subtransactions are assigned permanent XIDs only when/if they first do something that requires one --- typically, insert/update/delete a tuple, though there are a few other places that need an XID assigned."
Is this the key point? That is, if a READ/WRITE transaction only has reads, then an XID isn't assigned, and assigning an XID would otherwise account for the overhead difference between a READ/WRITE transaction with no writes and a READ ONLY transaction.
Does this mean that other databases assign an XID even if no rows are changed, removed, or added?
The overhead difference is related to how read-only and read-write transactions are defined, and how permanent XIDs are assigned in PostgreSQL.
A transaction is defined as virtual transaction (aka read-only) and does not get assigned a true XID until it does a data modification operation on the database. Virtual transactions do not affect the visibility of tuples (nevertheless, they do trigger the pruning of dead tuples in the page depending on the free page-size left, different topic). No impact on visibility means no impact on snapshot isolation. And no need for the assignment of a true XID -- which normally requires synchronization of internal processes, page writes (xmin, xmax, and hint-bits related to those), additional IO, etc. This is the extra overhead. You can self-experiment with this by starting a transaction block and observing no-permanent_XID assignment until a DML statement is executed by using a built-in function (for details, https://pgpedia.info/p/pg_current_xact_id_if_assigned.html):
postgres=# begin;
BEGIN
postgres=*# select pg_current_xact_id_if_assigned();
pg_current_xact_id_if_assigned
--------------------------------
(1 row)
A virtual_XID assignment is still done for read-only transactions as well. But assigned IDs are memory-only, local to the process, and temporary; which makes them much less expensive.
When it comes to other DBMSs; MS SQL also differentiates between different types of transactions and if I am not mistaken they are all common in how they are assigned: https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-tran-active-transactions-transact-sql?view=azuresqldb-current
Yes, that is the case: if you commit a transaction that did not request a new transaction ID, hardly anything happens (unless you created a WITH HOLD cursor). If the transaction got a transaction ID, a COMMIT record is written to WAL, and WAL is flushed to disk.
I'm trying to figure out the logical replication protocol. My attention was drawn to the message "standby status update" (byte('r')). Unfortunately, the documentation does not seem to describe the expected server behavior for this command.
If I send an lsn from the past, will the server resend transactions that were in the future regarding this lsn? Or, as far as I could find out in a practical way, this message will only affect the meta data of the replication slot (from pg_replication_slots table)?
That information is used to move the replication slot ahead and to provide the data that can be seen in pg_stat_replication (sent_lsn, write_lsn, flush_lsn and replay_lsn). With synchronous replication, it also provides the data that the primary has to wait for before it can report a transaction as committed.
Sending old, incorrect log sequence numbers will not change the data that the primary streams to the standby, but it might make the primary retail old WAL, because the replication slot is not moved ahead. It will also confuse monitoring. With synchronous replication, it will lead to hanging DML transactions.
I'm getting the following error when running a query on a PostgreSQL db in standby mode. The query that causes the error works fine for 1 month but when you query for more than 1 month an error results.
ERROR: canceling statement due to conflict with recovery
Detail: User query might have needed to see row versions that must be removed
Any suggestions on how to resolve? Thanks
No need to touch hot_standby_feedback. As others have mentioned, setting it to on can bloat master. Imagine opening transaction on a slave and not closing it.
Instead, set max_standby_archive_delay and max_standby_streaming_delay to some sane value:
# /etc/postgresql/10/main/postgresql.conf on a slave
max_standby_archive_delay = 900s
max_standby_streaming_delay = 900s
This way queries on slaves with a duration less than 900 seconds won't be cancelled. If your workload requires longer queries, just set these options to a higher value.
Running queries on hot-standby server is somewhat tricky — it can fail, because during querying some needed rows might be updated or deleted on primary. As a primary does not know that a query is started on secondary it thinks it can clean up (vacuum) old versions of its rows. Then secondary has to replay this cleanup, and has to forcibly cancel all queries which can use these rows.
Longer queries will be canceled more often.
You can work around this by starting a repeatable read transaction on primary which does a dummy query and then sits idle while a real query is run on secondary. Its presence will prevent vacuuming of old row versions on primary.
More on this subject and other workarounds are explained in Hot Standby — Handling Query Conflicts section in documentation.
There's no need to start idle transactions on the master. In postgresql-9.1 the
most direct way to solve this problem is by setting
hot_standby_feedback = on
This will make the master aware of long-running queries. From the docs:
The first option is to set the parameter hot_standby_feedback, which prevents
VACUUM from removing recently-dead rows and so cleanup conflicts do not occur.
Why isn't this the default? This parameter was added after the initial
implementation and it's the only way that a standby can affect a master.
As stated here about hot_standby_feedback = on :
Well, the disadvantage of it is that the standby can bloat the master,
which might be surprising to some people, too
And here:
With what setting of max_standby_streaming_delay? I would rather
default that to -1 than default hot_standby_feedback on. That way what
you do on the standby only affects the standby
So I added
max_standby_streaming_delay = -1
And no more pg_dump error for us, nor master bloat :)
For AWS RDS instance, check http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.html
The table data on the hot standby slave server is modified while a long running query is running. A solution (PostgreSQL 9.1+) to make sure the table data is not modified is to suspend the replication and resume after the query:
select pg_xlog_replay_pause(); -- suspend
select * from foo; -- your query
select pg_xlog_replay_resume(); --resume
I'm going to add some updated info and references to #max-malysh's excellent answer above.
In short, if you do something on the master, it needs to be replicated on the slave. Postgres uses WAL records for this, which are sent after every logged action on the master to the slave. The slave then executes the action and the two are again in sync. In one of several scenarios, you can be in conflict on the slave with what's coming in from the master in a WAL action. In most of them, there's a transaction happening on the slave which conflicts with what the WAL action wants to change. In that case, you have two options:
Delay the application of the WAL action for a bit, allowing the slave to finish its conflicting transaction, then apply the action.
Cancel the conflicting query on the slave.
We're concerned with #1, and two values:
max_standby_archive_delay - this is the delay used after a long disconnection between the master and slave, when the data is being read from a WAL archive, which is not current data.
max_standby_streaming_delay - delay used for cancelling queries when WAL entries are received via streaming replication.
Generally, if your server is meant for high availability replication, you want to keep these numbers short. The default setting of 30000 (milliseconds if no units given) is sufficient for this. If, however, you want to set up something like an archive, reporting- or read-replica that might have very long-running queries, then you'll want to set this to something higher to avoid cancelled queries. The recommended 900s setting above seems like a good starting point. I disagree with the official docs on setting an infinite value -1 as being a good idea--that could mask some buggy code and cause lots of issues.
The one caveat about long-running queries and setting these values higher is that other queries running on the slave in parallel with the long-running one which is causing the WAL action to be delayed will see old data until the long query has completed. Developers will need to understand this and serialize queries which shouldn't run simultaneously.
For the full explanation of how max_standby_archive_delay and max_standby_streaming_delay work and why, go here.
It might be too late for the answer but we face the same kind of issue on the production.
Earlier we have only one RDS and as the number of users increases on the app side, we decided to add Read Replica for it. Read replica works properly on the staging but once we moved to the production we start getting the same error.
So we solve this by enabling hot_standby_feedback property in the Postgres properties.
We referred the following link
https://aws.amazon.com/blogs/database/best-practices-for-amazon-rds-postgresql-replication/
I hope it will help.
Likewise, here's a 2nd caveat to #Artif3x elaboration of #max-malysh's excellent answer, both above.
With any delayed application of transactions from the master the follower(s) will have an older, stale view of the data. Therefore while providing time for the query on the follower to finish by setting max_standby_archive_delay and max_standby_streaming_delay makes sense, keep both of these caveats in mind:
the value of the follower as a standby / backup diminishes
any other queries running on the follower may return stale data.
If the value of the follower for backup ends up being too much in conflict with hosting queries, one solution would be multiple followers, each optimized for one or the other.
Also, note that several queries in a row can cause the application of wal entries to keep being delayed. So when choosing the new values, it’s not just the time for a single query, but a moving window that starts whenever a conflicting query starts, and ends when the wal entry is finally applied.
My case I have connected to another GP DB to import data into my PostgreSQL tables and written Java schedulers to refresh it Daily. But when I'm trying to fetch the records everyday by using SQL functions, it's giving me an error Greenplum Database does not support REPEATABLE READ transactions. So, Can anyone suggest me how can I load the data in frequent from GP to postgres without isolation hassle.
I knew to execute to refresh the tables by
START TRANSACTION ISOLATION LEVEL SERIALIZABLE;
But, I'm not able to use the same in the functions due to transactions blocks.
Unlike Oracle database, which uses locks and latches for concurrency control, Greenplum Database (as does PostgreSQL) maintains data consistency by using a multiversion model (Multiversion Concurrency Control, MVCC). This means that while querying a database, each transaction sees a snapshot of data which protects the transaction from viewing inconsistent data that could be caused by (other) concurrent updates on the same data rows. This provides transaction isolation for each database session. In a nutshell, readers don’t block writers and writers don’t block readers. Each transaction sees a snapshot of the database rather than locking tables.
Transaction Isolation Levels
The SQL standard defines four transaction isolation levels. In Greenplum Database, you can request any of the four standard transaction isolation levels. But internally, there are only two distinct isolation levels — read committed and serializable:
read committed — When a transaction runs on this isolation level, a SELECT query sees only data committed before the query began. It never sees either uncommitted data or changes committed during query execution by concurrent transactions. However, the SELECT does see the effects of previous updates executed within its own transaction, even though they are not yet committed. In effect, a SELECT query sees a snapshot of the database as of the instant that query begins to run. Notice that two successive SELECT commands can see different data, even though they are within a single transaction, if other transactions commit changes during execution of the first SELECT. UPDATE and DELETE commands behave the same as SELECT in terms of searching for target rows. They will only find target rows that were committed as of the command start time. However, such a target row may have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. The partial transaction isolation provided by read committed mode is adequate for many applications, and this mode is fast and simple to use. However, for applications that do complex queries and updates, it may be necessary to guarantee a more rigorously consistent view of the database than the read committed mode provides.
serializable — This is the strictest transaction isolation. This level emulates serial transaction execution, as if transactions had been executed one after another, serially, rather than concurrently. Applications using this level must be prepared to retry transactions due to serialization failures. When a transaction is on the serializable level, a SELECT query sees only data committed before the transaction began. It never sees either uncommitted data or changes committed during transaction execution by concurrent transactions. However, the SELECT does see the effects of previous updates executed within its own transaction, even though they are not yet committed. Successive SELECT commands within a single transaction always see the same data. UPDATE and DELETE commands behave the same as SELECT in terms of searching for target rows. They will only find target rows that were committed as of the transaction start time. However, such a target row may have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the serializable transaction will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects are negated and the serializable transaction can proceed with updating the originally found row. But if the first updater commits (and actually updated or deleted the row, not just locked it) then the serializable transaction will be rolled back.
read uncommitted — Treated the same as read committed in Greenplum Database.
repeatable read — Treated the same as serializable in Greenplum Database.
The default transaction isolation level in Greenplum Database is read committed. To change the isolation level for a transaction, you can declare the isolation level when you BEGIN the transaction, or else use the SET TRANSACTION command after the transaction is started.
enter link description here
I've been read a lot about MongoDB recently, but one topic I can't find any clear material on, is how data is written to the journal and oplog.
So this is what I understand of the process so far, please correct me where I'm wrong
A client connect to mongod and performs a write. The write is stored in the socket buffer
When Mongo is available (not sure what available means at this point), data is written to the journal?
The mongoDB docs then say that writes every 60 seconds are flushed from the journal onto disk. By this I can only assume this mean written to the primary and the oplog. If this is the case, how to writes appear earlier than the 60 seconds sync interval?
Some time later, secondaries suck data from the primary or their sync source and update their oplog and databases. It seems very vague about when exactly this happens and what delays it.
I'm also wondering if journaling was disabled (I understand that's a really bad idea), at what point does the oplog and database get updated?
Lastly I'm a bit stumpted at which points in this process, the write locks get created. Is this just when the database and oplog are updated or at other times too?
Thanks to anyone who can shed some light on this or point me to some reading material.
Simon
Here is what happens as far as I understand it. I simplified a bit, but it should make clear how it works.
A client connects to mongo. No writes done so far, and no connection torn down, because it really depends on the write concern what happens now.Let's assume that we go with the (by the time of this writing) default "acknowledged".
The client sends it's write operation. Here is where I am really not sure. Either after this step or the next one the acknowledgement is sent to the driver.
The write operation is run through the query optimizer. It is here where the acknowledgment is sent because with in an acknowledged write concern, you may be returned a duplicate key error. It is possible that this was checked in the last step. If I should bet, I'd say it is after this one.
The output of the query optimizer is then applied to the data in memory Actually to the data of the memory mapped datafiles, to the memory mapped oplog and to the journal's memory mapped files. Queries are answered from this memory mapped parts or the according data is mapped to memory for answering the query. The oplog is read from memory if present, too.
Every 100ms in general the journal is synced to disk. The precise value is determined by a number of factors, one of them being the journalCommitInterval configuration parameter. If you have a write concern of journaled, the driver will be notified now.
Every syncDelay seconds, the current state of the memory mapped files is synced to disk I think the journal is truncated to the entries which weren't applied to the data yet, but I am not too sure of that since that it should basically never happen that data in the journal isn't yet applied to the current data.
If you have read carefully, you noticed that the data is ready for the oplog as early as it has been run through the query optimizer and was applied to the files mapped into memory. When the oplog entry is pulled by one of the secondaries, it is immediately applied to it's data of the memory mapped files and synced in the disk the same way as on the primary.
Some things to note: As soon as the relatively small data is written to the journal, it is quite safe. If a node goes down between two syncs to the datafiles, both the datafiles and the oplog can be restored from their last state in the datafiles and the journal. In general, the maximum data loss you can have is the operations recorded into the log after the last commit, 50ms in median.
As for the locks. If you have written carefully, there aren't locks imposed on a database level when the data is synced to disk. Write locks may be created in order to assure that only one thread at any given point in time modifies a given document. There are other write locks possible , but in general, they should be rather rare.
Write locks on the filesystem layer are created once, though only implicitly, iirc. During application startup, a lock file is created in the root directory of the dbpath. Any other mongod instance will refuse to do any operation on those datafiles while a valid lock exists. And you shouldn't either ;)
Hope this helps.