I've been reading through the WAL chapter of the Postgres manual and was confused by a portion of the chapter:
Using WAL results in a significantly reduced number of disk writes, because only the log file needs to be flushed to disk to guarantee that a transaction is committed, rather than every data file changed by the transaction.
How is it that continuous writing to WAL more performant than simply writing to the table/index data itself?
As I see it (forgetting for now the resiliency benefits of WAL) postgres need to complete two disk operations; first pg needs to commit to WAL on disk and then you'll still need to change the table data to be consistent with WAL. I'm sure there's a fundamental aspect of this I've misunderstood but it seems like adding an additional step between a client transaction and the and the final state of the table data couldn't actually increase overall performance. Thanks in advance!
You are fundamentally right: the extra writes to the transaction log will per se not reduce the I/O load.
But a transaction will normally touch several files (tables, indexes etc.). If you force all these files out to storage (“sync”), you will incur more I/O load than if you sync just a single file.
Of course all these files will have to be written and sync'ed eventually (during a checkpoint), but often the same data are modified several times between two checkpoints, and then the corresponding files will have to be sync'ed only once.
Related
As per my understanding
checkpoint write all dirty buffer(data) periodically into disk and
background writer writes some specific dirty buffer(data) into disk
It looks both do almost same work.
But what are the specific dirty buffer(data) writes into disk?
How frequently checkpoint and bgwriter it is calling?
I want to know what are the difference between them.
Thanks in advance
It looks both do almost same work.
Looking at the source code link given by Adrian, you can see these words in the comments for the background writer:
As of Postgres 9.2 the bgwriter no longer handles checkpoints.
...which means in the past, the background writer and checkpointer tasks were handled by one component, which explains the similarity that probably led you to ask this question. The two components were split on 1/Nov/2011 in this commit and you can learn more about the checkpointer here.
From my own understanding, they are doing the same task from different perspectives. The task is making sure we use a limited amount of resources:
For the background writer, that resource is RAM and it writes dirty buffers to disk so the buffers can be reused to store other data hence limiting the amount of RAM required.
For the checkpointer, that resource is DISK and it writes all dirty buffers to disk so it can add a checkpoint record to the WAL, which allows all segments of the WAL prior to that record to be removed/recycled hence limiting the amount of DISK required to store the WAL files. You can confirm this in the docs which say ...after a checkpoint, log segments preceding the one containing the redo record are no longer needed and can be recycled or removed.
It may be helpful to read more about the WAL (Write-Ahead Log) in general.
I have a script that performs a bunch of updates on a moderately large (approximately 6 million rows) table, based on data read from a file.
It currently begins and then commits a transaction for each row it updates and I wanted to improve its performance somehow. I wonder if starting a single transaction at the beginning of the script's run and then rollbacking to individual savepoints in case any validation error occurs would actually result in a performance increase.
I looked online but haven't had much luck finding any documentation or benchmarks.
COMMIT is mostly an I/O problem, because the transaction log (WAL) has to be synchronized to disk.
So using subtransactions (savepoints) will verylikely boost performance. But beware that using more than 64 subtransactions per transaction will again hurt performance if you have concurrent transactions.
If you can live with losing some committed transactions in the event of a database server crash (which is rare), you could simply set synchronous_commit to off and stick with many small transactions.
Another, more complicated method is to process the rows in batches without using subtransactions and repeating the whole batch in case of a problem.
Having a single transaction with only 1 COMMIT should be faster than having multiple single row update transactions because each COMMIT must synchronize WAL writing to disk. But how really faster it is in a given environment depends a lot of the environment (number of transactions, table structure, index structure, UPDATE statement, PostgreSQL configuration, system configuration etc.): only you can benchmark in your environment.
I am using kafka_2.10-0.10.0.1. I have two questions:
- I want to know how I can modify the default configuration of Kafka to process large amount of data with good performance.
- Is it possible to configure Kafka to process the records in memory without storing in disk?
thank you
Is it possible to configure Kafka to process the records in memory without storing in disk?
No. Kafka is all about storing records reliably on disk, and then reading them back quickly off of disk. In fact, its documentation says:
As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.
You can read more about its design here: https://kafka.apache.org/documentation/#design. The implementation section is also quite interesting: https://kafka.apache.org/documentation/#implementation.
That said, Kafka is also all about processing large amounts of data with good performance. In 2014 it could handle 2 million writes per second on three cheap instances: https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines. More links about performance:
https://docs.confluent.io/current/kafka/deployment.html
https://www.confluent.io/blog/optimizing-apache-kafka-deployment/
https://community.hortonworks.com/articles/80813/kafka-best-practices-1.html
https://www.cloudera.com/documentation/kafka/latest/topics/kafka_performance.html
I've been read a lot about MongoDB recently, but one topic I can't find any clear material on, is how data is written to the journal and oplog.
So this is what I understand of the process so far, please correct me where I'm wrong
A client connect to mongod and performs a write. The write is stored in the socket buffer
When Mongo is available (not sure what available means at this point), data is written to the journal?
The mongoDB docs then say that writes every 60 seconds are flushed from the journal onto disk. By this I can only assume this mean written to the primary and the oplog. If this is the case, how to writes appear earlier than the 60 seconds sync interval?
Some time later, secondaries suck data from the primary or their sync source and update their oplog and databases. It seems very vague about when exactly this happens and what delays it.
I'm also wondering if journaling was disabled (I understand that's a really bad idea), at what point does the oplog and database get updated?
Lastly I'm a bit stumpted at which points in this process, the write locks get created. Is this just when the database and oplog are updated or at other times too?
Thanks to anyone who can shed some light on this or point me to some reading material.
Simon
Here is what happens as far as I understand it. I simplified a bit, but it should make clear how it works.
A client connects to mongo. No writes done so far, and no connection torn down, because it really depends on the write concern what happens now.Let's assume that we go with the (by the time of this writing) default "acknowledged".
The client sends it's write operation. Here is where I am really not sure. Either after this step or the next one the acknowledgement is sent to the driver.
The write operation is run through the query optimizer. It is here where the acknowledgment is sent because with in an acknowledged write concern, you may be returned a duplicate key error. It is possible that this was checked in the last step. If I should bet, I'd say it is after this one.
The output of the query optimizer is then applied to the data in memory Actually to the data of the memory mapped datafiles, to the memory mapped oplog and to the journal's memory mapped files. Queries are answered from this memory mapped parts or the according data is mapped to memory for answering the query. The oplog is read from memory if present, too.
Every 100ms in general the journal is synced to disk. The precise value is determined by a number of factors, one of them being the journalCommitInterval configuration parameter. If you have a write concern of journaled, the driver will be notified now.
Every syncDelay seconds, the current state of the memory mapped files is synced to disk I think the journal is truncated to the entries which weren't applied to the data yet, but I am not too sure of that since that it should basically never happen that data in the journal isn't yet applied to the current data.
If you have read carefully, you noticed that the data is ready for the oplog as early as it has been run through the query optimizer and was applied to the files mapped into memory. When the oplog entry is pulled by one of the secondaries, it is immediately applied to it's data of the memory mapped files and synced in the disk the same way as on the primary.
Some things to note: As soon as the relatively small data is written to the journal, it is quite safe. If a node goes down between two syncs to the datafiles, both the datafiles and the oplog can be restored from their last state in the datafiles and the journal. In general, the maximum data loss you can have is the operations recorded into the log after the last commit, 50ms in median.
As for the locks. If you have written carefully, there aren't locks imposed on a database level when the data is synced to disk. Write locks may be created in order to assure that only one thread at any given point in time modifies a given document. There are other write locks possible , but in general, they should be rather rare.
Write locks on the filesystem layer are created once, though only implicitly, iirc. During application startup, a lock file is created in the root directory of the dbpath. Any other mongod instance will refuse to do any operation on those datafiles while a valid lock exists. And you shouldn't either ;)
Hope this helps.
We're using MongoDB 2.2.0 at work. The DB contains about 51GB of data (at the moment) and I'd like to do some analytics on the user data that we've collected so far. Problem is, it's the live machine and we can't afford another slave at the moment. I know MongoDB has a read lock which may affect any writes that happen especially with complex queries. Is there a way to tell MongoDB to treat my (particular) query with the lowest priority?
In MongoDB reads and writes do affect each other. Read locks are shared, but read locks block write locks from being acquired and of course no other reads or writes are happening while a write lock is held. MongoDB operations yield periodically to keep other threads waiting for locks from starving. You can read more about the details of that here.
What does that mean for your use case? Because there is no way to tell MongoDB to access the data without a read lock, nor is there a way to prioritize the requests (at least not yet) whether the reads significantly affect the performance of your writes depends on how much "headroom" you have available while write activity is going on.
One suggestion I can make is when figuring out how to run analytics, rather than scanning the entire data set (i.e. doing an aggregation query over all historical data) try running smaller aggregation queries on short time slices. This will accomplish two things:
reads jobs will be shorter lived and therefore will finish quicker, this will give you a chance to assess what impact the queries have on your "live" performance.
you won't be pulling all old data into RAM at once - by spacing out these analytical queries over time you will minimize the impact it will have on current write performance.
Depending on what it is you can't afford about getting another server - you might consider getting a short lived AWS instance which may be not very powerful but would be available to run a long analytical query against a copy of your data set. Just be careful when making it a copy of your data - doing a full sync off of the production system will place a heavy load on it (more effective way would be to use a recent backup/file snapshot to resume from).
Such operations are best left for slaves of a replica set. For one thing, read locks can be shared to allow many reads at once, but write locks will block reads. And, while you can't prioritize queries, mongodb yields long running read/write queries. Their concurrency docs should help
If you can't afford another server, you can setup a slave on the same machine, provided you have some spare RAM/Disk headroom, and you use the slave lightly/occasionally. You must be careful though, your disk I/O will increase significantly.