I monitor the amount of WAL files in pg_wal. Overtimes, it reduces by itself. I dont have clustering, just single server with logical replication.
My parameter :
archive_timeout = 3600
min_wal_size = 2 GB
max_wal_size = 16 GB
wal_keep_segment = 4000
archiving_mode = ON
archive command = test ! -f /archive/%f && cp %p /archive/%f
wal_level = logical
What are the reasons the amount of WALs reduce ? I try to look for articles but never found one. Please point me to one or maybe answer this.
Thanks
WAL segments are automatically deleted if they are no longer needed. Also, PostgreSQL automatically creates new WAL segments for future use, and the number of such segments depends on the amount of data modification activity. So it is totally normal for the size of pg_wal to vary with the amount of data modification activity.
Related
I have done a full import with the planet from OSM website and scheduled the updates to run daily on a cronjob.
But I have noticed that the disk usage is growing very fast in size on a daily basis.
When running the command df -h, I have noticed that every day the disk size grows about 1GB. Not sure if this command does some round up, but even so this size seems very huge.
I have a disk with 1TB free, but this would mean that the disk would be full in about 3 years.
I have tried to inspect the folders under /var/lib/postgresql/<version>/<cluster> and it seems that the folders that concur to this size increase are the folders pg_wal and base/16390.
The folder base/16390 has many files with 1GB each and the folder pg_wal has about 40 something files of 16MB each.
I don't know which files are safe to remove or if there are some configs for the postgresql.conf file that would prevent this huge increase in size each day.
Also don't know if this has to do with some backups or logs that postgres does by default, but I would like to also reduce those backups and logs to a minimum.
Any help on this would be appreciated.
Thanks in advance.
I am trying to understand the behaviour of wal files. The wal related settings of the database are as follows:
"min_wal_size" "2GB"
"max_wal_size" "20GB"
"wal_segment_size" "16MB"
"wal_keep_segments" "0"
"checkpoint_completion_target" "0.8"
"checkpoint_timeout" "15min"
The number of wal files is always 1281 or higher:
SELECT COUNT(*) FROM pg_ls_dir('pg_xlog') WHERE pg_ls_dir ~ '^[0-9A-F]{24}';
-- count 1281
As I understand it this means wal files currently never fall below max_wal_size (1281 * 16 MB = 20496 MB = max_wal_size) ??
I would expect the number of wal files to decrease below maximum right after a checkpoint is reached and data is synced to disk. But this is clearly not the case. What am I missing?
As per the documentation (emphasis added):
The number of WAL segment files in pg_xlog directory depends on min_wal_size, max_wal_size and the amount of WAL generated in previous checkpoint cycles. When old log segment files are no longer needed, they are removed or recycled (that is, renamed to become future segments in the numbered sequence). If, due to a short-term peak of log output rate, max_wal_size is exceeded, the unneeded segment files will be removed until the system gets back under this limit. Below that limit, the system recycles enough WAL files to cover the estimated need until the next checkpoint, and removes the rest
So, as per your observation, you are probably observing the "recycle" effect -- the old WAL files are getting renamed instead of getting removed. This saves the disk some I/O, especially on busy systems.
Bear in mind that once a particular file has been recycled, it will not be reconsidered for removal/recycle again until it has been used (i.e., the relevant LSN is reached and checkpointed). That may take a long time if your system suddenly becomes less active.
If your server is very busy and then abruptly becomes mostly idle, you can get into a situation where the log fails remain at max_wal_size for a very long time. At the time it was deciding whether to remove or recycle the files, it was using them up quickly and so decided to recycle up to max_wal_size for predicted future use, rather than remove them. Once recycled, they will never get removed until they have been used (you could argue that that is a bug), and if the server is now mostly idle it will take a very long time for them to be used and thus removed.
I would like to keep at least 12 hours worth of wal segments around to keep replication going for extended network outages (namely long DR tests that my database is not a part of)
I've estimated that I will need to raise my wal_keep_segments from 64 to 1000+
Are there any drawbacks of doing this other than the space it would require? i.e. performance?
I'm considering the archive option as a backup plan for now.
Apart from the disk space, there is no problem with a high wal_keep_segments setting.
I am having a trouble setting up a PostgreSQL hot_standby. When attempting to start the database after running pg_basebackup, I receive, FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000006440000008D has already been removed, in the postgresql.log. After a brief discussion in IRC, I came to understand the error likely originates from a too low wal_keep_segments setting for my write intensive database..
How might I calculate, if possible, the proper setting for wal_keep_segments? What is an acceptable value for this setting?
What I am working with:
Postgresql 9.3
Debian 7.6
wal_keep_segments could be estimated as the average number of new WAL segments per minute in the pg_xlog directory multiplied by the number of minutes across which you want to be safe for. Bear in mind that the rate is expected to increase after wal_level is changed from its default value of minimal to either archive or hot_standby. The only cost is disk space, which as you know by default is 16 MB per segment.
I typically use powers of 2 as values. At the rate of about 1 segment per minute, a value of 256 gives me about 4 hours in which to set up the standby.
You could alternatively consider using WAL streaming with pg_basebackup. This is per its --xlog-method=stream option. Unfortunately, at least as of 2013, per a discussion on a PostgreSQL mailing list, setting wal_keep_segments to a nonzero value may still be recommended - this is to prevent risking the stream from being unable to keep up. If you do use pg_basebackup though, also don't forget --checkpoint=fast.
We have recently started using Cassandra database in production. We have a single cross colo cluster of 24 nodes meaning 12 nodes in PHX and 12 nodes in SLC colo. We have a replication factor of 4 which means 2 copies will be there in each datacenter.
Below is the way by which keyspace and column families have been created by our Production DBA's.
create keyspace profile with placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy' and
strategy_options = {slc:2,phx:2};
create column family PROFILE_USER
with key_validation_class = 'UTF8Type'
and comparator = 'UTF8Type'
and default_validation_class = 'UTF8Type'
and gc_grace = 86400;
We are running Cassandra 1.2.2 and it has org.apache.cassandra.dht.Murmur3Partitioner, with KeyCaching, SizeTieredCompactionStrategy and Virtual Nodes enabled as well.
Machine Specifications for Cassandra production nodes-
16 cores, 32 threads
128GB RAM
4 x 600GB SAS in Raid 10, 1.1TB usable
2 x 10GbaseT NIC, one usable
Below is the result I am getting.
Read Latency(95th Percentile) Number of Threads Duration the program was running(in minutes) Throughput(requests/seconds) Total number of id's requested Total number of columns requested
9 milliseconds 10 30 1977 3558701 65815867
I am not sure what other things I should try it out with Cassandra to get much better read performance. I am assuming it is hitting the disk in my case. Should I try increasing the Replication Factor to some higher number? Any other suggestion?
I believe reading the data from HDD is around 6-12ms as compared to SSD's? In my case it is hitting the disk everytime I guess and enabling key cache is not working fine here. I cannot enable RowCache becasue it’s more efficient to use OS page cache. Maintaining row cache in JVM is very expensive, thus row cache is recommended for smaller number of rows, like <100K rows, only.
Is there any way I can verify whether keycaching is working fine in my case or not?
This is what I get when I do show schema for column family-
create column PROFILE
with column_type = 'Standard'
and comparator = 'UTF8Type'
and default_validation_class = 'UTF8Type'
and key_validation_class = 'UTF8Type'
and read_repair_chance = 0.1
and dclocal_read_repair_chance = 0.0
and populate_io_cache_on_flush = false
and gc_grace = 86400
and min_compaction_threshold = 4
and max_compaction_threshold = 32
and replicate_on_write = true
and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
and caching = 'KEYS_ONLY'
and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor'};
Is there anything I should make a change to get good read performance?
I am assuming it is hitting the disk in my case. Should I try increasing the Replication Factor to some higher number? Any other suggestion?
If your data is much larger than memory and your access is close to random you will be hitting disk. This is consistent with latencies of ~10ms.
Increasing the replication factor might help, although it will make your cache less efficient since each node will store more data. It is probably only worth doing if your read pattern is mostly random, your data is very large, you have low consistency requirements and your access is read heavy.
If you want to decrease read latency, you can use a lower consistency level. Reading at consistency level CL.ONE generally gives the lowest read latency at a cost of consistency. You will only get consistent reads at CL.ONE if writes are at CL.ALL. But if consistency is not required it is a good tradeoff.
If you want to increase read throughput, you can decrease read_repair_chance. This number specifies the probability that Cassandra performs a read repair on each read. Read repair involves reading from available replicas and updating any that have old values.
If reading at a low consistency level, read repair incurs extra read I/O so decreases throughput. It doesn't affect latency (for low consistency levels) since read repair is done asynchronously. Again, if consistency isn't important for your application, decrease read_repair_chance to maybe 0.01 to improve throughput.
Is there any way I can verify whether keycaching is working fine in my
case or not?
Look at the output of 'nodetool info' and it will output a line like:
Key Cache : size 96468768 (bytes), capacity 96468992 (bytes), 959293 hits, 31637294 requests, 0.051 recent hit rate, 14400 save period in seconds
This gives you the key cache hit rate, which is quite low in the example above.
Old post but incase someone else comes by this.
Don't use even RF. Your RF of 4 requires quorum of 3 nodes, this is no different than a RF of 5.
Your key cache is probably working fine, this only tells cassandra where on the disk it's located. This only reduces seek times.
You have a rather large amount of ram pre 3.0, likely you're not leveraging all of this. Try G1GC on newer cassandra nodes.
Row key cache, make sure that your partitions are ordered in the way you intend to access them. Ex: If you're picking up only recent data, make sure you order by timestamp ASC instead of timestamp DESC as it will cache from the START of the partition.
Parallelize and bucket queries. Use nodetool cfhistograms to evaluate the size of your partitions. Then try and break the partitions into smaller chunks if they exceed 100mb. From here you change your queries to SELECT x FROM table WHERE id = X and bucket in (1,2,3) if you need to scan. Significant performance can then be gained from removing the "in bucket" and moving that to 3 separate queries. Ex running: Select... WHERE id = X and bucket = 1, Select ... WHERE id = X and bucket = 2 and doing the aggregation at the application layer.