We recently moved a db(1.2TB) cluster from mirrored SSD to ZFS pool build-up of SSD. After the move, I have seen a massive drop in performance on large write operations (Alter table types, vacuums, index creation etc.).
To Isolate the problem I did the following, copied the 361 GB table and ensured no triggers are active, try to run the following command, original type as timestamp
ALTER TABLE table_log_test ALTER COLUMN date_executed TYPE timestamptz;
This takes about 3 hours to complete, make sense it needs to touch every one of the 60 mil rows, however, this takes about 10 min on the test system only on SSD
Comparing the alter commands zpool iostat outputs to fio I get the following results
Alter command
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 1.33T 5.65T 6.78K 5.71K 31.9M 191M
raidz1 1.33T 5.65T 6.78K 5.71K 31.9M 191M
sda - - 1.94K 1.34K 9.03M 48.6M
sdb - - 1.62K 1.45K 7.66M 48.5M
sdc - - 1.62K 1.46K 7.66M 48.3M
sdd - - 1.60K 1.45K 7.59M 45.5M
FIO
fio --ioengine=libaio --filename=tank --size=10G --time_based --name=fio --group_reporting --runtime=10 --direct=1 --sync=1 --iodepth=1 --rw=randrw --bs=1MB --numjobs=32
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 1.34T 5.65T 14 27.5K 59.8K 940M
raidz1 1.34T 5.65T 14 27.5K 59.8K 940M
sda - - 5 7.14K 23.9K 235M
sdb - - 1 7.02K 7.97K 235M
sdc - - 4 7.97K 19.9K 235M
sdd - - 1 5.33K 7.97K 235M
So it seems to me the zfs is working fine, it's just an interaction with PostgreSQL that's slow.
What settings have I played with
ZFS
recordsize = 16KB changed from 128KB
logbias = Latency , throughput preformed worse
compression = lz4
primarycache = all , we have large write and reads
NO ARC or ZIL enabled
Postgres settings
full_page_writes=off
shared_buffers = 12GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
random_page_cost = 1.2
effective_io_concurrency = 200
work_mem = 256MB
min_wal_size = 1GB
max_wal_size = 2GB
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
and tried
synchronous_commit = off , didn't see any performance increase
As a note synchronous_commit and full_page_writes I only did a Postgres config reload as this is a production site. I see some guys do restarts while some documentation states it only requires to reload. Reloads shows in psql if I SHOW setting.
At this point, I am a bit lost on what to try next. I am also unsure if the reload vs restart may be the reason I don't see the gains others are talking about.
As a side note. Vacuum full analyze didn't help either, not that I expected it to on a new copied table.
Thanks in advance for your help
UPDATE 1
I amended my fio commands as suggested by jjanes, Here are the outputs
First one is based on jjanes suggestion.
fio --ioengine=psync --filename=tank --size=10G --time_based --name=fio --group_reporting --runtime=10 --rw=rw --rwmixread=50 --bs=8KB
fio: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=psync, iodepth=1
fio-3.16
Starting 1 process
fio: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [M(1)][100.0%][r=91.6MiB/s,w=90.2MiB/s][r=11.7k,w=11.6k IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=3406394: Tue Dec 28 08:11:06 2021
read: IOPS=16.5k, BW=129MiB/s (135MB/s)(1292MiB/10001msec)
clat (usec): min=2, max=15165, avg=25.87, stdev=120.57
lat (usec): min=2, max=15165, avg=25.94, stdev=120.57
clat percentiles (usec):
| 1.00th=[ 3], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4],
| 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 6], 60.00th=[ 9],
| 70.00th=[ 43], 80.00th=[ 48], 90.00th=[ 57], 95.00th=[ 68],
| 99.00th=[ 153], 99.50th=[ 212], 99.90th=[ 457], 99.95th=[ 963],
| 99.99th=[ 7504]
bw ( KiB/s): min=49392, max=209248, per=99.76%, avg=131997.16, stdev=46361.80, samples=19
iops : min= 6174, max=26156, avg=16499.58, stdev=5795.23, samples=19
write: IOPS=16.5k, BW=129MiB/s (135MB/s)(1291MiB/10001msec); 0 zone resets
clat (usec): min=5, max=22574, avg=33.29, stdev=117.32
lat (usec): min=5, max=22574, avg=33.40, stdev=117.32
clat percentiles (usec):
| 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 9],
| 30.00th=[ 10], 40.00th=[ 11], 50.00th=[ 13], 60.00th=[ 14],
| 70.00th=[ 17], 80.00th=[ 22], 90.00th=[ 113], 95.00th=[ 133],
| 99.00th=[ 235], 99.50th=[ 474], 99.90th=[ 1369], 99.95th=[ 2073],
| 99.99th=[ 3720]
bw ( KiB/s): min=49632, max=205664, per=99.88%, avg=132066.26, stdev=46268.55, samples=19
iops : min= 6204, max=25708, avg=16508.00, stdev=5783.26, samples=19
lat (usec) : 4=16.07%, 10=30.97%, 20=23.77%, 50=15.29%, 100=7.37%
lat (usec) : 250=5.94%, 500=0.30%, 750=0.10%, 1000=0.07%
lat (msec) : 2=0.08%, 4=0.03%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=3.47%, sys=72.13%, ctx=19573, majf=0, minf=28
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=165413,165306,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=1292MiB (1355MB), run=10001-10001msec
WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=1291MiB (1354MB), run=10001-10001msec
Second one is from https://subscription.packtpub.com/book/big-data-and-business-intelligence/9781785284335/1/ch01lvl1sec14/checking-iops
fio --ioengine=libaio --direct=1 --name=test_seq_mix_rw --filename=tank --bs=8k --iodepth=32 --size=10G --readwrite=rw --rwmixread=50
test_seq_mix_rw: (g=0): rw=rw, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
test_seq_mix_rw: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [M(1)][100.0%][r=158MiB/s,w=157MiB/s][r=20.3k,w=20.1k IOPS][eta 00m:00s]
test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=3484893: Tue Dec 28 08:13:31 2021
read: IOPS=17.7k, BW=138MiB/s (145MB/s)(5122MiB/36990msec)
slat (usec): min=2, max=33046, avg=31.73, stdev=95.75
clat (nsec): min=1691, max=34831k, avg=878259.94, stdev=868723.61
lat (usec): min=6, max=34860, avg=910.14, stdev=883.09
clat percentiles (usec):
| 1.00th=[ 306], 5.00th=[ 515], 10.00th=[ 545], 20.00th=[ 586],
| 30.00th=[ 619], 40.00th=[ 652], 50.00th=[ 693], 60.00th=[ 742],
| 70.00th=[ 807], 80.00th=[ 955], 90.00th=[ 1385], 95.00th=[ 1827],
| 99.00th=[ 2933], 99.50th=[ 3851], 99.90th=[14877], 99.95th=[17433],
| 99.99th=[23725]
bw ( KiB/s): min=48368, max=205616, per=100.00%, avg=142130.51, stdev=34694.67, samples=73
iops : min= 6046, max=25702, avg=17766.29, stdev=4336.81, samples=73
write: IOPS=17.7k, BW=138MiB/s (145MB/s)(5118MiB/36990msec); 0 zone resets
slat (usec): min=6, max=18233, avg=22.24, stdev=85.73
clat (usec): min=6, max=34848, avg=871.98, stdev=867.03
lat (usec): min=15, max=34866, avg=894.36, stdev=898.46
clat percentiles (usec):
| 1.00th=[ 302], 5.00th=[ 515], 10.00th=[ 545], 20.00th=[ 578],
| 30.00th=[ 611], 40.00th=[ 644], 50.00th=[ 685], 60.00th=[ 734],
| 70.00th=[ 807], 80.00th=[ 955], 90.00th=[ 1385], 95.00th=[ 1811],
| 99.00th=[ 2868], 99.50th=[ 3687], 99.90th=[15008], 99.95th=[17695],
| 99.99th=[23987]
bw ( KiB/s): min=47648, max=204688, per=100.00%, avg=142024.70, stdev=34363.25, samples=73
iops : min= 5956, max=25586, avg=17753.07, stdev=4295.39, samples=73
lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
lat (usec) : 250=0.16%, 500=3.61%, 750=58.52%, 1000=19.22%
lat (msec) : 2=14.79%, 4=3.25%, 10=0.25%, 20=0.19%, 50=0.02%
cpu : usr=4.36%, sys=85.41%, ctx=28323, majf=0, minf=447
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=655676,655044,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=138MiB/s (145MB/s), 138MiB/s-138MiB/s (145MB/s-145MB/s), io=5122MiB (5371MB), run=36990-36990msec
WRITE: bw=138MiB/s (145MB/s), 138MiB/s-138MiB/s (145MB/s-145MB/s), io=5118MiB (5366MB), run=36990-36990msec
Conclusions
So it turns out the major issue for poor performance was write amplification. The below post has a good comment on this by Dunuin https://www.linuxbabe.com/mail-server/setup-basic-postfix-mail-sever-ubuntu
In short summary
4K writes where the primary writes for the alter command
Adding dedicated SLOG helped
Adding dedicated ARC helped
Moving WAHL files to separate tanks helped
Changing record size to 16Kb helped
Disabling sync writes on WAHL helped.
One thing I didn't try was recompling Postgres in 32Kb pages. Based on what I have seen this could have a significant performance impact and is worth investigating if you are installing a new cluster.
Thanks to everyone for their input into this problem. Hope this info helps someone else.
I wonder how you created the zfs pool. Once I forgot the ashift=12 option when creating the zfs pool.
Maybe check this option with zdb. (https://charsiurice.wordpress.com/2016/05/30/checking-ashift-on-existing-pools/)
I block on my problem that I will wrote in details below. During 3 days I tried a lot of differents things, none worked..
If anyone have an idea of what to do !
Here is my message error :
the call to "ft_selectdata" took 0 seconds
preprocessing
Out of memory. Type "help memory" for your options.
Error in ft_preproc_dftfilter (line 187)
tmp = exp(2*1i*pi*freqs(:)*time); % complex sin and cos
Error in ft_preproc_dftfilter (line 144)
filt = ft_preproc_dftfilter(filt, Fs, Fl(i), 'dftreplace', dftreplace, 'dftbandwidth',
dftbandwidth(i), 'dftneighbourwidth', dftneighbourwidth(i)); % enumerate all options
Error in preproc (line 464)
dat = ft_preproc_dftfilter(dat, fsample, cfg.dftfreq, optarg{:});
Error in ft_preprocessing (line 375)
[dataout.trial{i}, dataout.label, dataout.time{i}, cfg] = preproc(data.trial{i}, data.label,
data.time{i}, cfg, begpadding, endpadding);
Error in EEG_Prosocial_script (line 101)
data_intpl = ft_preprocessing(cfg, allData_preprosses);
187 tmp = exp(2*1i*pi*freqs(:)*time); % complex sin and cos
There is some informations from matlab about my computer and about the caracteristics of the calcul
Maximum possible array: 7406 MB (7.766e+09 bytes)*
Memory available for all arrays: 7406 MB (7.766e+09 bytes) *
Memory used by MATLAB: 4195 MB (4.398e+09 bytes)
Physical Memory (RAM): 12206 MB (1.280e+10 bytes)
Limited by System Memory (physical + swap file) available.
K>> whos Name
Size Bytes Class Attributes
freqs 7753x1 62024 double
li 1x1 16 double complex
time 1x1984512 15876096 double
So there the config of the computer which failed to run the script (Alienware aurora R4) :
Ram : 4gb free / 12 # 1,6Ghz --> 2x (4Gb 1600Mhz) - 2x (2Gb 1600 MHz)
Intel core i7-3820 4 core 8 threads 3,7 GHz 1 CPU
NVIDIA GeForce GTX 690 2gb
RAM : Kingston KVT8FP HYC
Hard disk : SSD kingston 250Go SATA 3"
This code work on this computer (Dell inspiron 14-500) : config
Ram 4 Go of memory DDR4 2 666 MHz (4 Go x 1)
Intel® Core™ i5-8265U 8e generatio, (6 Mo memory, 3,9 GHz)
Intel® UHD Graphics 620
Hard disk SATA 2,5" 500 Go 5 400 tr/min
Thank you
Kind regards,
by doing freqs(:)*time you are trying to create a 7753*1984512 size array, and you dont have memory for that... (you will need ~123 gigabytes for that, or 1.2309e+11 bytes, where your computer has ~7e+09)
see for example the case for :
f=rand(7,1);
t=rand(1,19);
size(f(:)*t)
ans =
7 19
what you want do to is probably a for loop per time element etc.
According to Intel's optimization manual, the L1 data cache is 32 KiB and 8-way associative with 64-byte lines. I have written the following micro-benchmark to test memory read performance.
I hypothesize that if we access only blocks that can fit in the 32 KiB cache, each memory access will be fast, but if we exceed that cache size, the accesses will suddenly be slower. When skip is 1, the benchmark accesses every line in order.
void benchmark(int bs, int nb, int trials, int skip)
{
printf("block size: %d, blocks: %d, skip: %d, trials: %d\n", bs, nb, skip, trials);
printf("total data size: %d\n", nb*bs*skip);
printf("accessed data size: %d\n", nb*bs);
uint8_t volatile data[nb*bs*skip];
clock_t before = clock();
for (int i = 0; i < trials; ++i) {
for (int block = 0; block < nb; ++block) {
data[block * bs * skip];
}
}
clock_t after = clock() - before;
double ns_per_access = (double)after/CLOCKS_PER_SEC/nb/trials * 1000000000;
printf("%f ns per memory access\n", ns_per_access);
}
Again with skip = 1, the results match my hypothesis:
~ ❯❯❯ ./bm -s 64 -b 128 -t 10000000 -k 1
block size: 64, blocks: 128, skip: 1, trials: 10000000
total data size: 8192
accessed data size: 8192
0.269054 ns per memory access
~ ❯❯❯ ./bm -s 64 -b 256 -t 10000000 -k 1
block size: 64, blocks: 256, skip: 1, trials: 10000000
total data size: 16384
accessed data size: 16384
0.278184 ns per memory access
~ ❯❯❯ ./bm -s 64 -b 512 -t 10000000 -k 1
block size: 64, blocks: 512, skip: 1, trials: 10000000
total data size: 32768
accessed data size: 32768
0.245591 ns per memory access
~ ❯❯❯ ./bm -s 64 -b 1024 -t 10000000 -k 1
block size: 64, blocks: 1024, skip: 1, trials: 10000000
total data size: 65536
accessed data size: 65536
0.582870 ns per memory access
So far, so good: when everything fits in L1 cache, the inner loop runs about 4 times per nanosecond, or a bit more than once per clock cycle. When we make the data too big, it takes significantly longer. This is all consistent with my understanding of how the cache should work.
Now let's accessing every other block by letting skip be 2.
~ ❯❯❯ ./bm -s 64 -b 512 -t 10000000 -k 2
block size: 64, blocks: 512, skip: 2, trials: 10000000
total data size: 65536
accessed data size: 32768
0.582181 ns per memory access
This violates my understanding! It would make sense for a direct-mapped cache, but since our cache is associative, I can't see why the lines should be conflicting with each other. Why is accessing every other block slower?
But if I set skip to 3, things are fast again. In fact, any odd value of skip is fast; any even value is slow. For example:
~ ❯❯❯ ./bm -s 64 -b 512 -t 10000000 -k 7
block size: 64, blocks: 512, skip: 7, trials: 10000000
total data size: 229376
accessed data size: 32768
0.265338 ns per memory access
~ ❯❯❯ ./bm -s 64 -b 512 -t 10000000 -k 12
block size: 64, blocks: 512, skip: 12, trials: 10000000
total data size: 393216
accessed data size: 32768
0.616013 ns per memory access
Why is this happening?
For completeness: I am on a mid-2015 MacBook Pro running macOS 10.13.4. My full CPU brand string is Intel(R) Core(TM) i7-4980HQ CPU # 2.80GHz. I am compiling with cc -O3 -o bm bm.c; the compiler is the one shipped with Xcode 9.4.1. I have omitted the main function; all it does is parse the command-line options and call benchmark.
The cache is not fully-associative, it's set-associative, meaning that each address maps to a certain set, and the associativity only works among lines that map to the same set.
By making the step equal 2, you keep half of the sets out of the game, so the fact you access effectively 32K doesn't matter - you only have 16k available (even sets, for example), so you still exceed your capacity and start thrashing (end fetching data from the next level).
When the step is 3, the problem is gone since after wrapping around you can use all the sets. Same would go for any prime number (which is why it's sometimes used for address hashing)
We're running OrientDB 2.1.11 (Community Edition) along with JDK 1.8.0.74.
We're noticing memory consumption by 'orientdb' java slowly creeping up and in a few days, the database becomes un-responsive (we have to stop/start Orientdb in order to release memory).
We also noticed this kind of behavior in a few hours when we index the database.
The total size of the database is only 60 GB and not more than 200 million records!
As you can see below, it already consumes VIRT(11.44 GB) RES(8.62 GB).
We're running CentOS 7.1.x.
Even change heap from 512 to 256M and modified diskcache.bufferSize to 8GB
MAXHEAP=-Xmx256m
ORIENTDB MAXIMUM DISKCACHE IN MB, EXAMPLE, ENTER -Dstorage.diskCache.bufferSize=8192 FOR 8GB
MAXDISKCACHE="-Dstorage.diskCache.bufferSize=8192"
top output:
Tasks: 155 total, 1 running, 154 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 0.1 sy, 0.0 ni, 99.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 16269052 total, 229492 free, 9510740 used, 6528820 buff/cache
KiB Swap: 8257532 total, 8155244 free, 102288 used. 6463744 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2367 nmx 20 0 11.774g 8.620g 14648 S 0.3 55.6 81:26.82 java
ps aux output:
nmx 2367 4.3 55.5 12345680 9038260 ? Sl May02 81:28 /bin/java
-server -Xmx256m -Djna.nosys=true -XX:+HeapDumpOnOutOfMemoryError
-Djava.awt.headless=true -Dfile.encoding=UTF8 -Drhino.opt.level=9
-Dprofiler.enabled=true -Dstorage.diskCache.bufferSize=8192
How do I control memory usage?
Is there a CB memory leak?
Could you set following setting for JVM -XX:+UseLargePages -XX:LargePageSizeInBytes=2m .
This should sovle your issue.
this page, solved my issue.
in nutshell:
add this configuration to your database:
static {
OGlobalConfiguration.NON_TX_RECORD_UPDATE_SYNCH.setValue(true); //Executes a synch against the file-system at every record operation. This slows down records updates but guarantee reliability on unreliable drives
OGlobalConfiguration.STORAGE_LOCK_TIMEOUT.setValue(300000);
OGlobalConfiguration.RID_BAG_EMBEDDED_TO_SBTREEBONSAI_THRESHOLD.setValue(-1);
OGlobalConfiguration.FILE_LOCK.setValue(false);
OGlobalConfiguration.SBTREEBONSAI_LINKBAG_CACHE_SIZE.setValue(5000);
OGlobalConfiguration.INDEX_IGNORE_NULL_VALUES_DEFAULT.setValue(true);
OGlobalConfiguration.MEMORY_CHUNK_SIZE.setValue(32);
long maxMemory = Runtime.getRuntime().maxMemory();
long maxMemoryGB = (maxMemory / 1024L / 1024L / 1024L);
maxMemoryGB = maxMemoryGB < 1 ? 1 : maxMemoryGB;
long cacheSizeMb = 2 * 65536 * maxMemoryGB / 1024;
long maxDirectMemoryMb = VM.maxDirectMemory() / 1024L / 1024L;
String cacheProp = System.getProperty("storage.diskCache.bufferSize");
if (cacheProp==null) {
long maxDirectMemoryOrientMb = maxDirectMemoryMb / 3L;
long cachSizeMb = cacheSizeMb > maxDirectMemoryOrientMb ? maxDirectMemoryOrientMb : cacheSizeMb;
cacheSizeMb = (long)Math.pow(2, Math.ceil( Math.log(cachSizeMb)/ Math.log((2))));
System.setProperty("storage.diskCache.bufferSize",Long.toString(cacheSizeMb));
// the command below generates a NullPointerException in Orient 2.2.15-snapshot
// OGlobalConfiguration.DISK_CACHE_SIZE.setValue(cacheSizeMb);
}
}