citus write activity after ingestion is completed - postgresql

I have a citus cluster of 1 coordinator node (32 vcores , 64 GB RAM) and 6 worker nodes (4 cores , 32 GB RAM each).
After performing ingestion of data using the following command where chunks_0 directory contains 300 files having 1M record each:
find chunks_0/ -type f | time xargs -n1 -P24 sh -c "psql -d citus_testing -c \"\\copy table_1 from '/home/postgres/\$0' with csv HEADER;\""
I notice that after the ingestion is done, there is still a write activity occurring on the worker nodes at smaller rate (was around 800MB/sec overall during ingestion, and around 80-100MB/sec after ingestion) for a certain time.
I'm wondering what is citus doing during this time?

If you do not run any queries in said time period, I do not think Citus is responsible for any writes. It's possible that PostgreSQL ran autovacuum. You can check the PostgreSQL logs in the worker nodes and see for yourself.

Related

Postgres analyze takes time after upgrade 11.12 to 13.4

Issue:Analyze running for hours
I kept maintenance worker memory as 1 gb and maintenance workers as 4.DB Size is 40 GB
We upgraded postgres from 11.12 to 13.4.
Post upgrade, i am running analyze as below statement and i am seeing this job runs for hours.
(4 hours still running) .
Any input for this unusual long hours.
Note:
Command i used-> **VACUUM (VERBOSE, ANALYZE,parallel 4)**
Tracking via below statement:
select * from pg_stat_progress_analyze,From this table,I can see per second 250 blocks are scanned

faster mongoimport, in parallel in airflow?

tl;dr: there seems to be a limit on how fast data is inserted into our mongodb atlas cluster. Inserting data in parallel does not speed this up. How can we speed this up? Is our only option to get a larger mongodb atlas cluster with more Write IOPS? What even are write IOPS?
We replace and re-insert >10GB+ of data daily into our mongodb cluster with atlas. We have the following 2 bash commands, wrapped in python functions to help parameterize the commands, that we use with BashOperator in airflow:
upload single JSON to mongo cluster
def mongoimport_file(mongo_table, file_name):
# upload single file from /tmp directory into Mongo cluster
# cleanup: remove .json in /tmp at the end
uri = 'mongodb+srv://<user>:<pass>#our-cluster.dwxnd.gcp.mongodb.net/ourdb'
return f"""
echo INSERT \
&& mongoimport --uri "{uri}" --collection {mongo_table} --drop --file /tmp/{file_name}.json \
&& echo AND REMOVE LOCAL FILEs... \
&& rm /tmp/{file_name}.json
"""
upload directory of JSONs to mongo cluster
def mongoimport_dir(mongo_table, dir_name):
# upload directory of JSONs into mongo cluster
# cleanup: remove directory at the end
uri = 'mongodb+srv://<user>:<pass>#our-cluster.dwxnd.gcp.mongodb.net/ourdb'
return f"""
echo INSERT \
&& cat /tmp/{dir_name}/*.json | mongoimport --uri "{uri}" --collection {mongo_table} --drop \
&& echo AND REMOVE LOCAL FILEs... \
&& rm -rf /tmp/{dir_name}
"""
There are called in airflow using the BashOperator:
import_to_mongo = BashOperator(
task_id=f'mongo_import_v0__{this_table}',
bash_command=mongoimport_file(mongo_table = 'tname', file_name = 'fname')
)
Both of these work, although with varying performance:
mongoimport_file with 1 5GB file: takes ~30 minutes to mongoimport
mongoimport_dir with 100 50MB files: takes ~1 hour to mongoimport
There is currently no parallelization with ** mongoimport_dir**, and in fact it is slower than importing just a single file.
Within airflow, is it possible to parallelize the mongoimport of our directory of 100 JSONs, to achieve a major speedup? If there's a parallel solution using python's pymongo that cannot be done with mongoimport, we're happy to switch (although we'd strongly prefer to avoid loading these JSONs into memory).
What is the current bottleneck with importing to mongo? Is it (a) CPUs in our server / docker container, or (b) something with our mongo cluster configuration (cluster RAM, or cluster vCPU, or cluster max connections, or cluster read / write IOPS (what are these even?)). For reference, here is our mongo config. I assume we can speed up our import by getting a much bigger cluster but mongodb atlas becomes very expensive very fast. 0.5 vCPUs doesn't sound like much, but this already runs us $150 / month...
First of all "What is the current bottleneck with importing to mongo?" and "Is it (a) CPUs in our server / docker container " - don't believe to anyone who will tell you the answer from the screenshot you provided.
Atlas has monitoring tools that will tell you if the bottleneck is in CPU, RAM, disk or network or any combination of those on db side:
On the client side (airflow) - please use system monitor of your host OS to answer the question. Test disk I/O inside docker. Some combinations of host OS and docker storage drivers performed quite poor in the past.
Next, "What even are write IOPS" - random
write operations per second
https://cloud.google.com/compute/docs/disks/performance
IOPS calculation differs depending on cloud provider. Try AWS and Azure to compare cost vs speed. M10 on AWS gives you 2 vCPU, yet again I doubt you can compare them 1:1 between vendors. The good thing is it's on-demand and will cost you less than a cup of coffee to test and delete the cluster.
Finally, "If there's a parallel solution using python's pymongo" - I doubt so. mongoimport uses batches of 100,000 documents, so essentially it sends it as fast as the stream is consumed on the receiver. The limitations on the client side could be: network, disk, CPU. If it is network or disk, parallel import won't improve a thing. Multi-core systems could benefit from parallel import if mongoimport was using a single CPU and it was the limiting factor. By default mongoimport uses all CPUs available: https://github.com/mongodb/mongo-tools/blob/cac1bfbae193d6ba68abb764e613b08285c6f62d/common/options/options.go#L302. You can hardly beat it with pymongo.

How does one monitor Postgres memory usage?

We are running Postgres in Kubernetes. Our Prometheus pod monitoring always shows that Postgres fills up the entire pods memory in a shared or cached state.
free -h
total used free shared buff/cache available
Mem: 62G 1.7G 13G 11G 47G 48G
ps -u postgres o pid= | sed 's#.*#/proc/&/smaps#' | sed 's/ //g#/proc/&/smaps#' | \
xargs grep -s ^Pss: | awk '{A+=$2} END{print "PSS: " A}'
PSS: 13220013 kb
ps -u postgres o pid,rss:8,cmd | awk 'NR>1 {A+=$2} {print} END{print "RSS: " A}' | tail -1
RSS: 38794236 kb
Correct me if I am wrong, but since the memory displayed in top and ps(RSS/RES) is shared memory, this means that Posgres isnt using that memory, its only reserved for when it needs it and other processes can also use that memory. Some articles say that one needs to cat /proc/<PID>/smaps and check the PSS to find the actual memory usage of Postgres.
We recently got OOM errors, but we where unable to pick it up in our monitoring. Due to the fact that our pod memory monitoring is always displaying 90% usage as its only monitoring the RSS/RES memory which is includes the shared/cached memory as well. So we didnt see any increase in RAM when the OOM errors happened and our database went down. The error was caused by a new query we introduced to our backend which used large amounts of memory per query.
We have prometheus-postgres-exporter installed giving us good Postgres metrics, but this didnt show us that we had queries using large amounts memory, but maybe our Grafana graph is missing that graph?
Because we are running Postgres in Kubernetes exporting the PSS memory is a hassle. So it feels like I am missing something.
So how does one monitor the memory usage of Postgres ? How does one pick up on queries using too much memory or pick up on postgres using too much memory due to load?
I recommend that you disable memory overcommit by setting vm.overcommit_memory = 2 in /etc/sysctl.conf and running sysctl -p. Don't forget to set vm.overcommit_ratio appropriately, based on the RAM and swap that you have.
Than should keep the OOM killer at bay.
Then you can examine /proc/meminfo to find if you are getting tight on memory:
CommitLimit shows how much memory the kernel is willing to hand out
Committed_AS shows how much memory is allocated by processes
Look at the kernel documentation for details.

How to limit pg_dump's memory usage?

I have a ~140 GB postgreDB on Heroku / AWS. I want to create a dump of this on a windows Azure - Windows server 2012 R2 virtual machine, as i need to move the DB into Azure environment.
The DB has a couple of smaller tables, but mainly consists of a single table taking ~130 GB, including indexes. It has ~500 million rows.
I've tried to use pg_dump for this, with:
./pg_dump -Fc --no-acl --no-owner --host * --port 5432 -U * -d * > F:/051418.dump
I've tried on various Azure virtual machine sizes, including some fairly large with (D12_V2) 28GB ram, 4 VCPUs 12000 MAXIOPs, etc. But in all cases the pg_dump stalls completely due to memory swapping.
On above machine it's currently using all available memory and has used the past 12 hrs swapping memory on the disk. I dont expect it to complete, due to the swapping.
From other posts i've understood it could be an issue with the network speed, beeing much faster than the disk IO speed, causing pg_dump to suck up all available memory and more, so i've tried using the azure machine with most IOPs. This hasnt helped.
So is there another way i can force pg_dump to cap it's memory usage, or wait on pulling more data until it has written to disk and clear memory ?
Looking forward to your help!
Krgds.
Christian

performance issue until mongodump

we operate for our customer a server with a single mongo instance, gradle, postgres and nginx running on it. The problem is we had massiv performance problmes until mongodump is running. The mongo queue is growing and no data be queried. The next problem is the costumer want not invest in a replica-set or a software update (mongod 3.x).
Has somebody any idea how i clould improve the performance.
command to create the dump:
mongodump -u ${MONGO_USER} -p ${MONGO_PASSWORD} -o ${MONGO_DUMP_DIR} -d ${MONGO_DATABASE} --authenticationDatabase ${MONGO_DATABASE} > /backup/logs/mongobackup.log
tar cjf ${ZIPPED_FILENAME} ${MONGO_DUMP_DIR}
System:
6 Cores
36 GB RAM
1TB SATA HDD
+ 2TB (backup NAS)
MongoDB 2.6.7
Thanks
Best regards,
Markus
As you have heavy load, adding a replica set is a good solution, as backup could be taken on secondary node, but be aware that replica need at least three servers (you can have an master/slave/arbiter - where the last need a little amount of resources)
MongoDump makes general query lock which will have an impact if there is a lot of writes in dumped database.
Hint: try to make backup when there is light load on system.
Try with volume snapshots. Check with your cloud provider what are the options available to take snapshots. It is super fast and cheaper if you compare actual pricing used in taking a backup(RAM and CPU used and if HDD then transactions const(even if it is little)).