Ballpark value for `--jobs` in `pg_restore` command - postgresql

I'm using pg_restore to restore a database to its original state for load testing. I see it has a --jobs=number-of-jobs option.
How can I get a ballpark estimate of what this value should be on my machine? I know this is dependent on a bunch of factors, e.g. machine, dataset, but it would be great to get a conceptual starting point.
I'm on a MacBook Pro so maybe I can use the number of physical CPU cores:
sysctl -n hw.physicalcpu
# 10

If there is no concurrent activity at all, don't exceed the number of parallel I/O requests your disk can handle. This applies to the COPY part of the dump, which is likely I/O bound. But a restore also creates indexes, which uses CPU time (in addition to I/O). So you should also not exceed the number of CPU cores available.

Related

PostgreSQL use of Linux hugepages

I am having some trouble understanding linux hugepages within PostgreSQL. From what I have googled:
Configured huge pages will be allocated and not swapped out of RAM.
Huge pages may improve performance due to a lower number of pages to be managed by the kernel.
Individual huge pages are contiguous blocks of memory.
My Postgres cluster conf:
shared_buffers set to 4GB
max_connections 30
all other conf is set as default
Huge pages is set but when starting defaulted to not using them. After setting huge pages on in PostgreSQL and in Linux to 4096/2 with vm.nr_hugepages and sysctl -p PostgreSQL would't start because it couldn't allocate memory enough. After trying several vm.nr_hugepages values the lowest it seamed to work was 3500.
My questions are:
How does PostgreSQL calculate the amount of memory it will need in advance?
Read that a memlimit should be set to PostgreSQL after setting hugepages ¿If it is only going to use huge pages wouldn't this be a limit already? Assuming no other process uses them.
Related to previous question: what would happen if it ran out of hugepages?
Will hugepages be used by all PostgreSQL processes, a sort operation for instance. The fact of having to allocate 3500 pages in order to have PostgreSQL starting ok is a bit confusing because I thought they would be mainly used by shared_buffers.
Thanks in advance!
Edit:
The system I am testing on is an Intel 8 cores with 32GB RAM. The main purpose of the setup is ETLs, receive files which will be loaded with COPY (several GBs per file), transform with some more or less complex SQL, persist results in several tables with a design similar to a DWH (star schemas), 40TB of total storage for PostgreSQL (4x10TB drives) and 512GB SSD for Ubuntu server 18.04. Some of the transformations that will be done will require tablescans or scans of a big part of the tables, in DB2 I have the option of using a block-based buffer pool (https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.admin.perf.doc/doc/c0009651.html). I thought I could achieve something "similar" with hugepages in PostgreSQL ¿Would this be possible?

Mongodb in Docker: numactl --interleave=all explanation

I'm trying to create Dockerfile for in-memory MongoDB based on official repo at https://hub.docker.com/_/mongo/.
In dockerfile-entrypoint.sh I've encountered:
numa='numactl --interleave=all'
if $numa true &> /dev/null; then
set -- $numa "$#"
fi
Basically it prepends numactl --interleave=all to original docker command, when numactl exists.
But I don't really understand this NUMA policy thing. Can you please explain what NUMA really means, and what --interleave=all stands for?
And why do we need to use it to create MongoDB instance?
The man page mentions:
The libnuma library offers a simple programming interface to the NUMA (Non Uniform Memory Access) policy supported by the Linux kernel. On a NUMA architecture some memory areas have different latency or bandwidth than others.
This isn't available for all architectures, which is why issue 14 made sure to call numa only on numa machine.
As explained in "Set default numa policy to “interleave” system wide":
It seems that most applications that recommend explicit numactl definition either make a libnuma library call or incorporate numactl in a wrapper script.
The interleave=all alleviates the kind of issue met by app like cassandra (a distributed database for managing large amounts of structured data across many commodity servers):
By default, Linux attempts to be smart about memory allocations such that data is close to the NUMA node on which it runs. For big database type of applications, this is not the best thing to do if the priority is to avoid disk I/O. In particular with Cassandra, we're heavily multi-threaded anyway and there is no particular reason to believe that one NUMA node is "better" than another.
Consequences of allocating unevenly among NUMA nodes can include excessive page cache eviction when the kernel tries to allocate memory - such as when restarting the JVM.
For more, see "The MySQL “swap insanity” problem and the effects of the NUMA architecture"
Without numa
In a NUMA-based system, where the memory is divided into multiple nodes, how the system should handle this is not necessarily straightforward.
The default behavior of the system is to allocate memory in the same node as a thread is scheduled to run on, and this works well for small amounts of memory, but when you want to allocate more than half of the system memory it’s no longer physically possible to even do it in a single NUMA node: In a two-node system, only 50% of the memory is in each node.
With Numa:
An easy solution to this is to interleave the allocated memory. It is possible to do this using numactl as described above:
# numactl --interleave all command
I mentioned in the comments that numa enumerates the hardware to understand the physical layout. And then divides the processors (not cores) into “nodes”.
With modern PC processors, this means one node per physical processor, regardless of the number of cores present.
That is bit of an over-simplification, as Hristo Iliev points out:
AMD Opteron CPUs with larger number of cores are actually 2-way NUMA systems on their own with two HT (HyperTransport)-interconnected dies with own memory controllers in a single physical package.
Also, Intel Haswell-EP CPUs with 10 or more cores come with two cache-coherent ring networks and two memory controllers and can be operated in a cluster-on-die mode, which presents itself as a two-way NUMA system.
It is wiser to say that a NUMA node is some cores that can reach some memory directly without going through a HT, QPI (QuickPath_Interconnect), NUMAlink or some other interconnect.

How to keep 32 bit mongodb memory usage down on changing dataset

I'm using MongoDB on a 32 bit production system, which sucks but it's out of my control right now. The challenge is to keep the memory usage under ~2.5GB since going over this will cause 32 bit systems to crash.
According to the mongoDB team, the best way to track the memory usage is to use your operating system's process tracking system (i.e. ps or htop on Unix systems; Process Explorer on Windows.) for virtual memory size.
The DB mainly consists of one table which is continually cycling data, i.e. receiving data at regular intervals from sensors, and every day a cron job wipes all data from before the last 3 days. Over a period of time, the memory usage slowly increases. I took some notes over time using db.serverStats(), db.lectura.totalSize() and ps, shown in the chart below. Note that the size of the table in question has reduced in the last month but the memory usage increased nonetheless.
Now, there is some scope for adjustment in how many days of data I store. Today I deleted basically half of the data, and then restarted mongodb, and yet the mem virtual / mem mapped and most importantly memory usage according to ps have hardly changed! Why do these not reduce when I wipe data (and restart)? I read some other questions where people said that mongo isn't really using all the memory that it might appear to be using, and that you can't clear the cache or limit memory use. But then how can I ensure I stay under the 2.5GB limit?
Unless there is a way to stem this dataset-size-irrespective gradual increase in memory usage, it seems to me that the 32-bit version of Mongo is unuseable. Note: I don't mind losing a bit of performance if it solves the problem.
To answer regarding why the mapped and virtual memory usage does not decrease with the deletes, the mapped number is actually what you get when you mmap() the entire set of data files. This does not shrink when you delete records, because although the space is freed up inside the data files, they are not themselves reduced in size - the files are just more empty afterwards.
Virtual will include journal files, and connections, and other non-data related memory usage also, but the same principle applies there. This, and more, is described here:
http://www.mongodb.org/display/DOCS/Checking+Server+Memory+Usage
So, the 2GB storage size limitation on 32-bit will actually apply to the data files whether or not there is data in them. To reclaim deleted space, you will have to run a repair. This is a blocking operation and will require the database to be offline/unavailable while it was run. It will also need up to 2x the original size in terms of free disk space to be able to run the repair, since it essentially represents writing out the files again from scratch.
This limitation, and the problems it causes, is why the 32-bit version should not be run in production, it is just not suitable. I would recommend getting onto a 64-bit version as soon as possible.
By the way, neither of these figures (mapped or virtual) actually represents your resident memory usage, which is what you really want to look at. The best way to do this over time is via MMS, which is the free monitoring service provided by 10gen - it will graph virtual, mapped and resident memory for you over time as well as plenty of other stats.
If you want an immediate view, run mongostat and check out the corresponding memory columns (res, mapped, virtual).
In general, when using 64-bit builds with essentially unlimited storage, the data will usually greatly exceed the available memory. Therefore, mongod will use all of the available memory it can in terms of resident memory (which is why you should always have swap configured to the OOM Killer does not come into play).
Once that is used, the OS does not stop allocating memory, it will just have the oldest items paged out to make room for the new data (LRU). In other words, the recycling of memory will be done for you, and the resident memory level will remain fairly constant.
Your options for stretching 32-bit are limited, but you can try some things. The thing that you run out of is address space, and the increases in the sizes of additional database files mean that you would like to avoid crossing over the boundary from "n" files to "n+1". It may be worth structuring your data into more or fewer databases so that you can get the maximum amount of actual data into memory and as little as possible "dead space".
For example, if your database named "mydatabase" consists of the files mydatabase.ns (the namespace file) at 16 MB, mydatabase.0 at 64 MB, mydatabase.1 at 128 MB and mydatabase.2 at 256 MB, then the next file created for this database will be mydatabase.3 at 512 MB. If instead of adding to mydatabase you instead created an additional database "mynewdatabase" it would start life with mynewdatabase.ns at 16 MB and mynewdatabase.0 at 64 MB ... quite a bit smaller than the 512 MB that adding to the original database would be. In fact, you could create 4 new databases for less space than would be consumed by adding a new file to the original database, and because the files are smaller they would be easier to fit into contiguous blocks of memory.
It is a well-known message that 32-bit should not be used for production.
Use 64-bit systems.
Point.

MongoDB Sharding On One Machine

Does it make sense to implement mongodb sharding with say 100 shards on one beefier machine just to achieve higher concurrenct write into the database as I am told, there is a global lock for each monogod.exe process? Assuming that is possible, will that aproach give me higher write concurrency?
Running multiple mongods on a machine is not a good idea. Every one of the mongod processes will try to use all the available memory, forcing other mongod's memory mapped pages out of memory. This will create an enormous amount of swapping in most cases.
The global database lock is generally not a problem as is demonstrated in: http://blog.pythonisito.com/2011/12/mongodbs-write-lock.html
Only use one mongod per machine (but it's fine to add a mongos or config server as well), unless it's for some simple testing.
cheers,
Derick
I totally disagree. We run 8 shards per box in our setup. It consists of two head nodes each with two other machines for replication. 6 boxes total. These are beefy boxes with about 120GB of RAM, 32 Cores and 2TB each. By having 8 shards per box (we could go higher by the way this is set at 8 for historic purposes) we make sure we utilize the CPU efficiently. The RAM sorts itself out. You do have to watch the metrics and make sure you aren't paging too much but with SSD drives (which we have) if you do spill onto the disk drives it isn't too bad.
The only use case where I found running several mongod on the same server was to increase replication speed on high latency connection.
As highlighted by Derick, the write lock is not really your issue when running mongodb.
To answer your question : yes you can demonstrate mongo scaling with several instance per machine (4 instances per server sems to be enough) if your test does not involve too much data (otherwise page out will dramatically decrase your performance, I have already tested it)
However, instances will still compete for resources. All you will manage to do is to shift the database lock issue to a resource lock issue.
Yes, you can and in fact that's what we do for 50+ mil write-heavy database. Just make sure all your indexes per mongod fit into the RAM and there's room for growth and maintenance.
However, there's a small trade-off: Depending on what your target QPS is, this kind of sharing requires machines with more horsepower, whereas sharding on a single machine will not and in most cases you can do away with commodity, cheaper hardware.
Whatever the case is, do the series of performance tests (ageinst IO, Network, PQS etc) and establish your baseline carefully and consider SSD drives for storage and this may sound biased, but Linux XFS storage is also something to consider.

Can memcached make full use of multi-core?

Is memcached capable of making full use of multi-core? Or is there any way tuning this?
memcached has "-t" option:
-t <threads>
Number of threads to use to process incoming requests. This option is only meaningful
if memcached was compiled with thread support enabled. It is typically not useful to
set this higher than the number of CPU cores on the memcached server. The default is
4.
so, I believe it can use all your CPU cores, of course if it was compiled with corresponding option.
memcached is multi-threaded by default and has no problem saturating many cores. It's a bit harder to saturate all cores on more massively parallel boxes (e.g. a 256-core CMT box) just because it gets harder to get the data in and out of the network.
If you find areas where some sort of contention is preventing you from saturating cores, file a bug or start a discussion.
Based on a this research by Intel, Memcached v.1.6 beta cannot scale well on a multicore system. Their experiments show that as core counts increase from 1 to 8, maximum throughput (with a median RTT < 1ms SLA) only doubles.
CAREFUL. This terminology is quite confusing. Memcached man page says -t option is only good up to the number of cores. However, this is odd because threads and processes are VERY different. Threads have NOTHING to do with the number of cores. Processes can definitely run on more than one cor, while threads cannot (unless they call to an OS routine, then they can thread switch and pack in more than 100% cpu usage). Threads share memory and just depend on an instruction pointer to differentiate who is who. Processes share nothing unless it is explicitly declared as shared ahead of time, and sharing occurs via the OS.
Overall, I want MORE CLARITY from the Memcached people about whether their app is multiprocessing or multithreaded and thus if it can use more than 100% of cpu.