We use distributed index:
index ind_mm {
type = distributed
agent = 192.168.0.11:9318:ind_1
agent = 192.168.0.11:9319:ind_2
agent = 192.168.0.22:9317:ind_3
agent = 192.168.0.22:9317:ind_4
}
Are any limitations on count of agents? When the overhead well be too high? If we will use about 100 agnets, will be good performance? 1000 angetns?
Well there probably isnt much point having more agents that CPU cores have on the machine.
If have more agents, they pretty much guaranteed to queue up on cores. Ie the queries wont actually run in parallel. (well they might, due to process switching)
There is no guarantee that having events equal number to number of cores, will mean one part per core (its upto the OS to distribute processes to cores) - but in theory it COULD be.
The instance running the distributed index, will use a core, so in an ideal situation have one less agent than cores.
Related
Yet I can not find any reliable recommendation regarding the optimal value for max_worker_processes.
Some sources suppose that the value should not be higher than the number of available cores, but is that correct taking that server threads do a lot of IO?
Say I have 8 cores for PG container and plan to handle about 100 clients in parallel. Is that feasible, especially with the default max_worker_processes=8 ?
Any trusted reference would be much appreciated.
The reasonable limit dies not depend on the number of client connections, but on the actual upper limit on concurrent queries.
If it is guaranteed that only one of these clients will ever be active at the same time, you could set max_worker_processes, max_parallel_workers and max_parallel_workers_per_gather one less than the number of cores or parallel I/O operations that your storage can handle, whatever of the two is smaller. In essence, one query can then consume all the available resources.
On the other hand, if many of these clients are likely to run queries concurrently, you should disable parallel query by setting max_parallel_workers_per_gather to 0 to avoid overloading your database.
Concerning your comments to my answer: if your goal is to limit the number of active queries, use a connection pool.
Vert.x seems to create upto 2 * NUM_OF_CORES event loop threads by default.
And this seems to be a fairly old change (7 years+)
On a machine with 4 physical cores (8 logical cores with hyper-threading), it creates 16 event loop threads.
Shouldn't NUM_OF_CORES (i.e., 8 in above example) number of event loop threads be ideal?
Only explaination I could find was from Tim Fox (original author of vertx):
we use 2 * number of cores by default - in practice this gives better results as OSes don't always distribute threads evenly across cores.
But a few load tests I did gave better results when I used 8 instead of 16. So want to understand under what conditions should the default give better results?
In optimal CPU bound calculations having about the same number of threads and logical cores is a good practice because we want out thread to use more CPU power as possible without interfering with other threads.
Usually Vert.x is not used for CPU intensive computations; for the most common usecases of Vert.x it might be helpful to have some more threads ready for beign used when needed, rather than having to create new ones on the go.
Why not using 10 * NUM_OF_CORES threads then? Because of the thread creation overhead and the risk of creating too many unused threads (that would lower the system performace). So this choise is (probably) the result of the tradeoff between thread responsiveness and waste of system resources.
Your benchmarks can produce bad results with 2 * NUM_OF_CORES for a variety or reasons, such as:
OS thread management (allocation time and context switches);
lack of system resouces (a lot of programs running with the one you are testing);
misuration issues (did the measure start before the thread allocation? did the test last for an amount of time that makes the thread creation time negligible?);
probably something else I can't figure out rn 😅
Hope it helped!
Let's say I'm running a Service Fabric cluster on 5 D1 class (1 core, 3.5GB RAM, 50GB SSD) VMs. and that I'm running 2 reliable services on this cluster, one stateless and one stateful. Let's assume that the replica target is 3.
How to calculate how much can my reliable collections hold?
Let's say I add one or more stateful services. Since I don't really know how the framework distributes services do I need to take most conservative approach and assume that a node may run all of my stateful services on a single node and that their cumulative memory needs to be below the RAM available on a single machine?
TLDR - Estimating the expected capacity of a cluster is part art, part science. You can likely get a good lower bound which you may be able to push higher, but for the most part deploying things, running them, and collecting data under your workload's conditions is the best way to answer this question.
1) In general, the collections on a given machine are bounded by the amount of available memory or the amount of available disk space on a node, whichever is lower. Today we keep all data in the collections in memory and also persist it to disk. So the maximum amount that your collections across the cluster can hold is generally (Amount of available memory in the cluster) / (Target Replica Set Size).
Note that "Available Memory" is whatever is left over from other code running on the machines, including the OS. In your above example though you're not running across all of the nodes - you'll only be able to get 3 of them. So, (unrealistically) assuming 0 overhead from these other factors, you could expect to be able to put about 3.5 GB of data into that stateful service replica before you ran out of memory on the nodes on which it was running. There would still be 2 nodes in the cluster left empty.
Let's take another example. Let's say that it is about the same as your example above, except in this case you set up the stateful service to be partitioned. Let's say you picked a partition count of 5. So now on each node, you have a primary replica and 2 secondary replicas from other partitions. In this case, each partition would only be able to hold a maximum of around 1.16 GB of state, but now overall you can pack 5.83 GB of state into the cluster (since all nodes can now be utilized fully). Incidentally, just to prove out the math works, that's (3.5 GB of memory per node * 5 nodes in the cluster) [17.5] / (target replica set size of 3) = 5.83.
In all of these examples, we've also assumed that memory consumption for all partitions and all replicas is the same. A lot of the time that turns out to not be true (at least temporarily) - some partitions can end up with more or less work to do and hence have uneven resource consumption. We also assumed that the secondaries were always the same as the primaries. In the case of the amount of state, it's probably fair to assume that these will track fairly evenly, though for other resource consumption it may not (just something to keep in mind). In the case of uneven consumption, this is really where the rest of Service Fabric's Cluster Resource Management will help, since we can come to know about the consumption of different replicas and pack them efficiently into the cluster to make use of the available space. Automatic reporting of consumption of resources related to state in the collections is on our radar and something we want to do, so in the future, this would be automatic but today you'd have to report this consumption on your own.
2) By default, we will balance the services according to the default metrics (more about metrics is here). So by default, the different replicas of those two different services could end up on the machine, but in your example, you'll end up with 4 nodes with 1 replica from a service on it and then 1 node with two replicas from the two different services. This means that each service (each with 1 partition as per your example) would only be able to consume 1.75 GB of memory in each service for a total of 3.5 GB in the cluster. This is again less than the total available memory of the cluster since there are some portions of nodes that you're not utilizing.
Note that this is the maximum possible consumption, and presuming no consumption outside the service itself. Taking this as your maximum is not advisable. You'll want to reduce it for several reasons, but the most practical reason is to ensure that in the presence of upgrades and failures that there's sufficient available capacity in the cluster. As an example, let's say that you have 5 Upgrade Domains and 5 Fault Domains. Now let's say that a fault domain's worth of nodes fails while you have an upgrade going on in an upgrade domain. This means that (a little less than) 40% of your cluster capacity can be gone at any time, and you probably want enough room left over on the remaining nodes to continue. This means that if your cluster previously could hold 5.83 GB of state (from our prior calculations), in reality you probably don't want to put more than about 3.5 GB of state in it since with more of that the service may not be able to get back to 100% healthy (note also that we don't build replacement replicas immediately so the nodes would have to be down for your ReplicaRestartWaitDuration before you ran into this case). There's a bunch more information about metrics, capacity, buffered capacity (which you can use to ensure that room is left on nodes for the failure cases) and fault and upgrade domains are covered in this article.
There are some other things that practically will limit the amount of state you'll be able to store. You'll want to do several things:
Estimate the size of your data. You can make a reasonable estimate up-front of how big your data is by calculating the size of each field your objects hold. Be sure to take into consideration 64-bit references. This will give you a lower-bound starting point.
Storage overhead. Each object you store in a collection will come with some overhead for storing that object. In the reliable collections depending on the collection and the operations currently in flight (copy, enumerations, updates, etc.) this overhead can range from between 100 and around 700 bytes per item (row) stored in the collections. Do know also that we're always looking for ways to reduce the amount of overhead we introduce.
We also strongly recommend running your service over some period of time and measuring actual resource consumption via performance counters. Simulating some sort of real workload and then measuring the actual usage of the metrics you care about will serve you pretty well. The reason we recommend this in particular is that you will be able to see consumption from things like which CLR object heap your objects end up placed in, how often GC is running, if there's leaks, or other things like this which will impact the amount of memory you can actually utilize.
I know that this has been a long answer but I hope you find it helpful and complete.
I have a few questions about HPC. I have a code with serial and parallel sections. Parallel sections work on different chunks of memory and at some point they communicate with each other. For this I used MPI on our cluster. SLURM is the resource manager. Below is the specifications of a node in a cluster.
Specifications of a node:
Processor: 2x Intel Xeon E5-2690 (totally 16 cores 32 thread)
Memory : 256 GB 1600MHz ECC
Disk : 2 x 600 GB 2.5" SAS (configured with raid 1)
Questions:
1) Do all cores on a node share the same memory (RAM)? If yes, do all of cores access memory at the same speed?
2) Consider a case:
--nodes = 1
--ntasks-per-node = 1
--cpus-per-task = 16 (all cores on a node)
If all cores share the same memory (depends on answer to question 1) will all cores be used or 15 of them sleep since OpenMP (for shared memory) is not used?
3) If required memory is less total memory of a node, isn't it much better to use a single node, use OpenMP to achieve core-level parallelism, and avoid time loss due to communication between nodes? That is, use this
--nodes = 1
--ntasks-per-core = 1
instead of this:
--nodes = 16
--ntasks-per-node = 1
Rest of the questions are related to statements in this link.
Use core allocation if your application is CPU bound; the more processors you can throw at it the better!
Does this statement mean that --ntasks-per-core is good when cores don't access RAM too often?
Use socket allocation if memory access is what bottlenecks your application’s performance. Since how much data can come in from memory is what limits the speed of the job, running more tasks on the same memory bus won’t result in speed-up since all of those tasks are fighting over the path to memory.
I just don't get this. What I know is all sockets and cores on sockets share the same memory. This is why I don't get why --ntasks-per-socket option is available?
Use node allocation if some node-wide resource is what bottlenecks your application. This is the case with applications that are relying heavily on access to disk or to networks resources. Running multiple tasks per node won’t result in a speed-up since all of those tasks are waiting for access to the same disk or network pipe.
Does this mean that, if memory required is more than total RAM of a single node then its better to use multiple nodes?
In order:
Yes, all cores share the same memory. But not usually at the same speed. Usually, each Processor (in your configuration, you have 2 processors or sockets) has memory that is 'closer' to it. Usually the Linux kernel will attempt to allocate memory on nearby memory. This is not something that a user application usually has to worry about.
If it is a serial job, then yes, 15 cores will sit idle. If your job uses MPI, then it can use the other cores on the same node. Actually, MPI on the same node usually much faster than MPI stretched across multiple nodes.
You can use OpenMP or MPI on a single node. I'm not sure about the speed difference, but if you are already familiar with MPI, I would just stick with that. The difference probably isn't that big. But, the difference between running MPI on a single node vs. multiple nodes is going to be large. Running MPI on a single node will be significantly faster than across multiple nodes.
Use core allocation if your application is CPU bound; the more processors you can throw at it the better!
This is likely targeting OpenMP or single node parallel jobs.
Use socket allocation if memory access is what bottlenecks your application’s performance. Since how much data can come in from memory is what limits the speed of the job, running more tasks on the same memory bus won’t result in speed-up since all of those tasks are fighting over the path to memory.
See the answer to 1. Though it is the same memory, cores usually have separate bus's to memory.
Use node allocation if some node-wide resource is what bottlenecks your application. This is the case with applications that are relying heavily on access to disk or to networks resources. Running multiple tasks per node won’t result in a speed-up since all of those tasks are waiting for access to the same disk or network pipe.
If you need more RAM than a single node can provide, then you have no choice but to divide your program and use MPI.
What is the better way of making full use of multiple cores for parallel processing in a Scala/Hadoop system?
Let's say I need to process 100 million documents. Documents are not very large, but processing them is computationally intensive. If I have a Hadoop cluster with 100 machines with 10 cores each, I could either:
A) send 1000 documents to each machine and let Hadoop start a map on each of the 10 cores (or as many as are available)
or
B) send 1000 documents to each machine (still using Hadoop) and use Scala's parallel collections to make full use of the multiple cores. (I would put all documents in a parallel collection, and then call map on the collection). In other words, use Hadoop for distribution at cluster level, and use parallel collections to manage the distribution to cores within each machine.
Hadoop is going to offer a lot more than just parallelization. It offers a platform to distribute work, a scheduler for handling concurrent jobs, a distributed filesystem, the ability to perform a distributed reduce, and fault tolerance. That said, it is a complicated system and can sometimes be difficult to work with.
If you plan to have multiple users submitting many different jobs, Hadoop is the way to go (out of the two options). However, if you are devoting a cluster to be always be processing documents through the same function, you could, without too much trouble, develop a system with Scala parallel collections and actors for inter-machine communication. The Scala solution would give you more control, the system could respond in real time, and you wouldn't have to deal with a lot of Hadoop configuration that doesn't pertain to your task.
If you need to run varied jobs over large amounts of data (larger than would fit on a single node), then use Hadoop. I can give you more information if you describe your requirements in more detail.
Update: one million is a fairly small number. You might want to do some calculations and see how long it would take on a single machine with parallel collections. The advantage here is that the development time is minimal!
The answer depends on the following question - does your Scala code capable to fully utilize all cores available. Probabbly if you have good intrinsic synchronization between parts of the document to be processed or some other way to parralelyze algorithm without lock contention - then the "B"" is the way. If so - configure one mapper per node and let your mapper to utilize cores in a best way.
If your gain from the parralelization is not that good, and adding more threads (cores) to the processing does not improve performance in a linear way - then the "A" can be better way. Efficiency of "A" also depends on the size of your RAM - you will need enough ram for 10 mappers per node.
I can suspect that ideal solution can be somewhere in between. So my suggestion is to develop mapper which takes number of threads used as a parameter and then do a few tests increasing number of threads per mapper and decreasing number of mappers per node.
Hadoop is not very good for processing a lot of small files, but for processing a small amount of very large files. Is there any way you can merge the files before processing them, or are they all totally different? Hadoop takes care of distribution and parallelism itself, so there is no need to explicitly send X docs to Y machines. And also i don't think you should use hadoop only as a distribution mechanism, that is not what it's made for. You should either use a real map/reduce, or build your own system for whatever you are trying to do, but not try to bend hadoop to your will.