I recently successfully experimented with Scala futures. I'm pleased as punch with the gains I'm seeing from the parallelism, but I'm only seeing 4 worker threads.
I've been looking all over for how I can crank up the number of threads to 11, but no luck. How can I do this?
Check out the source at http://www.scala-lang.org/api/current/index.html#scala.actors.scheduler.ResizableThreadPoolScheduler
There are some system properties: actors.corePoolSize and actors.maxPoolSize. Looks you can adjust those, and the scheduler for daemon-like things (including futures) will get a larger number. By default it picks up on the number of processors, I guess.
Related
Concurrency means the ability to allow more than one tasking process at a time
But where does threading fit in it?
What's the relation between threading and concurrency?
What is the important link between these two which will fully clear all the confusion?
Threads are one way to achieve concurrency. Concurrency can be achieved at many levels and in many ways. Here are some of them from low to high level to give you a rough idea:
CPU pipelines: at a hardware level, multiple instructions are executed in parallel (each instruction is at a different stage in the pipeline)
Duplication of ALU and FPU CPU units. There are more arithmetic-logic units and floating point units in a processor that can execute instructions in parallel.
vectorized instructions. Instructions which execute for multiple data.
hyperthreading/SMT. Duplication of the process context.
threads. Streams of instructions which can be executed in parallel.
processes. You run both a browser and a word processor on your system.
tasks. Higher abstraction over threads and async work.
multiple computers. Run your program on multiple computers
I'm new here but I don't really understand the down votes? Could someone explain it to me? Is it just because this question has (likely) been answered or because it's considered obvious?
Now that that's out of the way...
Nothing being executed on the CPU is from a "process" or anything else. They're all threads, scheduled and entirely managed by the kernel using a variety of algorithms to reach expected performance for any given application. The CPU only allows n threads, where n equals (cores * hyperthreads). In most cases hyperthreads will be 2 so you have double the core count to get logical CPU count. What this really means is that instead of 4 (for example) threads being run at once, it can support up to 8. Now the OS may have hundreds of threads at any given time, how is that possible? Well the kernel uses a variety of checks such as how frequently and long the thread sleeps to assign it a priority. Whenever the CPU triggers a timer interrupt the OS will swap out threads appropriately if they've reached their alotted time slice based on the OS determination of its priority.
How can i configure mongodb's pool connection for support 1100 threads per seconds?
I tried some configurations like bellow without sucess.
connectionsPerHost = 200
threadsAllowedToBlockForConnectionMultiplier = 5
Can someone help me?
Thanks.
It won't.
That number of threads may be prejudicial, there's a lot of techniques to calculate some ideal number, and none of them get even close to 1100. If you're looking to attend a large number of users you should work with server redundancy. You won't get speed because 99.9% (really) of your threads will be locked waiting for a resource become available.
I've worked with java in fast processing, using distributed systems and threads, we used 0mq(tcp alternative) to acelerate communication and get more use of the threads, but we found that moderate number of threads was the ideal (if I remember correctly, 12).
Instead of letting hundreds of threads do the job, try to keep a limited number of workers threads, you won't have more resources anyway. The ideal for this kind of application would be have many servers attending your users.
My question concerns the extent to which a JVM application can exploit the NUMA layout of a host.
I have an Akka application in which actors concurrently process requests by combining incoming data with 'common' data already loaded into an immutable (Scala) object. The application scales well in the cloud, using many dual core VMs, but performs poorly on a single 64 core machine. I presume this is because the common data object resides in one NUMA cell and many threads concurrently accessing from other cells is too much for the interconnects.
If I run 64 separate JVM applications each containing 1 actor then performance is is good again. A more moderate approach might be to run as many JVM applications as there are NUMA cells (8 in my case), giving the host OS a chance to keep the threads and memory together?
But is there a smarter way to achieve the same effect within a single JVM? E.g. if I replaced my common data object with several instances of a case class, would the JVM have the capability to place them on the optimal NUMA cell?
Update:
I'm using Oracle JDK 1.7.0_05, and Akka 2.1.4
I've now tried with the UseNUMA and UseParallelGC JVM options. Neither seemed to have any significant impact on slow performance when using one or few JVMs. I've also tried using a PinnedDispatcher and the thre-pool-executor with no effect. I'm not sure if the configuration is having an effect though, since there seems nothing different in the startup logs.
The biggest improvement remains when I use a single JVM per worker (~50). However, the problem with this appears to be that there is a long delay (up to a couple of min) before the FailureDector registers the successful exchange of 'first heartbeat' between Akka cluster JVMs. I suspect there is some other issue here that I've not yet uncovered. I already had to increase the ulimit -u since I was hitting the default maximum number of processes (1024).
Just to clarify, I'm not trying to achieve large numbers of messages, just trying to have lots of separate actors concurrently access an immutable object.
I think if you sure that problems not in message processing algorithms then you should take in account not only NUMA option but whole env. configuration, starting from JVM version (latest is better, Oracle JDK also mostly performs better than OpenJDK) then JVM options (including GC, memory, concurrency options etc.) then Scala and Akka versions (latest release candidates and milestones can be much better) and also Akka configuration.
From here you can borrow all things that matter to got 50M messages per second of total throughput for Akka actors on contemporary laptops.
Never had chance to run these benchmarks on 64-core server - so any feedback will be greatly appreciated.
From my findings, which can help, current implementations of ForkJoinPool increases message send latency when number of threads in pool increases. It is greatly noticeable for cases when rate of response-request call between actors is high, e. g. on my laptop when increasing pool size from 4 to 64 message send latency of Akka actors for such cases grows up to 2-3x times for most executor services (Scala's ForkJoinPool, JDK's ForkJoinPool, ThreadPoolExecutor).
You can check if there are any differences by running mvnAll.sh with the benchmark.parallelism system variable set to different values.
I'm looking for any concrete info related to the number of background threads an NSOperationQueue with create given the NSOperationQueueDefaultMaxConcurrentOperationCount maximum concurrency setting.
I had assumed that some sort of load monitoring is employed to determine the most appropriate number of threads to spawn, plus this setting is recommended in the docs. What I'm finding is that the queue spawns roughly 100 background threads and my app (running on iPad 3 with iOS 5.1.1) crashes with SIGABRT. I've reduced this to a more acceptable number like 3 and everything is working fine.
Any comments or insight would be appreciated.
My experience matches yours (though not to 100 threads; do put in some instrumenting to make sure that you really have that many running simultaneously. I've never seen it go quite that high). Unless you manually manage the number of concurrent operations, NSOperationQueue will tend to generate too many concurrent operations. (I have yet to see anyone refute this with testable code rather than inferences from the documentation.) For anything that may generate a large number of potentially concurrent operations, I recommend setMaxConcurrentOperations. While not ideal, I often wind up using a function like this one to assist (this of course doesn't help you balance between queues, so is very sub-optimal):
unsigned int countOfCores() {
unsigned int ncpu;
size_t len = sizeof(ncpu);
sysctlbyname("hw.ncpu", &ncpu, &len, NULL, 0);
return ncpu;
}
I eagerly await anyone posting real code demonstrating NSOperationQueue automatically performing correct load balancing for CPU-bound operations. I've posted a sample gist demonstrating what I'm talking about. Without calling setMaxConcurrentOperations:, it will spawn about 6 parallel processes on a 2-core iPad 3. In this very simplistic case with no contention or shared resources, this adds about a 10%-15% overhead. In more complicated code with contention (and particularly if operations might be cancelled), it can slow things down by an order of magnitude.
assuming your threads are busy working, 100 active threads in one process on a dual-core iPad is unreasonable. each thread consumes a good amount of time and memory. having that many busy threads is going to slow things down on a dual-core.
regardless of whether you're doing something silly (like sleeping them all or adding run loops or just giving them nothing to do), this would be a bug.
From the documentation:
The default maximum number of operations is determined dynamically by the NSOperationQueue object based on current system conditions.
The iPad 3 has a powerful processor and 1Gb of ram. Since NSOperationQueue calculates the amount of thread based on system conditions, it's very likely that it determined to be able to run a large number of NSOperation based on the power available on that device. The reason why it crashed might not have to do with the amount of threads running simultaneously, but on the code being executed inside those threads. Check the backtrace and see if there is some condition or resource being shared among these treads.
Can anyone tell me if there is a way to find out the maximum number of threads that can run on different windows systems?
For example - (Assumption)A windows 32-bit system can run maximum 4000 threads.
I doubt there is a maximum number. Well, since we're using a finite amount of memory, it would be as many threads as you can fit into memory or as many as you can keep track of. Each system is different and I know Java and C don't have a function to provide this. C# can tell you how much memory a specific object/app needs so you could go calculate the estimate.
You could test this on your system. Write a sample app which spawns threads and see when you run out of memory. Use a counter to count them. This will give you roughly the range for your system.
In Java, you can use an ExecutorService with a thread pool.. Depending on which executor service you use, it can keep spawning threads if you submit more jobs.
A similar technique exists in C#.
A better question is what the maximum number of threads to spawn and avoid thrashing is.
Are you trying to take over the OS and do your own process/thread management? You should not be doing this.