Performance of sinatra app on jruby - sinatra

I have a sinatra app, which is performing considerably slower than I would like. My first suspicion is that it is my own code that is the bottleneck, so I extracted it to a standalone benchmarking script.
THREADS = 100
ITERATIONS = 1
def make_calls
ITERATIONS.times do
# ... my stuff here
end
end
1.upto(THREADS) do |n|
Benchmark.bm do |bm|
threads = []
n.times do
threads << Thread.new { make_calls }
end
bm.report("#{n} threads:") { threads.each { |t| t.value } }
end
end
Where make_calls calls my own code. I'm pleased to say that by the time we have reached 100 threads, the cumulative times of make_calls in all threads is 0.6 seconds, which is fast enough for my purposes. The reason I am wrapping the make_calls method in threads above is because my own code uses threads (java native threads via a java.concurrent.FixedThreadPool(500)) /ExecutorService and I wanted to make sure that this was behaving nicely in an environment that potentially uses other threading models. A single iteration in a single thread runs in about 0.02 seconds once the jruby has warmed up.
So the above is good, but when I add this to a sinatra web server with the following:
require 'sinatra'
get '/' do
# ... my stuff here
end
The response time on a request to this endpoint is approx .5 seconds - Increase the number of concurrent requests and the response time goes up in a linear fashion. I've used both jetty-rackup and trinidad to try this, using jruby 1.7 on both linux and solaris.
I have tried to optimise the trinidad instance to no avail (max/min runtimes etc). The best performance we have seen is by running either server in threadsafe! mode, and both servers show comparative performance in this mode.
Can anyone explain to me where the time is being consumed or how to improve this setup?

It doesn't sound like the server is the limiting factor. Perhaps it's a problem with rack or Sinatra, but
# ... my stuff here
doesn't really give much away. Try using a profiling tool (VisualVM is OK to start with) to check if you're allocating masses of objects you don't need, whether all your threads are waiting, e.t.c then change or eliminate what you think is the bottleneck.
Repeat this until you think it's fast enough ;)

Related

Using libevent together with GCD (libdispatch) in Swift

I'm creating a server side app in Swift 3. I've chosen libevent for implementing networking code because it's cross-platform and doesn't suffer from C10k problem. Libevent implements it's own event loop, but I want to keep CFRunLoop and GCD (DispatchQueue.main.after etc) functional as well, so I need to glue them somehow.
This is what I've came up with:
var terminated = false
DispatchQueue.main.after(when: DispatchTime.now() + 3) {
print("Dispatch works!")
terminated = true
}
while !terminated {
switch event_base_loop(eventBase, EVLOOP_NONBLOCK) { // libevent
case 1:
break // No events were processed
case 0:
print("DEBUG: Libevent processed one or more events")
default: // -1
print("Unhandled error in network backend")
exit(1)
}
RunLoop.current().run(mode: RunLoopMode.defaultRunLoopMode,
before: Date(timeIntervalSinceNow: 0.01))
}
This works, but introduces a latency of 0.01 sec. While RunLoop is sleeping, libevent won't be able to process events. Lowering this timeout increases CPU usage significantly when the app is idle.
I was also considering using only libevent, but third party libs in the project can use dispatch_async internally, so this can be problematic.
Running libevent's loop in a different thread makes synchronization more complex, is this the only way of solving this latency issue?
LINUX UPDATE. The above code does not work on Linux (2016-07-25-a Swift snapshot), RunLoop.current().run exists with an error. Below is a working Linux version reimplemented with a timer and dispatch_main. It suffers from the same latency issue:
let queue = dispatch_get_main_queue()
let timer = dispatch_source_create(DISPATCH_SOURCE_TYPE_TIMER, 0, 0, queue)
let interval = 0.01
let block: () -> () = {
guard !terminated else {
print("Quitting")
exit(0)
}
switch server.loop() {
case 1: break // Just idling
case 0: break //print("Libevent: processed event(s)")
default: // -1
print("Unhandled error in network backend")
exit(1)
}
}
block()
let fireTime = dispatch_time(DISPATCH_TIME_NOW, Int64(interval * Double(NSEC_PER_SEC)))
dispatch_source_set_timer(timer, fireTime, UInt64(interval * Double(NSEC_PER_SEC)), UInt64(NSEC_PER_SEC) / 10)
dispatch_source_set_event_handler(timer, block)
dispatch_resume(timer)
dispatch_main()
A quick search of the Open Source Swift Foundation libraries on GitHub reveals that the support in CFRunLoop is (perhaps obviously) implemented differently on different platforms. This means, in essence, that RunLoop and libevent, with respect to cross-platform-ness, are just different ways to achieve the same thing. I can see the thinking behind the thought that libevent is probably better suited to server implementations, since CFRunLoop didn't grow up with that specific goal, but as far as being cross-platform goes, they're both barking up the same tree.
That said, the underlying synchronization primitives used by RunLoop and libevent are inherently private implementation details and, perhaps more importantly, different between platforms. From the source, it looks like RunLoop uses epoll on Linux, as does libevent, but on macOS/iOS/etc, RunLoop is going to use Mach ports as its fundamental primitive, but libevent looks like it's going to use kqueue. You might, with enough effort, be able to make a hybrid RunLoopSource that ties to a libevent source for a given platform, but this would likely be very fragile, and generally ill-advised, for a couple of reasons: Firstly, it would be based on private implementation details of RunLoop that are not part of the public API, and therefore subject to change at any time without notice. Second, assuming you didn't go through and do this for every platform supported by both Swift and libevent, you would have broken the cross-platform-ness of it, which was one of your stated reasons for going with libevent in the first place.
One additional option you might not have considered would be to use GCD by itself, without RunLoops. Look at the docs for dispatch_main. In a server application, there's (typically) nothing special about a "main thread," so dispatching to the "main queue", should be good enough (if needed at all). You can use dispatch "sources" to manage your connections, etc. I can't personally speak to how dispatch sources scale up to the C10K/C100K/etc. level, but they've seemed pretty lightweight and low-overhead in my experience. I also suspect that using GCD like this would likely be the most idiomatic way to write a server application in Swift. I've written up a small example of a GCD-based TCP echo server as part of another answer here.
If you were bound and determined to use both RunLoop and libevent in the same application, it would, as you guessed, be best to give libevent it's own separate thread, but I don't think it's as complex as you might think. You should be able to dispatch_async from libevent callbacks freely, and similarly marshal replies from GCD managed threads to libevent using libevent's multi-threading mechanisms fairly easily (i.e. either by running with locking on, or by marshaling your calls into libevent as events themselves.) Similarly, third party libraries using GCD should not be an issue even if you chose to use libevent's loop structure. GCD manages its own thread pools and would have no way of stepping on libevent's main loop, etc.
You might also consider architecting your application such that it didn't matter what concurrency and connection handling library you used. Then you could swap out libevent, GCD, CFStreams, etc. (or mix and match) depending on what worked best for a given situation or deployment. Choosing a concurrency approach is important, but ideally you wouldn't couple yourself to it so tightly that you couldn't switch if circumstances called for it.
When you have such an architecture, I'm generally a fan of the approach of using the highest level abstraction that gets the job done, and only driving down to lower level abstractions when specific circumstances require it. In this case, that would probably mean using CFStreams and RunLoops to start, and switching out to "bare" GCD or libevent later, if you hit a wall and also determined (through empirical measurement) that it was the transport layer and not the application layer that was the limiting factor. Very few non-trivial applications actually get to the C10K problem in the transport layer; things tend to have to scale "out" at the application layer first, at least for apps more complicated than basic message passing.

OS system calls in x86

While working on a educational simplistic RISC processor I was wondering about how system calls work when implementing my software interrupt function. For example, hypothetically lets say our program calls sys_end which ends the current process. Now I know this would go to a vector table and then to the code to end the current process.
My question is the code that ends the process ran in supervisor mode or user mode? No where I seem to look specifies this. I'm assuming if its in normal user mode that could pose a very significant problem as a user mode process could do say do something evil like:
for (i=0; i++; i<10000){
int sys_fork //creates child process
}
which could be very bad I thought the OS would have some say on how many times a process could repeat itself and not to mention what other harmful things a process could do by changing the code in the system call itself.
system calls run in supervisor mode for the duration of the system call. The supervisor mode is necessary for accessing hardware (the screen, the keyboard), and for keeping user processes isolated from each other.
There are (or can be configured) limits on the amount of cpu, number of processes, etc. a user process may use or request, which can offer some protection against the kind of runaway program you describe.
But the default linux configuration allows 10k processes to be created in a tight loop; I've done it myself (both intentionally and accidentally)

Perl fork queue for n-Core processor

I am writing an application similar to what was suggested here. Essentially, I am using Perl to manage the execution of multiple CPU intensive processes in parallel via fork and wait. However, I am running on a 4-core machine, and I have many more processes, all with very dissimilar expected run-times which aren't known a priori.
Ultimately, it would take more effort to estimate the run times and gang them appropriately, than to simply utilize a queue system for each core. Ultimately I want each core to be processing, with as little downtime as possible, until everything is done. Is there a preferred algorithm or mechanism for doing this? I would assume this is a common problem/use so I don't want to re-invent the wheel, as my wheel will probably be inferior to the 'right way. '
As a minor aside, I would prefer to not have to import additional modules (like Parallel::ForkManager) to accomplish this, but if that is the best way to go, then I will consider it.
~Thanks!
EDIT: Fixed 'here' link: Thanks to ikegami
EDIT: P::FM is too easy to use, not to... Today I Learned.
Forks::Super has some features that are good for this sort of task.
extended syntax, but not a lot of new syntax: if you already have a program with fork and wait calls, you can still use the features of Forks::Super without too many changes. That is, your new code will still have fork and wait calls.
job throttling: like Parallel::ForkManager, you can control how many jobs you run simultaneously. When one job completes, the module can start another one, keeping your system fully utilized. You can also specify more complex logic like "run at most 6 background jobs on the weekends or between midnight and 6:00 am, but 2 background jobs the rest of the time"
timing utilities: Forks::Super keeps track of the start time and end time of every job, letting you log and analyze how long each job took:
fork { cmd => "some command" };
...
$pid = wait;
$elapsed = $pid->{end} - $pid->{start};
print LOG "That job took ${elapsed}s\n";
CPU affinity control: I can't tell whether this is something you need, but Guarav seemed to think it mattered. You can assign background jobs to specific cores
# restrict job to cores #0 and #2
$job = fork { sub => \&background_process, args => \#args,
cpu_affinity => 0x05 };

NoMethodError when running sinatra on jruby with sinatra-synchrony

I'm trying to integrate a basic 'hello world' jruby sinatra application with sinatra-synchrony and keep running into errors.
app.rb:
require 'sinatra/synchrony'
class App < Sinatra::Base
register Sinatra::Synchrony
get '/' do
'Hello world!'
end
end
config.ru:
require 'sinatra'
require 'app.rb'
run App
I've tried running this on a few different web servers and get varying errors to do with threads or memory leaks.
Synchrony libraries for Ruby are designed around using Fibers in event loops a la Eventmachine. For this particular case, you should consider using MRI and Goliath.io as your rack server.
However, Jruby is growing by leaps and bounds. I've been using it for last few months and avoiding the event loop paradigm altogether. Try removing the Synchrony library from your example and running it using puma.io.
Keep in mind the JVM takes a bit to "warm up". Hit it a few thousand times to optimize speed.

iOS Threads Wait for Action

I have a processing thread that I use to fill a data buffer. Elsewhere a piece of hardware triggers a callback which reads from this data buffer. The processing thread then kicks in and refills the buffer.
When the buffer fills up I am currently telling the thread to wait by:
while( [self FreeWriteSpace] < mProcessBufferSize && InActive) {
[NSThread sleepForTimeInterval:.0001];
}
However when I profile I am getting a lot of CPU time spent in sleep. Is there a better way to wait? Do I even care if the profiles says time is spent in sleep?
Time spent in sleep is effectively free. In Instruments, look at "running samples" rather than "all samples." But this still isn't an ideal solution.
First, your sleep interval is crazy. Do you really need .1µs granularity? The system almost certainly isn't giving you because the processor isn't that fast. I have to believe you could up this to .1 or .01. But that's still busy-waiting which is not ideal if you can help it.
The better solution is to use an NSCondition. In this thread, wait on the condition, and in your processing thread, trigger the condition when there's room to write.
Do be careful with your naming. Do not name methods with leading caps (that indicates that it's a class name). And avoid accessing ivars directly (InActive) like this. "InActive" is also a very confusing name. Does it mean the system is active (In Active) or not active (inactive). Naming in Objective-C is extremely important. The compiler will not protect you the way it does in C# and C++. Good naming is how you keep your programs working, and many parts of ObjC rely on it.
You may also want to investigate Grand Central Dispatch, which is particularly designed for these kinds of problems. Look at dispatch_async() to run things when new data comes in.
However when I profile I am getting a
lot of CPU time spent in sleep. Is
there a better way to wait? Do I even
care if the profiles says time is
spent in sleep?
Yes -- never, never poll. Polling eats CPU, makes your app less responsive, eats battery, and is an all around waste.
Notify instead.
The easiest way is to use one of the variants of "perform selector on main thread" (see NSThread's documentation). Or dispatch to a queue (including something like dispatch_async(dispatch_get_main_queue(), ^{ ... yo, data be ready ...});).