JAX pmap with multi-core CPU

JAX pmap with multi-core CPU - multicore

What is the correct method for using multiple CPU cores with jax.pmap?
The following example creates an environment variable for SPMD on CPU core backends, tests that JAX recognises the devices, and attempts a device lock.
import os
os.environ["XLA_FLAGS"] = '--xla_force_host_platform_device_count=2'
import jax as jx
import jax.numpy as jnp
jx.local_device_count()
# WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
# 2
jx.devices("cpu")
# [CpuDevice(id=0), CpuDevice(id=1)]
def sfunc(x): while True: pass
jx.pmap(sfunc)(jnp.arange(2))
Executing from a jupyter kernel and observing htop shows that only one core is locked
I receive the same output from htop when omitting the first two lines and running:
$ env XLA_FLAGS=--xla_force_host_platform_device_count=2 python test.py
Replacing sfunc with
def sfunc(x): return 2.0*x
and calling
jx.pmap(sfunc)(jnp.arange(2))
# ShardedDeviceArray([0., 2.], dtype=float32, weak_type=True)
does return a SharedDeviecArray.
Clearly I am not correctly configuring JAX/XLA to use two cores. What am I missing and what can I do to diagnose the problem?

As far as I can tell, you are configuring the cores correctly (see e.g. Issue #2714). The problem lies in your test function:
def sfunc(x): while True: pass
This function gets stuck in an infinite loop at trace-time, not at run-time. Tracing happens in your host Python process on a single CPU (see How to think in JAX for an introduction to the idea of tracing within JAX transformations).
If you want to observe CPU usage at runtime, you'll have to use a function that finishes tracing and begins running. For that you could use any long-running function that actually produces results. Here is a simple example:
def sfunc(x):
for i in range(100):
x = (x # x)
return x
jx.pmap(sfunc)(jnp.zeros((2, 1000, 1000)))

Related

Memory leak (?) using IO::Socket::Async (on FreeBSD 13.1)

In processing a stream of logs (via UDP) in a raku (v2022.07) app, I'm
hitting what appears to be a memory leak using IO::Socket::Async.
I pulled the code out into a simpler program which I've included below
(~ identical to code at https://docs.raku.org/type/IO::Socket::Async):
#!/usr/bin/env raku
#
my $socket = IO::Socket::Async.bind-udp('localhost', 24225);
react {
whenever $socket.Supply -> $v {
print $v if $v.chars > 0;
};
};
It leaks substantial ram - I let it run about 12 hours and
when I checked -- still running (on a 1T ram machine) -- with
ps auwwx [pid]
it showed 314974456 and 20739784 for VSZ and RSS (so, roughly 300G v size and 20G resident).
[btw, the UDP traffic is fairly light - average of 350 (~100 byte) packets/sec (spikes to ~1000/sec)]
So .. I rewrote above in perl5 (after similar leaky results w/
a couple of raku variants) which stabilizes quickly at about 8M resident - that's fine/stable/etc. -
but I'd prefer this process to feed a raku channel (without separate perl process/file
tailing, etc.).
My environment: FreeBSD 13.1-RELEASE-p2 GENERIC amd64 and raku:
v2022.07 built on MoarVM 2022.07 (installed with rakubrew).
I'm guessing this is unique to raku on freebsd but not sure.
I did attempt to upgrade (rakubrew) to v2022.12 to see if problem resolved there -
but in rebuilding modules (zef), too many failed (some issue with
Digest/Digest::HMAC) - so I had to revert to 2022.07.
I'll sure be grateful for any suggestions for addressing the leak or alternative
methods to address reading from a UDP port.

Not exactly a solution to your problem, but you can monitor memory usage from within your Raku code using built-in feature:
use Telemetry;
say T{"max-rss"};
Also remember that Supply by default decodes unicode chars. If your protocol is binary you may add :bin to Socket params to avoid treating binary data as text.

pytest-xdist indirect fixtures with class scope

I have some complicated and heavy logic to build a test object and the tests are very long running. They are integration tests and I wanted to try and parallelize them a bit. So i found the pytest-xdist library.
Because of the heavy nature of building the test object, I am using pytests indirection capability on fixtures to build them at test time rather than at collection. Some code I am using for testing can be found below.
#run.py
import pytest
#pytest.mark.parametrize("attribute",(
["pid1", ["pod1", "pod2", "pod3"]],
["pid2", ["pod2", "pod4", "pod5"]]
), indirect=True)
class TestSampleWithScenarios(object):
#pytest.fixture(scope="class")
def attribute(request):
# checkout the pod here
# build the device object and yield
device = {}
yield device
# teardown the device object
# release pod
def test_demo1(self, attribute):
assert isinstance(attribute, str)
def test_demo2(self, attribute):
assert isinstance(attribute, str)
My run command is currently pytest run.py -n 4 --dist=loadscope
When I do no use loadscope, all the tests are sent to their own worker. I do not want this because I would like to only build the device object once and use it for all related tests.
When I use loadscope, all the tests are executed against gw0 and I am not getting any parallelism.
I am wondering if there is any tweaks that I am missing or is this functionality not implemented currently.

akka custom fork-join-executor dispatcher behaves differently on OSX and RHEL

When I deploy a Play framework application, using the Akka framework to a production machine it behaves differently then on my development workstation.
This is a system that receives a batch of device IP addresses, it performs some processing on each device and aggregates the results after all devices in the batch have been processed. This processing isn't very CPU intensive.
I basically have 2 types of actors, A BatchActor, and a DeviceActor. For the devices, I've created a created an actor backed by a RoundRobinPool router, and a custom dispatcher. I'm attempting to process ~500 device at a time (in parallel).
This issue is that when I run this code on my OSX machine, it runs as I would except.
For instance if I submit a batch of 200 device IP addresses, the application running on my workstations all the devices in parallel.
However when I copy this application to the production machine, Red Hat Enterprise Linux (RHEL), and run it submitting the same list of devices, it only processes 1 to 2 devices at a time.
What do I need to do to fix this issue?
The relevant code is as follows:
object Application extends Controller {
...
val numberOfWorkers = 500
val workers = Akka.system.actorOf(Props[DeviceActor]
.withRouter(RoundRobinPool(nrOfInstances = numberOfWorkers))
.withDispatcher("my-dispatcher")
)
def batchActor(config:BatchConfig)
= Akka.system.actorOf(BatchActor.props(workers, config), s"batch-${config.batchId}")
...
def batch = Action(parse.json) { request =>
request.body.validate[BatchConfig] match {
case config:BatchConfig => {
...
val batch = batchActor(config)
batch ! BatchActorProtocol.Start
Ok(Json.toJson(status))
}
...
}
}
The application.conf configuration section looks like the following:
my-dispatcher {
# Dispatcher is the name of the event-based dispatcher
type = Dispatcher
# What kind of ExecutionService to use
executor = "fork-join-executor"
# Configuration for the fork join pool
fork-join-executor {
# Min number of threads to cap factor-based parallelism number to
parallelism-min = 1000
# Parallelism (threads) ... ceil(available processors * factor)
parallelism-factor = 100.0
# Max number of threads to cap factor-based parallelism number to
parallelism-max = 5000
}
# Throughput defines the maximum number of messages to be
# processed per actor before the thread jumps to the next actor.
# Set to 1 for as fair as possible.
throughput = 500
}
Inside the BatchActor I'm simply parsing the list of devices and feeding it to the
class BatchActor(val workers:ActorRef, val config:BatchConfig) extends Actor
...
def receive = {
case Start => start
...
}
private def start = {
...
devices.map { devices =>
results(devices.host) = None
workers ! DeviceWork(self, config, devices, steps)
}
...
}
after which the WorkerActor submits a result object back to the BatchActer.
My workstation: OS X - v10.9.3
java -version
java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
production machine: Red Hat Enterprise Linux Server release 6.5 (Santiago)
java -version
java version "1.7.0_65"
OpenJDK Runtime Environment (rhel-2.5.1.2.el6_5-x86_64 u65-b17)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
Software:
Scala: v2.11.2
SBT: v0.13.6
Play: v2.3.5
Akka: v2.3.4
I'm using typesafe activator/sbt to start the application. The command is as follows:
cd <project dir>
./activator run -Dhttp.port=6600
Any help appreciated. I've been stuck on this issue for a couple of days now.

I believe you have too much parallelism in your code i.e., you are creating too many threads in your dispatcher. How many cores do you have on your Redhat box ? I've never seen such high value used. A lot of threads in FJ pool may be resulting in a large number of context switches. Try just using the default dispatcher and see if that fixes your issue or not. You can also change the values of min and max parallelism to 2 or 3 times number of cores you have.
fork-join-executor {
# Min number of threads to cap factor-based parallelism number to
parallelism-min = 1000
# Parallelism (threads) ... ceil(available processors * factor)
parallelism-factor = 100.0
# Max number of threads to cap factor-based parallelism number to
parallelism-max = 5000
}
Another thing to try is to create an uber jar using (sbt-assembly) and then deploy that instead of using activator to deploy it.
Finally, you can look inside your JVMs using something like VisualJVM or Yourkit.

After hours spent trying different things including:
doing research on different threading implementations on linux - pthreads vs NPTL
reading through all the VM documentation on threading
ulimits
trying various changes in the Play and Akka framework configurations
and finally a complete re-write of the thread management using scala futures, etc..
Nothing seemed to work. Then I did a detailed comparison and the only thing that was different was that I used the Oracle Hotspot implementation on my laptop, and the OpenJDK implementation on the production machine.
So I installed the Oracle VM on the production machine and that seemed to fix the issue. Even though I couldn't determine what the ultimate solution was, it seems that the default installation of OpenJDK on RHEL is complied or configured differently enough to not allow spawning of ~ 500 threads at a time.
I'm sure I'm missing something, but after ~ 3 days of searching I couldn't find it.

Buildbot slaves priority

Problem
I have set up a latent slave in buildbot to help avoid congestion.
I've set up my builds to run either in permanent slave or latent one. The idea is the latent slave is waken up only when needed but the result is that buildbot randomly selectes one slave or the other so sometimes I have to wait for the latent slave to wake even if the permanent one is idle.
Is there a way to prioritize buildbot slaves?
Attempted solutions
1. Custom nextSlave
Following #david-dean suggestion, I've created a nextSlave function as follows (updated to working version):
from twisted.python import log
import traceback
def slave_selector(builder, builders):
try:
host = None
support = None
for builder in builders:
if builder.slave.slavename == 'host-slave':
host = builder
elif builder.slave.slavename == 'support-slave':
support = builder
if host and support and len(support.slave.slave_status.runningBuilds) < len(host.slave.slave_status.runningBuilds):
log.msg('host-slave has many running builds, launching build in support-slave')
return support
if not support:
log.msg('no support slave found, launching build in host-slave')
elif not host:
log.msg('no host slave found, launching build in support-slave')
return support
else:
log.msg('launching build in host-slave')
return host
except Exception as e:
log.err(str(e))
log.err(traceback.format_exc())
log.msg('Selecting random slave')
return random.choice(buildslaves)
And then passed it to BuilderConfig.
The result is that I get this in twistd.log:
2014-04-28 11:01:45+0200 [-] added buildset 4329 to database
But the build never starts, in the web UI it always appear as Pending and none of the logs I've put appear in twistd.log
2. Trying to mimic default behavior
I've having a look to buildbot code, to see how it is done by default.
in file ./master/buildbot/process/buildrequestdistributor.py, class BasicBuildChooser you have:
self.nextSlave = self.bldr.config.nextSlave
if not self.nextSlave:
self.nextSlave = lambda _,slaves: random.choice(slaves) if slaves else None
So I've set exactly that lambda function in my BuilderConfig and I'm getting exactly the same build not starting result.

You can set up a nextSlave function to assign slaves to a builder in a custom manner see: http://docs.buildbot.net/current/manual/cfg-builders.html#builder-configuration

Can I register event callbacks using the libvirt Python module with a QEMU backend?

I would like to write some code to monitor events for domains running under QEMU, managed by libvirt. However, trying to register an event handler yields the following error:
>>> import libvirt
>>> conn = libvirt.openReadOnly('qemu:///system')
>>> conn.domainEventRegister(callback, None)
libvir: Remote error : this function is not supported by the connection driver: no event support
("callback" in this case is a stub function that simply prints its arguments.)
The examples I've been able to find regarding libvirt's event handling don't seem to be specific as to which backend hypervisors support which features. Is this expected to work for QEMU backends?
I'm running a Fedora 16 system, which includes libvirt 0.9.6 and qemu-kvm 0.15.1.
For folks finding themselves here via <searchengine>:
UPDATE 2013-10-04
Many months and a few Fedora releases later, the event-test.py code in the libvirt git repository runs correctly on Fedora 19.

Make sure you have registered in the libvirt event loop (or set up your own) before registering for events.
There is a nice example of event handling shipped with the libvirt source (file is called event-test.py). I'm attaching an example based on that code;
import libvirt
import time
import threading
def callback(conn, dom, event, detail, opaque):
print "EVENT: Domain %s(%s) %s %s" % (dom.name(),
dom.ID(),
event,
detail)
eventLoopThread = None
def virEventLoopNativeRun():
while True:
libvirt.virEventRunDefaultImpl()
def virEventLoopNativeStart():
global eventLoopThread
libvirt.virEventRegisterDefaultImpl()
eventLoopThread = threading.Thread(target=virEventLoopNativeRun,
name="libvirtEventLoop")
eventLoopThread.setDaemon(True)
eventLoopThread.start()
if __name__ == '__main__':
virEventLoopNativeStart()
conn = libvirt.openReadOnly('qemu:///system')
conn.domainEventRegister(callback, None)
conn.setKeepAlive(5, 3)
while conn.isAlive() == 1:
time.sleep(1)
Good luck!
//Seto

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

JAX pmap with multi-core CPU - multicore

Related

Memory leak (?) using IO::Socket::Async (on FreeBSD 13.1)

pytest-xdist indirect fixtures with class scope

akka custom fork-join-executor dispatcher behaves differently on OSX and RHEL

Buildbot slaves priority

Can I register event callbacks using the libvirt Python module with a QEMU backend?

Categories

Resources