how does phase mechanism works in UVM? - system-verilog

I'm trying to understand UVM phasing mechanism especially in connect_phase().
UVM_INFO testbench.sv(14) # 0: e2.c1.gc1 [connect] phase
UVM_INFO testbench.sv(14) # 0: e2.c1.gc2 [connect] phase
UVM_INFO testbench.sv(39) # 0: e2.c1 [connect] phase
UVM_INFO testbench.sv(14) # 0: e2.c2.gc1 [connect] phase
UVM_INFO testbench.sv(14) # 0: e2.c2.gc2 [connect] phase
UVM_INFO testbench.sv(39) # 0: e2.c2 [connect] phase
UVM_INFO testbench.sv(62) # 0: e2 [connect] phase
Not like build phase, in connect_phase, the connect phase executed from bottom-up.
Someone says that it's not the matter after build_phase() done. But in hte every single simulation, I can see the bottom-up way.
I think there is some special or inevitable reason to do that, would you please help me for understand Why a connect phase and else phase in a UVM execute from bottom-up except build and finish phase?

Technically the build_phase is a breadth-first ordering. The ordering of the build_phase is dictated by the fact that the parent's build_phase creates its children, so naturally the parent must execute its phase before the children's phase.
You should only concern yourself with the ordering between phases, not ordering between components within the same phase. The connect_phase only requires that the components you are connecting have been constructed first and does not care about the order you make connections.
You might want to read this discussion about phase ordering with an attached example for more details.

Related

JAX pmap with multi-core CPU

What is the correct method for using multiple CPU cores with jax.pmap?
The following example creates an environment variable for SPMD on CPU core backends, tests that JAX recognises the devices, and attempts a device lock.
import os
os.environ["XLA_FLAGS"] = '--xla_force_host_platform_device_count=2'
import jax as jx
import jax.numpy as jnp
jx.local_device_count()
# WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
# 2
jx.devices("cpu")
# [CpuDevice(id=0), CpuDevice(id=1)]
def sfunc(x): while True: pass
jx.pmap(sfunc)(jnp.arange(2))
Executing from a jupyter kernel and observing htop shows that only one core is locked
I receive the same output from htop when omitting the first two lines and running:
$ env XLA_FLAGS=--xla_force_host_platform_device_count=2 python test.py
Replacing sfunc with
def sfunc(x): return 2.0*x
and calling
jx.pmap(sfunc)(jnp.arange(2))
# ShardedDeviceArray([0., 2.], dtype=float32, weak_type=True)
does return a SharedDeviecArray.
Clearly I am not correctly configuring JAX/XLA to use two cores. What am I missing and what can I do to diagnose the problem?
As far as I can tell, you are configuring the cores correctly (see e.g. Issue #2714). The problem lies in your test function:
def sfunc(x): while True: pass
This function gets stuck in an infinite loop at trace-time, not at run-time. Tracing happens in your host Python process on a single CPU (see How to think in JAX for an introduction to the idea of tracing within JAX transformations).
If you want to observe CPU usage at runtime, you'll have to use a function that finishes tracing and begins running. For that you could use any long-running function that actually produces results. Here is a simple example:
def sfunc(x):
for i in range(100):
x = (x # x)
return x
jx.pmap(sfunc)(jnp.zeros((2, 1000, 1000)))

Using 'caffe time' for benchmarking alexnet testing

I am new to caffe and presently I am trying to use it with Alexnet. When I say use, I mean I don't want to train the network, therefore, I got the '.caffemodel' for Alexnet as mentioned here.
Now, I want to use caffe's time feature to look at the time it takes for each layers execution during the TEST phase(I am doing for getting the execution time per layer during inference).
As per the caffe's options
usage: caffe <command> <args>
commands:
train train or finetune a model
test score a model
------------
time benchmark model execution time
collect collects layer data on specified device
compare collects layer data using inputs from other device
Flags from tools/caffe.cpp:
---------------------
-phase (Optional; network phase (TRAIN or TEST). Only used for 'time'.)
type: string default: ""
-sampling (Optional; Caffe test with sampling mode) type: bool
default: false
-------------------------
I can run the following command to benchmark Alexnet during TEST phase:
build/tools/caffe time -model models/bvlc_alexnet/train_val.prototxt -iterations 1000 -engine MKLDNN -phase TEST
But when I do that, I get the following error:
I0304 17:37:26.183619 29987 net.cpp:409] label_data_1_split does not need backward computation.
I0304 17:37:26.183625 29987 net.cpp:409] data does not need backward computation.
I0304 17:37:26.183629 29987 net.cpp:451] This network produces output accuracy
I0304 17:37:26.183635 29987 net.cpp:451] This network produces output loss
I0304 17:37:26.183647 29987 net.cpp:491] Network initialization done.
I0304 17:37:26.183732 29987 caffe.cpp:556] Performing Forward
I0304 17:37:26.287747 29987 caffe.cpp:561] Initial loss: 6.92452
I0304 17:37:26.287784 29987 caffe.cpp:563] Performing Backward
F0304 17:37:26.385227 29987 mkldnn_pooling_layer.cpp:464] Check failed: poolingBwd_pd
*** Check failure stack trace: ***
# 0x7fe03e3980cd google::LogMessage::Fail()
# 0x7fe03e399f33 google::LogMessage::SendToLog()
# 0x7fe03e397c28 google::LogMessage::Flush()
# 0x7fe03e39a999 google::LogMessageFatal::~LogMessageFatal()
# 0x7fe03ead741c caffe::MKLDNNPoolingLayer<>::InitPoolingBwd()
# 0x7fe03eac4ec2 caffe::MKLDNNPoolingLayer<>::Backward_cpu()
# 0x7fe03e8f9b19 caffe::Net<>::Backward()
# 0x5622d81a2530 (unknown)
# 0x5622d8199353 (unknown)
# 0x7fe03ab09b97 __libc_start_main
# 0x5622d8198e1a (unknown)
I am guessing there is some problem with the way I am using the command and I may have to change the .prototxt file for this.
I would appreciate if somebody can point me in the right direction as to how to get the benchmark numbers for Alexnet in Testing phase.
P.S: I could not find out what happens if you just run caffe time without specifying the Phase. Does it benchmark both the TEST and TRAIN phase?

boofuzz - Target connection reset, skip error

I am using boofuzz to try to fuzz a specific application. While creating the blocks etc and some testing i noticed that the target sometimes closes the connection. This causes procmon to terminate the target process and restarts it. However this is totally unnecessary for this target.
Can i somehow tell boofuzz to not handle this as an Error (so target is not restarted)
[2017-11-04 17:09:07,012] Info: Receiving...
[2017-11-04 17:09:07,093] Check Failed: Target connection reset.
[2017-11-04 17:09:07,093] Test Step: Calling post_send function:
[2017-11-04 17:09:07,093] Info: No post_send callback registered.
[2017-11-04 17:09:07,093] Test Step: Sleep between tests.
[2017-11-04 17:09:07,094] Info: sleeping for 0.100000 seconds
[2017-11-04 17:09:07,194] Test Step: Contact process monitor
[2017-11-04 17:09:07,194] Check: procmon.post_send()
[2017-11-04 17:09:07,196] Check OK: No crash detected.
Excellent question! There isn't (wasn't) any way to do this, but there really should be. A reset connection does not always mean a failure.
I just added ignore_connection_reset and ignore_connection_aborted options to the Session class to ignore ECONNRESET and ECONNABORTED errors respectively. Available in version 0.0.10.
Description of arguments available in the docs: http://boofuzz.readthedocs.io/en/latest/source/Session.html
You may find the commit that added these arguments informative for how some of the boofuzz internals work (relevant lines 182-183, 213-214, 741-756): https://github.com/jtpereyda/boofuzz/commit/a1f08837c755578e80f36fd1d78401f21ccbf852
Thank you for the solid question.

JVM crashes frequently

JVM crashes surprizingly and frequently on our prod environment and results in Jboss (EAP6.3) going down. We have java7 U72 installed
Crash logs has same output where current thread is:
Current thread (0x00000000d1d99000): JavaThread "Lucene Merge Thread #0" daemon [_thread_in_Java, id=1144, stack(0x00000000f6a00000,0x00000000f6b00000)]
and all the log is full of :
JavaThread "elasticsearch[Node BD852E44][search][T#68]" daemon [_thread_blocked, id=14396, stack(0x00000000f7b30000,0x00000000f7c30000)]
elasticsearch is some were related to indexing and it uses Lucene in hood as far as I understand but we have number or application deployed how to check on this can someone please help. complete crash logs are at : http://pastebin.com/845LU9iK
Looks like it didn't manage to record stack traces for the affected thread.
If that's the same for all crashes then it doesn't seem to match known lucene or jboss bugs.
# guarantee(result == EXCEPTION_CONTINUE_EXECUTION) failed: Unexpected result from topLevelExceptionFilter
AIUI this indicates an error in native exception handling, so it's one error masking another, probably making this crash log fairly useless.
So I can only provide really generic advice:
you're using an older JVM version, update to the latest java 7, java 8 or possibly even a java 9 dev build and see if it goes away. Even if they still crash they might provide different/more useful error reports
to diagnose potential compiler bugs you can try running with the following flags
-XX:-TieredCompilation 1 should disable the C1 compiler
-XX:+TieredCompilation -XX:TieredStopAtLevel=1 should disable the C2 compiler
-Xint disables all JIT, very slow
ask on the hotspot-dev mailing list for further guidance
1: Tiered compilation is a new java 7 feature, it basically combines the interpreter, C1 and C2 JIT compilers (which formerly were used separately in the client and server VMs) into different optimizing stages.
Each of them can have optimization bugs. Turning off individual stages helps isolating them as potential cause.
Edit: The new crash report is more useful since it at least has java frames, the interesting part is the following:
J 1559 sun.misc.Unsafe.getByte(J)B (0 bytes) # 0x000000000178e99b [0x000000000178e960+0x3b]
j java.nio.DirectByteBuffer.get()B+11
j org.apache.lucene.store.ByteBufferIndexInput.readByte()B+4
J 9447 C2 org.apache.lucene.store.DataInput.readVInt()I (114 bytes) # 0x000000000348cc00 [0x000000000348cbc0+0x40]
DataInput.readVInt seems to be an ongoing source of grief, see this SO answer for possible solutions

Buildbot slaves priority

Problem
I have set up a latent slave in buildbot to help avoid congestion.
I've set up my builds to run either in permanent slave or latent one. The idea is the latent slave is waken up only when needed but the result is that buildbot randomly selectes one slave or the other so sometimes I have to wait for the latent slave to wake even if the permanent one is idle.
Is there a way to prioritize buildbot slaves?
Attempted solutions
1. Custom nextSlave
Following #david-dean suggestion, I've created a nextSlave function as follows (updated to working version):
from twisted.python import log
import traceback
def slave_selector(builder, builders):
try:
host = None
support = None
for builder in builders:
if builder.slave.slavename == 'host-slave':
host = builder
elif builder.slave.slavename == 'support-slave':
support = builder
if host and support and len(support.slave.slave_status.runningBuilds) < len(host.slave.slave_status.runningBuilds):
log.msg('host-slave has many running builds, launching build in support-slave')
return support
if not support:
log.msg('no support slave found, launching build in host-slave')
elif not host:
log.msg('no host slave found, launching build in support-slave')
return support
else:
log.msg('launching build in host-slave')
return host
except Exception as e:
log.err(str(e))
log.err(traceback.format_exc())
log.msg('Selecting random slave')
return random.choice(buildslaves)
And then passed it to BuilderConfig.
The result is that I get this in twistd.log:
2014-04-28 11:01:45+0200 [-] added buildset 4329 to database
But the build never starts, in the web UI it always appear as Pending and none of the logs I've put appear in twistd.log
2. Trying to mimic default behavior
I've having a look to buildbot code, to see how it is done by default.
in file ./master/buildbot/process/buildrequestdistributor.py, class BasicBuildChooser you have:
self.nextSlave = self.bldr.config.nextSlave
if not self.nextSlave:
self.nextSlave = lambda _,slaves: random.choice(slaves) if slaves else None
So I've set exactly that lambda function in my BuilderConfig and I'm getting exactly the same build not starting result.
You can set up a nextSlave function to assign slaves to a builder in a custom manner see: http://docs.buildbot.net/current/manual/cfg-builders.html#builder-configuration