Using 'caffe time' for benchmarking alexnet testing - neural-network

I am new to caffe and presently I am trying to use it with Alexnet. When I say use, I mean I don't want to train the network, therefore, I got the '.caffemodel' for Alexnet as mentioned here.
Now, I want to use caffe's time feature to look at the time it takes for each layers execution during the TEST phase(I am doing for getting the execution time per layer during inference).
As per the caffe's options
usage: caffe <command> <args>
commands:
train train or finetune a model
test score a model
------------
time benchmark model execution time
collect collects layer data on specified device
compare collects layer data using inputs from other device
Flags from tools/caffe.cpp:
---------------------
-phase (Optional; network phase (TRAIN or TEST). Only used for 'time'.)
type: string default: ""
-sampling (Optional; Caffe test with sampling mode) type: bool
default: false
-------------------------
I can run the following command to benchmark Alexnet during TEST phase:
build/tools/caffe time -model models/bvlc_alexnet/train_val.prototxt -iterations 1000 -engine MKLDNN -phase TEST
But when I do that, I get the following error:
I0304 17:37:26.183619 29987 net.cpp:409] label_data_1_split does not need backward computation.
I0304 17:37:26.183625 29987 net.cpp:409] data does not need backward computation.
I0304 17:37:26.183629 29987 net.cpp:451] This network produces output accuracy
I0304 17:37:26.183635 29987 net.cpp:451] This network produces output loss
I0304 17:37:26.183647 29987 net.cpp:491] Network initialization done.
I0304 17:37:26.183732 29987 caffe.cpp:556] Performing Forward
I0304 17:37:26.287747 29987 caffe.cpp:561] Initial loss: 6.92452
I0304 17:37:26.287784 29987 caffe.cpp:563] Performing Backward
F0304 17:37:26.385227 29987 mkldnn_pooling_layer.cpp:464] Check failed: poolingBwd_pd
*** Check failure stack trace: ***
# 0x7fe03e3980cd google::LogMessage::Fail()
# 0x7fe03e399f33 google::LogMessage::SendToLog()
# 0x7fe03e397c28 google::LogMessage::Flush()
# 0x7fe03e39a999 google::LogMessageFatal::~LogMessageFatal()
# 0x7fe03ead741c caffe::MKLDNNPoolingLayer<>::InitPoolingBwd()
# 0x7fe03eac4ec2 caffe::MKLDNNPoolingLayer<>::Backward_cpu()
# 0x7fe03e8f9b19 caffe::Net<>::Backward()
# 0x5622d81a2530 (unknown)
# 0x5622d8199353 (unknown)
# 0x7fe03ab09b97 __libc_start_main
# 0x5622d8198e1a (unknown)
I am guessing there is some problem with the way I am using the command and I may have to change the .prototxt file for this.
I would appreciate if somebody can point me in the right direction as to how to get the benchmark numbers for Alexnet in Testing phase.
P.S: I could not find out what happens if you just run caffe time without specifying the Phase. Does it benchmark both the TEST and TRAIN phase?

Related

how does phase mechanism works in UVM?

I'm trying to understand UVM phasing mechanism especially in connect_phase().
UVM_INFO testbench.sv(14) # 0: e2.c1.gc1 [connect] phase
UVM_INFO testbench.sv(14) # 0: e2.c1.gc2 [connect] phase
UVM_INFO testbench.sv(39) # 0: e2.c1 [connect] phase
UVM_INFO testbench.sv(14) # 0: e2.c2.gc1 [connect] phase
UVM_INFO testbench.sv(14) # 0: e2.c2.gc2 [connect] phase
UVM_INFO testbench.sv(39) # 0: e2.c2 [connect] phase
UVM_INFO testbench.sv(62) # 0: e2 [connect] phase
Not like build phase, in connect_phase, the connect phase executed from bottom-up.
Someone says that it's not the matter after build_phase() done. But in hte every single simulation, I can see the bottom-up way.
I think there is some special or inevitable reason to do that, would you please help me for understand Why a connect phase and else phase in a UVM execute from bottom-up except build and finish phase?
Technically the build_phase is a breadth-first ordering. The ordering of the build_phase is dictated by the fact that the parent's build_phase creates its children, so naturally the parent must execute its phase before the children's phase.
You should only concern yourself with the ordering between phases, not ordering between components within the same phase. The connect_phase only requires that the components you are connecting have been constructed first and does not care about the order you make connections.
You might want to read this discussion about phase ordering with an attached example for more details.

JAX pmap with multi-core CPU

What is the correct method for using multiple CPU cores with jax.pmap?
The following example creates an environment variable for SPMD on CPU core backends, tests that JAX recognises the devices, and attempts a device lock.
import os
os.environ["XLA_FLAGS"] = '--xla_force_host_platform_device_count=2'
import jax as jx
import jax.numpy as jnp
jx.local_device_count()
# WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
# 2
jx.devices("cpu")
# [CpuDevice(id=0), CpuDevice(id=1)]
def sfunc(x): while True: pass
jx.pmap(sfunc)(jnp.arange(2))
Executing from a jupyter kernel and observing htop shows that only one core is locked
I receive the same output from htop when omitting the first two lines and running:
$ env XLA_FLAGS=--xla_force_host_platform_device_count=2 python test.py
Replacing sfunc with
def sfunc(x): return 2.0*x
and calling
jx.pmap(sfunc)(jnp.arange(2))
# ShardedDeviceArray([0., 2.], dtype=float32, weak_type=True)
does return a SharedDeviecArray.
Clearly I am not correctly configuring JAX/XLA to use two cores. What am I missing and what can I do to diagnose the problem?
As far as I can tell, you are configuring the cores correctly (see e.g. Issue #2714). The problem lies in your test function:
def sfunc(x): while True: pass
This function gets stuck in an infinite loop at trace-time, not at run-time. Tracing happens in your host Python process on a single CPU (see How to think in JAX for an introduction to the idea of tracing within JAX transformations).
If you want to observe CPU usage at runtime, you'll have to use a function that finishes tracing and begins running. For that you could use any long-running function that actually produces results. Here is a simple example:
def sfunc(x):
for i in range(100):
x = (x # x)
return x
jx.pmap(sfunc)(jnp.zeros((2, 1000, 1000)))

How to count cache-misses in mmap-ed memory (using eBPF)?

I would like to get timeseries
t0, misses
...
tN, misses
where tN is a timestamp (second-resolution) and misses is a number of times the kernel made disk-IO for my PID to load missing page of the mmap()-ed memory region when process did access to that memory. Ok, maybe connection between disk-IO and memory-access is harder to track, lets assume my program can not do any disk-io with another (than assessing missing mmapped memory) reason. I THINK, I need to track something called node-load-misses in perf world.
Any ideas how eBPF can be used to collect such data? What probes should I use?
Tried to use perf record for similar purpose: I dislike how much data perf records. As I recall the try was like (also I dont remember how I parsed that output.data file):
perf record -p $PID -a -F 10 -e node-loads -e node-load-misses -o output.data
I thought eBPF could give some facility to implement such thing in less overhead way.
Loading of mmaped pages which are not present in memory is not hardware event like perf's cache-misses or node-loads or node-load-misses. When your program assess not present memory address, GPFault/pagefault exception is generated by hardware and it is handled in software by Linux kernel codes. For first access to anonymous memory physical page will be allocated and mapped for this virtual address; for access of mmaped file disk I/O will be initiated. There are two kinds of page faults in linux: minor and major, and disk I/O is major page fault.
You should try to use trace-cmd or ftrace or perf trace. Support of fault tracing was planned for perf tool in 2012, and patches were proposed in https://lwn.net/Articles/602658/
There is a tracepoint for page faults from userspace code, and this command prints some events with memory address of page fault:
echo 2^123456%2 | perf trace -e 'exceptions:page_fault_user' bc
With recent perf tool (https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/) there is perf trace record which can record both mmap syscalls and page_fault_user into perf.data and perf script will print all events and they can be counted by some awk or python script.
Some useful links on perf and tracing: http://www.brendangregg.com/perf.html http://www.brendangregg.com/ebpf.html https://github.com/iovisor/bpftrace/blob/master/INSTALL.md
And some bcc tools may be used to trace disk I/O, like https://github.com/iovisor/bcc/blob/master/examples/tracing/disksnoop.py or https://github.com/brendangregg/perf-tools/blob/master/examples/iosnoop_example.txt
And for simple time-series stat you can use perf stat -I 1000 command with correct software events
perf stat -e cpu-clock,page-faults,minor-faults,major-faults -I 1000 ./program
...
# time counts unit events
1.000112251 413.59 msec cpu-clock # 0.414 CPUs utilized
1.000112251 5,361 page-faults # 0.013 M/sec
1.000112251 5,301 minor-faults # 0.013 M/sec
1.000112251 60 major-faults # 0.145 K/sec
2.000490561 16.32 msec cpu-clock # 0.016 CPUs utilized
2.000490561 1 page-faults # 0.005 K/sec
2.000490561 1 minor-faults # 0.005 K/sec
2.000490561 0 major-faults # 0.000 K/sec

Nvidia digits on TX2 Error code 1

I am new to Digits and TX2. I am trying to create object detection model using the tutorial from: https://github.com/dusty-nv/jetson-inference
I created dataset sucessfully. The issue is with the model
While creating a model, I am getting the following error.
Memory required for data: 3268934784
creating layer bbox_loss
Creating Layer bbox_loss
bbox_loss <- bboxes-obj-masked-norm
bbox_loss <- bbox-obj-label-norm
bbox_loss -> loss_bbox
Setting up bbox_loss
Top shape: (1)
with loss weight 2
Memory required for data: 3268934788
Creating layer coverage_loss
Creating Layer coverage_loss
coverage_loss <- coverage_coverage/sig_0_split_0
coverage_loss <- coverage-label_slice-label_4_split_0
coverage_loss -> loss_coverage
Setting up coverage_loss
Top shape: (1)
with loss weight 1
Memory required for data: 3268934792
Creating layer cluster
The job directory information on the left is:
Job Directory
/home/nvidia/DIGITS/digits/jobs/20180816-161051-e67a
Disk Size
0 B
Network (train/val)
train_val.prototxt
Network (deploy)
deploy.prototxt
Network (original)
original.prototxt
Solver
solver.prototxt
Raw caffe output
caffe_output.log
Pretrained Model
/home/nvidia/bvlc_googlenet.caffemodel.4
Visualizations
Tensorboard
The error on the server is
2018-08-16 16:10:53 [20180816-161051-e67a] [INFO ] Task subprocess args: "/home/nvidia/Caffe/caffe/build/tools/caffe train --solver=/home/nvidia/DIGITS/digits/jobs/20180816-161051-e67a/solver.prototxt --gpu=0 --weights=/home/nvidia/bvlc_googlenet.caffemodel.4"
2018-08-16 16:11:00 [20180816-161051-e67a] [ERROR] Train Caffe Model task failed with error code 1
I have no idea on how to free up memory as I have more than 2 gb available in the job directory.
Please help me. Thanks in advance.
Had the same issue for the last few days, maybe it will help someone in the future. Firstly, make sure that you have the right version of protobuf. You can check it with:
protoc --version
If it's 2.* you have to update to 3.*, for example to build it as listed here https://github.com/NVIDIA/DIGITS/blob/digits-6.0/docs/BuildProtobuf.md, and then rebuild the Caffe. Also, make sure that you have the compatible version of pip package of protobuf. For me the following version is working well right now for Digits and Caffe from the tutorial https://github.com/dusty-nv/jetson-inference :
pip install --user --upgrade protobuf==3.1.0.post1

Matlab fails to validate parallel environment

When I run Parallel >> Manage Congifurations..., Matlab fails to pass the Distributed Job, the Parallel Job and the Matlabpool tests. My system has a double core: Intel Core i5 CPU M520 # 2.40GHz 2.40GHZ, 2GB RAM, Win7 64bit, Matlab R2011b. After the failed validation, I get the following output:
Validation Details
Configuration: "local" Type: local
-------------------------------------- Stage: Find Resource
Status: Passed Description: Validation passed
Command Line Output: (none)
-------------------------------------- Stage: Distributed Job
Status: Failed Description: The given stage reached the default or
user-specified timeout.
Command Line Output: (none)
Error Report: (none)
Debug Log: LOG FILE OUTPUT:
-------------------------------------- Stage: Parallel Job
Status: Failed Description: The given stage reached the default or
user-specified timeout.
Command Line Output: (none)
Error Report: (none)
Debug Log: LOG FILE OUTPUT:
-------------------------------------- Stage: Matlabpool
Status: Failed Description: A MATLAB pool is already open and might
interfere with further testing. To avoid this, before the next test
run try executing "matlabpool close".
Command Line Output: (none)
Error Report: (none)
Debug Log: (none)
This is pretty much what I get if I've called matlabpool prior to running the validation checks. You did pay attention to the advice given in the Status report from the Matlabpool stage didn't you, about closing an open matlabpool ?