Porting word2vec to RISC-V.. potential proxy kernel issue? - porting

We are trying to port word2vec to RISC-V. Towards this end, we have compiled word2vec with a cross compiler and are trying to run it on Spike.
The cross compiler compiles the standard RISC-V benchmarks and they run without failure on Spike, but when we use the same setup for word2vec, it fails with "bad syscall #179!". We tried two different versions, both fail around the same place a minute or two into the run while executing these instructions. After going through the loop several 100k times, we see C1, C2 printed an then the crash. We are thinking this is more of a spike/pk issue than a word2vec issue.
Has anyone had similar experiences when porting code to RISC-V? Any ideas on how we might track down whether it's the proxy kernel?
A related question is about getting gdb working with Spike.. will post that separately.
Thank you.

The riscv-pk does not support all possible syscalls. You'll need to track down which syscall it is and whether you can implement it in riscv-pk or if you need to move to running it on a different kernel. For example, riscv-pk does not support any threading-related syscalls as multithreaded kernel support is an explicitly riscv-pk non-goal.
I would also be wary of using riscv-pk in general. It's a very simple, thin kernel which is great for running newlib user applications in the beginning, but it lacks rigorous testing and validation efforts against it, so running applications that stress virtual memory systems, rely on lots of syscalls (iotcl and friends), or are expecting more glibc-like environments may prove problematic.

Related

indexing problem when calling fit() function

I'm currently working on a project of a nn to play a game similar to atari games (more details in the link). I'm having trouble with the indexing. perhaps anyone knows what could be the problem? because I cant seem to find it. Thank you for your time. here's my code (click on the link) and here's the full traceback. the problem starts from the way I call
history = network.fit(state, epochs=10, batch_size=10) // in line 82
See this post: Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
As said in the correct answer,
Modern CPUs provide a lot of low-level instructions, besides the usual arithmetic and logic, known as extensions, e.g. SSE2, SSE4, AVX, etc. From the Wikipedia:
The warning states that your CPU does support AVX (hooray!).
Pretty much, AVX speeds up your training, etc. Sadly, tensorflow is saying that they aren't going to use it... Why?
Because tensorflow default distribution is built without CPU extensions, such as SSE4.1, SSE4.2, AVX, AVX2, FMA, etc. The default builds (ones from pip install tensorflow) are intended to be compatible with as many CPUs as possible. Another argument is that even with these extensions CPU is a lot slower than a GPU, and it's expected for medium- and large-scale machine-learning training to be performed on a GPU.
What should yo do?
If you have a GPU, you shouldn't care about AVX support, because most expensive ops will be dispatched on a GPU device (unless explicitly set not to). In this case, you can simply ignore this warning by:
# Just disables the warning, doesn't enable AVX/FMA
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
If you don't have a GPU and want to utilize CPU as much as possible, you should build tensorflow from the source optimized for your CPU with AVX, AVX2, and FMA enabled if your CPU supports them. It's been discussed in this question and also this GitHub issue. Tensorflow uses an ad-hoc build system called bazel and building it is not that trivial, but is certainly doable. After this, not only will the warning disappear, tensorflow performance should also improve.
You can find all the details and comments in this StackOverflow question.
NOTE: This answer is a product of my professional copy-and-pasting.
Happy coding,
Bobbay
Has the code been debugged line by line ? as this would trace to the line causing error.
I assume the index error crops up from the below one - where "i" and further targets[i] , outs[i] can be checked for the values they have -
per_sample_losses = loss_fn.call(targets[i], outs[i])

Measuring the number of JVM instructions in a class

I need to count the number of JVM instructions for executing a method with multiple parameters. I give various arguments to the method to measure the number of JVM instructions executed for each case.
What tools can I use? The source is written in Scala, but it will produce class file anyway, so any JVM aware tool will work fine. I was thinking about profiler, but I think profiler may be used for different purposes.
There is an excellent tool for understanding the behavior of the Java HotSpot Just-In-Time (JIT) compiler during the execution of your program - JITWatch.
Warning: Modern JVMs are too complex, and do all kinds of optimization so the number of JVM's instructions means nothing from performance point of view. If you try to measure some small piece of code, it is really complicated to do it correctly without very, very detailed knowledge of what the JVM is doing. Please be aware(http://shipilev.net/blog/2014/java-scala-divided-we-fail/) and use it on your own risk:
Always include a warmup phase which runs your method all the way through, enough to trigger all initializations and compilations before timing phase(s).
Be aware of the difference between -client and -server, and OSR and regular compilations.

Looking for the best equivalents of prefetch instructions for ia32, ia64, amd64, and powerpc

I'm looking at some slightly confused code that's attempted a platform abstraction of prefetch instructions, using various compiler builtins. It appears to be based on powerpc semantics initially, with Read and Write prefetch variations using dcbt and dcbtst respectively (both of these passing TH=0 in the new optional stream opcode).
On ia64 platforms we've got for read:
__lfetch(__lfhint_nt1, pTouch)
wherease for write:
__lfetch_excl(__lfhint_nt1, pTouch)
This (read vs. write prefetching) appears to match the powerpc semantics fairly well (with the exception that ia64 allows for a temporal hint).
Somewhat curiously the ia32/amd64 code in question is using
prefetchnta
Not
prefetchnt1
as it would if that code were to be consistent with the ia64 implementations (#ifdef variations of that in our code for our (still live) hpipf port and our now dead windows and linux ia64 ports).
Since we are building with the intel compiler I should be able to many of our ia32/amd64 platforms consistent by switching to the xmmintrin.h builtins:
_mm_prefetch( (char *)pTouch, _MM_HINT_NTA )
_mm_prefetch( (char *)pTouch, _MM_HINT_T1 )
... provided I can figure out what temporal hint should be used.
Questions:
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Some systems support the prefetchw instructions for writes
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
If the line is exclusively used by the calling thread, it shouldn't matter how you bring the line, both reads and writes would be able to use it. The benefit for prefetchw mentioned above is that it will bring the line and give you ownership on it, which may take a while if the line was also used by another core. The hint level on the other hand is orthogonal with the MESI states, and only affects how long would the prefetched line survive. This matters if you prefetch long ahead of the actual access and don't want to prefetch to get lost in that duration, or alternatively - prefetch right before the access, and don't want the prefetches to thrash your cache too much.
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Just speculating - perhaps the larger caches and aggressive memory BW are more vulnerable to bad prefetching and you'd want to reduce the impact through the non-temporal hint. Consider that your prefetcher is suddenly set loose to fetch anything it can, you'd end up swamped in junk prefetches that would through away lots of useful cachelines. The NTA hint makes them overrun each other, leaving the rest undamaged.
Of course this may also be just a bug, I can't tell for sure, only whoever developed the compiler, but it might make sense for the reason above.
The best resource I could find on x86 prefetching hint types was the good ol' article What Every Programmer Should Know About Memory.
For the most part on x86 there aren't different instructions for read and write prefetches. The exceptions seem to be those that are non-temporal aligned, where a write can bypass the cache but as far as I can tell, a read will always get cached.
It's going to be hard to backtrack through why the earlier code owners used one hint and not the other on a certain architecture. They could be making assumptions about how much cache is available on processors in that family, typical working set sizes for binaries there, long term control flow patterns, etc... and there's no telling how much any of those assumptions were backed up with good reasoning or data. From the limited background here I think you'd be justified in taking the approach that makes the most sense for the platform you're developing on now, regardless what was done on other platforms. This is especially true when you consider articles like this one, which is not the only context where I've heard that it's really, really hard to get any performance gain at all with software prefetches.
Are there any more details known up front, like typical cache miss ratios when using this code, or how much prefetches are expected to help?

Speed improvements for Perl's chameneos-redux in the Computer Language Benchmarks Game

Ever looked at the Computer Language Benchmarks Game (formerly known as the Great Language Shootout)?
Perl has some pretty healthy competition there at the moment. It also occurs to me that there's probably some places that Perl's scores could be improved. The biggest one is in the chameneos-redux script right now—the Perl version runs the worst out of any language: 1,626 times slower than the C baseline solution!
There are some restrictions on how the programs can be made and optimized, and there is Perl's interpreted runtime penalty, but 1,626 times? There's got to be something that can get the runtime of this program way down.
Taking a look at the source code and the challenge, how can the speed be improved?
I ran the source code through the Devel::SmallProf profiler. The profile output is a little too verbose to post here, but you can see the results yourself using $ perl -d:SmallProf chameneos.pl 10000 (no need to run it for 6000000 meetings unless you really want to!) See perlperf for more details on some profiling tools in Perl.
It turns out that using semaphores is the major bottleneck. The lion's share of total CPU time is spent on checking whether a semaphore is locked or not. Although I haven't had enough time to look at why the source code uses semaphores, it may be that you can work around having to use semaphores altogether. That's probably your best shot at improving the code's performance.
As Zaid posted, Thread::Semaphore is rather slow. One optimization could be to use the implicit locks on shared variables instead of them. It should be faster, though I suspect it won't be faster by much.
In general, Perl's threading implementation sucks for any kind of usage that requires a lot of interthread communication. It's very suitable for tasks with little communication (as unlike CPython's threads and CRuby's threads they are actually preemptive).
It may be possible to improve that situation, we need better primitives.
I have a version based on another version from Jesse Millikian, which I think was never published.
I think it may run ~ 7x faster than the current entry, and uses standard modules all around. I'm not sure if it actually complies with all the rules though.
I've tried the forks module on it, but I think it slows it down a bit.
Anyone tried s/threads/forks/ on the Perl entry? Or Coro / Coro::MP, though the latter would probably trigger the 'interesting alternative implementations' clause.

A trivial SYSENTER/SYSCALL question

If a Windows executable makes use of SYSENTER and is executed on a processor implementing AMD64 ISA, what happens? I am both new and newbie to this topic (OSes, hardware/software interaction) but from what I've read I have understood that SYSCALL is the AMD64 equivalent to Intel's SYSENTER. Hopefully this question makes sense.
If you try to use SYSENTER where it is not supported, you'll probably get an "invalid opcode" exception.
Note that this situation is unusual - generally, Windows executables do not directly contain instructions to enter kernel mode.
As far as i know AM64 processors using different type of modes to handle such issues.
SYSENTER works fine but is not that fast.
A very useful site to get started about the different modes:
Wikipedia
They got rid of a bunch of unused functionality when they developed AMD64 extensions. One of the main ones is the elimination of the cs, ds, es, and ss segment registers. Normally loading segment registers is an extremely expensive operation (the CPU has to do permission checks, which could involve multiple memory accesses). Entering kernel mode requires loading new segment register values.
The SYSENTER instruction accelerates this by having a set of "shadow registers" which is can copy directly to the (internal, hidden) segment descriptors without doing any permission checks. The vast majority of the benefit is lost with only a couple of segment registers, so most likely the reasoning for removing the support for the instructions is that using regular instructions for the mode switch is faster.