indexing problem when calling fit() function - neural-network

I'm currently working on a project of a nn to play a game similar to atari games (more details in the link). I'm having trouble with the indexing. perhaps anyone knows what could be the problem? because I cant seem to find it. Thank you for your time. here's my code (click on the link) and here's the full traceback. the problem starts from the way I call
history = network.fit(state, epochs=10, batch_size=10) // in line 82

See this post: Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
As said in the correct answer,
Modern CPUs provide a lot of low-level instructions, besides the usual arithmetic and logic, known as extensions, e.g. SSE2, SSE4, AVX, etc. From the Wikipedia:
The warning states that your CPU does support AVX (hooray!).
Pretty much, AVX speeds up your training, etc. Sadly, tensorflow is saying that they aren't going to use it... Why?
Because tensorflow default distribution is built without CPU extensions, such as SSE4.1, SSE4.2, AVX, AVX2, FMA, etc. The default builds (ones from pip install tensorflow) are intended to be compatible with as many CPUs as possible. Another argument is that even with these extensions CPU is a lot slower than a GPU, and it's expected for medium- and large-scale machine-learning training to be performed on a GPU.
What should yo do?
If you have a GPU, you shouldn't care about AVX support, because most expensive ops will be dispatched on a GPU device (unless explicitly set not to). In this case, you can simply ignore this warning by:
# Just disables the warning, doesn't enable AVX/FMA
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
If you don't have a GPU and want to utilize CPU as much as possible, you should build tensorflow from the source optimized for your CPU with AVX, AVX2, and FMA enabled if your CPU supports them. It's been discussed in this question and also this GitHub issue. Tensorflow uses an ad-hoc build system called bazel and building it is not that trivial, but is certainly doable. After this, not only will the warning disappear, tensorflow performance should also improve.
You can find all the details and comments in this StackOverflow question.
NOTE: This answer is a product of my professional copy-and-pasting.
Happy coding,
Bobbay

Has the code been debugged line by line ? as this would trace to the line causing error.
I assume the index error crops up from the below one - where "i" and further targets[i] , outs[i] can be checked for the values they have -
per_sample_losses = loss_fn.call(targets[i], outs[i])

Related

Porting word2vec to RISC-V.. potential proxy kernel issue?

We are trying to port word2vec to RISC-V. Towards this end, we have compiled word2vec with a cross compiler and are trying to run it on Spike.
The cross compiler compiles the standard RISC-V benchmarks and they run without failure on Spike, but when we use the same setup for word2vec, it fails with "bad syscall #179!". We tried two different versions, both fail around the same place a minute or two into the run while executing these instructions. After going through the loop several 100k times, we see C1, C2 printed an then the crash. We are thinking this is more of a spike/pk issue than a word2vec issue.
Has anyone had similar experiences when porting code to RISC-V? Any ideas on how we might track down whether it's the proxy kernel?
A related question is about getting gdb working with Spike.. will post that separately.
Thank you.
The riscv-pk does not support all possible syscalls. You'll need to track down which syscall it is and whether you can implement it in riscv-pk or if you need to move to running it on a different kernel. For example, riscv-pk does not support any threading-related syscalls as multithreaded kernel support is an explicitly riscv-pk non-goal.
I would also be wary of using riscv-pk in general. It's a very simple, thin kernel which is great for running newlib user applications in the beginning, but it lacks rigorous testing and validation efforts against it, so running applications that stress virtual memory systems, rely on lots of syscalls (iotcl and friends), or are expecting more glibc-like environments may prove problematic.

MatLab error: cannot open with static TLS

Since a couple of days, I constantly receive the same error while using MATLAB which happens at some point with dlopen. I am pretty new to MATLAB, and that is why I don't know what to do. Google doesn't seem to be helping me either. When I try to make an eigenvector, I get this:
Error using eig
LAPACK loading error:
dlopen: cannot load any more object with static TLS
I also get this while making a multiplication:
Error using *
BLAS loading error:
dlopen: cannot load any more object with static TLS
I did of course look for the solutions to this problem, but I don't understand too much and don't know what to do. These are threads I found:
How do I use the BLAS library provided by MATLAB?
http://www.mathworks.de/de/help/matlab/matlab_external/calling-lapack-and-blas-functions-from-mex-files.html
Can someone help me please?
Examples of function calls demonstrating this error
>> randn(3,3)
ans =
2.7694 0.7254 -0.2050
-1.3499 -0.0631 -0.1241
3.0349 0.7147 1.4897
>> eig(ans)
Error using eig
LAPACK loading error:
dlopen: cannot load any more object with static TLS
That's bug no 961964 of MATLAB known since R2012b (8.0). MATLAB dynamically loads some libs with static TLS (thread local storage, e.g. see gcc compiler flag -ftls-model). Loading too many such libs => no space left.
Until now mathwork's only workaround is to load the important(!) libs first by using them early (they suggest to put "ones(10)*ones(10);" in startup.m). I better don't comment on this "solution strategy".
Since R2013b (8.2.0.701) with Linux x86_64 my experience is: Don't use "doc" (the graphical help system)! I think this doc-utility (libxul, etc.) is using a lot of static TLS memory.
Here is an update (2013/12/31)
All the following tests were done with Fedora 20 (with glibc-2.18-11.fc20) and Matlab 8.3.0.73043 (R2014a Prerelease).
For more information on TLS, see
Ulrich Drepper, ELF handling For Thread-Local Storage, Version 0.21, 2013,
currently available at Akkadia and Redhat.
What happens exactly?
MATLAB dynamically (with dlopen) loads several libraries that need tls initialization. All those libs need a slot in the dtv (dynamic thread vector). Because MATLAB loads several of these libs dynamically at runtime at compile/link time the linker (at mathworks) had no chance to count the slots needed (that's the important part). Now it's the task of the dynamic lib loader to handle such a case at runtime. But this is not easy. To cite dl-open.c:
For static TLS we have to allocate the memory here and
now. This includes allocating memory in the DTV. But we
cannot change any DTV other than our own. So, if we
cannot guarantee that there is room in the DTV we don't
even try it and fail the load.
There is a compile time constant (called DTV_SURPLUS, see glibc-source/sysdeps/generic/ldsodefs.h) in the glibc's dynamic lib loader for reserving a number of additional slots for such a mess (dynamically loading libs with static TLS in a multithreading program). In the glibc-Version of Fedora 20 this value is 14.
Here are the first libs (running MATLAB) that needed dtv slots in my case:
matlabroot/bin/glnxa64/libut.so
/lib64/libstdc++.so.6
/lib64/libpthread.so.0
matlabroot/bin/glnxa64/libunwind.so.8
/lib64/libuuid.so.1
matlabroot/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so
matlabroot/sys/java/jre/glnxa64/jre/lib/amd64/libfontmanager.so
matlabroot/sys/java/jre/glnxa64/jre/lib/amd64/libt2k.so
matlabroot/bin/glnxa64/mkl.so
matlabroot/sys/os/glnxa64/libiomp5.so
/lib64/libasound.so.2
matlabroot/sys/jxbrowser/glnxa64/xulrunner/xulrunner-linux-64/libxul.so
/lib64/libselinux.so.1
/lib64/libpixman-1.so.0
/lib64/libEGL.so.1
/lib64/libGL.so.1
/lib64/libglapi.so.0
Yes more than 14 => too many => no slot left in the dtv. That's what the error message tries to tell us and especially mathworks.
For the record: In order not to violate MATLAB's license I didn't debug, decompile or disassemble any part of the binaries shipped with MATLAB. I only debugged the free and open glibc-binaries of Fedora 20 that MATLAB were using to dynamically load the libs.
What can be done, to solve this problem?
There are 3 options:
(a)
Rebuild MATLAB and do not dynamically load those libs
(with initial-exec tls model) instead link against them (then the linker
can count the required slots!)
(b)
Rebuild those libs and ensure they are NOT using the initial-exec tls model.
(c)
Rebuild glibc and increase DTV_SURPLUS in
glibc/sysdeps/generic/ldsodefs.h
Obviously options (a) and (b) can only be done by mathworks.
For option (c) no source of MATLAB is needed and thus can be done without mathworks.
What is the status at mathworks?
I really tried to explain this to the "MathWorks Technical Support Department". But my impression is: they don't understand me. They closed my support ticket and suggested a telephone(!) conversation in January 2014 with a technical support manager.
I'll do my very best to explain this, but to be honest: I'm not very confident.
Update (2014/01/10): Currently mathworks is trying option (b).
Update (2014/03/19): For the file libiomp5.so you can download a newly compiled version (without static TLS) at mathworks, bug report 961964. And the other libs? No improvement there. So don't be suprised to get "dlopen: cannot load any more object with static TLS" with "doc", e.g. see bug report 1003952.
Restarting Matlab solved the problem for me.
long story short: in the directory that you start matlab from create a file
startup.m with content ones(10)*ones(10);. Restart matlab and it will be taken care of.
This is, as I find, an age-old problem yet unsolved by MathWorks.
Here are my two cents, which worked for me (when I wanted IT++ external libraries, with MEX).
Let the library that you found to be the cause of the problem be "libXYZ.so", and that you know where it lies on your system.
The solution is to inform MATLAB to load the specific library at the earliest of its startup. The reason for this error is apparently due to the lack of slots for this thread local storage aka tls purpose (due to they already been filled-up).
Because the latest compilations suddenly required a new library that was not loaded earlier during its startup, MATLAB throws up this error.
Pity that MATLAB never cared to resolve this problem so long.
Fortunately, the solution is a single, very simple terminal command.
Typical steps on a linux-machine should be as follows:
Open command prompt (Ctrl+Alt+T in Ubuntu)
Issue the following command
export LD_PRELOAD=<PATH-TO-libxyz.so>
e.g.: export LD_PRELOAD=/usr/local/lib/libitpp.so
Start matlab from the same terminal
matlab &
Running your program now should resolve the issue, as it is for my case.
Good luck!
Reference:
[1] http://au.mathworks.com/matlabcentral/answers/125117-openmp-mex-files-static-tls-problem
http://www.mathworks.de/support/bugreports/961964 has been updated on 30/01/2014.
There is a zip file attached with libiomp5.so
I tested it on Mageia 4 x86_64 with Matlab R2013b.
I can now use the Documentation of Matlab to open a demo without any problem.
I had the same problem and I think I just solved it.
When installing matlab use the custom installation (I did not do this the first time). Choose to create symbolic links to matlab scripts in the predefined folder (/usr/local/bin). This did the trick for me!
I had the same problem with both Matlab 2013b and Matlab 2014a. The fix provided by mathworks for libiomp5.so only removed the problem of LAPACK not working. However, I could not use external libraries which are using OpenMp (such as VL_FEAT): I still get the error
"dlopen: cannot load any more object with static TLS."
The only thing which worked for me was downgrading to Matlab 2012b.
I came across this problem after "bar" (for bar plots) with a an array gives me just a single blue block, with no errors thrown. Reboot at first solved the problem. But after a memory error (after processing a very large file), I just cannot get past this blue block problem.
Using "hist" on a matrix input gives me the "BLAS loading error" problem and led me to this thread. The Mathwork workaround fixed the hist and bar problems.
Just wanted to bring recognition to the extent of this bug's influence.
I had the same problem and solved it by increasing my Java Heap memory. Go to Preferences > General > Java-Heap Memory, and increase the allocated memory.
Increasing Java heap memory (to 512 mb) also worked for me on R2013b/Ubuntu 12.04. The "BLAS loading error" began when I processed an 11 GB file (with 16 GB RAM), and has not recurred after increasing java heap memory and restarting matlab.

Looking for the best equivalents of prefetch instructions for ia32, ia64, amd64, and powerpc

I'm looking at some slightly confused code that's attempted a platform abstraction of prefetch instructions, using various compiler builtins. It appears to be based on powerpc semantics initially, with Read and Write prefetch variations using dcbt and dcbtst respectively (both of these passing TH=0 in the new optional stream opcode).
On ia64 platforms we've got for read:
__lfetch(__lfhint_nt1, pTouch)
wherease for write:
__lfetch_excl(__lfhint_nt1, pTouch)
This (read vs. write prefetching) appears to match the powerpc semantics fairly well (with the exception that ia64 allows for a temporal hint).
Somewhat curiously the ia32/amd64 code in question is using
prefetchnta
Not
prefetchnt1
as it would if that code were to be consistent with the ia64 implementations (#ifdef variations of that in our code for our (still live) hpipf port and our now dead windows and linux ia64 ports).
Since we are building with the intel compiler I should be able to many of our ia32/amd64 platforms consistent by switching to the xmmintrin.h builtins:
_mm_prefetch( (char *)pTouch, _MM_HINT_NTA )
_mm_prefetch( (char *)pTouch, _MM_HINT_T1 )
... provided I can figure out what temporal hint should be used.
Questions:
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Some systems support the prefetchw instructions for writes
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
If the line is exclusively used by the calling thread, it shouldn't matter how you bring the line, both reads and writes would be able to use it. The benefit for prefetchw mentioned above is that it will bring the line and give you ownership on it, which may take a while if the line was also used by another core. The hint level on the other hand is orthogonal with the MESI states, and only affects how long would the prefetched line survive. This matters if you prefetch long ahead of the actual access and don't want to prefetch to get lost in that duration, or alternatively - prefetch right before the access, and don't want the prefetches to thrash your cache too much.
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Just speculating - perhaps the larger caches and aggressive memory BW are more vulnerable to bad prefetching and you'd want to reduce the impact through the non-temporal hint. Consider that your prefetcher is suddenly set loose to fetch anything it can, you'd end up swamped in junk prefetches that would through away lots of useful cachelines. The NTA hint makes them overrun each other, leaving the rest undamaged.
Of course this may also be just a bug, I can't tell for sure, only whoever developed the compiler, but it might make sense for the reason above.
The best resource I could find on x86 prefetching hint types was the good ol' article What Every Programmer Should Know About Memory.
For the most part on x86 there aren't different instructions for read and write prefetches. The exceptions seem to be those that are non-temporal aligned, where a write can bypass the cache but as far as I can tell, a read will always get cached.
It's going to be hard to backtrack through why the earlier code owners used one hint and not the other on a certain architecture. They could be making assumptions about how much cache is available on processors in that family, typical working set sizes for binaries there, long term control flow patterns, etc... and there's no telling how much any of those assumptions were backed up with good reasoning or data. From the limited background here I think you'd be justified in taking the approach that makes the most sense for the platform you're developing on now, regardless what was done on other platforms. This is especially true when you consider articles like this one, which is not the only context where I've heard that it's really, really hard to get any performance gain at all with software prefetches.
Are there any more details known up front, like typical cache miss ratios when using this code, or how much prefetches are expected to help?

A trivial SYSENTER/SYSCALL question

If a Windows executable makes use of SYSENTER and is executed on a processor implementing AMD64 ISA, what happens? I am both new and newbie to this topic (OSes, hardware/software interaction) but from what I've read I have understood that SYSCALL is the AMD64 equivalent to Intel's SYSENTER. Hopefully this question makes sense.
If you try to use SYSENTER where it is not supported, you'll probably get an "invalid opcode" exception.
Note that this situation is unusual - generally, Windows executables do not directly contain instructions to enter kernel mode.
As far as i know AM64 processors using different type of modes to handle such issues.
SYSENTER works fine but is not that fast.
A very useful site to get started about the different modes:
Wikipedia
They got rid of a bunch of unused functionality when they developed AMD64 extensions. One of the main ones is the elimination of the cs, ds, es, and ss segment registers. Normally loading segment registers is an extremely expensive operation (the CPU has to do permission checks, which could involve multiple memory accesses). Entering kernel mode requires loading new segment register values.
The SYSENTER instruction accelerates this by having a set of "shadow registers" which is can copy directly to the (internal, hidden) segment descriptors without doing any permission checks. The vast majority of the benefit is lost with only a couple of segment registers, so most likely the reasoning for removing the support for the instructions is that using regular instructions for the mode switch is faster.

How to determine SSE prefetch instruction size?

I am working with code which contains inline assembly for SSE prefetch instructions. A preprocessor constant determines whether the instructions for 32-, 64- or 128-bye prefetches are used. The application is used on a wide variety of platforms, and so far I have had to investigate in each case which is the best option for the given CPU. I understand that this is the cache line size. Is this information obtainable automatically? It doesn't seem to be explicitly present in /proc/cpuinfo.
I think your question is related to this question or this one. I think it is clear that - unless you can rely on a OS or library-function - you will want to use the CPUID instruction, but the question then becomes exactly what information you are looking for. - And of course, AMD's and Intel's implementations don't need to agree. This page suggests using Cpuid.1.EBX[15:8] (i.e., BH) for finding out on Intel and function 80000005h on AMD. In addition, on Intel, CPUID.2... seems to contain the relevant information, but it looks like a real pain to parse out the desired information.
I think, from what I've read, both AMD and Intel CPUID instructions will support CPUID.1.EBX[15:8], which returns the size of one cache line in QUADWORDs as used by the CLFLUSH instruction (which isn't present on all processors, so I don't know whether you'll always find something there). So, after executing CPUID.1, you'd have to multiply BH by 8 to get the cache line size in bytes. This hinges on my implicit assumption (please can anyone say whether it is really valid?) that the definition of one cache line size is always the same for CLFLUSH and PREFETCHh instructions.
Also, Intel's manuals states that PREFETCHh is only a hint, but that, if it prefetches anything, it will always be a minimum of 32 bytes.
EDIT1:
Another useful resource (even if not directly answering your question) for the optimised use of PREFETCHh is Intel's optimisation manual here.