Memory usage in transforming fine tuning of GPTJ-6b to HuggingFace format

Memory usage in transforming fine tuning of GPTJ-6b to HuggingFace format - tpu

Following this tutorial using TPUs to fine tune GPTJ has worked well.
https://github.com/kingoflolz/mesh-transformer-jax/blob/master/howto_finetune.md
Why would the step to transform to huggingface format using to_hf_weights.py have an issue with memory at 256MB - even after slimming has been applied?
The issue I filed is here:
https://github.com/kingoflolz/mesh-transformer-jax/issues/209

Resolved by running this step on a standard machine (not TPU) with lots of mem.

Related

What version of Matlab should I install to make dsearch() work?

I have a code in MATLAB that I need to run once to get some quick reference results. This code uses dsearch(), and I relized that this function has become deprecated and has been substituted by dsearchn(). I tried to follow some suggestions to make it work but I didn't succeed (it's my absolute first time with MATLAB) and I just need to run this code once to get some numbers. Which MATLAB versione should I install to make it work?

dsearch was removed in MATLAB R2012a, therefore you may use any version of MATLAB before R2012a.
If you have an active MathWorks account, you may access the documentation archive here.

According to MathWork's release notes, this function started printing a warning on release 2010a. So you should be fine either using it or a previous version.

How can we improve our compilation flow with Specman?

We are working on a large design, for which the verification environment is complex. It contains 5 internal VIPs ( 3 of them we own and debug, doing minor changes and tweaks), CDNS unipro VIP and a low level services package we uses for all of our environments. Our e compilation flow is long and tedious, and for every change we make in our code base , our turnaround time for fixing is 10 mins.
How can we improve our compilation flow for increasing our team effectiveness?

Work in compiled mode.
Compile your code in parallel.
Use specman advance option which let you save and restore, reseed and dynamic load.

use multiple cores for much faster compilation time (-mc switch to sn_compile.sh). Requires advanced options license

Compile your code in compiled mode using multi core compilation. It will reduce the compilation time significantly.
You can use this compilation also for debugging instead of the interpreted mode.
This capability is already included in the last hotfix of your installed release.

You can compile your code. You can also use parallel compilation. Another thing you can do is to use reseed and dynamic load

Use SAO: use multi process compilation.
Download latest fix, as from version 13.1 you don't need a special verstion.
You can also use compiled code and compile only the modules you changed (multi stage compilation).
Starting version 14.1 you can compile the code to an elib file.

In addition to multiple cores compilation, from 14.1 can use elibs to prevent recompilation of modules that were not changed.

What we do for normal development is only compile the code that we will normally never change (base libraries, VIPs from other vendors, code reused from previous projects, etc.). Any code that we develop for that particular project is loaded interpreted on top. This gives a smaller turnaround time when we have to change something (because you just do a quick "reload").
For regression testing, we compile everything up to the testbench top and load the tests on top.

Scicos Code generation for rt_preempt system

Due to http://hart.sourceforge.net/, code generation should work for rt_preempt kernels, when using scilab 5.3.2 and hart toolsbox.
I installed both on ubuntu 12.04 LTS sucessfully, but I'm kind of lost with the code generation. I use one of the hart toolbox examples (realtime_demo), and try to compile and generate code.
First off all: Are these samples supposed to work with rt_preempt or only with RTAI? what code generation commands do I have to use for rt_preempt kernels?
If anybody managed code generation for rt_preempt kernels, I would aapreciate every hint I can get!

Code generation for rt_preempt is automatically enabled, if the hart-toolbox did not detect RTAI during installation. Thus, if you don't have RTAI, by compiling your schematic you should get code that runs using rt_preemption (if this is also not available, then your code will run as a normal Linux process). However at the moment there is no way to get data in or out of the real-time process, as the RTAI scopes, meters can not be used. To overcome this communication issue and also other limitations of Xcos concerning the implementation of real-time systems, you could also have a look at OpenRTDynamics as an alternative.

rake-pipeline performance

We're using Rake::Pipeline::Middleware to serve a rake-pipeline project with Rack. It seems incredibly slow, like it's rebuilding everything whenever a file has changed.
Are we doing something wrong? Is there something we can do to speed it up?

If you are compressing the files you should put a conditional in your Assetfile to not compress in development. The concat filters are not that slow. The uglify and yui_css filters are and you don't need the compression for dev.

Adding therubyracer gem has helped as well, cutting total compile time by factor 3. We are compiling a lot of CoffeeScript, and having therubyracer available avoids shelling out to Node.

Xcode (10.7) -- clGetProgramBinaries results unreadable

I have an OpenCL kernel that runs well but I want to look at the intermediate code. I use getprograminfo to pull out the binary and save it to a text file. I've tried this with nVidia, AMD, an i7 and a Xeon.
In all of these cases the binary is unreadable.
I understand that on OS X the chunk of data returned is actually a binary plist. I've found instructions for using plutil to convert it to xml, and they work.
It's still unreadable ... though I've seen instructions online that this is where you find the PTX code (in the case of my AMD 5870). There's the expected clBinaryData key but the data under that key is still one big chunk of stuff, not readable IL instructions in text form.
I'd really like to examine the intermediate language to assess inefficiencies in my use of the gpu. Is this simply not possible under Xcode? Or, what am I doing wrong?
Thanks for any information!...

If you run your program with following environmental variable set you should see .IL and .ISA files in your directory.
$ GPU_DUMP_DEVICE_KERNEL=3 ./my-program
Another way is to use AMD APP Kernel Analyzer (which comes along with AMD APP SDK) to look at the Intermediate file i.e IL and ISA.
(I am not sure whether AMD APP SDK available for MAC or not).
One more option according to APP SDK documentation, put the below in your host code.
putenv("GPU_DUMP_DEVICE_KERNEL=3");
References
AMD OpenCL Programming Guide
AMD Devgurus forum

(Making this a top-level answer so I can do some formatting.)
ocluser's answer was very helpful, in that it was enlightening and caused great learning, though it did not, alas, solve the problem.
I've verified that the environment variable described is being set, and is available to my application when run from within xcode. However, it does not have (under OSX) the highly desirable effect it has under Linux.
But, I now know how to set environment variables in 7 of 8 different ways. I also set "tracer" envars to tell me which methods are effective within the scope of my application. From the below, you can see that both the method of "edit scheme" to add arguments works, as does the "putenv" suggested by ocluser. What didn't set it in that scope: ~/.MACOS/environment.plist, app-specific plist, .profile, and adding a build phase to run a custom script (I found at least one other way within xcode to set one but forgot what I called the tracer and can't find it now; maybe it's on another machine....)
GPU_DUMP_DEVICE_KERNEL is 3
GPU_DUMP_TRK_ENVPLIST is (null)
GPU_DUMP_TRK_APPPLIST is (null)
GPU_DUMP_TRK_DOTPROFILE is (null)
GPU_DUMP_TRK_RUNSCRIPT is (null)
GPU_DUMP_TRK_SCHARGS is 1
GPU_DUMP_TRK_PUTENV is 1
... so, no this doesn't really answer the question, but expands on it a bit. Sorry if poor form. Thanks!
Have not given up and shall provide an actual problem-solver if I find one.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Memory usage in transforming fine tuning of GPTJ-6b to HuggingFace format - tpu

Resolved by running this step on a standard machine (not TPU) with lots of mem.

Related

What version of Matlab should I install to make dsearch() work?

How can we improve our compilation flow with Specman?

Scicos Code generation for rt_preempt system

rake-pipeline performance

Xcode (10.7) -- clGetProgramBinaries results unreadable

Categories

Resources