~/.felix folder contains massive number of files - matlab

On one of the accounts we use on a cluster there is a hidden folder in the home directory:
/home/user/.felix/
This contains a huge number of directories:
[user#gateway .felix]$ ls | head -10
osgi-cache1050e0f4_15774cb91f4_-7ffe
osgi-cache-1063880a_15289337854_-7ffe
osgi-cache-10716929_155ac249b99_-7ffe
osgi-cache-1076af32_1567b76f77c_-7ffe
osgi-cache10fdd858_15288297a76_-7ffe
osgi-cache1145761a_1567b157a97_-7ffe
osgi-cache-1158de5c_15775794758_-7ffe
osgi-cache-117b5c79_1577655ca87_-7ffe
osgi-cache-1188faa3_154532959fc_-7fff
osgi-cache11bf2822_1528906f443_-7ffe
In each of these folders:
osgi-cache-37166e7_1545cb3b7e0_-7ffe/bundle10
[user#gateway bundle4]$ cat bundle.location
reference:file:/gpfs22/local/centos6/matlab/2013a/java/jar/toolbox/bioinfo.jar
So I'm thinking these files are created by matlab somehow.
This .felix folder contains about ~150k files which is causing us to go over our quota of 300k files. Is there a way to:
disable the creation of these files
clean them up in a safe way (maybe a cron)
possible move the location of where these files are created?

Technically its the apache-felix bundle cache (http://felix.apache.org/documentation/subprojects/apache-felix-framework/apache-felix-framework-usage-documentation.html) and I'm afraid there's no safe way to remove any of these without contacting the user (even when migrating the path).

I noticed that Matlab is creating about 7k files in /tmp/.felix. The space usage is pretty minimal (184k). I was able to delete them by:
find /tmp/.felix -user <my username> -exec rm -r {} \;
But when I run my Matlab code it recreates many (all?) of the files. So at least in the Matlab usage case it seems relatively safe to delete them, but I could imagine there being problems if this info is actively being updated.
Digging into the Felix docs a bit (mentioned in answer), I google "Felix bundle cache", and find that this is used to store pointers to Java jar files, and perhaps to state as well. There are indeed parameters that you can configure to control the location and flushing of this cache. configuring Felix bundle cache
Mathworks also has Matlab specific suggestions. In the case mentioned there, this seemed to be triggered by plotting. Names in the stack trace there suggest it may have to do with implementation of key bindings (keyboard shortcuts).
Rob

Related

Kubernetes object size limitations

I am dealing with CRDs and creating Custom resources. I need to keep lots of information about my application in the Custom resource. As per the official doc, etcd works with request up to 1.5MB. I am hitting errors something like
"error": "Request entity too large: limit is 3145728"
I believe the specified limit in the error is 3MB. Any thoughts around this? Any way out for this problem?
The "error": "Request entity too large: limit is 3145728" is probably the default response from kubernetes handler for objects larger than 3MB, as you can see here at L305 of the source code:
expectedMsgFor1MB := `etcdserver: request is too large`
expectedMsgFor2MB := `rpc error: code = ResourceExhausted desc = trying to send message larger than max`
expectedMsgFor3MB := `Request entity too large: limit is 3145728`
expectedMsgForLargeAnnotation := `metadata.annotations: Too long: must have at most 262144 bytes`
The ETCD has indeed a 1.5MB limit for processing a file and you will find on ETCD Documentation a suggestion to try the--max-request-bytes flag but it would have no effect on a GKE cluster because you don't have such permission on master node.
But even if you did, it would not be ideal because usually this error means that you are consuming the objects instead of referencing them which would degrade your performance.
I highly recommend that you consider instead these options:
Determine whether your object includes references that aren't used;
Break up your resource;
Consider a volume mount instead;
There's a request for a new API Resource: File (orBinaryData) that could apply to your case. It's very fresh but it's good to keep an eye on.
If you still need help let me know.
This happened to me when I put some large files in my Helm chart directory. Removing those files helped me resolve my issue.
Check the size of files in the directory, which contains the templates and values.yaml of the release of your chart (as it seems the name of directory is usually equals to charts).
du <directory-path> --max-depth=1
# if you want it to be more readable add -h switch
du -h <directory-path> --max-depth=1
Make sure you do not have any irrelevant files if the file size exceeded 3145728. (source)
If you are using HELM, check if you have large file like log files. Add .helmignore
.DS_Store
# Common VCS dirs
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
*.log

How to compress a list of files into a single gzip file using elasticluster, grid-engine-tools, and google cloud

I want to start by thanking you all for your help ahead of time, as this will help clear up a detail left out on the readthedocs.io guide. What I need is to compress several files into a single gzip, however, the guide shows only how to compress a list of files as individual gzipped file. Again, I appreciate any help as there is very few resources and documentation for this set up. (If there is some extra info, please include links to sources)
After I had set up the grid engine, I ran through the samples in the guide.
Am I right in assuming there is not a script for combining multiple files into one gzip using grid-computing-tools?
Are there any solutions on the Elasticluster Grid Engine setup to compress multiple files into 1 gzip?
What changes can be made to the grid-engine-tools to make it work?
EDIT
The reason we are considering a cluster is that we do expect multiple operations occurring simultaneously, zipped up files per order, which will occur systematically so that a vendor can download a single compressed file per order.
May I state the definition of the problem and you can let me know if I understood it correctly, as both Matt and I provided the exact same solution and somehow it doesn't seem sufficient.
Problem Definition
You have an Order defining the start of a task to process some data.
The processing of data would be split among several compute nodes, each producing a resulting file stored on GS directories.
The goal is:
Collect the files from GS bucket (that were produced by each of the nodes),
Archive the collection of files as one file,
Then compress that archive, and
Push it back to a different GS location.
Let me know if I summarized it properly,
Thanks,
Paul
Are the files in question in Cloud Storage?
Are the files in question on a local or network drive?
In your description, you indicate "What I need is to compress several files into a single gzip". It isn't clear to me that a cluster of computers is needed for this. It sounds more like you just want to use tar along with gzip.
The tar utility will create an archive file it can compress it as well. For example:
$ # Create a directory with a few input files
$ mkdir myfiles
$ echo "This is file1" > myfiles/file1.txt
$ echo "This is file2" > myfiles/file2.txt
$ # (C)reate a compressed archive
$ tar cvfz archive.tgz myfiles/*
a myfiles/file1.txt
a myfiles/file2.txt
$ # (V)erify the archive
$ tar tvfz archive.tgz
-rw-r--r-- 0 myuser mygroup 14 Jul 20 15:19 myfiles/file1.txt
-rw-r--r-- 0 myuser mygroup 14 Jul 20 15:19 myfiles/file2.txt
To extract the contents use:
$ # E(x)tract the archive contents
$ tar xvfz archive.tgz
x myfiles/file1.txt
x myfiles/file2.txt
UPDATE:
In your updated problem description, you have indicated that you may have multiple orders processed simultaneously. If the frequency in which results need to be tar-ed is low, and providing the tar-ed results is not extremely time-sensitive, then you could likely do this with a single node.
However, as the scale of the problem ramps up, you might take a look at using the Pipelines API.
Rather than keeping a fixed cluster running, you could initiate a "pipeline" (in this case a single task) when a customer's order completes.
A call to the Pipelines API would start a VM whose sole purpose is to download the customer's files, tar them up, and push the resulting tar file into Cloud Storage. The Pipelines API infrastructure does the copying from and to Cloud Storage for you. You would effectively just need to supply the tar command line.
There is an example that does something similar here:
https://github.com/googlegenomics/pipelines-api-examples/tree/master/compress
This example will download a list of files and compress each of them independently. It could be easily modified to tar the list of input files.
Take a look at the https://github.com/googlegenomics/pipelines-api-examples github repository for more information and examples.
-Matt
So there are many ways to do it, but the thing is that you cannot directly compress on Google Storage a collection of files - or a directory - into one file, and would need to perform the tar/gzip combination locally before transferring it.
If you want you can have the data compressed automatically via:
gsutil cp -Z
Which is detailed at the following link:
https://cloud.google.com/storage/docs/gsutil/commands/cp#changing-temp-directories
And the nice thing is that you retrieve uncompressed results from compressed data on Google Storage, because it has the ability to perform Decompressive Transcoding:
https://cloud.google.com/storage/docs/transcoding#decompressive_transcoding
You will notice on the last line in the following script:
https://github.com/googlegenomics/grid-computing-tools/blob/master/src/compress/do_compress.sh
The following line will basically copy the current compressed file to Google Cloud Storage:
gcs_util::upload "${WS_OUT_DIR}/*" "${OUTPUT_PATH}/"
What you will need is to first perform the tar/zip on the files in the local scratch directory, and then gsutil copy the compressed file over to Google Storage, but make sure that all the files that need to be compressed are in the scratch directory before starting to compress them. Most likely you would need to SSH copy (scp) them to one of the nodes (i.e. master), and then have the master tar/gzip the whole directory before sending it over to Google Storage. I am assuming each GCE instance has its own scratch disk, but the "gsutil cp" transfer is very fast when working on GCE.
Since Google Storage is fast at data transfers with Google Compute instances, the easiest second option to pursue is to mark out lines 66-69 in the do_compress.sh file:
https://github.com/googlegenomics/grid-computing-tools/blob/master/src/compress/do_compress.sh
This way no compression happens, but the copy happens on the last line via gsutil::upload, in order to have all the uncompressed files transferred to the same Google Storage bucket. Then using "gsutil cp" from the master node you would copy them back locally, in order to compress them locally via tar/gz and then copy the compressed directory file back to the bucket using "gsutil cp".
Hope it helps but it's tricky,
Paul

Version Control a file with per machine dependent stuff

I have a config file in my project that includes some info that is per machine dependent (db username, password, path). I understand that in this particular case, I could enforce everybody to use the same username, db path, and password to keep this simple, but there must be another way to deal with this problem.
I use mercurial, if you care, but I am ok with just a theoretical answer if you are unfamiliar with hg specifics.
A common way to handle this is to put a config.example or similar under version control and force the user to copy it and make any necessary changes. That way the user can pull down the overall structure of the file from your repository without overwriting local changes.
Alternatively, you could make your config file provide only defaults, with the option to source a subset of variables from a higher-priority custom config file (in the same format) which the user may or may not provide.
You'll want to use the .hgignore file to not include the config file in the repository.
This will allow everyone to have their own version of the config file.
Basically, you just want to add the relative path to the config file and Mercurial commands will ignore it. So the file would look like this:
config/dbconfig.ext
Edit
I just realized you still want to be able to version control the config file (misunderstood the question). So I suggest moving the parts of the config file that are dependent into their own config file and then applying the fix above. That way, you can still have the regular config information under version control and keep part of it separate for each person's machine.
I have per machine databases for my PHP projects. What I do is check the hostname at runtime. If it is one host, I feed it certain credentials. If another, feed it different credentials.
On some systems I create a list of credentials and then just go down the line trying them until one of the connections works. If the list is exhausted, the connection cannot be made.
I've never found a solid method for handling this type of configuration files. My final solution was to just maintain a version of each file and use symbolic links. That way each server has the same file path, but different root file.
Without knowing exactly what is in your config file, I'm going to assume your file has some stuff that is machine-dependent (e.g., db password, paths) and other stuff that is not (db hostname, maybe some paths relative to a path that is configured on a per-machine basis, etc.)
If that's the case, what you want to do is re-factor your config file so that you have two config files---one for the common stuff, one for the machine-specific stuff. Check the common one in, and add the machine-specific configuration to the ignore file.

Merging 2 clearcase views on different Servers?

I'm in a bit of a pickle...
I work on a project that is multi-site. Unfortunately, the VOB sync between the two sites is not working properly right now, and our Clearcase Admins are too busy doing other work to get it fixed.
I need to take code from a Dynamic View on one server and merge it to a Dynamic View on another server.
Usually we check everything in, label it, and then once the VOB syncs merge from the label on the other side.
Any tips or tricks on how to do this merge?
Ok, here's what I've got so far:
- I made sure that my source view & my target view were based on the same (slightly older) label that had synced properly.
Running the following command tells me what files have changed in my branch on the 1st server:
ct find . -version 'version (.../branch-name/LATEST)' -nxn -print
Running this command will give me a GNU style diff against the labeled version:
ct diff -diff FILENAME `cleartool find FILENAME -version 'lbtype(LABEL)' -print`
Now I need to chain these together to create a Patchset file than I can then use GNU Merge to merge into the 2nd view that's based on the same label.
You need to get the data back somehow from the other site of the replicated environment.
if the mkreplica did work, but the ship process failed, you could try to ask for a shared file replica, which could then be imported (see mkreplica help, section Imports).
multitool mkreplica –export –workdir /tmp/ms_workdir –c "make a new replica for sanfran_hub" –out /tmp/sanfran_hub_packet
multitool mkreplica –import –workdir /tmp/ms_workdir –tag /vobs/dev –vob /net/goldengate/vobstg/dev.vbs –preserve –c "create sanfran_hub replica" /tmp/sanfran_hub_packet
But if your CC admins are that busy, all there is left is the "replica of the poor":
some kind of zip, and a merge with a third party tool between your local view and said zip.
I am sure you could extract any relevant data from a source dynamic view which would not be up-to-date anyway.
Admins finally got around to cleaning it up before I could finish my solution, so don't need this anymore. Hopefully they will keep it up and running.

What's the best Perl module for hierarchical and inheritable configuration?

If I have a greenfield project, what is the best practice Perl based configuration module to use?
There will be a Catalyst app and some command line scripts. They should share the same configuration.
Some features I think I want ...
Hierarchical Configurations to cleanly maintain different development and live settings.
I'd like to define "global" configurations once (eg, results_per_page => 20), have those inherited but override-able by my dev/live configs.
Global:
results_per_page: 20
db_dsn: DBI:mysql;
db_name: my_app
Dev:
inherit_from: Global
db_user: dev
db_pass: dev
Dev_New_Feature_Branch:
inherit_from: Dev
db_name: my_app_new_feature
Live:
inherit_from: Global
db_user: live
db_pass: secure
When I deploy a project to a new server, or branch/fork/copy it somewhere new (eg, a new development instance), I want to (one time only) set which configuration set/file to use, and then all future updates are automatic.
I'd envisage this could be achieved with a symlink:
git clone example.com:/var/git/my_project . # or any equiv vcs
cd my_project/etc
ln -s live.config to_use.config
Then in the future
git pull # or any equiv vcs
I'd also like something that akin to FindBin, so that my configs can either use absolute paths, or relative to the current deployment. Given
/home/me/development/project/
bin
lib
etc/config
where /home/me/development/project/etc/config contains:
tmpl_dir: templates/
when my perl code looks up the tmpl_dir configuration it'll get:
/home/me/development/project/templates/
But on the live deployment:
/var/www/project/
bin
lib
etc/config
The same code would magically return
/var/www/project/templates/
Absolute values in the config should be honoured, so that:
apache_config: /etc/apache2/httpd.conf
would return "/etc/apache2/httpd.conf" in all cases.
Rather than a FindBin style approach, an alternative might be to allow configuration values to be defined in terms of other configuration values?
tmpl_dir: $base_dir/templates
I'd also like a pony ;)
Catalyst::Plugin::ConfigLoader supports multiple overriding config files. If your Catalyst app is called MyApp, then it has three levels of override: 1) MyApp.pm can have a __PACKAGE__->config(...) directive, 2) it next looks for MyApp.yml in the main directory of the app, 3) it looks for MyApp_local.yml. Each level may override settings in each other level.
In a Catalyst app I built, I put all of my immutable settings in MyApp.pm, my debug settings in MyApp.yml, and my production settings in MyApp_<servertype>.yml and then symlinked MyApp_local.yml to point at MyApp_<servertype>.yml on each deployed server (they were all a little different...).
That way, all of my config was in SVN and I just needed one ln -s step to manually config a server.
Perl Best Practices warns against exactly what you want. It states that config files should be simple and avoid the sort of baroque features you desire. It goes on to recommend three modules (none of which are Core Perl): Config::General, Config::Std, and Config::Tiny.
The general rational behind this is that the editing of config files tends to be done by non-programmers and the more complicated you make your config files, the more likely they will screw them up.
All of that said, you might take a look at YAML. It provides a full featured, human readable*, serialization format. I believe the currently recommend parser in Perl is YAML::XS. If you do go this route I would suggest writing a configuration tool for end users to use instead of having them edit the files directly.
ETA: Based on Chris Dolan's answer it sounds like YAML is the way to go for you since Catalyst is already using it (.yml is the de facto extension for YAML files).
* I have heard complaints that blind people may have difficulty with it
YAML is hateful for config - it's not non-programmer friendly partly because yaml in pod is by definition broken as they're both white-space dependent in different ways. This addresses the main problem with Config::General. I've written some quite complicated config files with C::G in the past and it really keeps out of your way in terms of syntax requirements etc. Other than that, Chris' advice seems on the money.