Why first run of grep is several times slower than consecutive runs? - operating-system

I run this command every time I build my project from the project directory:
egrep -r -n --include=*.java <my regex> .
And I cannot understand why consecutive runs are up to 10 times faster than first one. Actually I have seen this behavior in other disk IO operations involving large directories (calculation directory size, code commits etc.).
I think that it is related to operation system's disk IO internals. Probably it is caching on some level. Can somebody point my nose in right direction?

Because recently accessed files are cached by the operating system.
Have a look here.

Related

Could not create shared memory segment. Failed system call was shmget. PostgreSQL MacOS Mojave. Symlinked postgres data directory

This is a common error message and there are many general answers that have not worked for me.
I think I have isolated this particular problem to the PostgreSQL data directory being symlinked to an external hard drive.
FATAL: could not create shared memory segment: No space left on device
DETAIL: Failed system call was shmget(key=5432001, size=56, 03600).
HINT: This error does *not* mean that you have run out of disk space. It occurs either if all available shared memory IDs have been taken, in which case you need to raise the SHMMNI parameter in your kernel, or because the system's overall limit for shared memory has been reached.
$ sysctl -a | grep sysv
kern.sysv.shmmax: 412316860416
kern.sysv.shmmin: 8
kern.sysv.shmmni: 64
kern.sysv.shmseg: 128
kern.sysv.shmall: 100663296
$ sudo cat /etc/sysctl.conf
kern.sysv.shmmax=412316860416
kern.sysv.shmmin=8
kern.sysv.shmmni=64
kern.sysv.shmseg=128
kern.sysv.shmall=100663296
PostgreSQL version 9.4.15. From my PostgreSQL config
shared_buffers = 128MB
Don't know what other settings would be relevant.
Other environment details:
The external hard drive with the data directory is at only 50% capacity. My RAM usage when this happens is ~60% capacity.
I have not been able to determine an exact set of steps that reproduces the bug. I have an external hard drive with a PostgreSQL data directory and a local folder with another data directory. In my project, I'll symlink to one or the other depending on which copy of data I want to use. As far as I have noticed, the problem only appears when I've been working off the symlinked hard drive and when I unplug it without stopping the server and then plug it back in. But it doesn't happen every time when I perform those steps.
I don't expect anyone to be able to point to the specific problem given the above description.
But how can I get more useful information next time I'm in a bugged state? Are there any system commands that would help identify the exact problem?
...It occurs either if all available shared memory IDs have been taken, in which case you need to raise the SHMMNI parameter in your kernel, or because the system's overall limit for shared memory has been reached.
How can I check if if all available shared memory IDs have been taken or if the system's overall limit for shared memory has been reached and what do I do with the answer?

Bulk file restore from Google Cloud Storage

Accidentally run delete command on wrong bucket, object versioning is turned on, but I don't really understand what steps should I take in order to restore files, or what's more important, how to do it in bulk as I've deleted few hundreds of them.
Will appreciate any help.
To restore hundreds of objects you could do something as simple as:
gsutil cp -AR gs://my-bucket gs://my-bucket
This will copy all objects (including deleted ones) to the live generation, using metadata-only copying, i.e., not require copying the actual bytes. Caveats:
It will leave the deleted generations in place, so costing you extra storage.
If your bucket isn't empty this command will re-copy any live objects on top of themselves (ending up with an extra archived version of each of those as well, also costing you for extra storage).
If you want to restore a large number of objects this simplistic script would run too slowly - you'd want to parallelize the individual gsutil cp operations. You can't use the gsutil -m option in this case, because gsutil prevents that, in order to preserve generation ordering (e.g., if there were several generations of objects with the same name, parallel copying them would end up with the live generation coming from an unpredictable generation). If you only have 1 generation of each you could parallelize the copying by doing something like:
gsutil ls -a gs://my-bucket/** | sed 's/\(.\)\(#[0-9]\)/gsutil cp \1\2 \1 \&/' > gsutil_script.sh
This generates a listing of all objects (including deleted ones), and transforms it into a sequence of gsutil cp commands to copy those objects (by generation-specific name) back to the live generation in parallel. If the list is long you'll want to break in into parts so you don't (for example) try to fork 100k processes to do the parallel copying (which would overload your machine).

Matlab/Simulink: run batch of simulations in parallel?

I have to run a series of simulations and save the results. Since by default Matlab only uses one core, I wonder if it is possible to open multiple worker tasks and assign different simulation runs to them?
You could run each simulation in a separate MATLAB instance and let the OS handle the process to core assignment.
One master MATLAB could synchronize each child instances checking for example if simulation results file are existing.
I aso have the same problem but I did not manage to really understand how to make it in MatLab. The documentation in matlab is too advanced to get to know how to make it.
Since I am working with Ubuntu I find a way to do the work calling the unix command from MatLab and using the parallel GNU command
So I mange to run my simulation in parallel with 4 cores.
unix('parallel --progress -j4 flow > /dev/null :::: Pool.txt','-echo')
you can find more info in the link
Shell, run four processes parallel
Details of the syntaxis can be found at https://www.gnu.org/software/parallel/
but breifly I can tell you
--progress shows a status of the progress
-j4 tells the amount or jobs in parallel you want to have
flow is the name of my simulator
/dev/null was just to avoid the screen run output of the simulator to show up
Pool.txt is a file I made with the required simulator input that is basically the path and the main simulator file.
echo I do not remember now what was it for :D

Fast Mass File Copying in Scala with Akka

I am wondering what would be the best way to Copy a file src to dest within Scala that will be wrapped in an Akka Actor and possibly using a RemoteActor with several machines.
I have a tremendous amount of image files I have to copy from one directory to a NFS mounted directory.
Haven't done much FileHandling in Java or Scala, but know there is the NIO lib and some others out there that have been worked on since Scala 2.7. Something that would be the safest and quickest.
I probably should give some idea of my infrastructure as well. The connection is 1000 MB's in which connects via a Cisco3560 from an Isilon node to a Windows 2003 Server. The Isilon node is the NFS mount and the Windows 2003 Server is a highly configured Samba(Cifs) mount.
You probably can't beat the underlying OS file copy speed, so if the files are large or you can batch them, you're probably best off writing a shell script with Scala and then calling it with bash or somesuch. Chances are that one thread can saturate the disk IO, so there really isn't anything fancy to do. If the images are large, you'll be waiting for the 50ish MB/s limit on your disk (or 10ish MB/s limit on your 100 Mbps ethernet); if they're small, you'll be waiting for the quite-some-dozens of ms overhead on file seeks and network ping times and so on.
That said, you can use Apache Commons IO, which has a file copy utility, and a previous question has a high-performance answer among the top rated entries. You can have one actor handle all the copying tasks, and that should be as fast as if you have a bunch of actors all trying to compete for the same limited IO bandwidth.

How to grab a full memory dump of a large memory usage

I am hosting IIS based web service applications on Windows 2008 64-bit system running on a Quad core 8G machine. Ran into couple of instances when W3WP was running at 7.6G of memory usage. Nothing else was responding on the system including RDP. Right click on the process from the task manager and creating the dumps, froze the system and all its threads for a long time (close to 30minutes). When the freeze up occurred during off hours, we let the dump run for a while (ran close to 1 hour) but still dump didn't complete. In the interest of getting the system up, we had to kill IIS
Tried other tools like procexp, debug diag etc to create full memory dump and all have the same results
So, what tool does the community use to grab dump files quickly? Or without freezing all the threads? I realize latter might be a rhetorical question. But what are the options for generating such a large dump file without locking up the system for a long time?
IMO you shouldn't have to wait until the process memory grows to 8 GB. I am sure with something like 3 - 4 GB you should be able to detect the memory leak.
Procdump has an option based on memory threshold
-m Memory commit threshold in MB at which to create a dump of the process.
I would you this option to dump the memory of the process.
And also SSD would help in writing faster.
WPA a.k.a xperf (http://msdn.microsoft.com/en-us/performance/cc825801.aspx) is a powerfull tool, to diagnose the applications. You will get call stack of the culprit allocation. You dont have to collect the dump and it is no-invasive and does not load much in production systems
Complete step by step information is available here. http://msdn.microsoft.com/en-us/library/ff190906(v=VS.85).aspx.