gnu sort - default buffer size - gnu-sort

I have read the full documentation for gnu sort and searched online but I cannot find what the default for the --buffer-size option is (which determines how much system memory the program uses when it runs). I am guessing it is somehow determined based on total system memory? (or perhaps on memory available at the time the program is begins execution?). How can I determine this?
update: I've experimented a bit and it seems that when I don't specify a particular --buffer-size value, it ends up using very little ram and thus going very slowly. It would be nice though to better understand what exactly is determining this behavior.

I went digging through the coreutils sort source code and found these functions: default_sort_size and sort_buffer_size.
It turns out that --buffer-size (sort_size in the source code) isn't the target buffer size but rather the maximum buffer size. If no --buffer-size value is specified, the default_sort_size function is used to determine a safe maximum buffer size. It does this based on resource limits, available memory, and total memory. A summary of the function is as follows:
size = MIN(SIZE_MAX, resource_limit) / 2;
mem = MAX(available_memory, total_memory / 8);
if ( size > total_memory * 0.75 )
size = total * 0.75;
buffer_max = MIN(mem, size);
buffer_max = MAX(buffer, MIN_SORT_SIZE);
The other function, sort_buffer_size, is used to determine exactly how much memory to allocate for the given input files. A summary of the function is as follows:
if (sort_size is set)
size_bound = sort_size;
else
size_bound = default_sort_size();
buffer_size = line_bytes + 2;
for each input_file
if (input_file is regular)
file_size = input_file_size;
else
if (sort_size is set)
return sort_size;
else
file_size = guess;
worst_case = file_size * worst_case_per_input_byte + 1;
if (worst_case overflows || size + worst_case >= size_bound)
return size_bound;
else
size += worst_case;
return size;
Possibly the most important point of the sort_buffer_size function is that if you're sorting data from STDIN or a pipe, it will automatically default to sort_size (i.e. --buffer-size) if it was provided. Otherwise, for regular files it will make some rough calculations based on the file sizes and only use sort_size as an upper limit.

To summarize in English, the defaults are:
Reading from a real file:
Use all free memory, up to 3/4 and not less than 1/8 of total memory.
(If there is a process (rusage) memory limit in effect, sort will not use more than half of that.)
Reading from a pipe:
Use a small, fixed amount (tens of MB).
You will probably want -S.
Current for GNU coreutils 8.29, Jan 2018.

Related

C++ AMP Division Quirk When Stressing GPU

Wrote a program that uses C++ AMP to run on my laptop's GPU (Intel HD Graphics 520). The GPU kernel is long so I will give a high level description (but let me know if more is needed).
Note that I fall into the "know enough to be dangerous" category of programmer.
parallel_for_each(accelerator_view, number_of_runs.extent, [data](index<1> idx) restrict(amp)
{
double total = data.starting_total[idx];
//these "working variables" are used for a variety of things in the code
double working_variable = 0.0;
double working_variable2 = 0.0;
for (int i = 0; i < 20; i++)
{
...do lots of stuff. "total" is changed by various factors...
//total is still a positive number that is greater than zero
//working_variable now has a positive non-zero value, and I want to find what %
//of the remaining total that value is
working_variable2 = 1.0 / total;
working_variable2 = working_variable * working_variable2;
//Note that if I write it like this the same issue will happen:
working_variable2 = working_variable / total;
...keep going and doing more things, write some values to data..
if (total == 0)
break;
}
}
When I run this without doing much else on my computer this runs just fine and I get the results I expect.
Where it gets really tricky is when I am stressing the system (or I think I am stressing the system). I test stress the system by
1) Kicking off my program
2) Opening up Chrome
3) Going to Youtube and starting a video
When I do that I get unexpected results (either when I am opening a program or running a video). I traced it back to the "1.0 / total" calculation returning infinity (inf), even though "total" is greater than zero. Here is an example of what I output to the console when this issue happens:
total = 51805.6
1.0 / total = inf
precise_math::pow(total, -1) = 1.93029e-05
I am running the kernel about 1.6 million times and I'll see between 0 and 15 of those 1.6 million hit this issue. The number of issues varies and which threads hit the issue varies.
So I feel confident that "total" is not zero and this is not a divide by zero situation. What am I missing? What could be causing this issue? Any way to prevent this? I am thinking of replacing all division in the kernel with the pow(num, -1)
P.S. Yes I am aware that the part of the answer is "don't watch videos while running". It is the opening of programs during execution that I am most concerned about.
Thank you!

How to do a median projection of a large image stack in Matlab

I have a large stack of 800 16bit gray scale images with 2048x2048px. They are read from a single BigTIFF file and the whole stack barely fits into my RAM (8GB).
Now I need do a median projection. That means I want to compute the median of each pixel across all 800 frames. The Matlab median function fails because there is not enough memory left make a copy of the whole array for the function call. What would be an efficient way to compute the median?
I have tried using a for loop to compute the median one pixel at a time, but this is still terribly slow.
Iterating over blocks, as #Shai suggests, may be the most straightforward solution. If you do have this problem frequently, you may want to consider converting the image to a mat-file, so that you can access the pixels as n-d array directly from disk.
%# convert to mat file
matObj = matfile('dest.mat','w');
matObj.data(2048,2048,numSlices) = 0;
for t = 1:numSlices
matObj.data(:,:,t) = imread(tiffFile,'index',t);
end
%# load a block of the matfile to take median (run as part of a loop)
medianOfBlock = median(matObj.data(1:128,1:128,:),3);
I bet that the distributions of the individual pixel values over the stack (i.e. the histograms of the pixel jets) are sparse.
If that's the case, the amount of memory needed to keep all the pixel histograms is much less than 2K x 2K x 64k: you can use a compact hash map to represent each histogram, and update them loading the images one at a time. When all updates are done, you go through your histograms and compute the median of each.
If you have access to the Image Processing Toolbox, Matlab has a set of tool to handle large images called Blockproc
From the docs :
To avoid these problems, you can process large images incrementally: reading, processing, and finally writing the results back to disk, one region at a time. The blockproc function helps you with this process.
I will try my best to provide help (if any), because I don't have an 800-stack TIFF image, nor an 8GB computer, but I want to see if my thinkings can form a solution.
First, 800*2048*2048*8bit = 3.2GB, not including the headers. With your 8GB RAM it should not be too difficult to store it at once; there might be too many programs running and chopping up the contiguous memories. Anyway, let's treat the problem as Matlab can't load it as a whole into the memory.
As Jonas suggests, imread supports loading a TIFF image by index. It also supports a PixelRegion parameter, so you can also consider accessing parts of the image by this parameter if you want to utilize Shai's idea.
I came up with a median algo that doesn't use all the data at the same time; it barely scans through a sequence of un-ordered data, one at each time; but it does keep a memory of 256 counters.
_
data = randi([0,255], 1, 800);
bins = num2cell(zeros(256,1,'uint16'));
for ii = 1:800
bins{data(ii)+1} = bins{data(ii)+1} + 1;
end
% clearvars data
s = cumsum(cell2mat(bins));
if find(s==400)
med = ( find(s==400, 1, 'first') + ...
find(s>400, 1, 'first') ) /2 - 1;
else
med = find(s>400, 1, 'first') - 1;
end
_
It's not very efficient, at least because it uses a for loop. But the benefit is instead of keeping 800 raw data in memory, only 256 counters are kept; but the counters need uint16, so actually they are roughly equivalent to 512 raw data. But if you are confident that for any pixel the same grayscale level won't count for more than 255 times among the 800 samples, you can choose uint8, and hence reduce the memory by half.
The above code is for one pixel. I'm still thinking how to expand it to a 2048x2048 version, such as
for ii = 1:800
img_data = randi([0,255], 2048, 2048);
(do stats stuff)
end
By doing so, for each iteration, you only need these kept in memory:
One frame of image;
A set of counters;
A few supplemental variables, with size comparable to one frame of image.
I use a cell array to store the counters. According to this post, a cell array can be pre-allocated while its elements can still be stored in memory non-contigously. That means the 256 counters (512*2048*2048 bytes) can be stored separately, which is quite reasonable for your 8GB RAM. But obviously my sample code does not make use of it since bins = num2cell(zeros(....

Matlab read heterogenious binary data

I'd like to read heterogenious binary data into matlab. I do know from the beginning how much it is and in which datatype each segment is. For example:
%double %double %int32 ...
and then this get repeated about a million times. Easy enoug to handle with fread since the know the number of bites for each segment and can therefore calculate the skip value for each row.
But now the data segment looks something like this :
%double %int32%*char %double %double ...
Whereby the int prior to the *char is the length of the said string. This brings the problem that I cannot calculate the skip anymore and I'm stuck be reading in the whole file line by line therefore needing to make a lot more file access and slowing everthing down.
In order to get at least some speed up I wan't to read in all %double %double ... (Around 30 elements) at a time and then use those from a buffer to fill up the array's. In C this would be a rather easy task here, without memcpy and not so direct access to pointers...
Do you know any way to achive this, not using mex files?
You can't solve the problem that the record size is unknown, and thus you don't know how much to read ahead of time. But you can batch up the reads, and if you have a reasonable max size for the string, you can always read that amount, and ignore the unneeded bytes at the end. typecast is the trick:
readlen = 1024;
buf = fread(fid, readlen, '*uint8'); % the asterisk keeps the returned array as uint8
rec.val1 = typecast(buf(1:8), 'double');
string_len = typecast(buf(9:12), 'int32');
rec.str1 = typecast(buf(13:13+string_len-1), 'uint8');
pos = 13+string_len;
rec.val2 = typecast(buf(pos:pos+8-1), 'double');
You might wrap a simple function around this technique to track the current offset automatically.

How to run the code for large variables

I have one code like below-
W = 3;
i = 4;
s = fullfact(ones(1,i)*(W + 1)) - 1;
p2 = unique(sort(s(sum(s,2) == i,:),2),'rows');
I can run this code only upto "i=11" but i want to run this code for upto "i=25".When i run these code for i=12 it shows error message "Out of Memory".
I need to keep these code as it is.How can i modify these code for larger value of "i"?
Matlab experts need your valuable suggestion.
Just wanting to do silly things is not enough. You are generating arrays that are simply too large to fit into memory.
See that the size of the matrix s is a function of i here. size(s) will be 2^(2*i) by i. (By the way, some will argue it is a bad idea to use i as a variable, which is normally sqrt(-1), for such variables.)
So when i = 4, s is only 256x4.
When i = 11, s is 4194304x11. This array takes 369098752 bytes of space, so 370 megabytes.
When i = 25, the array will be of size
2^50*25
ans =
2.8147e+16
Multiply that by 8 to get the memory needed. Something like 224 petabytes of memory! If you have that much memory, then send me a few terabytes of RAM. You won't miss them.
Yes, there are times when MATLAB runs out of memory. You can get the amount of memory available at any point of time by executing the following:
memory
However, I would suggest follow one of the strategies to reduce memory usage available here. Also, you might want to clear the variables which are not required in every iteration by
clear variable_name

Matlab - Assignment into array causes error: "Maximum variable size allowed by the program is exceeded"

I am running a script which creates a lot of big arrays. Everything runs fine until the following lines:
%dist is a sparse matrix
inds=dist~=0;
inserts=exp(-dist(inds).^2/2*sig_dist);
dist(inds)=inserts;
The last line causes the error: ??? Maximum variable size allowed by the program is exceeded.
I don't understand how could the last line increase the variable size - notice I am inserting into the matrix dist only in places which were non-zero to begin with. So what's happening here?
I'm not sure why you are seeing that error. However, I suggest you use the Matlab function spfun to apply a function to the nonzero elements in a sparse matrix. For example:
>>dist = sprand(10000,20000,0.001);
>>f = #(x) exp(-x.^2/2*sig_dist);
>>dist = spfun(f,dist)
MATLAB implements a "lazy copy-on-write" model. Let me explain with an example.
First, create a really large vector
x = ones(5*1e7,1);
Now, say we wanted to create another vector of the same size:
y = ones(5*1e7,1);
On my machine, this will fail with the following error
??? Out of memory. Type HELP MEMORY for your options.
We know that y will require 5*1e7*8 = 400000000 bytes ~ 381.47 MB (which is also confirmed by which x), but if we check the amount of free contiguous-memory left:
>> memory
Maximum possible array: 242 MB (2.540e+008 bytes) *
Memory available for all arrays: 965 MB (1.012e+009 bytes) **
Memory used by MATLAB: 820 MB (8.596e+008 bytes)
Physical Memory (RAM): 3070 MB (3.219e+009 bytes)
* Limited by contiguous virtual address space available.
** Limited by virtual address space available.
we can see that it exceeds the 242 MB available.
On the other hand, if you assign:
y = x;
it will succeed almost instantly. This is because MATLAB is not actually allocating another memory chunk of the same size as x, instead it creates a variable y that shares the same underlying data as x (in fact, if you call memory again, you will see almost no difference).
MATLAB will only try to make another copy of the data once one of the variables changes, thus if you try this rather innocent assignment statement:
y(1) = 99;
it will throw an error, complaining that it ran out of memory, which I suspect is what is happening in your case...
EDIT:
I was able to reproduce the problem with the following example:
%# a large enough sparse matrix (you may need to adjust the size)
dist = sparse(1:3000000,1:3000000,1);
First, lets check the memory status:
» whos
Name Size Bytes Class Attributes
dist 3000000x3000000 48000004 double sparse
» memory
Maximum possible array: 394 MB (4.132e+008 bytes) *
Memory available for all arrays: 1468 MB (1.539e+009 bytes) **
Memory used by MATLAB: 328 MB (3.440e+008 bytes)
Physical Memory (RAM): 3070 MB (3.219e+009 bytes)
* Limited by contiguous virtual address space available.
** Limited by virtual address space available.
say we want to apply a function to all non-zeros elements:
f = #(X) exp(-X.^2 ./ 2);
strange enough, if you try to slice/assign then it will fail:
» dist(dist~=0) = f( dist(dist~=0) );
??? Maximum variable size allowed by the program is exceeded.
However the following assignment does not throw an error:
[r,c,val] = find(dist);
dist = sparse(r, c, f(val));
I still don't have an explanation why the error is thrown in the first case, but maybe using the FIND function this way will solve your problem...
In general, reassigning non-sparse elements does not change the memory footprint of the matrix. Call whos before and after assignment to check.
dist = sparse(10, 10);
dist(1,1) = 99;
dist(6,7) = exp(1);
inds = dist ~= 0;
whos
dist(inds) = 1;
whos
Without a reproducible example it is hard to determine the cause of the problem. It may be that some intermediate assignment is taking place that isn't sparse. Or you have something specific to your problem that we can't see here.