How to make volume range calculation more RAM friendly? - kdb

I am trying to find out the mean, media and percentile ranges of price movements for a given volume to be filled using trade data. Attaching the code below. The problem is that the code gives me wsfull error when i run it on ~80k records. I am using a 4g linux box. At the moment I can only run it for ~30k records and even then q uses >70% of my ram.
Is there any way to make it more memory friendly?
rangeForVol : {[symIn; vol; dt]
data: select from table where sym=symIn, date=dt;
data: update cumVol: sums quantity, cVol: sums quantity from data;
data: update cumVolTgt: cumVol + vol from data;
data: update pxLst: price[where each ((cumVol>=/:cVol) and (cumVol<=/:cumVolTgt))=1] from data;
.Q.gc[];
data: update minPx: min each pxLst, maxPx: max each pxLst from data;
data: update range: maxPx - minPx from data;
data
};
select count i by floor range%0.5 from rangeForVol[`ABC; 2500; 2012.06.04]

The code you quote above almost certainly does not do what you were trying to achieve.
The column cumVol and cVol are both identical (in that they contain a running total of that day's volume). Later you calculate cumVol>=/:cVol. /: means that for every element in cVol you will compare it to the entire vector cumVol. As they are identical, you will get the identity matrix (plus some extra 1b for any non-distinct values).
q)(til 4)=\:til 4
1000b
0100b
0010b
0001b
It seems you wanted to perform an element-wise comparison between the two vectors (though comparing a vector to itself also doesn't make sense), and if you want to do this explicitly, each-both would be the correct adverb (='). However, in q, the = operator will implicitly apply item-wise to two vectors of the same length (or a vector and a scalar, as is happening in your each-left example), making any adverb unnecessary.
The fact you are creating two n x n matrices when you probably intended a length n vector is probably the reason you're running out of memory.

Related

Why do I get different result in different versions of MATLAB (2016 vs 2021)?

Why do I get different results when using the same code running in different version of MATLAB (2016 vs 2021) for sum(b.*x1) where b is single and x1 is double. How to avoid such error between MATLAB version?
MATLAB v.2021:
sum(b.*x1)
ans =
single
-0.0013286
MATLAB 2016
sum(b.*x1)
ans =
single
-0.0013283
In R2017b, they changed the behavior of sum for single-precision floats, and in R2020b they made the same changes for other data types too.
The change speeds up the computation, and improves accuracy by reducing the rounding errors. Simply put, previously the algorithm would just run through the array in sequence, adding up the values. The new behavior computes the sum over smaller portions of the array, and then adds up those results. This is more precise because the running total can become a very large number, and adding smaller numbers to it causes more rounding in those smaller numbers. The speed improvement comes from loop unrolling: the loop now steps over, say, 8 values at the time, and in the loop body, 8 running totals are computed (they don’t specify the number they use, the 8 here is an example).
Thus, your newer result is a better approximation to the sum of your array than the old one.
For more details (a better explanation of the new algorithm and the reason for the change), see this blog post.
Regarding how to avoid the difference: you could implement your own sum function, and use that instead of the builtin one. I would suggest writing it as a MEX-file for efficiency. However, do make sure you match the newer behavior of the builtin sum, as that is the better approximation.
Here is an example of the problem. Let's create an array with N+1 elements, where the first one has a value of N and the rest have a value of 1.
N = 1e8;
a = ones(N+1,1,'single');
a(1) = N;
The sum over this array is expected to be 2*N. If we set N large enough w.r.t. the data type, I see this in R2017a (before the change):
>> sum(a)
ans =
single
150331648
And I see this in R2018b (after the change for single-precision sum):
>> sum(a)
ans =
single
199998976
Both implementations make rounding errors here, but one is obviously much, much closer to the expected result (2e8, or 200000000).

Combination Operators on Sensors Arrays Data

Is there any standardized operators over data from arrays of sensors?
I am normally dealing with sensors data in the form of time + channels. The time is a timestamp, and the channels are the available data for these timestamps. All these fields are numeric, no strings involved.
Normally I have to mix those data objects in different ways. Let's suppose M1 is size m1xn1 and M2 is size m2xn2:
Combine rows of data from the same channels and different timestamps (i.e. n1 == n2). This leads to a vertical concatenation [M1; M2].
Combine columns of data from the same timestamps and different channels, (i.e. m1 == m2). This leads to a horizontal concatenation [M1 M2].
These operators are trivial and well defined.
When I have slight differences, for example, a few additional samples in M1 or M2, everything turns complicated and I have to think in weird schemes to perform such operations, such as these:
Cleaning the exceeding samples on M1 or M2, for matching the dimensions.
Calculate an aggregated timestamp, obtaining a unique(sort()) on the timestamps, and then apply a union like in a SQL JOIN sentence.
Aggregate the data on M1 or M2, this is, reducing m1 or m2 to a smaller figure, resampling the timescale, and then apply an aggregation like in a SQL GROUP sentence.
I cannot think of a unique and definite function to combine this sort of data. How can I do this?
Let's say you have an m1-element vector of time values t1 and and n1-element vector of channel values c1 for your m1-by-n1 matrix M1 (and likewise for M2). First and foremost, you will likely need to convert your time and channel values into equivalent index values. You can do this by expanding your time and channel values into grids using ndgrid, then converting them to index values using unique:
[t1, c1] = ndgrid(t1, c1);
[t2, c2] = ndgrid(t2, c2);
[tUnion, ~, tIndex] = unique([t1(:); t2(:)]);
[cUnion, ~, cIndex] = unique([c1(:); c2(:)]);
Now there are two approaches you can take for aggregating the data using the above indices. If you know for certain that the matrices M1 and M2 will never contain repeated measurements (i.e. the same combination of time and channel will not appear in both), then you can build the final joined matrix by creating a linear index from tIndex and cIndex and combining the values from M1 and M2 like so:
MUnion = zeros(numel(tUnion), numel(cUnion));
MUnion(tIndex+numel(tUnion).*(cIndex-1)) = [M1(:); M2(:)];
If the matrices M1 and M2 could contain repeated measurements at the same combination of time and channel values, then accumarray will be the way to go. You will have to decide how you want to combine the repeated measurements, such as taking the mean as shown here:
MUnion = accumarray([tIndex cIndex], [M1(:); M2(:)], [], #mean);

MATLAB spending an incredible amount of time writing a relatively small matrix

I have a small MATLAB script (included below) for handling data read from a CSV file with two columns and hundreds of thousands of rows. Each entry is a natural number, with zeros only occurring in the second column. This code is taking a truly incredible amount of time (hours) to run what should be achievable in at most some seconds. The profiler identifies that approximately 100% of the run time is spent writing a matrix of zeros, whose size varies depending on input, but in all usage is smaller than 1000x1000.
The code is as follows
function [data] = DataHandler(D)
n = size(D,1);
s = max(D,1);
data = zeros(s,s);
for i = 1:n
data(D(i,1),D(i,2)+1) = data(D(i,1),D(i,2)+1) + 1;
end
It's the data = zeros(s,s); line that takes around 100% of the runtime. I can make the code run quickly by just changing out the s's in this line for 1000, which is a sufficient upper bound to ensure it won't run into errors for any of the data I'm looking at.
Obviously there're better ways to do this, but being that I just bashed the code together to quickly format some data I wasn't too concerned. As I said, I fixed it by just replacing s with 1000 for my purposes, but I'm perplexed as to why writing that matrix would bog MATLAB down for several hours. New code runs instantaneously.
I'd be very interested if anyone has seen this kind of behaviour before, or knows why this would be happening. Its a little disconcerting, and it would be good to be able to be confident that I can initialize matrices freely without killing MATLAB.
Your call to zeros is incorrect. Looking at your code, D looks like a D x 2 array. However, your call of s = max(D,1) would actually generate another D x 2 array. By consulting the documentation for max, this is what happens when you call max in the way you used:
C = max(A,B) returns an array the same size as A and B with the largest elements taken from A or B. Either the dimensions of A and B are the same, or one can be a scalar.
Therefore, because you used max(D,1), you are essentially comparing every value in D with the value of 1, so what you're actually getting is just a copy of D in the end. Using this as input into zeros has rather undefined behaviour. What will actually happen is that for each row of s, it will allocate a temporary zeros matrix of that size and toss the temporary result. Only the dimensions of the last row of s is what is recorded. Because you have a very large matrix D, this is probably why the profiler hangs here at 100% utilization. Therefore, each parameter to zeros must be scalar, yet your call to produce s would produce a matrix.
What I believe you intended should have been:
s = max(D(:));
This finds the overall maximum of the matrix D by unrolling D into a single vector and finding the overall maximum. If you do this, your code should run faster.
As a side note, this post may interest you:
Faster way to initialize arrays via empty matrix multiplication? (Matlab)
It was shown in this post that doing zeros(n,n) is in fact slow and there are several neat tricks to initializing an array of zeros. One way is to accomplish this by empty matrix multiplication:
data = zeros(n,0)*zeros(0,n);
One of my personal favourites is that if you assume that data was not declared / initialized, you can do:
data(n,n) = 0;
If I can also comment, that for loop is quite inefficient. What you are doing is calculating a 2D histogram / accumulation of data. You can replace that for loop with a more efficient accumarray call. This also avoids allocating an array of zeros and accumarray will do that under the hood for you.
As such, your code would basically become this:
function [data] = DataHandler(D)
data = accumarray([D(:,1) D(:,2)+1], 1);
accumarray in this case will take all pairs of row and column coordinates, stored in D(i,1) and D(i,2) + 1 for i = 1, 2, ..., size(D,1) and place all that match the same row and column coordinates into a separate 2D bin, we then add up all of the occurrences and the output at this 2D bin gives you the total tally of how many values at this 2D bin which corresponds to the row and column coordinate of interest mapped to this location.

MATLAB massive multi-assignment technique

I'm trying to assign ~1 Million values to a 100x100 logical matrix like this:
CC(Labels,LabelsXplusOne) = true;
where CC is 100x100 logical and Labels, LabelsXplusOne are 1024x768 int32.
The problem now is the above statement takes about as long as 5 minutes to complete on a modern CPU.
Obviously it is badly implemented in MATLAB, so how can we make the above run faster without resorting to loops?
In case you are wondering, i need this statement to compute blobs in a integer (not binary) image.
And also:
max(max(Labels)) = 100
max(max(LabelsXplusOne)) = 100
EDIT:
Ok i got it. Maybe this will help others in the future:
tic; CC(sub2ind(size(CC),Labels,LabelsXplusOne)) = true; toc;
Elapsed time is 0.026414 seconds.
Much better now.
There are a couple of issues that jump out at me...
I have the feeling you are doing the matrix indexing wrong. As it stands now, what will happen is every value in Labels will be paired with every value in LabelsXplusOne, producing (1024*768)^2 total index pairs for your rows and columns of CC. That's likely what's taking so long.
What you probably want is to only use each pair of values as indices, like Labels(1,1),LabelsXplusOne(1,1), Labels(1,2),LabelsXplusOne(1,2), etc. To do this, you should convert your indices into linear indices using the function SUB2IND.
Additionally, your matrix CC only contains 10,000 entries, yet your index matrices each contain 786,432 integer values. This means you will end up assigning the value true to the same entry in CC many times over. You should first remove redundant sets of indices using the function UNIQUE, then use them to assign values to CC.
This is what I think you want:
CC(unique(sub2ind(size(CC), Labels, LabelsXplusOne))) = true;

Extract large Matlab dataset subsets

Referencing and assigning a subset of a matlab dataset appears to be extremely inefficient and possibly scales like rows^2
Example:
alldata is a large dataset of mixed data - say 150,000 rows by 25 columns (integer, boolean and string).
The format for the dataset is:
'format', '%s%u%u%u%u%u%s%s%s%s%s%s%s%u%u%u%u%s%u%s%s%u%s%s%s%s%u%s%u%s%s%s%u%s'
I then convert 2 type integer cols into type boolean
the following subset assignment:
somedata = alldata(1:m,:)
takes >7 sec for m = 10,000 and ridiculous amounts of time for larger values of m. Plotting time vs m shows a m^2 type relationship which is strange, given that copying alldata is nearly instantaneous, as is using functions like sortrows and find. In fact reading the original .csv data file in is faster than the above assignment for large values of m.
Using the profiler, it appears there is a function subref that includes a very slow line that checks for string comparisons to determine unique values within the dataset. Is this related to how the dataset type is stored (i.e. a reference table)? The dataset includes large number of unique string values.
Are their any solutions to extracting a subset of a dataset in matlab? Such as preallocation (how?), or copying the dataset and deleting rows rather than assigning an extract/subset.
I am using a dual core machine with 1.5Gb ram, but task manager reports less than 1Gb of ram in use.
I have previously worked with MATLAB's dataset array for large data, unfortunately its true that they do suffer from performance issues. One thing I found which helps speed things up, is to clear the observation names (ObsNames) property
Try the following fix:
%# I assume you have a 'dataset' object
ds = dataset(...);
%# clear the observation names property (It simply a label for each record)
ds.Properties.ObsNames = [];
Amro suggested clearing the observation names:
ds.Properties.ObsNames = [];
This results in a massive performance benefit as the subset assignment changes from quadratic (linear when plotted against rows^2) to linear (when plotted against rows) with rows at the minor cost of losing the ObsNames.
Copying a DataSet is near instantaneous, so when combined with clearing the unneeded rows also results in a massive performance improvement, though slightly a less optimal solution (but with no loss of ObsNames). Performance is about 2x slower compared to dropping ObsNames. This only improves by 2% when ObsNames are also dropped.
supporting data
I used a small script to assign a subset rows of a 150,000 x 25 mixed string/integer/boolean dataset generated the following time measurements (seconds).
The memory heap size made no significant difference in performance and was left at 128 MB.
Subref means standard function for subset assignment was used
ObsNames=[] means the ObsNames are dropped
Delete means dataset was copied and unneeded rows cleared.
Rows, subref, subref&ObsName=[], Delete, Delete&ObsName=[]
8000, 4.19, 2.06, 4.81, 4.72
32000, 57.61, 2.49, 5.26, 5.21
72000, 390.72, 3.21, 6.09, 6.03
128000, ?(*), 4.21, 7.25, 7.19
(*) I gave up on evaluating this value. Based on linear extrapolation against rows^2 I would guess 2000 sec, or half an hour.
Script
clear
load('data'); % load 'alldata' dataset
% alldata.Properties.ObsNames = []; % drop obsnames
tic;
x = ((1:4).^2.*8000);
for h = 1:length(x)
start = toc;
somedata = alldata(1:x(h),:);
% somedata = alldata;
% somedata(x(h):end,:) = []; % drop unrequired obs
t(h) = toc - start;
clear somedata
disp([x(h), t(h)]);
end