Is there any standardized operators over data from arrays of sensors?
I am normally dealing with sensors data in the form of time + channels. The time is a timestamp, and the channels are the available data for these timestamps. All these fields are numeric, no strings involved.
Normally I have to mix those data objects in different ways. Let's suppose M1 is size m1xn1 and M2 is size m2xn2:
Combine rows of data from the same channels and different timestamps (i.e. n1 == n2). This leads to a vertical concatenation [M1; M2].
Combine columns of data from the same timestamps and different channels, (i.e. m1 == m2). This leads to a horizontal concatenation [M1 M2].
These operators are trivial and well defined.
When I have slight differences, for example, a few additional samples in M1 or M2, everything turns complicated and I have to think in weird schemes to perform such operations, such as these:
Cleaning the exceeding samples on M1 or M2, for matching the dimensions.
Calculate an aggregated timestamp, obtaining a unique(sort()) on the timestamps, and then apply a union like in a SQL JOIN sentence.
Aggregate the data on M1 or M2, this is, reducing m1 or m2 to a smaller figure, resampling the timescale, and then apply an aggregation like in a SQL GROUP sentence.
I cannot think of a unique and definite function to combine this sort of data. How can I do this?
Let's say you have an m1-element vector of time values t1 and and n1-element vector of channel values c1 for your m1-by-n1 matrix M1 (and likewise for M2). First and foremost, you will likely need to convert your time and channel values into equivalent index values. You can do this by expanding your time and channel values into grids using ndgrid, then converting them to index values using unique:
[t1, c1] = ndgrid(t1, c1);
[t2, c2] = ndgrid(t2, c2);
[tUnion, ~, tIndex] = unique([t1(:); t2(:)]);
[cUnion, ~, cIndex] = unique([c1(:); c2(:)]);
Now there are two approaches you can take for aggregating the data using the above indices. If you know for certain that the matrices M1 and M2 will never contain repeated measurements (i.e. the same combination of time and channel will not appear in both), then you can build the final joined matrix by creating a linear index from tIndex and cIndex and combining the values from M1 and M2 like so:
MUnion = zeros(numel(tUnion), numel(cUnion));
MUnion(tIndex+numel(tUnion).*(cIndex-1)) = [M1(:); M2(:)];
If the matrices M1 and M2 could contain repeated measurements at the same combination of time and channel values, then accumarray will be the way to go. You will have to decide how you want to combine the repeated measurements, such as taking the mean as shown here:
MUnion = accumarray([tIndex cIndex], [M1(:); M2(:)], [], #mean);
Related
The key idea of Locality sensitive hashing (LSH) is that neighbor points, v are more likely
mapped to the same bucket but points far from each other are more likely mapped to different buckets. In using Random projection, if the the database contains N samples each of higher dimension d, then theory says that we must create k randomly generated hash functions, where k is the targeted reduced dimension denoted as g(**v**) = (h_1(v),h_2(v),...,h_k(v)). So, for any vector point v, the point is mapped to a k-dimensional vector with a g-function. Then the hash code is the vector of reduced length /dimension k and is regarded as a bucket. Now, to increase probability of collision, theory says that we should have L such g-functions g_1, g_2,...,g_L at random. This is the part that I do not understand.
Question : How to create multiple hash tables? How many buckets are contained in a hash table?
I am following the code given in the paper Sparse Projections for High-Dimensional Binary Codes by Yan Xia et. al Link to Code
In the file Coding.m
dim = size(X_train, 2);
R = randn(dim, bit);
% coding
B_query = (X_query*R >= 0);
B_base = (X_base*R >=0);
X_query is the set of query data each of dimension d and there are 1000 query samples; R is the random projection and bit is the target reduced dimensionality. The output of B_query and B_base are N strings of length k taking 0/1 values.
Does this way create multiple hash tables i.e. N is the number of hash tables? I am confused as to how. A detailed explanation will be very helpful.
How to create multiple hash tables?
LSH creates hash-table using (amplified) hash functions by concatenation:
g(p) = [h1(p), h2(p), · · · , hk (p)], hi ∈R H
g() is a hash function and it corresponds to one hashtable. So we map the data, via g() to that hashtable and with probability, the close ones will fall into the same bucket and the non-close ones will fall into different buckets.
We do that L times, thus we create L hashtables. Note that every g() is/should most likely to be different that the other g() hash functions.
Note: Large k ⇒ larger gap between P1, P2. Small P1 ⇒ larer L so as to find neighbors. A practical choice is L = 5 (or 6). P1 and P2 are defined in the image below:
How many buckets are contained in a hash table?
Wish I knew! That's a difficult question, how about sqrt(N) where N is the number of points in the dataset. Check this: Number of buckets in LSH
The code of Yan Xia
I am not familiar with that, but from what you said, I believe that the query data you see are 1000 in number, because we wish to pose 1000 queries.
k is the length of the strings, because we have to hash the query to see in which bucket of a hashtable it will be mapped. The points inside that bucket are potential (approximate) Nearest Neighbors.
I have a bunch of 2 sets of time series data in cellarrays (2x1 cell) to which I want to apply corrcoef.
When both sets of data encompass the same amount of years and are the same size, I have no problem applying corrcoef after removing NaNs by interpolation.
But some of the sets are different sizes.
For example the first cell in set1 is 1x552 and the second cell is 1x576 (2 more years of monthly data at the beginning of the series). Because the data are time series I need to ensure that the relationship between data and data year is maintained when I resize. The year data is in another array.
I would like to be able to check what years are missing from the smaller cellarray and add these (maybe as means) in the right spot so that it becomes the same size as the larger cellarray.
Can anyone help?
You want to have a look at interp1. Say you have a time-series with times t and values v, where:
t is a Nt x 1 column vector with the sampling times sorted in ascending chronological order;
v is a Nt x Nc matrix of associated values across Nc channels at each time;
then you can interpolate new values at timepoints new_t simply with
new_v = interp1( t, v, new_t );
I am trying to find out the mean, media and percentile ranges of price movements for a given volume to be filled using trade data. Attaching the code below. The problem is that the code gives me wsfull error when i run it on ~80k records. I am using a 4g linux box. At the moment I can only run it for ~30k records and even then q uses >70% of my ram.
Is there any way to make it more memory friendly?
rangeForVol : {[symIn; vol; dt]
data: select from table where sym=symIn, date=dt;
data: update cumVol: sums quantity, cVol: sums quantity from data;
data: update cumVolTgt: cumVol + vol from data;
data: update pxLst: price[where each ((cumVol>=/:cVol) and (cumVol<=/:cumVolTgt))=1] from data;
.Q.gc[];
data: update minPx: min each pxLst, maxPx: max each pxLst from data;
data: update range: maxPx - minPx from data;
data
};
select count i by floor range%0.5 from rangeForVol[`ABC; 2500; 2012.06.04]
The code you quote above almost certainly does not do what you were trying to achieve.
The column cumVol and cVol are both identical (in that they contain a running total of that day's volume). Later you calculate cumVol>=/:cVol. /: means that for every element in cVol you will compare it to the entire vector cumVol. As they are identical, you will get the identity matrix (plus some extra 1b for any non-distinct values).
q)(til 4)=\:til 4
1000b
0100b
0010b
0001b
It seems you wanted to perform an element-wise comparison between the two vectors (though comparing a vector to itself also doesn't make sense), and if you want to do this explicitly, each-both would be the correct adverb (='). However, in q, the = operator will implicitly apply item-wise to two vectors of the same length (or a vector and a scalar, as is happening in your each-left example), making any adverb unnecessary.
The fact you are creating two n x n matrices when you probably intended a length n vector is probably the reason you're running out of memory.
I have a huge simulation data which needs to be post processed in MATLAB.
Say my matrix is A and its columns are named as variables ID, X, Y, Z, s1, s2 and s3. Actually my requirement is I want to find out rows with repeated X (here I mean that I am having many points for one value of x-coordinate) and add all the corresponding elements of columns s1 and s2, and divide each by no. of occurrences of X. Finally I want s1, s2 and s3 averaged over their frequency of occurrences.
It may be very trivial question, but, as a beginner I searched & tried a lot in this web, but cud not advance much. I know we can find out the repeated rows and their frequency by using commands like mode or unique etc. but iam not able to add the corresponding column elements and do averaging.
Finally when I want to plot say X vs. s1, I should have only one value of s1 for each value of x1. (i.e. s1 needs to be averaged over all repeating X)
Do we have any direct matlab command for this or we need to use some loop?
Please help me.
There is a function in matlab named grpstats that solves your very problem.
It computes groupwise summary statistics, for data in a matrix or dataset array.
Example:
data = [1,2,3,4];
group = [1,1,1,3];
[name,mean] = grpstats(data, group,{'gname','mean'})
would output:
name =
'1'
'3'
mean =
2
4
You may type help grpstats in Matlab for more information.
I currently implementing an optimization algorithm that requires me to sample without replacement from several sets. Although I am coding in MATLAB, this is essentially a CS question.
The situation is as follows:
I have a finite number of sets (A, B, C) each with a finite but possibly different number of elements (a1,a2...a8, b1,b2...b10, c1, c2...c25). I also have a vector of probabilities for each set which lists a probability for each element in that set (i.e. for set A, P_A = [p_a1 p_a2... p_a8] where sum(P_A) = 1). I normally use these to create a probability generating function for each set, which given a uniform number between 0 to 1, can spit out one of the elements from that set (i.e. a function P_A(u), which given u = 0.25, will select a2).
I am looking to sample without replacement from the sets A, B, and C. Each "full sample" is a sequence of elements from each of the different sets i.e. (a1, b3, c2). Note that the space of full samples is the set of all permutations of the elements in A, B, and C. In the example above, this space is (a1,a2...a8) x (b1,b2...b10) x (c1, c2...c25) and there are 8*10*25 = 2000 unique "full samples" in my space.
The annoying part of sampling without replacement with this setup is that if my first sample is (a1, b3, c2) then that does not mean I cannot sample the element a1 again - it just means that I cannot sample the full sequence (a1, b3, c2) again. Another annoying part is that the algorithm I am working with requires me do a function evaluation for all permutations of elements that I have not sampled.
The best method at my disposal right now is to keep track the sampled cases. This is a little inefficient since my sampler is forced to reject any case that has been sampled before (since I'm sampling without replacement). I then do the function evaluations for the unsampled cases, by going through each permutation (ax, by, cz) using nested for loops and only doing the function evaluation if that combination of (ax, by, cz) is not included in the sampled cases. Again, this is a little inefficient since I have to "check" whether each permutation (ax, by, cz) has already been sampled.
I would appreciate any advice in regards to this problem. In particular, I am looking a method to sample without replacement and keep track of unsampled cases that does not explicity list out the full sample space (I usually work with 10 sets with 10 elements each so listing out the full sample space would require a 10^10 x 10 matrix). I realize that this may be impossible, though finding efficient way to do it will allow me to demonstrate the true limits of the algorithm.
Do you really need to keep track of all of the unsampled cases? Even if you had a 1-by-1010 vector that stored a logical value of true or false indicating if that permutation had been sampled or not, that would still require about 10 GB of storage, and MATLAB is likely to either throw an "Out of Memory" error or bring your entire machine to a screeching halt if you try to create a variable of that size.
An alternative to consider is storing a sparse vector of indicators for the permutations you've already sampled. Let's consider your smaller example:
A = 1:8;
B = 1:10;
C = 1:25;
nA = numel(A);
nB = numel(B);
nC = numel(C);
beenSampled = sparse(1,nA*nB*nC);
The 1-by-2000 sparse matrix beenSampled is empty to start (i.e. it contains all zeroes) and we will add a one at a given index for each sampled permutation. We can get a new sample permutation using the function RANDI to give us indices into A, B, and C for the new set of values:
indexA = randi(nA);
indexB = randi(nB);
indexC = randi(nC);
We can then convert these three indices into a single unique linear index into beenSampled using the function SUB2IND:
index = sub2ind([nA nB nC],indexA,indexB,indexC);
Now we can test the indexed element in beenSampled to see if it has a value of 1 (i.e. we sampled it already) or 0 (i.e. it is a new sample). If it has been sampled already, we repeat the process of finding a new set of indices above. Once we have a permutation we haven't sampled yet, we can process it:
while beenSampled(index)
indexA = randi(nA);
indexB = randi(nB);
indexC = randi(nC);
index = sub2ind([nA nB nC],indexA,indexB,indexC);
end
beenSampled(index) = 1;
newSample = [A(indexA) B(indexB) C(indexC)];
%# ...do your subsequent processing...
The use of a sparse array will save you a lot of space if you're only going to end up sampling a small portion of all of the possible permutations. For smaller total numbers of permutations, like in the above example, I would probably just use a logical vector instead of a sparse vector.
Check the matlab documentation for the randi function; you'll just want to use that in conjunction with the length function to choose random entries from each vector. Keeping track of each sampled vector should be as simple as just concatenating it to a matrix;
current_values = [5 89 45]; % lets say this is your current sample set
used_values = [used_values; current_values];
% wash, rinse, repeat