Vectorizing set operations for string-valued cell arrays in MATLAB/Octave - matlab

I have a large data set, X, comprising demographic information of survey respondents. The data is largely categorical, so each row in X contains a bunch of string-valued features such as gender, race, interests, etc for a single respondent. Each column of X is a single response category. I have loaded this data set into a big cell array in MATLAB/Octave (testing on both). I would like to measure the Jaccard distance between each sample and every other sample in the data set. Basically what I want to do is this:
dist = zeros(size(X,1)); % Initialize my distance matrix
for ii = 1:size(X,1)
for jj = ii:size(X,1) % Only need the upper triangle since dist is symmetric
% Find the Jaccard distance between the ii-th and jj-th respondent
dist(ii,jj) = 1 - numel(intersect(X(ii,:), X(jj,:))) / numel(union(X(ii,:), X(jj,:)));
end
end
Except obviously I want to vectorize the code. I have tried using cellfun and bsxfun to vectorize, but when I do something like:
res = cellfun('intersect', X, X, 'UniformOutput', false);
I get a cell array the same size as X, wherein the (i,j)-element is equivalent to intersect(X(i,j), X(i,j)); basically the unique characters in the (i-j)-cell. This does not help me. When I try:
res = bsxfun('intersect', X, X);
I get one long cell array containing (I think) all of the unique values that any cell in X takes. This does not help me either.
I would like a solution that enables me to vectorize the code at the beginning of this discussion. If it is easier to do so, a code that finds the subset of X with the minimum (or maximum) Jaccard distance from any one row in X would be exactly what I need.
Thanks in advance!
EDIT: Changed the loop code to only calculate the upper triangle of dist. Still takes far too long, and the fact that it is non-vectorized bugs me on a philosophical level.
EDIT: The first element of X, given by typing X(1,:) is:
ans =
{
[1,1] = Non - U.S. Citizen
[1,2] = Denied
[1,3] = M
[1,4] = CHINA
[1,5] = Full Time
[1,6] = D-Asian American or Pacific Islander
[1,7] =
[1,8] =
[1,9] = MSME
[1,10] =
}
This is just testing data for developing the algorithm while I wait on my actual survey results, but the survey results will have a similar form.
EDIT: More data from X, but in CSV form, is as follows:
Non - U.S. Citizen,Denied,M,INDIA,Full Time,E-Other,,,MSME,
Non - U.S. Citizen,Denied,F,INDIA,Full Time,D-Asian American or Pacific Islander,,,MSME,DESIGN
Non - U.S. Citizen,Denied,M,INDIA,Full Time,E-Other,,,MS,
Non - U.S. Citizen,Denied,M,IRAN,Full Time,B-Caucasian American Non-Hispanic,,,PhD,NANO
Non - U.S. Citizen,Left Without Degree,M,JORDAN,Full Time,E-Other,,,,
Non - U.S. Citizen,Denied,F,IRAN,Full Time,E-Other,,,PhD,BIOENG
,Not Attending,M,,Full Time,,,,PhD,
Non - U.S. Citizen,Not Attending,F,IRAN,Full Time,I-International Student,,,PhD,
Non - U.S. Citizen,Denied,M,BANGLADESH,Full Time,E-Other,,,PhD,NANO
Non - U.S. Citizen,Denied,M,BANGLADESH,Full Time,E-Other,,,MS,

This might be a workaround, I'll illustrate on a single row of data:
a={'Non - U.S. Citizen','Denied','M','INDIA','Full Time','E-Other','','','MSME',''}
Sum each cell element, this casts the strings to doubles and sum thier value. It will work assuming the odds for a non unique sum result are slim (if not there's a trick you can implement, but I doubt it'll actually happen):
b=cellfun(#sum,a,'un',0)
now you have a single number per cell element, you can use cell2mat to get a matrix and \ or pdist etc ...

Related

Change the random number generator in Matlab function

I have a task to complete that requires quasi-random numbers as input, but I notice that the Matlab function I want to use does not have an option to select any of the quasi generators I want to use (e.g. Halton, Sobol, etc.). Matlab has them as stand alone functions and not as options in the ubiquitous 'randn' and 'rng' functions. What MatLab uses is the Mersenne Twister, a pseudo generator. So for instance the copularnd uses 'randn'/'rng' which is based on pseudo random numbers....
Is there a way to incorporate them into the rand or rng functions embedded in other code (e.g.copularnd)? Any pointers would be much appreciated. Note; 'copularnd' calls 'mvnrnd' which in turn uses 'randn' and then pulls 'rng'...
First you need to initialize the haltonset using the leap, skip, and scramble properties.
You can check the documents but the easy description is as follows:
Scramble - is used for shuffling the points
Skip - helps to exclude a range of points from the set
Leap - is the size of jump from the current selected point to the next one. The points in between are ignored.
Now you can built a haltonset object:
p = haltonset(2,'Skip',1e2,'Leap',1e1);
p = scramble(p,'RR2');
This makes a 2D halton number set by skipping the first 100 numbers and leaping over 10 numbers. The scramble method is 'PR2' which is applied in the second line. You can see that many points are generated:
p =
Halton point set in 2 dimensions (818836295885536 points)
Properties:
Skip : 100
Leap : 10
ScrambleMethod : RR2
When you have your haltonset object, p, you can access the values by just selecting them:
x = p(1:10,:)
Notice:
So, you need to create the object first and then use the generated points. To get different results, you can play with Leap and Scramble properties of the function. Another thing you can do is to use a uniform distribution such as randi to select numbers each time from the generated points. That makes sure that you are accessing uniformly random parts of the dataset each time.
For instance, you can generate a random index vector (4 points in this example). And then use those to select points from the halton points.
>> idx = randi(size(p,1),1,4)
idx =
1.0e+14 *
3.1243 6.2683 6.5114 1.5302
>> p(idx,:)
ans =
0.5723 0.2129
0.8918 0.6338
0.9650 0.1549
0.8020 0.3532
link
'qrandstream' may be the answer I am looking for....with 'qrand' instead of 'rand'
e.g..from MatLab doc
p = haltonset(1,'Skip',1e3,'Leap',1e2);
p = scramble(p,'RR2');
q = qrandstream(p);
nTests = 1e5;
sampSize = 50;
PVALS = zeros(nTests,1);
for test = 1:nTests
X = qrand(q,sampSize);
[h,pval] = kstest(X,[X,X]);
PVALS(test) = pval;
end
I will post my solution once I am done :)

A moving average with different functions and varying time-frames

I have a matrix time-series data for 8 variables with about 2500 points (~10 years of mon-fri) and would like to calculate the mean, variance, skewness and kurtosis on a 'moving average' basis.
Lets say frames = [100 252 504 756] - I would like calculate the four functions above on over each of the (time-)frames, on a daily basis - so the return for day 300 in the case with 100 day-frame, would be [mean variance skewness kurtosis] from the period day201-day300 (100 days in total)... and so on.
I know this means I would get an array output, and the the first frame number of days would be NaNs, but I can't figure out the required indexing to get this done...
This is an interesting question because I think the optimal solution is different for the mean than it is for the other sample statistics.
I've provided a simulation example below that you can work through.
First, choose some arbitrary parameters and simulate some data:
%#Set some arbitrary parameters
T = 100; N = 5;
WindowLength = 10;
%#Simulate some data
X = randn(T, N);
For the mean, use filter to obtain a moving average:
MeanMA = filter(ones(1, WindowLength) / WindowLength, 1, X);
MeanMA(1:WindowLength-1, :) = nan;
I had originally thought to solve this problem using conv as follows:
MeanMA = nan(T, N);
for n = 1:N
MeanMA(WindowLength:T, n) = conv(X(:, n), ones(WindowLength, 1), 'valid');
end
MeanMA = (1/WindowLength) * MeanMA;
But as #PhilGoddard pointed out in the comments, the filter approach avoids the need for the loop.
Also note that I've chosen to make the dates in the output matrix correspond to the dates in X so in later work you can use the same subscripts for both. Thus, the first WindowLength-1 observations in MeanMA will be nan.
For the variance, I can't see how to use either filter or conv or even a running sum to make things more efficient, so instead I perform the calculation manually at each iteration:
VarianceMA = nan(T, N);
for t = WindowLength:T
VarianceMA(t, :) = var(X(t-WindowLength+1:t, :));
end
We could speed things up slightly by exploiting the fact that we have already calculated the mean moving average. Simply replace the within loop line in the above with:
VarianceMA(t, :) = (1/(WindowLength-1)) * sum((bsxfun(#minus, X(t-WindowLength+1:t, :), MeanMA(t, :))).^2);
However, I doubt this will make much difference.
If anyone else can see a clever way to use filter or conv to get the moving window variance I'd be very interested to see it.
I leave the case of skewness and kurtosis to the OP, since they are essentially just the same as the variance example, but with the appropriate function.
A final point: if you were converting the above into a general function, you could pass in an anonymous function as one of the arguments, then you would have a moving average routine that works for arbitrary choice of transformations.
Final, final point: For a sequence of window lengths, simply loop over the entire code block for each window length.
I have managed to produce a solution, which only uses basic functions within MATLAB and can also be expanded to include other functions, (for finance: e.g. a moving Sharpe Ratio, or a moving Sortino Ratio). The code below shows this and contains hopefully sufficient commentary.
I am using a time series of Hedge Fund data, with ca. 10 years worth of daily returns (which were checked to be stationary - not shown in the code). Unfortunately I haven't got the corresponding dates in the example so the x-axis in the plots would be 'no. of days'.
% start by importing the data you need - here it is a selection out of an
% excel spreadsheet
returnsHF = xlsread('HFRXIndices_Final.xlsx','EquityHedgeMarketNeutral','D1:D2742');
% two years to be used for the moving average. (250 business days in one year)
window = 500;
% create zero-matrices to fill with the MA values at each point in time.
mean_avg = zeros(length(returnsHF)-window,1);
st_dev = zeros(length(returnsHF)-window,1);
skew = zeros(length(returnsHF)-window,1);
kurt = zeros(length(returnsHF)-window,1);
% Now work through the time-series with each of the functions (one can add
% any other functions required), assinging the values to the zero-matrices
for count = window:length(returnsHF)
% This is the most tricky part of the script, the indexing in this section
% The TwoYearReturn is what is shifted along one period at a time with the
% for-loop.
TwoYearReturn = returnsHF(count-window+1:count);
mean_avg(count-window+1) = mean(TwoYearReturn);
st_dev(count-window+1) = std(TwoYearReturn);
skew(count-window+1) = skewness(TwoYearReturn);
kurt(count-window +1) = kurtosis(TwoYearReturn);
end
% Plot the MAs
subplot(4,1,1), plot(mean_avg)
title('2yr mean')
subplot(4,1,2), plot(st_dev)
title('2yr stdv')
subplot(4,1,3), plot(skew)
title('2yr skewness')
subplot(4,1,4), plot(kurt)
title('2yr kurtosis')

Optimizing code, removing "for loop"

I'm trying to remove outliers from a tick data series, following Brownlees & Gallo 2006 (if you may be interested).
The code works fine but given that I'm working on really long vectors (the biggest has 20m observations and after 20h it was not done computing) I was wondering how to speed it up.
What I did until now is:
I changed the time and date format to numeric double and I saw that it saves quite some time in processing and A LOT OF MEMORY.
I allocated memory for the vectors:
[n] = size(price);
x = price;
score = nan(n,'double'); %using tic and toc I saw that nan requires less time than zeros
trimmed_mean = nan(n,'double');
sd = nan(n,'double');
out_mat = nan(n,'double');
Here is the loop I'd love to remove. I read that vectorizing would speed up a lot, especially using long vectors.
for i = k+1:n
trimmed_mean(i) = trimmean(x(i-k:i-1 & i+1:i+k),10,'round'); %trimmed mean computed on the 'k' closest observations to 'i' (i is excluded)
score(i) = x(i) - trimmed_mean(i);
sd(i) = std(x(i-k:i-1 & i+1:i+k)); %same as the mean
tmp = abs(score(i)) > (alpha .* sd(i) + gamma);
out_mat(i) = tmp*1;
end
Here is what I was trying to do
trimmed_mean=trimmean(regroup_matrix,10,'round',2);
score=bsxfun(#minus,x,trimmed_mean);
sd=std(regroup_matrix,2);
temp = abs(score) > (alpha .* sd + gamma);
out_mat = temp*1;
But given that I'm totally new to Matlab, I don't know how to properly construct the matrix of neighbouring observations. I just think it should be shaped like: regroup_matrix= nan (n,2*k).
EDIT: To be specific, what I am trying to do (and I am not able to) is:
Given a column vector "x" (n,1) for each observation "i" in "x" I want to take the "k" neighbouring observations to "i" (from i-k to i-1 and from i+1 to i+k) and put these observations as rows of a matrix (n, 2*k).
EDIT 2: I made a few changes to the code and I think I am getting closer to the solution. I posted another question specific to what I think is the problem now:
Matlab: Filling up matrix rows using moving intervals from a column vector without a for loop
What I am trying to do now is:
[n] = size(price,1);
x = price;
[j1]=find(x);
matrix_left=zeros(n, k,'double');
matrix_right=zeros(n, k,'double');
toc
matrix_left(j1(k+1:end),:)=x(j1-k:j1-1);
matrix_right(j1(1:end-k),:)=x(j1+1:j1+k);
matrix_group=[matrix_left matrix_right];
trimmed_mean=trimmean(matrix_group,10,'round',2);
score=bsxfun(#minus,x,trimmed_mean);
sd=std(matrix_group,2);
temp = abs(score) > (alpha .* sd + gamma);
outmat = temp*1;
I have problems with the matrix_left and matrix_right creation.
j1, that I am using for indexing is a column vector with the indices of price's observations. The output is simply
j1=[1:1:n]
price is a column vector of double with size(n,1)
For your reshape, you can do the following:
idxArray = bsxfun(#plus,(k:n)',[-k:-1,1:k]);
reshapedArray = x(idxArray);
Thanks to Jonas that showed me the way to go I came up with this:
idxArray_left=bsxfun(#plus,(k+1:n)',[-k:-1]); %matrix with index of left neighbours observations
idxArray_fill_left=bsxfun(#plus,(1:k)',[1:k]); %for observations from 1:k I take the right neighbouring observations, this way when computing mean and standard deviations there will be no problems.
matrix_left=[idxArray_fill_left; idxArray_left]; %Just join the two matrices and I have the complete matrix of left neighbours
idxArray_right=bsxfun(#plus,(1:n-k)',[1:k]); %same thing as left but opposite.
idxArray_fill_right=bsxfun(#plus,(n-k+1:n)',[-k:-1]);
matrix_right=[idxArray_right; idxArray_fill_right];
idx_matrix=[matrix_left matrix_right]; %complete index matrix, joining left and right indices
neigh_matrix=x(idx_matrix); %exactly as proposed by Jonas, I fill up a matrix of observations from 'x', following idx_matrix indexing
trimmed_mean=trimmean(neigh_matrix,10,'round',2);
score=bsxfun(#minus,x,trimmed_mean);
sd=std(neigh_matrix,2);
temp = abs(score) > (alpha .* sd + gamma);
outmat = temp*1;
Again, thanks a lot to Jonas. You really made my day!
Thanks also to everyone that had a look to the question and tried to help!

find indices in array, use indices as lookup, plot w/r/t time

I'm looking to find the n largest values in an array, then to use the indices of those found values as a look up into another array representing time. But I am wondering how I can plot this if i want time to display as a continuous variable. Do I need to zero out data? That wouldn't be preferable for my use case as I'm looking to save memory.
Let's say that I have array A, which is where I am looking for the max values. Then I have array T, which represents timestamps. I want my plot to display continuous time and plot() doesn't like arguments of differing size. How do most people deal with this?
Here's what I've got so far:
numtofind = 4;
A = m{:,10};
T = ((m{:,4} * 3600.0) + (m{:,5} * 60.0) + m{:,6});
[sorted, sortindex] = sort(A(:), 'descend');
maxvalues = sorted(1:numtofind);
maxindex = sortindex(1:numtofind);
corresponding_timestamps = T(maxindex);
%here i plot the max values against time/corresponding timestamps,
%but i want to place them in the right timestamp and display time as continuous
%rather than the filtered set:
plot(time_values, maxvalues);
When you say "time as continuous", do you mean you want time going from minimum to maximum? If so, you can just sort corresponding_timestamps and use that to reorder maxvalues. Even if you don't do that, you can still do plot(time_values, maxvalues, '.') to get a scatter plot which won't mess up your graph with lines.

Adjacency matrix from edge list (preferrably in Matlab)

I have a list of triads (vertex1, vertex2, weight) representing the edges of a weighted directed graph. Since prototype implementation is going on in Matlab, these are imported as a Nx3 matrix, where N is the number of edges. So the naive implementation of this is
id1 = L(:,1);
id2 = L(:,2);
weight = L(:,3);
m = max(max(id1, id2)) % to find the necessary size
V = zeros(m,m)
for i=1:m
V(id1(i),id2(i)) = weight(i)
end
The trouble with tribbles is that "id1" and "id2" are nonconsecutive; they're codes. This gives me three problems. (1) Huge matrices with way too many "phantom", spurious vertices, which distorts the results of algorithms to be used with that matrix and (2) I need to recover the codes in the results of said algorithms (suffice to say this would be trivial if id codes where consecutive 1:m).
Answers in Matlab are preferrable, but I think I can hack back from answers in other languages (as long as they're not pre-packaged solutions of the kind "R has a library that does this").
I'm new to StackOverflow, and I hope to be contributing meaningfully to the community soon. For the time being, thanks in advance!
Edit: This would be a solution, if we didn't have vertices at the origin of multiple vertices. (This implies a 1:1 match between the list of edge origins and the list of identities)
for i=1:n
for j=1:n
if id1(i) >0 & i2(j) > 0
V(i,j) = weight(i);
end
end
end
You can use the function sparse:
sparse(id1,id2,weight,m,m)
If your problem is that the node ID numbers are nonconsecutive, why not re-map them onto consecutive integers? All you need to do is create a dictionary of all unique node ID's and their correspondence to new IDs.
This is really no different to the case where you're asked to work with named nodes (Australia, Britain, Canada, Denmark...) - you would map these onto consecutive integers first.
You can use GRP2IDX function to convert your id codes to consecutive numbers, and ids can be either numerical or not, does not matter. Just keep the mapping information.
[idx1, gname1, gmap1] = grp2idx(id1);
[idx2, gname2, gmap2] = grp2idx(id2);
You can recover the original ids with gmap1(idx1).
If your id1 and id2 are from the same set you can apply grp2idx to their union:
[idx, gname,gmap] = grp2idx([id1; id2]);
idx1 = idx(1:numel(id1));
idx2 = idx(numel(id1)+1:end);
For the reordering see a recent question - how to assign a set of coordinates in Matlab?
You can use ACCUMARRAY or SUB2IND to solve this problem.
V = accumarray([idx1 idx2], weight);
or
V = zeros(max(idx1),max(idx2)); %# or V = zeros(max(idx));
V(sub2ind(size(V),idx1,idx2)) = weight;
Confirm if you have non-unique combinations of id1 and id2. You will have to take care of that.
Here is another solution:
First put together all your vertex ids since there might a sink vertex in your graph:
v_id_from = edge_list(:,1);
v_id_to = edge_list(:,2);
v_id_all = [v_id_from; v_id_to];
Then find the unique vertex ids:
v_id_unique = unique(v_id_all);
Now you can use the ismember function to get the mapping between your vertex ids and their consecutive index mappings:
[~,from] = ismember(v_id_from, v_id_unique);
[~,to] = ismember(v_id_to, v_id_unique);
Now you can use sub2ind to populate your adjacency matrix:
adjacency_matrix = zeros(length(from), length(to));
linear_ind = sub2ind(size(adjacency_matrix), from, to);
adjacency_matrix(linear_ind) = edge_list(:,3);
You can always go back from the mapped consecutive id to the original vertex id:
original_vertex_id = v_id_unique(mapped_consecutive_id);
Hope this helps.
Your first solution is close to what you want. However it is probably best to iterate over your edge list instead of the adjacency matrix.
edge_indexes = edge_list(:, 1:2);
n_edges = max(edge_indexes(:));
adj_matrix = zeros(n_edges);
for local_edge = edge_list' %transpose in order to iterate by edge
adj_matrix(local_edge(1), local_edge(2)) = local_edge(3);
end