Matlab categorical table variables: Speed? Use in join keys? - matlab

I've dipping my toe into Matlab's categorical variable pool in the context of Matlab tables. Actually, I may have wandered into that territory in the past, but if so, it would have been in a relatively superficial manner.
These days, I want to use Matlab code patterns to do what I normally would do in MS Access, e.g., various types of joins and filtering. Much of my data is categorical, and I've read up on the advantages of using categorical variables in tables. However, they mostly centre around descriptiveness (over enumerated types) and memory efficiency. I haven't run across mention of speed. Do categorical variables offer a speed advantage?
I also wonder how advisable it is to use categorical variables when doing various types of joins. The categorical variables will occupy different tables, so it's not clear to me how equivalence in values is established if such variables are involved in the SQL ON clause (which Matlab refers to as a keys parameter).
From the dearth of relevant Google hits, it almost seems like I'm in new territory, which to me would be a scary thing. Lack of documentation of best practices, and the resulting need for trial/error and reverse engineering, requires more time than I can devote, so I'll sadly revert back to using strings.
If anyone can point to online guidance information, I'd appreciate it.

A partial answer only....
The following test indicates that catgorized data behaves sensibly when used as join keys:
BigList = {'dog' 'cat' 'mouse' 'horse' 'rat'}'
SmallList = BigList( 1 : end-2 )
Nrows = 20;
% Create tables for innerjoin using strings
tBig = table( ...
(1:Nrows)' , ...
BigList( ceil( length(BigList) * rand( Nrows , 1 ) ) ) , ...
'VariableNames' , {'B_ID' 'Animal'} )
tSmall = table( ...
(1:Nrows)' , ...
SmallList( ceil( length(SmallList) * rand( Nrows , 1 ) ) ) , ...
'VariableNames' , {'S_ID' 'Animal'} )
tBigSmall = innerjoin( tBig , tSmall , 'Keys','Animal' );
tBig = sortrows( tBig , {'Animal','B_ID'} );
tSmall = sortrows( tSmall, {'Animal','S_ID'} );
tBigSmall = sortrows( tBigSmall, {'Animal' 'B_ID' 'S_ID'} );
% Now innerjoin the same tables using categorized strings
tcBig = tBig;
tcBig.cAnimal = categorical( tcBig.Animal );
tcBig.Animal = [];
tcSmall = tSmall;
tcSmall.cAnimal = categorical( tcSmall.Animal );
tcSmall.Animal = [];
tcBigSmall = innerjoin( tcBig , tcSmall , 'Keys','cAnimal' );
tcBig = sortrows( tcBig , {'cAnimal','B_ID'} );
tcSmall = sortrows( tcSmall, {'cAnimal','S_ID'} );
tcBigSmall = sortrows( tcBigSmall, {'cAnimal' 'B_ID' 'S_ID'} );
% Check if the join results are the same
if all( tBigSmall.Animal == tcBigSmall.cAnimal )
disp('categorical vs string key: inner joins MATCH.')
else
disp('categorical vs string key: inner joins DO NOT MATCH.')
end % if
So the only question now is about speed. This is a general question, not just for joins, so I'm not sure what would be a good test. There are many possibilities, e.g., number of table rows, number of categories, whether it's a join or a filtering, etc.
In any case, I believe that the answers to both question would be better documented.

Related

Few minizinc questions on constraints

A little bit of background. I'm trying to make a model for clustering a Design Structure Matrix(DSM). I made a draft model and have a couple of questions. Most of them are not directly related to DSM per se.
include "globals.mzn";
int: dsmSize = 7;
int: maxClusterSize = 7;
int: maxClusters = 4;
int: powcc = 2;
enum dsmElements = {A, B, C, D, E, F,G};
array[dsmElements, dsmElements] of int: dsm =
[|1,1,0,0,1,1,0
|0,1,0,1,0,0,1
|0,1,1,1,0,0,1
|0,1,1,1,1,0,1
|0,0,0,1,1,1,0
|1,0,0,0,1,1,0
|0,1,1,1,0,0,1|];
array[1..maxClusters] of var set of dsmElements: clusters;
array[1..maxClusters] of var int: clusterCard;
constraint forall(i in 1..maxClusters)(
clusterCard[i] = pow(card(clusters[i]), powcc)
);
% #1
% constraint forall(i, j in clusters where i != j)(card(i intersect j) == 0);
% #2
constraint forall(i, j in 1..maxClusters where i != j)(
card(clusters[i] intersect clusters[j]) == 0
);
% #3
% constraint all_different([i | i in clusters]);
constraint (clusters[1] union clusters[2] union clusters[3] union clusters[4]) = dsmElements;
var int: intraCost = sum(i in 1..maxClusters, j, k in clusters[i] where k != j)(
(dsm[j,k] + dsm[k,j]) * clusterCard[i]
) ;
var int: extraCost = sum(el in dsmElements,
c in clusters where card(c intersect {el}) = 0,
k,j in c)(
(dsm[j,k] + dsm[k,j]) * pow(card(dsmElements), powcc)
);
var int: TCC = trace("\(intraCost), \(extraCost)\n", intraCost+extraCost);
solve maximize TCC;
Question 1
I was under the impression, that constraints #1 and #2 are the same. However, seems like they are not. The question here is why? What is the difference?
Question 2
How can I replace constraint #2 with all_different? Does it make sense?
Question 3
Why the trace("\(intraCost), \(extraCost)\n", intraCost+extraCost); shows nothing in the output? The output I see using gecode is:
Running dsm.mzn
intraCost, extraCost
clusters = array1d(1..4, [{A, B, C, D, E, F, G}, {}, {}, {}]);
clusterCard = array1d(1..4, [49, 0, 0, 0]);
----------
<sipped to save space>
----------
clusters = array1d(1..4, [{B, C, D, G}, {A, E, F}, {}, {}]);
clusterCard = array1d(1..4, [16, 9, 0, 0]);
----------
==========
Finished in 5s 419msec
Question 4
The expression constraint (clusters[1] union clusters[2] union clusters[3] union clusters[4]) = dsmElements;, here I wanted to say that the union of all clusters should match the set of all nodes. Unfortunately, I did not find a way to make this big union more dynamic, so for now I just manually provide all clusters. Is there a way to make this expression return union of all sets from the array of sets?
Question 5
Basically, if I understand it correctly, for example from here, the Intra-cluster cost is the sum of all interactions within a cluster multiplied by the size of the cluster in some power, basically the cardinality of the set of nodes, that represents the cluster.
The Extra-cluster cost is a sum of interactions between some random element that does not belong to a cluster and all elements of that cluster multiplied by the cardinality of the whole space of nodes to some power.
The main question here is are the intraCost and extraCost I the model correct (they seem to be but still), and is there a better way to express these sums?
Thanks!
(Perhaps you would get more answers if you separate this into multiple questions.)
Question 3:
Here's an answer on the trace question:
When running the model, the trace actually shows this:
intraCost, extraCost
which is not what you expect, of course. Trace is in effect when creating the model, but at that stage there is no value of these two decision values and MiniZinc shows only the variable names. They got some values to show after the (first) solution is reached, and can then be shown in the output section.
trace is mostly used to see what's happening in loops where one can trace the (fixed) loop variables etc.
If you trace an array of decision variables then they will be represented in a different fashion, the array x will be shown as X_INTRODUCED_0_ etc.
And you can also use trace for domain reflection, e.g. using lb and ub to get the lower/upper value of the domain of a variable ("safe approximation of the bounds" as the documentation states it: https://www.minizinc.org/doc-2.5.5/en/predicates.html?highlight=ub_array). Here's an example which shows the domain of the intraCost variable:
constraint
trace("intraCost: \(lb(intraCost))..\(ub(intraCost))\n")
;
which shows
intraCost: -infinity..infinity
You can read a little more about trace here https://www.minizinc.org/doc-2.5.5/en/efficient.html?highlight=trace .
Update Answer to question 1, 2 and 4.
The constraint #1 and #2 means the same thing, i.e. that the elements in clusters should be disjoint. The #1 constraint is a little different in that it loops over decision variables while the #2 constraint use plain indices. One can guess that #2 is faster since #1 use the where i != j which must be translated to some extra constraints. (And using i < j instead should be a little faster.)
The all_different constraint states about the same and depending on the underlying solver it might be faster if it's translated to an efficient algorithm in the solver.
In the model there is also the following constraint which states that all elements must be used:
constraint (clusters[1] union clusters[2] union clusters[3] union clusters[4]) = dsmElements;
Apart from efficiency, all these constraints above can be replaced with one single constraint: partition_set which ensure that all elements in dsmElements must be used in clusters.
constraint partition_set(clusters,dsmElements);
It might be faster to also combine with the all_different constraint, but that has to be tested.

Organising large datasets in Matlab

I have a problem I hope you can help me with.
I have imported a large dataset (200000 x 5 cell) in Matlab that has the following structure:
'Year' 'Country' 'X' 'Y' 'Value'
Columns 1 and 5 contain numeric values, while columns 2 to 4 contain strings.
I would like to arrange all this information into a variable that would have the following structure:
NewVariable{Country_1 : Country_n , Year_1 : Year_n}(Y_1 : Y_n , X_1 : X_n)
All I can think of is to loop through the whole dataset to find matches between the names of the Country, Year, X and Y variables combining the if and strcmp functions, but this seems to be the most ineffective way of achieving what I am trying to do.
Can anyone help me out?
Thanks in advance.
As mentioned in the comments you can use categorical array:
% some arbitrary data:
country = repmat('ca',10,1);
country = [country; repmat('cb',10,1)];
country = [country; repmat('cc',10,1)];
T = table(repmat((2001:2005)',6,1),cellstr(country),...
cellstr(repmat(['x1'; 'x2'; 'x3'],10,1)),...
cellstr(repmat(['y1'; 'y2'; 'y3'],10,1)),...
randperm(30)','VariableNames',{'Year','Country','X','Y','Value'});
% convert all non-number data to categorical arrays:
T.Country = categorical(T.Country);
T.X = categorical(T.X);
T.Y = categorical(T.Y);
% here is an example for using categorical array:
newVar = T(T.Country=='cb' & T.Year==2004,:);
The table class is made for such things, and very convenient. Just expand the logic statement in the last line T.Country=='cb' & T.Year==2004 to match your needs.
Tell me if this helps ;)

Efficiently calculate mean value from files

From a Monte-Carlo simulation I have a range of files, say: file_1.mat, file_2.mat,...,file_n.mat, where n is large. Each file contains one or several (maximum 3 if it matters) large 1D arrays in time of interest, say var1, var2, var3.
I am now as always interested in finding the mean value of these variables. My question is now, how do I do this in the most efficient way? The keyword here is efficiency. Below you will find the MWE which is done the standard way, but this is quite time consuming as the files are large and there are many.
I am programming in Matlab, however ideas presented in pseudo code is also very well received.
MWE:(The standard way)
meanVar1 = zeros(1,1e6); %I do not remember the exact size, just use 1e6
meanVar2 = zeros(1,1e6);
meanVar3 = zeros(1,1e6);
for i 1=1:n
load(strcat('file_',int2str(i)),'var1','var2','var3')
meanVar1 = meanVar1 + var1;
meanVar2 = meanVar2 + var2;
meanVar3 = meanVar3 + var3;
end
meanVar1 = meanVar1/n;
meanVar2 = meanVar2/n;
meanVar3 = meanVar3/n;

Using For and While Loops to Determine Who to Hire MATLAB

It's that time of the week where I realize just how little I understand in MATLAB. This week, we have homework on iteration, so using for-loops and while-loops. The problem I am currently experiencing difficulties with is one where I have to write a function that decides who to hire somebody. I'm given a list of names, a list of GPAs and a logical vector that tells me whether or not a student stayed to talk. What I have to output is the names of people to hire and the time they spent chatting with the recruiter.
function[candidates_hire, time_spent] = CFRecruiter(names, GPAs, stays_to_talk)
In order to be hired, a canidate must have a GPA that is higher than 2.5 (not inclusive). In order to be hired, the student must stick around to talk, if they don't talk, they don't get hired. The names are separated by a ', ' and the GPAs is a vector. The time spent talking is determined by:
Time in minutes = (GPA - 2.5) * 4;
My code so far:
function[candidates_hire, time_spent] = CFRecruiter(names, GPAs, stays_to_talk)
candidates = strsplit(names, ', ');
%// My attempt to split up the candidates names.
%// I get a 1x3 cell array though
for i = 1:length(GPAs)
%// This is where I ran into trouble, I need to separate the GPAs
student_GPA = (GPAs(1:length(GPAs)));
%// The length is unknown, but this isn't working out quite yet.
%// Not too sure how to fix that
return
end
time_spent = (student_GPA - 2.5) * 4; %My second output
while stays_to_talk == 1 %// My first attempt at a while-loop!
if student_GPA > 2.5
%// If the student has a high enough GPA and talks, yay for them
student = 'hired';
else
student = 'nothired'; %If not, sadface
return
end
end
hired = 'hired';
%// Here was my attempt to get it to realize how was hired, but I need
%// to concatenate the names that qualify into a string for the end
nothired = 'nothired';
canidates_hire = [hired];
What my main issue is here is figuring out how to let the function know them names(1) has the GPA of GPAs(1). It was recommended that I start a counter, and that I had to make sure my loops kept the names with them. Any suggestions with this problem? Please and thank you :)
Test Codes
[Names, Time] = CFRecruiter('Jack, Rose, Tom', [3.9, 2.3, 3.3],...
[false true true])
=> Name = 'Tom'
Time = 3.2000
[Names, Time] = CFRecruiter('Vatech, George Burdell, Barnes Noble',...
[4.0, 2.5, 3.6], [true true true])
=> Name = 'Vatech, Barnes Noble'
Time = 10.4000
I'm going to do away with for and while loops for this particular problem, mainly because you can solve this problem very elegantly in (I kid you not) three lines of code... well four if you count returning the candidate names. Also, the person who is teaching you MATLAB (absolutely no offense intended) hasn't the faintest idea of what they're talking about. The #1 rule in MATLAB is that if you can vectorize your code, do it. However, there are certain situations where a for loop is very suitable due to the performance enhancements of the JIT (Just-In-Time) accelerator. If you're curious, you can check out this link for more details on what JIT is about. However, I can guarantee that using loops in this case will be slow.
We can decompose your problem into three steps:
Determine who stuck around to talk.
For those who stuck around to talk, check their GPAs to see if they are > 2.5.
For those that have satisfied (1) and (2), determine the total time spent on talking by using the formula in your post for each person and add up the times.
We can use a logical vector to generate a Boolean array that simultaneously checks steps #1 and #2 so that we can index into our GPA array that you are specifying. Once we do this, we simply apply the formula to the filtered GPAs, then sum up the time spent. Therefore, your code is very simply:
function [candidates_hire, time_spent] = CFRecruiter(names, GPAs, stays_to_talk)
%// Pre-processing - split up the names
candidates = strsplit(names, ', ');
%// Steps #1 and #2
filtered_candidates = GPAs > 2.5 & stays_to_talk;
%// Return candidates who are hired
candidates_hire = strjoin(candidates(filtered_candidates), ', ');
%// Step #3
time_spent = sum((GPAs(filtered_candidates) - 2.5) * 4);
You had the right idea to split up the names based on the commas. strsplit splits up a string that has the token you're looking for (which is , in your case) into separate strings inside a cell array. As such, you will get a cell array where each element has the name of the person to be interviewed. Now, I combined steps #1 and #2 into a single step where I have a logical vector calculated that tells you which candidates satisfied the requirements. I then use this to index into our candidates cell array, then use strjoin to join all of the names together in a single string, where each name is separated by , as per your example output.
The final step would be to use the logical vector to index into the GPAs vector, grab those GPAs from those candidates who are successful, then apply the formula to each of these elements and sum them up. With this, here are the results using your sample inputs:
>> [Names, Time] = CFRecruiter('Jack, Rose, Tom', [3.9, 2.3, 3.3],...
[false true true])
Names =
Tom
Time =
3.2000
>> [Names, Time] = CFRecruiter('Vatech, George Burdell, Barnes Noble',...
[4.0, 2.5, 3.6], [true true true])
Names =
Vatech, Barnes Noble
Time =
10.4000
To satisfy the masses...
Now, if you're absolutely hell bent on using for loops, we can replace steps #1 and #2 by using a loop and an if condition, as well as a counter to keep track of the total amount of time spent so far. We will also need an additional cell array to keep track of those names that have passed the requirements. As such:
function [candidates_hire, time_spent] = CFRecruiter(names, GPAs, stays_to_talk)
%// Pre-processing - split up the names
candidates = strsplit(names, ', ');
final_names = [];
time_spent = 0;
for idx = 1 : length(candidates)
%// Steps #1 and #2
if GPAs(idx) > 2.5 && stays_to_talk(idx)
%// Step #3
time_spent = time_spent + (GPAs(idx) - 2.5)*4;
final_names = [final_names candidates(idx)];
end
end
%// Return candidates who are hired
candidates_hire = strjoin(final_names, ', ');
The trick with the above code is that we are keeping an additional cell array around that stores those candidates that have passed. We will then join all of the strings together with a , between each name as we did before. You'll also notice that there is a difference in checking for steps #1 and #2 between the two methods. In particular, there is a & in the first method and a && in the second method. The single & is for arrays and matrices while && is for single values. If you don't know what that symbol is, that is the symbol for logical AND. This means that something is true only if both the left side of the & and the right side of the & are both true. In your case, this means that someone who has a GPA of > 2.5 and stays to talk must both be true if they are to be hired.

Summing specific fields in Matlab

How do I sum different fields? I want to sum all of the information for material(1) ...so I want to add 5+4+6+300 but I am unsure how. Like is there another way besides just doing material(1).May + material(1).June etc....
material(1).May= 5;
material(1).June=4;
material(1).July=6;
material(1).price=300;
material(2).May=10;
material(2).price=550;
material(3).May=90;
You can use structfun for this:
result = sum( structfun(#(x)x, material(1)) );
The inner portion (structfun(#(x)x, material(1))) runs a function each individual field in the structure, and returns the results in an array. By using the identity function (#(x)x) we just get the values. sum of course does the obvious thing.
A slightly longer way to do this is to access each field in a loop. For example:
fNames = fieldnames(material(1));
accumulatedValue = 0;
for ix = 1:length(fNames)
accumulatedValue = accumulatedValue + material(1).(fNames{ix});
end
result = accumulatedValue
For some users this will be easier to read, although for expert users the first will be easier to read. The result and (approximate) performance are the same.
I think Pursuit's answer is very good, but here is an alternative off the top of my head:
sum( cell2mat( struct2cell( material(1) )));