I was under the impression that structure in matlab were similar to query tables in sql but I have a feeling I might be wrong.
I have a rather large dataset consisting of many entries and many fields. Ideally I want to index the structure, pulling out only the data I am interested in. Here is an example of the dataset
Cond Type Stime ETime
2 10 1 900
2 10 1 900
2 10 1 900
3 1 901 1800
3 1 901 1800
4 1 1801 2700
8 1 901 1800
8 1 901 1800
9 1 901 1800
9 1 901 1800
12 1 901 1800
12 1 901 1800
13 10 1 900
13 10 1 900
13 10 1 900
16 1 901 1800
16 1 901 1800
17 10 1 900
17 10 1 900
17 10 1 900
19 10 1 900
19 10 1 900
19 10 1 900
20 10 1 900
20 10 1 900
20 10 1 900
22 1 901 1800
22 1 901 1800
25 10 1 900
25 10 1 900
25 10 1 900
27 1 901 1800
27 1 901 1800
28 1 901 1800
28 1 901 1800
30 1 1801 2700
31 1 901 1800
31 1 901 1800
32 10 1 900
32 10 1 900
32 10 1 900
35 10 1 900
35 10 1 900
35 10 1 900
What I want to do is pull specific data entries for analysis example being I want all entries where Type is equal to 10 or I want all Cond from 1:20 that have ETime == 900.
I can do this by the following
idx = find([stats.Type] == 10);
[stats(idx).Stime]
but for multiple types I need a for loop as trying to use a vector throws an error.
idx = find([stats.Type] == 1:10); % Does not work
% must use this
temp = [];
for aa = 1:10
idx = find([stats.Type] == aa);
temp = horzcat(idx,temp);
end
[stats(temp).Stime]
Is this the wrong way to use structures? Is there an easier method to index a structure to pull data of interest?
This answer proposes using table indexing instead of struct indexing, which is a bit of a side-step to directly answering the question. However, my comments on this post were deemed useful so I've formalised as an answer...
If you use struct2table then you can interact with it as a table, which is generally much more intuitive.
Structures are useful if your fields have different numbers of elements (i.e. you couldn't form a consistent height table). In almost all other areas, I find tables are easier to use.
With tables you can use:
Logical indexing
Sorting (including sortrows by column name)
The family of "join" operations
Dot notation for accessing table columns by name, as you do for accessing struct fields, or select multiple columns by name using myTable( :, {'col1','col2'} ). - You don't need weird syntactic tricks like [stats.Type] to group outputs, you can just do stats.Type
I would then use ismember to compare multiple items against a table column...
idx = ismember( stats.Type, 1:10 );
Unless you need the indices, you can skip using find for speed, and just directly index using idx.
Related
I'm trying to perform some simple logic on a table but I'd like to verify that the columns exists prior to doing so as a validation step. My data consists of standard table names though they are not always present in each data source.
While the following seems to work (just validating AAA at present) I need to expand to ensure that PRI_AAA (and eventually many other variables) is present as well.
t: $[`AAA in cols `t; temp: update AAA_VAL: AAA*AAA_PRICE from t;()]
Two part question
This seems quite tedious for each variable (imagine AAA-ZZZ inputs and their derivatives). Is there a clever way to leverage a dictionary (or table) to see if a number of variables exists or insert a place holder column of zeros if they do not?
Similarly, can we store a formula or instructions to to apply within a dictionary (or table) to validate and return a calculation (i.e. BBB_VAL: BBB*BBB_PRICE.) Some calculations would be dependent on others (i.e. BBB_Tax_Basis = BBB_VAL - BBB_COSTS costs for example so there could be iterative issues.
Thank in advance!
A functional update may be the best way to achieve this if your intention is to update many columns of a table in a similar fashion.
func:{[t;x]
if[not x in cols t;t:![t;();0b;(enlist x)!enlist 0]];
:$[x in cols t;
![t;();0b;(enlist`$string[x],"_VAL")!enlist(*;x;`$string[x],"_PRICE")];
t;
];
};
This function will update t with *_VAL columns for any column you pass as an argument, while first also adding a zero column for any missing columns passed as an argument.
q)t:([]AAA:10?100;BBB:10?100;CCC:10?100;AAA_PRICE:10*10?10;BBB_PRICE:10*10?10;CCC_PRICE:10*10?10;DDD_PRICE:10*10?10)
q)func/[t;`AAA`BBB`CCC`DDD]
AAA BBB CCC AAA_PRICE BBB_PRICE CCC_PRICE DDD_PRICE AAA_VAL BBB_VAL CCC_VAL DDD DDD_VAL
---------------------------------------------------------------------------------------
70 28 89 10 90 0 0 700 2520 0 0 0
39 17 97 50 90 40 10 1950 1530 3880 0 0
76 11 11 0 0 50 10 0 0 550 0 0
26 55 99 20 60 80 90 520 3300 7920 0 0
91 51 3 30 20 0 60 2730 1020 0 0 0
83 81 7 70 60 40 90 5810 4860 280 0 0
76 68 98 40 80 90 70 3040 5440 8820 0 0
88 96 30 70 0 80 80 6160 0 2400 0 0
4 61 2 70 90 0 40 280 5490 0 0 0
56 70 15 0 50 30 30 0 3500 450 0 0
As you've already mentioned, to cover point 2, a dictionary of functions might be the best way to go.
q)dict:raze{(enlist`$string[x],"_VAL")!enlist(*;x;`$string[x],"_PRICE")}each`AAA`BBB`DDD
q)dict
AAA_VAL| * `AAA `AAA_PRICE
BBB_VAL| * `BBB `BBB_PRICE
DDD_VAL| * `DDD `DDD_PRICE
And then a slightly modified function...
func:{[dict;t;x]
if[not x in cols t;t:![t;();0b;(enlist x)!enlist 0]];
:$[x in cols t;
![t;();0b;(enlist`$string[x],"_VAL")!enlist(dict`$string[x],"_VAL")];
t;
];
};
yields a similar result.
q)func[dict]/[t;`AAA`BBB`DDD]
AAA BBB CCC AAA_PRICE BBB_PRICE CCC_PRICE DDD_PRICE AAA_VAL BBB_VAL DDD DDD_VAL
-------------------------------------------------------------------------------
70 28 89 10 90 0 0 700 2520 0 0
39 17 97 50 90 40 10 1950 1530 0 0
76 11 11 0 0 50 10 0 0 0 0
26 55 99 20 60 80 90 520 3300 0 0
91 51 3 30 20 0 60 2730 1020 0 0
83 81 7 70 60 40 90 5810 4860 0 0
76 68 98 40 80 90 70 3040 5440 0 0
88 96 30 70 0 80 80 6160 0 0 0
4 61 2 70 90 0 40 280 5490 0 0
56 70 15 0 50 30 30 0 3500 0 0
Here's another approach which handles dependent/cascading calculations and also figures out which calculations are possible or not depending on the available columns in the table.
q)show map:`AAA_VAL`BBB_VAL`AAA_RevenueP`AAA_RevenueM`BBB_Other!((*;`AAA;`AAA_PRICE);(*;`BBB;`BBB_PRICE);(+;`AAA_Revenue;`AAA_VAL);(%;`AAA_RevenueP;1e6);(reciprocal;`BBB_VAL));
AAA_VAL | (*;`AAA;`AAA_PRICE)
BBB_VAL | (*;`BBB;`BBB_PRICE)
AAA_RevenueP| (+;`AAA_Revenue;`AAA_VAL)
AAA_RevenueM| (%;`AAA_RevenueP;1000000f)
BBB_Other | (%:;`BBB_VAL)
func:{c:{$[0h=type y;.z.s[x]each y;-11h<>type y;y;y in key x;.z.s[x]each x y;y]}[y]''[y];
![x;();0b;where[{all in[;cols x]r where -11h=type each r:(raze/)y}[x]each c]#c]};
q)t:([] AAA:1 2 3;AAA_PRICE:1 2 3f;AAA_Revenue:10 20 30;BBB:4 5 6);
q)func[t;map]
AAA AAA_PRICE AAA_Revenue BBB AAA_VAL AAA_RevenueP AAA_RevenueM
---------------------------------------------------------------
1 1 10 4 1 11 1.1e-05
2 2 20 5 4 24 2.4e-05
3 3 30 6 9 39 3.9e-05
/if the right columns are there
q)t:([] AAA:1 2 3;AAA_PRICE:1 2 3f;AAA_Revenue:10 20 30;BBB:4 5 6;BBB_PRICE:4 5 6f);
q)func[t;map]
AAA AAA_PRICE AAA_Revenue BBB BBB_PRICE AAA_VAL BBB_VAL AAA_RevenueP AAA_RevenueM BBB_Other
--------------------------------------------------------------------------------------------
1 1 10 4 4 1 16 11 1.1e-05 0.0625
2 2 20 5 5 4 25 24 2.4e-05 0.04
3 3 30 6 6 9 36 39 3.9e-05 0.02777778
The only caveat is that your map can't have the same column name as both the key and in the value of your map, aka cannot re-use column names. And it's assumed all symbols in your map are column names (not global variables) though it could be extended to cover that
EDIT: if you have a large number of column maps then it will be easier to define it in a more vertical fashion like so:
map:(!). flip(
(`AAA_VAL; (*;`AAA;`AAA_PRICE));
(`BBB_VAL; (*;`BBB;`BBB_PRICE));
(`AAA_RevenueP;(+;`AAA_Revenue;`AAA_VAL));
(`AAA_RevenueM;(%;`AAA_RevenueP;1e6));
(`BBB_Other; (reciprocal;`BBB_VAL))
);
I am given a matrix of horizontal edge points. What I want to do is to combine these edge points and find the distance between them if possible.
Let's say, the matrix M below represents the edge points on the xy coordinate plane.
M=[15 1
16 1
19 1
21 1
17 2
18 2
94 2
98 2
20 3
95 3
97 3
96 4
16 20
18 20
21 20
17 21
19 22
20 23];
What I want to do is to combine closer points in a piecewise fashion and get the matrix below:
M=[15 1
16 1
17 2
18 2
19 1
20 3
21 1
16 20
17 21
18 20
19 22
20 23
21 20
94 2
95 3
96 4
97 3
98 2];
which means the points that are close to each other are shown together. Sorting using M=sortrows(M,1) or M=sortrows(M,2) doesn't group the points together. What should I be using?
This is a clustering problem, not a sorting problem. Try using kmeans (which performs k-means clustering, a well known method to find groups of data that are close together).
If you have the MATLAB Statistics and Machine Learning Toolbox, use the built-in kmeans function (run ver to see your installed toolboxes). You must specify the number of clusters. For a simple dataset like the one you have posted, this is easy to do by looking at a scatter plot (scatter(M(:,1), M(:,2))).
Run k-means clustering, which outputs a vector containing the cluster that each point belongs to:
clusters = kmeans(M, 3)
Add this as the third column of M sort by that column. The 3rd column now denotes which cluster each point is in (e.g., all points with 1 in the third column are a group).
M(:,3) = clusters
groupedData = sortrows(M, 3)
16 20 1
18 20 1
21 20 1
17 21 1
19 22 1
20 23 1
94 2 2
98 2 2
95 3 2
97 3 2
96 4 2
15 1 3
16 1 3
19 1 3
21 1 3
17 2 3
18 2 3
20 3 3
You could then select the points in a single group using logical indexing:
M(M(:,3) == 1, :)
16 20 1
18 20 1
21 20 1
17 21 1
19 22 1
20 23 1
If you do not have the MATLAB Statistics and Machine Learning Toolbox package, this file from the FileExchange could be a good replacement: Kmeans Clustering by Mo Chen
I have two arrays. The first one is a consecutive sequential one, like:
seq1 =
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
...continues
The second one is like:
seq2 =
2 250
3 260
5 267
6 270
8 280
10 290
13 300
18 310
20 320
21 330
...continues
I need to embed seq2 into seq1 in such a way that I end up with the sequence:
seq3 =
1 0
2 250
3 260
4 260
5 267
6 270
7 270
8 280
9 280
10 290
11 290
... continues
I could do this with loops but the arrays are really big so I don't want to use two for loops, it is taking too long. How can I do this in a vectorised manner?
I think this does what you want:
[~, jj, vv] = find(sum(bsxfun(#le, seq2(:,1), seq1(:,1).'), 1));
seq3 = seq1;
seq3(jj,2) = seq2(vv,2);
How it works
The required index is obtained by computing how many values in the first column of seq2 are less than or equal to each value in the first column or seq1 (code sum(bsxfun(#le, ...), 1)). This will be used to select the appropriate entries from the second column of seq2 and write them into the result. But before that, the value 0 in this index needs to be discarded. This is done using the three-output version of find (code [~, jj, vv] = find(...)).
If your second column of data is always increasing, you can solve this easily with accumarray and cummax:
seq = [seq1; seq2];
seq3 = cummax(accumarray(seq(:, 1), seq(:, 2), [], #max));
seq3 = [(1:numel(seq3)).' seq3];
And here's what you get for your sample inputs:
seq3 =
1 0
2 250
3 260
4 260
5 267
6 270
7 270
8 280
9 280
10 290
11 290
12 290
13 300
14 300
15 300
16 300
17 300
18 310
19 310
20 320
21 330
How it works...
After concatenating seq1 and seq2, accumarray collects all the values in the second column that have the same value in the first column (i.e. [0 250] for the value 2), then gets the maximum value of each set. The function cummax is then used to fill any zero values with the previous non-zero value. Finally, an index column is added to the new sequence.
Say I have a matrix A whose first column contains item IDs with repetition and second column contains their weights.
A= [1 40
3 33
2 12
4 22
2 10
3 6
1 15
6 29
4 10
1 2
5 18
5 11
2 8
6 25
1 14
2 11
4 28
3 38
5 35
3 9];
I now want to find the difference of each instance of A and its associated minimum weight. For that, I make a matrix B with its first column containing the unique IDs from column 1 of A, and its column 2 containing the associated minimum weight found from column 2 of A.
B=[1 2
2 8
3 6
4 10
5 11
6 25];
Then, I want to store in column 3 of A the difference of each entry and its associated minimum weight.
A= [1 40 38
3 33 27
2 12 4
4 22 12
2 10 2
3 6 0
1 15 13
6 29 4
4 10 0
1 2 0
5 18 7
5 11 0
2 8 0
6 25 0
1 14 12
2 11 3
4 28 18
3 38 32
5 35 24
3 9 3];
This is the code I wrote to do this:
for i=1:size(A,1)
A(i,3) = A(i,1) - B(B(:,1)==A(i,2),2);
end
But this code takes a long time to execute as it needs to loop through B every time it loops through A. That is, it has a complexity of size(A) x size(B). Is there a better way to do this without using loops, that would execute faster?
You can use accumarray to first compute the minimum value in the second column of A for each unique value in the first column of A. We can then index into the result using the first column of A and compare to the second column of A to create the third column.
% Compute the mins
min_per_group = accumarray(A(:,1), A(:,2), [], #min);
% Compute the difference between the second column and the minima
A(:,3) = A(:,2) - min_per_group(A(:,1));
I have the following matrix declared in Matlab:
EmployeeData =
1 20 100000 42 14
2 15 95000 35 14
3 18 70000 28 14
4 10 85000 35 14
5 10 40000 21 12
6 4 45000 14 8
7 3 50000 21 10
8 5 55000 21 14
9 1 25000 14 7
10 2 50000 21 9
42 4 100000 42 10
Where column 1 represents ID numbers, 2 represents years, 3 is salary, 4 is vacation days, and 5 is sick days. I am trying to find the maximum value of a column (in this case the salary column), and print out the ID associated with that value. If more than one employee has the maximum value, all the IDs with that maximum are supposed to be shown. So here is how I naively implemented a way to do it:
>> maxVal = [];
>> j = 1;
>> for i = EmployeeData(:, 3)
if i == max(EmployeeData(:, 3))
maxVal = [maxVal EmployeeData(j, 1)];
end
j = j + 1;
end
But it shows maxVal to be [] in my workspace variables, instead of [1 42] as I expected. Upon inserting a disp(i) in the for loop above the if to debug, I get the following output:
100000
95000
70000
85000
40000
45000
50000
55000
25000
50000
Just like I expected. But when I switch out that disp(i) with a disp(j), I get this for my output:
1
What am I doing wrong? Should this not work?
MATLAB for loops operate on rows, not columns. You should try replacing your for loop with:
for i = EmployeeData(:, 3)' % NOTE THE TRANSPOSE
...
end
EDIT: Note that you can do what you're trying to do without a forloop:
maxVal = EmployeeData(EmployeeData(:,3) == max(EmployeeData(:,3)),1);
Is this what you want?
>> EmployeeData(EmployeeData(:,3)==max(EmployeeData(:,3)),1)
ans =
1
42