Splitting a table in MATLAB by Sensor ID - matlab

I have a large table in MATLAB which contains over 1000 rows of data, in two columns. Column 1 is the ID of the sensor which gathered the data, and column two is the data itself (in this case a voltage).
I have been able to sort my table to gather all the data for sensors together. So, all the data from Sensor 1 is in rows 1 to 100, the data for Sensor 2 is in rows 101 to 179, the data for Sensor 3 is in rows 180 to 310, and so on. In other words, the number of rows which contain data for a given sensor is never the same.
Now, I want to split this main table into separate tables for each sensor ID, and I am having trouble figuring out a way to do it. I imagine I could do it with a loop, where my I cycle through the various IDs, but that doesn't seem like a very, MATLAB way of doing it.
What would be an efficient way to complete this task? Or would a loop really be the only way?
I have attached a small screenshot of some of my data.

The screenshot you shared shows a 1244x1 structure array with 2 fields but the question describes a table. You could convert the structure array to a table using,
T = struct2table(S); % Assuming S is the name of your structure
Whether the variable is a structure or table, it's better to not separate the variable and to use indexing instead. For example, assuming the variable is a table, you can compute the mean voltage for sensor1 using,
mean(T.reported_voltage(strcmp(T.sensor_id,'Sensor1')))
and you could report the mean of all groups using,
groupsummary(T,'sensor_id', 'mean')
or
splitapply(#mean,T.reported_voltage,findgroups(T.sensor_id))
But if you absolutely must break apart and tidy, well-organized table, you can do so by splitting the table into sub-tables stored within a cell array using,
unqSensorID = unique(T.sensor_id);
C = arrayfun(#(id){T(strcmp(T.sensor_id, id),:)},unqSensorID)

In this case the for loop is fine because (I guess) there aren't that many different sensors and your code will likely spend most of its time processing the data anyway - the loop won't give you a significant overhead.
Assuming your table is called t, the following should do what you want.
unique_sensors = unique(t.sensor_id)
for i = 1:length(unique_sensors)
sensor_data = t(t.sensor_id == unique_sensors(i), :);
% save or do some processing on this data
end

Related

Add missing/extra values in data array in Matlab

I have recorded WiFi CSI sensor data 5000 packets in 5 seconds(5000 packets x 57 subcarriers). But due to dynamic hardware configuration sometimes I only receive 4998 x 57. I want to add and estimate 2 rows so that my original design has consistent 5000 rows x 57 columns.
As you can see some data are 5000x57, and some are 4998x57.
You can achieve your desired output using mean()-function combined with the concatenation operator [] and the repmat() like this:
A=randi(100,4998,57);
A=[A;repmat(mean(A),2,1)];
Most of the functions in Matlab that take arrays as an input will calculate for each column except if the input array hast just 1 row. So does the mean function and you can just append means output to your arrays.
If you show me the code that you used to import the data, I might be able to help you create a cleaner data structure and thus be able to automatically process all of your arrays. The way the data is currently designed it's only possible to do this with dynamic variable names which is considered bad programming practice.

MATLAB spending an incredible amount of time writing a relatively small matrix

I have a small MATLAB script (included below) for handling data read from a CSV file with two columns and hundreds of thousands of rows. Each entry is a natural number, with zeros only occurring in the second column. This code is taking a truly incredible amount of time (hours) to run what should be achievable in at most some seconds. The profiler identifies that approximately 100% of the run time is spent writing a matrix of zeros, whose size varies depending on input, but in all usage is smaller than 1000x1000.
The code is as follows
function [data] = DataHandler(D)
n = size(D,1);
s = max(D,1);
data = zeros(s,s);
for i = 1:n
data(D(i,1),D(i,2)+1) = data(D(i,1),D(i,2)+1) + 1;
end
It's the data = zeros(s,s); line that takes around 100% of the runtime. I can make the code run quickly by just changing out the s's in this line for 1000, which is a sufficient upper bound to ensure it won't run into errors for any of the data I'm looking at.
Obviously there're better ways to do this, but being that I just bashed the code together to quickly format some data I wasn't too concerned. As I said, I fixed it by just replacing s with 1000 for my purposes, but I'm perplexed as to why writing that matrix would bog MATLAB down for several hours. New code runs instantaneously.
I'd be very interested if anyone has seen this kind of behaviour before, or knows why this would be happening. Its a little disconcerting, and it would be good to be able to be confident that I can initialize matrices freely without killing MATLAB.
Your call to zeros is incorrect. Looking at your code, D looks like a D x 2 array. However, your call of s = max(D,1) would actually generate another D x 2 array. By consulting the documentation for max, this is what happens when you call max in the way you used:
C = max(A,B) returns an array the same size as A and B with the largest elements taken from A or B. Either the dimensions of A and B are the same, or one can be a scalar.
Therefore, because you used max(D,1), you are essentially comparing every value in D with the value of 1, so what you're actually getting is just a copy of D in the end. Using this as input into zeros has rather undefined behaviour. What will actually happen is that for each row of s, it will allocate a temporary zeros matrix of that size and toss the temporary result. Only the dimensions of the last row of s is what is recorded. Because you have a very large matrix D, this is probably why the profiler hangs here at 100% utilization. Therefore, each parameter to zeros must be scalar, yet your call to produce s would produce a matrix.
What I believe you intended should have been:
s = max(D(:));
This finds the overall maximum of the matrix D by unrolling D into a single vector and finding the overall maximum. If you do this, your code should run faster.
As a side note, this post may interest you:
Faster way to initialize arrays via empty matrix multiplication? (Matlab)
It was shown in this post that doing zeros(n,n) is in fact slow and there are several neat tricks to initializing an array of zeros. One way is to accomplish this by empty matrix multiplication:
data = zeros(n,0)*zeros(0,n);
One of my personal favourites is that if you assume that data was not declared / initialized, you can do:
data(n,n) = 0;
If I can also comment, that for loop is quite inefficient. What you are doing is calculating a 2D histogram / accumulation of data. You can replace that for loop with a more efficient accumarray call. This also avoids allocating an array of zeros and accumarray will do that under the hood for you.
As such, your code would basically become this:
function [data] = DataHandler(D)
data = accumarray([D(:,1) D(:,2)+1], 1);
accumarray in this case will take all pairs of row and column coordinates, stored in D(i,1) and D(i,2) + 1 for i = 1, 2, ..., size(D,1) and place all that match the same row and column coordinates into a separate 2D bin, we then add up all of the occurrences and the output at this 2D bin gives you the total tally of how many values at this 2D bin which corresponds to the row and column coordinate of interest mapped to this location.

MATLAB loop over unique function output

following problem:
I have a very large matrix and several rows share the same identifier in column 1. For these rows I need to do some averaging, reformatting etc.
Currently I am identifying all unique identifier values in column 1 by using the function unique and then do averaging, reformatting of values in other columns within this loop for each set of rows sharing the same column 1 value within a loop.
ID = unique(data.1);
for i = 1:length(ID);
do stuff
end
I guess this is highly inefficient and slow but I cannot think of a better way of handling this.

When to use a cell, matrix, or table in Matlab

I am fairly new to matlab and I am trying to figure out when it is best to use cells, tables, or matrixes to store sets of data and then work with the data.
What I want is to store data that has multiple lines that include strings and numbers and then want to work with the numbers.
For example a line would look like
'string 1' , time, number1, number 2
. I know a matrix works best if al elements are numbers, but when I use a cell I keep having to convert the numbers or strings to a matrix in order to work with them. I am running matlab 2012 so maybe that is a part of the problem. Any help is appreciated. Thanks!
Use a matrix when :
the tabular data has a uniform type (all are floating points like double, or integers like int32);
& either the amount of data is small, or is big and has static (predefined) size;
& you care about the speed of accessing data, or you need matrix operations performed on data, or some function requires the data organized as such.
Use a cell array when:
the tabular data has heterogeneous type (mixed element types, "jagged" arrays etc.);
| there's a lot of data and has dynamic size;
| you need only indexing the data numerically (no algebraic operations);
| a function requires the data as such.
Same argument for structs, only the indexing is by name, not by number.
Not sure about tables, I don't think is offered by the language itself; might be an UDT that I don't know of...
Later edit
These three types may be combined, in the sense that cell arrays and structs may have matrices and cell arrays and structs as elements (because thy're heterogeneous containers). In your case, you might have 2 approaches, depending on how you need to access the data:
if you access the data mostly by row, then an array of N structs (one struct per row) with 4 fields (one field per column) would be the most effective in terms of performance;
if you access the data mostly by column, then a single struct with 4 fields (one field per column) would do; first field would be a cell array of strings for the first column, second field would be a cell array of strings or a 1D matrix of doubles depending on how you want to store you dates, the rest of the fields are 1D matrices of doubles.
Concerning tables: I always used matrices or cell arrays until I
had to do database related things such as joining datasets by a unique key; the only way I found to do this in was by using tables. It takes a while to get used to them and it's a bit annoying that some functions that work on cell arrays don't work on tables vice versa. MATLAB could have done a better job explaining when to use one or the other because it's not super clear from the documentation.
The situation that you describe, seems to be as follows:
You have several columns. Entire columns consist of 1 datatype each, and all columns have an equal number of rows.
This seems to match exactly with the recommended situation for using a [table][1]
T = table(var1,...,varN) creates a table from the input variables,
var1,...,varN . Variables can be of different sizes and data types,
but all variables must have the same number of rows.
Actually I don't have much experience with tables, but if you can't figure it out you can always switch to using 1 cell array for the first column, and a matrix for all others (in your example).

Creating dataset from matrix in Matlab

I'm trying to create a dataset from a double matrix and cell array of labels.
I don't have access to the mat2dataset function so I'm trying to write something similar.
>> whos data feature_labels
Name Size Bytes Class Attributes
data 2x208 3328 double
feature_labels 1x208 50776 cell
In actual use the data will have ~2million rows and always be double format. The number of columns will range from 20 up to 2000, so doing something like;
>> D = dataset([],[],[],[],[],...[], 'VarNames', feature_labels);
isn't really feasible.
Any suggestions?
edit:
Currently using a for loop and horzcat to concatenate new dataset columns on each loop. I don't see a way to pre-allocate the dataset size is this way so I imagine performance will chug with the larger datasets though..
Have you considered using a struct? I use these all the time in MATLAB for database things. I know it works absolutely fantastic for up to 20,000 elements with about 15 fields each, so I think it would still work just as well as anything else for 2 million items with 2 fields.
Alternatively, can't you just put it in a cell array?
DataBase{rowNum,1}=dataVector(rowNum,:);
DataBase{rowNum,2}=label{rowNum};
To preallocate a struct or cell, its relatively easy, with a struct, once you make your first one to initialize the fields, just say Struct(2000000).fieldName =[]
TO preallocate your cell array, just do
DataBase={[]}
DataBase{2000000,2}=[]
This preallocates all of it and fills it with empty values.