Organising large datasets in Matlab - matlab

I have a problem I hope you can help me with.
I have imported a large dataset (200000 x 5 cell) in Matlab that has the following structure:
'Year' 'Country' 'X' 'Y' 'Value'
Columns 1 and 5 contain numeric values, while columns 2 to 4 contain strings.
I would like to arrange all this information into a variable that would have the following structure:
NewVariable{Country_1 : Country_n , Year_1 : Year_n}(Y_1 : Y_n , X_1 : X_n)
All I can think of is to loop through the whole dataset to find matches between the names of the Country, Year, X and Y variables combining the if and strcmp functions, but this seems to be the most ineffective way of achieving what I am trying to do.
Can anyone help me out?
Thanks in advance.

As mentioned in the comments you can use categorical array:
% some arbitrary data:
country = repmat('ca',10,1);
country = [country; repmat('cb',10,1)];
country = [country; repmat('cc',10,1)];
T = table(repmat((2001:2005)',6,1),cellstr(country),...
cellstr(repmat(['x1'; 'x2'; 'x3'],10,1)),...
cellstr(repmat(['y1'; 'y2'; 'y3'],10,1)),...
randperm(30)','VariableNames',{'Year','Country','X','Y','Value'});
% convert all non-number data to categorical arrays:
T.Country = categorical(T.Country);
T.X = categorical(T.X);
T.Y = categorical(T.Y);
% here is an example for using categorical array:
newVar = T(T.Country=='cb' & T.Year==2004,:);
The table class is made for such things, and very convenient. Just expand the logic statement in the last line T.Country=='cb' & T.Year==2004 to match your needs.
Tell me if this helps ;)

Related

Find and replace in matlab?

I want do find and replace all in matlab (As we do in MS office).
https://www.dropbox.com/s/hxfqunjwhnvkl1f/matlab.mat?dl=0
I have a cell array LUT_HS_complete (contains identifier in column 1 and protein name in column 2 and summary in column 3) this is my look up table. on the other hand, I have my protein-protein interaction data (named Second_layer with identifiers in first two columns and the score in column 3).
I want to replace the first two columns in my Second_layer with the corresponding protein name from my look up table.
I tried strmatch, but that didn't help me.
Source_gene = Second_layer(:,1); Source_gene = regexprep(Source_gene,'[-/\s]','');
Target_gene = Second_layer(:,2); Target_gene = regexprep(Target_gene,'[-/\s]','');
Inter_score = Second_layer(:,3);
%%
for i=1:length(Source_gene(1:end,1));
SG = strmatch(Source_gene(i),LUT_HS_complete(1:end,1),'exact');
renamed_Source_gene(SG,1) = LUT_HS_complete(SG,2);
end
for j=1:length(Target_gene(1:end,1));
TG = strmatch(Target_gene(j),LUT_HS_complete(1:end,1),'exact');
renamed_Target_gene(TG,1) = LUT_HS_complete(TG,2);
end
If you could find a solution. It would be a great help.
Might this work for you?
renamed_Second_layer(:,1)=LUT_HS_complete(cellfun(#(x) find(strcmp(x,LUT_HS_complete(:,1))),Second_layer(:,1)),2);
renamed_Second_layer(:,2)=LUT_HS_complete(cellfun(#(x) find(strcmp(x,LUT_HS_complete(:,1))),Second_layer(:,2)),2);
renamed_Second_layer(:,3)=Second_layer(:,3);

Filtering by Column in Matlab for a list or variety of values

I've produced a code which separates data within a text file into the required format, filters the data and averages the output (in this case, the value in the fourth column)
I am trying to filter the data in column one for a list of values at the same time, with no strict pattern for the values. e.g 1001, 1007, 1048, 1192, 1200 ....
Currently my code only filters by a certain value (1001) is there a way of incorporating a list of values into this function?
C_f = C(C(:,1) == 1001 , :);
Any help would be much appreciated!
See if this is what you want,
val = [1000 1001];
ind = ismember(C(:,1),val);
C_f = C(ind,:)

Is there a quick way to assign unique text entries in an array a number?

In MatLab, I have several data vectors that are in text. For example:
speciesname = [species1 species2 species3];
genomelength = [8 10 5];
gonometype = [RNA DNA RNA];
I realise that to make a plot, arrays must be numerical. Is there a quick and easy way to assign unique entries in an array a number, for example so that RNA = 1 and DNA = 2? Note that some arrays might not be binary (i.e. have more than two options).
Thanks!
So there is a quick way to do it, but im not sure that your plots will be very intelligible if you use numbers instead of words.
You can make a unique array like this:
u = unique(gonometype);
and make a corresponding number array is just 1:length(u)
then when you go through your data the number of the current word will be:
find(u == current_name);
For your particular case you will need to utilize cells:
gonometype = {'RNA', 'DNA', 'RNA'};
u = unique(gonometype)
u =
'DNA' 'RNA'
current = 'RNA';
find(strcmp(u, current))
ans =
2

Skip multiple points of 'x' and store and variable 'newdata in matlab

I have data stored in variable data.
data =
[43.98272955 39.55809471;
-49.51656799 28.57164726;
-9.475861028 -44.31264255;
27.14884251 2.603921223;
-2.914496888 7.864022006;
4.093025860 4.816211687;
-12.11007479 5.797539648;
-1.653535904 -12.49864642;
5.978990391 1.229984916;
0.9837133282 -2.001124423;
5.674977844 6.323209942;
-9.574459589 3.696791663;
0.3410452503 -7.338955191]
but need use only data corresponding to multiple numbers of x.
Example:
if x = 3,
want store only multiple rows of 3, so
newdata = [-9.475861028 -44.31264255;
4.093025860 4.816211687;
5.978990391 1.229984916;
-9.574459589 3.696791663]
how do I do that?
P.S I would use the command textscan.
this is straightforward with indexing:
newData = data(3:3:end,:)
If I understood the question correctly:
data(x:x:length(data),:)
You could just do just scan it row by row using the mod (modulo) function to extract rows corresponding to the desired multiples. For example:
x=3;
newdata=[];
for k=1:size(data,1)
if mod(k,x)==0
newdata=[newdata; data(k,:)];
end
end

How to extract year from a dates cell array in MATLAB?

i have a cell array as below, which are dates. I am wondering how can i extract the year at the last 4 digits? Could anyone teach me how to locate the year in the string? Thank you!
'31.12.2001'
'31.12.2000'
'31.12.2004'
'31.12.2003'
'31.12.2002'
'31.12.2000'
'31.12.1999'
'31.12.1998'
'31.12.1997'
'31.12.2005'
'31.12.2004'
'31.12.2003'
'31.12.2002'
'31.12.2001'
'31.12.2000'
'31.12.1999'
'31.12.1998'
'31.12.2005'
'31.12.2004'
'31.12.2003'
'31.12.2002'
'31.12.2005'
Example cell array:
A = {'31.12.2001'; '31.12.2002'; '31.12.2003'};
Apply some regular expressions:
B = regexp(A, '\d\d\d\d', 'match')
B = [B{:}];
EDIT: I never realized that matlab will "nest" an extra layer of cells until I tested this. I don't like this solution as much now that I know the second line is necessary. Here is an alternative approach that gets you the years in numeric form:
C = datevec(A, 'dd.mm.yyyy');
C = C(:, 1);
SECOND EDIT: Suprisingly, if your cell array has less than 10000 elements, the regexp approach is faster on my machine. But the output of it is another cell array (which takes up much more memory than a numeric matrix). You can use B = cell2mat(B) to get a character array instead, but this brings the two approaches to approximately equal efficiency.
Just to add a fun answer, designed to take the OP to the stranger regions of Matlab:
C = char(C);
y = (D(:,7:end)-'0') * 10.^(3:-1:0).'
which is an order of magnitude faster than anything posted in the other answers :)
Or, to stay a bit closer to home,
y = cellfun(#(x)str2double(x(7:end)),C);
or, yet another regexp variation:
y = str2num(char(regexprep(C, '\d+\.\d+\.','')));
Assuming your matrix with dates is M or a cell array C:
In case your data is in a cell array start with
M = cell2mat(C)
Then get the relevant part
Y=M(:,end-4:end)
If required you can even make the year a number
Year = str2num(Y)
Using regexp this will works also with dates with slightly different formats, like 1.1.2000, which can mess with you offsets
res = regexp(dates, '(?<=\d+\.\d+\.)\d+', 'match')