How to extract the "mean" and "depth" data like the following of each month?
MEAN, S.D., NO. OF OBSERVATIONS
January February ...
Depth Mean S.D. #Obs Mean S.D. #Obs ...
0 32.92 0.43 9 32.95 0.32 21
10 32.92 0.43 14 33.06 0.37 48
20 32.88 0.46 10 33.06 0.37 50
30 32.90 0.51 9 33.12 0.35 48
50 33.05 0.54 6 33.20 0.42 41
75 33.70 1.11 7 33.53 0.67 37
100 34.77 1 34.47 0.42 10
150
200
July August
Depth Mean S.D. #Obs Mean S.D. #Obs
0 32.76 0.45 18 32.75 0.80 73
10 32.76 0.40 23 32.65 0.92 130
20 32.98 0.53 24 32.84 0.84 121
30 32.99 0.50 24 32.93 0.59 120
50 33.21 0.48 16 33.05 0.47 109
75 33.70 0.77 10 33.41 0.73 80
100 34.72 0.54 3 34.83 0.62 20
150 34.69 1
200
It has undefinable number of spaces between the data, and a introduction line at the beginning.
Thank you!
Here is an example for how to read line from file:
fid = fopen('yourfile.txt');
tline = fgetl(fid);
while ischar(tline)
disp(tline)
tline = fgetl(fid);
end
fclose(fid);
Inside the while loop you'll want to use strtok (or something like it) to break up each line into string tokens delimited by spaces.
Matlab's regexp is powerful for pulling data out of less-structure text. It's really worth getting familiar with regular expressions in general: http://www.mathworks.com/help/techdoc/ref/regexp.html
In this case, you would define the pattern to capture each observation group (Mean SD Obs), e.g.: 32.92 0.43 9
Here I see a pattern for each group of data: each group is preceded by 6 spaces (regular expression = \s{6}), and the 3 data points are divided by less than 6 spaces (\s+). The data itself consists of two floats (\d+.\d+) and one integer (\d+):
So, putting this together, your capture pattern would look something like this (the brackets surround the pattern of data to capture):
expr = '\s{6}(\d+\.\d+)\s+(\d+\.\d+)\s+(\d+)';
We can add names for each token (i.e. each data point to capture in the group) by adding '?' inside the brackets:
expr = '\s{6}(?<mean>\d+\.\d+)\s+(?<sd>\d+\.\d+)\s+(?<obs>\d+)';
Then, just read your file into one string variable 'strFile' and extract the data using this defined pattern:
strFile = urlread('file://mydata.txt');
[tokens data] = regexp(strFile, expr, 'tokens', 'names');
The variable 'tokens' will then contain a sequence of observation groups and 'data' is a structure with fields .mean .sd and .obs (because these are the token names in 'expr').
If you just want to get, for example, the first two columns, then textscan() is a great choice.
fid = fopen('yourfile.txt');
tline = fgetl(fid);
while ischar(tline)
oneCell = textscan(tline, '%n'); % read the whole line, put it into a cell
allTheNums = oneCell{1}; % open up the cell to get at the columns
if isempty(allTheNums) % no numbers, header line
continue;
end
usefulNums = allTheNums(1:2) % get the first two columns
end
fclose(fid);
textscan automatically splits the strings you feed it where there is whitespace, so the undefined number of strings between columns isn't an issue. A string with no numbers will give an array that you can test as empty to avoid out-of-bounds or bad data errors.
If you need to programmatically figure out which columns to get, you can scan for the words 'Depth' and 'Mean' to find the indeces. Regular expressions might be helpful here, but textscan should work fine too.
Related
I need to write a string and a table to one text file. The string contains a description of the data in the table, but is not the headers for the table. I am using R2019a which I guess means the "Append" writetable option does not work? Please see my example data below:
% Load data
load cereal.mat;
table = table(Calories, Carbo, Cups, Fat, Fiber, Mfg, Name, Potass);
string = {'This is a string about the cereal table'};
filename = "dummyoutput.sfc";
% How I tried to do this (which does not work!)
fid = fopen(filename, 'w', 'n');
fprintf(fid, '%s', cell2mat(string))
fclose(fid);
writetable(table, filename, 'FileType', 'text', 'WriteVariableNames', 0, 'Delimiter', 'tab', 'WriteMode', 'Append')
I get this error:
Error using writetable (line 155)
Wrong number of arguments.
Does anyone have a suggestion as to how to proceed?
Thanks!
A bit hacky, but here's an idea.
Convert your existing table to a cell array with table2cell.
Prepend a row of cells which consists of your string, followed by empty cells.
Convert the cell array back to a table with cell2table, and write the new table to the file.
load cereal.mat;
table = table(Calories, Carbo, Cups, Fat, Fiber, Mfg, Name, Potass);
s = {'This is a string about the cereal table'};
filename = "dummyoutput.sfc";
new_table = cell2table([[s repmat({''},1,size(table,2)-1)] ; table2cell(table)]);
writetable(new_table,filename,'FileType','text','WriteVariableNames',0,'Delimiter','tab');
>> !head dummyoutput.sfc
This is a string about the cereal table
70 5 0.33 1 10 N 100% Bran 280
120 8 -1 5 2 Q 100% Natural Bran 135
70 7 0.33 1 9 K All-Bran 320
50 8 0.5 0 14 K All-Bran with Extra Fiber 330
110 14 0.75 2 1 R Almond Delight -1
110 10.5 0.75 2 1.5 G Apple Cinnamon Cheerios 70
110 11 1 0 1 K Apple Jacks 30
130 18 0.75 2 2 G Basic 4 100
90 15 0.67 1 4 R Bran Chex 125
I'm working on an assignment where I have to read a tab delimited text file and my output has to be a matlab structure.
The contents of the file look like this (It is a bit messy but you get the picture). The actual file contains 500 genes (the rows starting at Analyte 1) and 204 samples (the columns starting at A2)
#1.2
500 204
Name Desc A2 B2 C2 D2 E2 F2 G2 H2
Analyte 1 Analyte 1 978 903 1060 786 736 649 657 733.5
Analyte 2 Analyte 2 995 921 995.5 840 864.5 757 739 852
Analyte 3 Analyte 3 1445.5 1556.5 1579 1147.5 1249 1069.5 1048 1235
Analyte 4 Analyte 4 1550 1371 1449 1127 1196 1337 1167 1359
Analyte 5 Analyte 5 2074 1776 1960 1653 1544 1464 1338 1706
Analyte 6 Analyte 6 2667 2416.5 2601 2257 2258 2144 2173.5 2348
Analyte 7 Analyte 7 3381.5 3013.5 3353 3099.5 2763 2692 2774 2995
My code is as follows:
fid = fopen('gene_expr_500x204.gct', 'r');%Open the given file
% Skip the first line and determine the number or rows and number of samples
dims = textscan(fid, '%d', 2, 'HeaderLines', 1);
ncols = dims{1}(2);
% Now read the variable names
varnames = textscan(fid, '%s', 2 + ncols);
varnames = varnames{1};
% Now create the format spec for your data (2 strings and the rest floats)
spec = ['%s%s', repmat('%f', [1 ncols])];
% Read in all of the data using this custom format specifier. The delimiter will be a tab
data = textscan(fid, spec, 'Delimiter', '\t');
% Place the data into a struct where the variable names are the fieldnames
ge = data{3:ncols+2}
S = struct('gn', data{1}, 'gd', data{2}, 'sid', {varnames});
The part about ge is my current attempt but its not really working. Any help would be very appreciated, thank you in advance!!
A struct field can hold any datatype including a multi-dimensional array or matrix.
Your issue is that data{3:ncols+2} creates a comma-separated list. Since you only have one output on the left side of the assignment, ge will only hold the last column's value. You need to use cat to concatenate all of the columns into a big matrix.
ge = cat(2, data{3:end});
% Or you can do this implicitly with []
% ge = [data{3:end}];
Then you can pass this value to the struct constructor
S = struct('gn', data(1), 'gd', data(2), 'sid', {varnames}, 'ge', ge);
I want to insert a number in the following matrix: n x 1 matrix
6
103
104
660
579
750
300
299
300
750
579
661
580
760
302
301
302
760
580
662
581
How to I insert it in the middle and shift the remaining numbers? I tried the following code:
Idx=[723];
c=false(1,length(Element_set2)+length(Idx));
c(Idx)=true;
result=nan(size(c));
result(~c)=Element_set2;
result(c)=8
You are complicating things. Simply find the middle index by finding the length of the array, dividing by 2 and truncating any decimal points, then using simply indexing to update the new matrix. Supposing that result is the column vector that was created by you and number is the value you want to insert in the middle, do the following:
number = 8; %// Change to suit whatever number you desire
middle = floor(numel(result) / 2);
result = [result(1:middle); number; result(middle+1:end)];
In the future, please read this great MATLAB tutorial on indexing directly from MathWorks: http://www.mathworks.com/company/newsletters/articles/matrix-indexing-in-matlab.html. It's a good resource on the kinds of indexing operations one expects from starting out in MATLAB.
consider an array in MATLAB :
a = [102 20 1 30 8 255];
In this array, I need to make all the numbers to three digits by prefixing zero to all values lie this :
a = 102 020 001 030 008 255
After that, I need to reverse it again to the same. how can i do this?
I tried to separate the digits and do this. But it failed.
You want to use the notation of fprintf, which can be saved as a string with sprintf:
>> a = [102 20 1 30 8 255]
a =
102 20 1 30 8 255
>> b = sprintf('%.3d ',a) % b is a single string
b =
102 020 001 030 008 255
>> a = str2num(b)
a =
102 20 1 30 8 255
You probably need to convert to string. Have a look at the int2str or num2strfunctions for example. Then you can easily concatenate zeros at the beginning. For example:
s = int2str(10);
['0' s]
This gives you 010 as an output.
You can then revert with the str2num function.
I am a beginner with MATLAB and I am struggling with this assignment. Can anyone guide me through it?
Consider the data given below:
x = [ 1 , 48 , 81 , 2 , 10 , 25 , ,14 , 18 , 53 , 41, 56, 89,0, 1000, , ...
34, 47, 455, 21, , 22, 100 ];
Once the data is loaded, see if you can find any:
Outliers or
Missing data in the data file
Correct the missing values using median, mode and noisy data using median binning, mean binning and bin boundaries.
This isn't so bad. First off, take a look at the distribution of your data. You can see that the majority of your data has double digits. The outliers are those with single digits, or those that are way larger than double digits. Mind you, this is totally subjective so someone else may tell you that the single digits are part of your data too. Also, the missing data are those numbers that are spaces in between the commas. Let's write some MATLAB code and change these to NaN (or not-a-number), because if you try copying and pasting this code directly into MATLAB, it will give you a syntax error because if you are explicitly defining numbers this way, you have to be sure all of them are there.
To do this, use regexprep so that any parts of this string that have a comma, space, then another comma, put a NaN in between. To do this, we need to put this statement as a string first. We then use eval to convert this string to an actual MATLAB statement:
x = '[ 1 , 48 , 81 , 2 , 10 , 25 , ,14 , 18 , 53 , 41, 56, 89,0, 1000, , 34, 47, 455, 21, , 22, 100 ];'
y = eval(regexprep(x, ', ,', ', NaN, '));
If we display this data, we get:
y =
Columns 1 through 6
1 48 81 2 10 25
Columns 7 through 12
NaN 14 18 53 41 56
Columns 13 through 18
89 0 1000 NaN 34 47
Columns 19 through 23
455 21 NaN 22 100
As such, to answer our first question, any values that are missing are denoted as NaN and those numbers that are bigger than double digits are outliers.
For the next question, we simply extract those values that are not missing, calculate the mean and median of what is not missing, and fill in those NaN values with the mean and median. For the bin boundaries, this is the same thing as using the values to the left (or right... depends on your definition, but let's use left) of the missing value and fill those in. As such:
yMissing = isnan(y); %// Which values are missing?
y_noNaN = y(~yMissing); %// Extract the non-missing values
meanY = mean(y_noNaN); %// Get the mean
medianY = median(y_noNaN); %// Get the median
%// Output - Fill in missing values with median
yMedian = y;
yMedian(yMissing) = medianY;
%// Same for mean
yMean = y;
yMean(yMissing) = meanY;
%// Bin boundaries
yBinBound = y;
yBinBound(yMissing) = y(find(yMissing)-1);
The mean and median for the data of the non-missing values is:
meanY =
105.8500
medianY =
37.5000
The outputs for each of these, in addition to the original data with the missing values looks like:
format bank; %// Do this to show just the first two decimal places for compact output
format compact;
y =
Columns 1 through 5
1 48 81 2 10
Columns 6 through 10
25 NaN 14 18 53
Columns 11 through 15
41 56 89 0 1000
Columns 16 through 20
NaN 34 47 455 21
Columns 21 through 23
NaN 22 100
yMean =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 105.85 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
105.85 34.00 47.00 455.00 21.00
Columns 21 through 23
105.85 22.00 100.00
yMedian =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 37.50 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
37.50 34.00 47.00 455.00 21.00
Columns 21 through 23
37.50 22.00 100.00
yBinBound =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 25.00 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
1000.00 34.00 47.00 455.00 21.00
Columns 21 through 23
21.00 22.00 100.00
If you take a look at each of the output values, this fills in our data with the mean, median and also the bin boundaries as per the question.