extracting data from excel to matlab - matlab

Suppose i have an excel file (data.xlsx) , which contains the following data.
Name age
Tom 43
Dick 24
Harry 32
Now i want to extract the data from it and make 2 cell array (or matrix) which shall contain
name = ['Tom' ; 'Dick';'Harry'] age = [43;24;32]
i have used xlsread(data.xlsx) , but its only extracting the numerical values ,but i want to obtain both as mentioned above . Please help me out

You have to use additional output arguments from xlread in order to get the text.
I created a dummy Excel file with your data and here is the output (nevermind about the NaNs):
[ndata, text, alldata] = xlsread('DummyExcel.xlsx')
ndata =
43
24
32
text =
'Name' 'Age'
'Tom' ''
'Dick' ''
'Harry' ''
alldata =
[NaN] 'Name' 'Age'
[NaN] 'Tom' [ 43]
[NaN] 'Dick' [ 24]
[NaN] 'Harry' [ 32]
Now if you use this:
text{2:end,1}
you get
ans =
Tom
ans =
Dick
ans =
Harry

You can use the function called importdata.
Example:
%Import Data
filename = 'yourfilename.xlsx';
delimiterIn = ' ';
headerlinesIn = 1;
A = importdata(filename,delimiterIn,headerlinesIn);
This will help to take both the text data and numerical data. Textdata will be under A.textdata and numerical data will be under A.data.

Related

Adding a datapoint to datastruct in matlab

I am trying to add a datapoint to an existing data struct. I have created the following data struct.
ourdata.animal= {'wolf', 'dog', 'cat'}
ourdata.height = [110 51 32]
ourdata.weight = [55 22 10]
say I want to add another one to the data struct with name 'fish' height 3 and weight 1, how do I go about this?
You can simply attach it to the end of the structure:
ourdata.animal{end+1} = 'fish'
ourdata.height(end+1) = 3
ourdata.weight(end+1) = 1
If you want to work with multiple structures, you can write a little function to combine the values of fields in multiple structs. Here's one, using fieldnames() to discover what fields exist:
function out = slapItOn(aStruct, anotherStruct)
% Slap more data on to the end of fields of a struct
out = aStruct;
for fld = string(fieldnames(aStruct))'
out.(fld) = [aStruct.(fld) anotherStruct.(fld)];
end
end
Works like this:
>> ourdata
ourdata =
struct with fields:
animal: {'wolf' 'dog' 'cat'}
height: [110 51 32]
weight: [55 22 10]
>> newdata = slapItOn(ourdata, struct('animal',{{'bobcat'}}, 'height',420, 'weight',69))
newdata =
struct with fields:
animal: {'wolf' 'dog' 'cat' 'bobcat'}
height: [110 51 32 420]
weight: [55 22 10 69]
>>
BTW, I'd suggest that you use string arrays instead of cellstrs for storing your string data. They're better in pretty much every way (except performance). Get them with double quotes:
>> strs = ["wolf" "dog" "cat"]
strs =
1×3 string array
"wolf" "dog" "cat"
>>
Also, consider using a table array instead of a struct array for tabular-looking data like this. Tables are nice!
>> animal = ["wolf" "dog" "cat"]';
>> height = [110 51 32]';
>> weight = [55 22 10]';
>> t = table(animal, height, weight)
t =
3×3 table
animal height weight
______ ______ ______
"wolf" 110 55
"dog" 51 22
"cat" 32 10
>>

What the best way to build a dictionary (word count) for NLP in matlab?

I have a frequency count dictionary, I want to be able to read the frequency count to a given word in my dictonary.
for example
my input word is 'about' ,so the output will be the count of 'about' in my dictionary, which 139 to be able to calculate the probability.
139 about
133 according
163 accusing
244 actually
567 afternoon
175 again
156 ah
167 a-ha
165 ahh
I tried do this with fopen method, but not getting the wanted result.
1 fid = fopen('dictionary.txt');
2 words = textscan(fid, '%s');
3 fclose(fid);
4 words = words{1};
I tried this as well, but getting different result,
countfunction = #(word) nnz(strcmp(word, words));
count = cellfun(countfunction, words);
tally = [words num2cell(count)];
sortrows(tally, 2);
The problem is that you're running countfunction for each instance of each word in the dictionary, rather than each unique word in the dictionary.
Here's how to incrementally improve your code:
words = {'hi' 'hi' 'the' 'hi' 'the' 'a'};
unique_words = unique(words(:));
countfunction = #(word) nnz(strcmp(word, words));
count = cellfun(countfunction, unique_words);
tally = [unique_words, num2cell(count)];
disp(sortrows(tally, 2));
'a' [1]
'the' [2]
'hi' [3]
However, I'd recommend using grpstats instead:
words = {'hi' 'hi' 'the' 'hi' 'the' 'a'};
[unique_words, count] = grpstats(ones(size(words)), words(:), {'gname', 'numel'});
tally = [unique_words, num2cell(count)];
disp(sortrows(tally, 2));
'a' [1]
'the' [2]
'hi' [3]

Compute the Frequency of bigrams in Matlab

I am trying to compute and plot the distribution of bigrams frequencies
First I did generate all possible bigrams which gives 1296 bigrams
then i extract the bigrams from a given file and save them in words1
my question is how to compute the frequency of these 1296 bigrams for the file a.txt?
if there are some bigrams did not appear at all in the file, then their frequencies should be zero
a.txt is any text file
clear
clc
%************create bigrams 1296 ***************************************
chars ='1234567890abcdefghijklmonpqrstuvwxyz';
chars1 ='1234567890abcdefghijklmonpqrstuvwxyz';
bigram='';
for i=1:36
for j=1:36
bigram = sprintf('%s%s%s',bigram,chars(i),chars1(j));
end
end
temp1 = regexp(bigram, sprintf('\\w{1,%d}', 1), 'match');
temp2 = cellfun(#(x,y) [x '' y],temp1(1:end-1)', temp1(2:end)','un',0);
bigrams = temp2;
bigrams = unique(bigrams);
bigrams = rot90(bigrams);
bigram = char(bigrams(1:end));
all_bigrams_len = length(bigrams);
clear temp temp1 temp2 i j chars1 chars;
%****** 1. Cleaning Data ******************************
collection = fileread('e:\a.txt');
collection = regexprep(collection,'<.*?>','');
collection = lower(collection);
collection = regexprep(collection,'\W','');
collection = strtrim(regexprep(collection,'\s*',''));
%*******************************************************
temp = regexp(collection, sprintf('\\w{1,%d}', 1), 'match');
temp2 = cellfun(#(x,y) [x '' y],temp(1:end-1)', temp(2:end)','un',0);
words1 = rot90(temp2);
%*******************************************************
words1_len = length(words1);
vocab1 = unique(words1);
vocab_len1 = length(vocab1);
[vocab1,void1,index1] = unique(words1);
frequencies1 = hist(index1,vocab_len1);
I. Character counting problem for a string
bsxfun based solution for counting characters -
counts = sum(bsxfun(#eq,[string1-0]',65:90))
Output -
counts =
2 0 0 0 0 2 0 1 0 0 ....
If you would like to get a tabulate output of counts against each letter -
out = [cellstr(['A':'Z']') num2cell(counts)']
Output -
out =
'A' [2]
'B' [0]
'C' [0]
'D' [0]
'E' [0]
'F' [2]
'G' [0]
'H' [1]
'I' [0]
....
Please note that this was a case-sensitive counting for upper-case letters.
For a lower-case letter counting, use this edit to this earlier code -
counts = sum(bsxfun(#eq,[string1-0]',97:122))
For a case insensitive counting, use this -
counts = sum(bsxfun(#eq,[upper(string1)-0]',65:90))
II. Bigram counting case
Let us suppose that you have all the possible bigrams saved in a 1D cell array bigrams1 and the incoming bigrams from the file are saved into another cell array words1. Let us also assume certain values in them for demonstration -
bigrams1 = {
'ar';
'de';
'c3';
'd1';
'ry';
't1';
'p1'}
words1 = {
'de';
'c3';
'd1';
'r9';
'yy';
'de';
'ry';
'de';
'dd';
'd1'}
Now, you can get the counts of the bigrams from words1 that are present in bigrams1 with this code -
[~,~,ind] = unique(vertcat(bigrams1,words1));
bigrams_lb = ind(1:numel(bigrams1)); %// label bigrams1
words1_lb = ind(numel(bigrams1)+1:end); %// label words1
counts = sum(bsxfun(#eq,bigrams_lb,words1_lb'),2)
out = [bigrams1 num2cell(counts)]
The output on code run is -
out =
'ar' [0]
'de' [3]
'c3' [1]
'd1' [2]
'ry' [1]
't1' [0]
'p1' [0]
The result shows that - First element ar from the list of all possible bigrams has no find in words1 ; second element de has three occurrences in words1 and so on.
Hey similar to Dennis solution you can just use histc()
string1 = 'ASHRAFF'
histc(string1,'ABCDEFGHIJKLMNOPQRSTUVWXYZ')
this checks the number of entries in the bins defined by the string 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' which is hopefully the alphabet (just wrote it fast so no garantee). The result is:
Columns 1 through 21
2 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0
Columns 22 through 26
0 0 0 0 0
Just a little modification of my solution:
string1 = 'ASHRAFF'
alphabet1='A':'Z'; %%// as stated by Oleg Komarov
data=histc(string1,alphabet1);
results=cell(2,26);
for k=1:26
results{1,k}= alphabet1(k);
results{2,k}= data(k);
end
If you look at results now you can easily check rather it works or not :D
This answer creates all bigrams, loads in the file does a little cleanup, ans then uses a combination of unique and histc to count the rows
Generate all Bigrams
note the order here is important as unique will sort the array so this way it is created presorted so the output matches expectation;
[y,x] = ndgrid(['0':'9','a':'z']);
allBigrams = [x(:),y(:)];
Read The File
this removes capitalisation and just pulls out any 0-9 or a-z character then creates a column vector of these
fileText = lower(fileread('d:\loremipsum.txt'));
cleanText = regexp(fileText,'([a-z0-9])','tokens');
cleanText = cell2mat(vertcat(cleanText{:}));
create bigrams from file by shifting by one and concatenating
fileBigrams = [cleanText(1:end-1),cleanText(2:end)];
Get Counts
the set of all bigrams is added to our set (so the values are created for all possible). Then a value ∈{1,2,...,1296} is assigned to each unique row using unique's 3rd output. Counts are then created with histc with the bins equal to the set of values from unique's output, 1 is subtracted from each bin to remove the complete set bigrams we added
[~,~,c] = unique([fileBigrams;allBigrams],'rows');
counts = histc(c,1:1296)-1;
Display
to view counts against text
[allBigrams, counts+'0']
or for something potentially more useful...
[sortedCounts,sortInd] = sort(counts,'descend');
[allBigrams(sortInd,:), sortedCounts+'0']
ans =
or9
at8
re8
in7
ol7
te7
do6 ...
Did not look into the entire code fragment, but from the example at the top of your question, I think you are looking to make a histogram:
string1 = 'ASHRAFF'
nr = histc(string1,'A':'Z')
Will give you:
2 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
(Got a working solution with hist, but as #The Minion shows histc is more easy to use here.)
Note that this solution only deals with upper case letters.
You may want to do something like so if you want to put lower case letters in their correct bin:
string1 = 'ASHRAFF'
nr = histc(upper(string1),'A':'Z')
Or if you want them to be shown separately:
string1 = 'ASHRaFf'
nr = histc(upper(string1),['a':'z' 'A':'Z'])
bi_freq1 = zeros(1,all_bigrams_len);
for k=1: vocab_len1
for i=1:all_bigrams_len
if char(vocab1(k)) == char(bigrams(i))
bi_freq1(i) = frequencies1(k);
end
end
end

MATLAB: textscan to parse irregular text, trouble debugging format specifier

I've been browsing stack overflow and the mathworks website trying to come up with a solution for reading an irregularly formatted text file into MATLAB using textscan but have yet to figure out a good solution.
The format of the text file looks as such:
// Reference=MNI
// Citation=Beauregard M, 1998
// Condition=Primed - Unprimed Semantic Category Decision
// Domain=Semantics
// Modality=Visual
// Subjects=13
-55 -25 -23
33 -9 -20
// Citation=Beauregard M, 1998
// Condition=Unprimed Semantic Category Decision - Baseline
// Domain=Semantics
// Modality=Visual
// Subjects=13
0 -73 9
-25 -59 47
0 -14 59
8 -18 63
-21 -90 -11
-24 -4 62
24 -93 -6
-21 15 47
-35 -26 -21
9 13 44
// Citation=Binder J R, 1996
// Condition=Words > Tones - Passive
// Domain=Language Perception
// Modality=Auditory
// Subjects=12
-58.73 -12.05 -4.61
I would like to end up with a cell array that looks like this
{nx3 double} {nx1 cellstr} {nx1 cellstr} {nx1 cellstr} {nx1 double}
Where the first element in the array are the 3d coordinates, the second element the citation, the third element the condition, the fourth element the domain, the fifth element the modality and the sixth element the number of subjects.
I would then like to use these cell array to organize the data into a structure to allow for easy indexing of the coordinates by each of the features I extracted from the text file.
I've tried a bunch of things but have only been able to extract out the coordinates as a string and the feature as a single cell array.
Here is how far I have gotten after searching through stack overflow and the mathworks website:
fid = fopen(fullfile(path2proj,path2loc),'r');
data = textscan(fid,'%s %s %s','HeaderLines',1,...
'delimiter',{...
sprintf('// '),...
'Citation=',...
'Condition=',...
'Domain=',...
'Modality=',...
'Subjects='});
I get the following output with this code:
data =
{16470x1 cell} {16470x1 cell} {16470x1 cell}
data{1}(1:20)
ans =
''
''
''
''
''
'-55 -25 -23'
'33 -9 -20'
''
''
''
''
''
'0 -73 9'
'-25 -59 47'
'0 -14 59'
'8 -18 63'
'-21 -90 -11'
'-24 -4 62'
'24 -93 -6'
'-21 15 47'
data{2}(1:20)
ans =
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
data{3}(1:20)
ans =
'Beauregard M, 1998'
'Primed - Unprimed Semantic Category Decision'
'Semantics'
'Visual'
'13'
''
''
'Beauregard M, 1998'
'Unprimed Semantic Category Decision - Baseline'
'Semantics'
'Visual'
'13'
''
''
''
''
''
''
''
''
Although I can work with the data in this format, it would be nice to understand how to correctly right a format specifier to extract out piece of data into it's own cell array. Does anyone have any dieas?
Assuming that Reference is only in the first line, you could do the following to obtained the values you want from each section Citation section.
% read the file and split it into sections based on Citation
filecontents = strsplit(fileread('data.txt'), '// Citation');
% iterate through section and extract desired info from each
% section. We start from i=2, as for i=1 we have 'Reference' line.
for i = 2:numel(filecontents)
lines = regexp(filecontents{i}, '\n', 'split');
% remove empty lines
lines(find(strcmp(lines, ''))) = [];
% get values of the fields
citation = lines{1};
condition = get_value(lines{2}, 'Condition');
domain = get_value(lines{3}, 'Domain');
modality = get_value(lines{4}, 'Modality');
subjects = get_value(lines{5}, 'Subjects');
coordinates = cellfun(#str2num, lines(6:end), 'UniformOutput', 0)';
% now you can save in some global cell,
% display or process the extracted values as you please.
end
where get_value is:
function value = get_value(line, search_for)
[tokens, ~] = regexp(line, [search_for, '=(.+)'],'tokens','match');
value = tokens{1};
Hope this helps.

how can I import multiple csv files with selected columns using textscan?

I have a large number of csv files to be processed. I only want the selected columns in each file and then load all the files from a certain folder and then output as one combined file. Here are my codes running with errors.... Could anyone help me to solve this problem?
data_directory = 'C:\Users\...\data';
numfiles = 17;
for n = 1:numfiles
filepath = [data_directory,'data_', num2str(n),'_output.csv'];
fid = fopen (filepath, 'rt');
wanted_columns= [2 3 4 5 10 11 12 13 14 15 16 17 35 36 41 42 44 45 59 61];
format = [];
columns = 109;
for i = 1 : columns;
if any (i == wanted_columns)
format = [format '%s'];
else
format = [format '%*s'];
end
end
data = textscan(fid, format, 'Delimiter',',','HeaderLines',1);
fclose(fid);
end
I think you should check whether the file is opened correctly. The error message seems to indicate that this is not the case. If it is not, check if the filepath is correct.
fid = fopen (filepath, 'rt');
if fid == -1
error('Failed to open file');
end
If the error is thrown here, you know that there was a problem with 'fopen'.
Ofcourse I don't know which files are on your computer, but I assume the '...' in the filename is not in your actual matlab file, only in your question on SO.
But could it be that you repeat the word 'data', while the actual filename only contains 'data' once? You code now will result in filenames like ''C:\Users\...\datadata_1_output.csv'. Maybe 'data' should be removed in data_directory or in filepath = ...?
Here is another way how you can setup the format string in a vectorized manner:
fcell = repmat({'%*s '},1,n_columns);
fcell(wanted_columns) = {'%s '};
formatstr = [fcell{:}];
Notice format is a build-in function in MATLAB, and it's better not to be used for variable name.