How to get Matlab to read correct amount of xml nodes - matlab

I'm reading a simple xml file using matlab's xmlread internal function.
<root>
<ref>
<requestor>John Doe</requestor>
<project>X</project>
</ref>
</root>
But when I call getChildren() of the ref element, it's telling me that it has 5 children.
It works fine IF I put all the XML in ONE line. Matlab tells me that ref element has 2 children.
It doesn't seem to like the spaces between elements.
Even if I run Canonicalize in oXygen XML editor, I still get the same results. Because Canonicalize still leaves spaces.
Matlab uses java and xerces for xml stuff.
Question:
What can I do so that I can keep my xml file in human readable format (not all in one line) but still have matlab correctly parse it?
Code Update:
filename='example01.xml';
docNode = xmlread(filename);
rootNode = docNode.getDocumentElement;
entries = rootNode.getChildNodes;
nEnt = entries.getLength

The XML parser behind the scenes is creating #text nodes for all whitespace between the node elements. Whereever there is a newline or indentation it will create a #text node with the newline and following indentation spaces in the data portion of the node. So in the xml example you provided when it is parsing the child nodes of the "ref" element it returns 5 nodes
Node 1: #text with newline and indentation spaces
Node 2: "requestor" node which in turn has a #text child with "John Doe" in the data portion
Node 3: #text with newline and indentation spaces
Node 4: "project" node which in turn has a #text child with "X" in the data portion
Node 5: #text with newline and indentation spaces
This function removes all of these useless #text nodes for you. Note that if you intentionally have an xml element composed of nothing but whitespace then this function will remove it but for the 99.99% of xml cases this should work just fine.
function removeIndentNodes( childNodes )
numNodes = childNodes.getLength;
remList = [];
for i = numNodes:-1:1
theChild = childNodes.item(i-1);
if (theChild.hasChildNodes)
removeIndentNodes(theChild.getChildNodes);
else
if ( theChild.getNodeType == theChild.TEXT_NODE && ...
~isempty(char(theChild.getData())) && ...
all(isspace(char(theChild.getData()))))
remList(end+1) = i-1; % java indexing
end
end
end
for i = 1:length(remList)
childNodes.removeChild(childNodes.item(remList(i)));
end
end
Call it like this
tree = xmlread( xmlfile );
removeIndentNodes( tree.getChildNodes );

I felt that #cholland answer was good, but I didn't like the extra xml work. So here is a solution to strip the whitespace from a copy of the xml file which is the root cause of the unwanted elements.
fid = fopen('tmpCopy.xml','wt');
str = regexprep(fileread(filename),'[\n\r]+',' ');
str = regexprep(str,'>[\s]*<','><');
fprintf(fid,'%s', str);
fclose(fid);

Related

Having trouble conditionally moving files based on their names

I am trying to write a script that will auto sort files based on the 7th and 8th digit in their name. I get the following error: "Argument must be a string scalar or character vector". Error is coming from line 16:
Argument must be a string scalar or character vector.
Error in sort_files (line 16)
movefile (filelist(i), DirOut)
Here's the code:
DirIn = 'C:\Folder\Experiment' %set incoming directory
DirOut = 'C:\Folder\Experiment\1'
eval(['filelist=dir(''' DirIn '/*.wav'')']) %get file list
for i = 1:length(filelist);
Filename = filelist(i).name
name = strsplit(Filename, '_');
newStr = extractBetween(name,7,8);
if strcmp(newStr,'01')
movefile (filelist(i), DirOut)
end
end
Also, I am trying to make the file folder conditional so that if the 10-11 digits are 02 the file goes to DirOut/02 etc.
First, try avoid using the eval function, it is pretty much dreaded as slow and hard to understand. Specially if you need to create variables. Instead do this:
filelist = dir(fullfile(DirIn,'*.wav'));
Second, the passage:
name = strsplit(Filename, '_');
Makes name a list, so you can access name{1} or possibly name{2}. Each of these are strings. But name isn't a string, it is a list. extractBetween requires a string as an input. That is why you are getting this problem. But note that you could have simply done:
newStr = name(7:8);
If name was a string, which in Matlab is a char array.
EDIT:
Since it has been now claimed that the error occurs on movefile (filelist(i), DirOut), the likely cause is because filelist(i) is a struct. Wheres a filena name (char array) should have been given at input. The solution should be replacing this line with:
movefile(fullfile(filelist(i).folder, filelist(i).name), DirOut)
Now, if you want to number the output folders too, you can do this:
movefile(fullfile(filelist(i).folder, filelist(i).name), [DirOut,filesep,name(7:8)])
This will move a file to /DirOut/01. If you wanted /DirOut/1, you could do this:
movefile(fullfile(filelist(i).folder, filelist(i).name), [DirOut,filesep,int2str(str2num(name(7:8)))])

matlab text read and write %s character (without escaping)

Dear All (with many thanks in advance),
The following script has trouble reading (and therefore writing) the %s character in the file 'master.py'.
I get that matlab thinks the %s is an escape character, so perhaps an option is to modify the terminator, but I have found this difficult.
(EDIT: Forgot to mention the file master.py is not in my control, so I can't modify the file to %%s for example).
%matlab script
%===============
fileID = fopen('script.py','w');
yMax=5;
fprintf(fileID,'yOverallDim = %d\n', -1*yMax);
%READ IN "master.py" for rest of script
fileID2 = fopen('master.py','r');
currentLine = fgets(fileID2);
while ischar(currentLine)
fprintf(fileID,currentLine);
currentLine = fgets(fileID2);
end
fclose(fileID);
fclose(fileID2);
The file 'master.py' looks like this (and the problem is on line 6 'setName ="Set-%s"%(i+1)':
i=0
for yPos in range (0,yOverallDim,yVoxelSize):
yCoordinate=yPos+(yVoxelSize/2) #
for xPos in range (0,xOverallDim,xVoxelSize):
xCoordinate=xPos+(xVoxelSize/2)
setName ="Set-%s"%(i+1)
p = mdb.models['Model-1'].parts['Part-1']
# p = mdb.models['Model-1'].parts['Part-2']
c = p.cells
cells = c.findAt(((xCoordinate, yCoordinate, 10.0), ))
region = p.Set(cells=cells, name=setName)
p.SectionAssignment(region=region, sectionName='Section-1', offset=0.0, offsetType=MIDDLE_SURFACE, offsetField='', thicknessAssignment=FROM_SECTION)
i+=1
In the documentation of fprintf you'll find this:
fprintf(fileID,formatSpec,A1,...,An) applies the formatSpec to all elements of arrays A1,...An in column order, and writes the data to a text file.
So in your function fprintf uses currentLine as format specification, resulting in an unexpected output for line 6. Correct application of fprintf by providing a formatSpec, fixes this issue and doesn't require any replace operations:
fprintf(fileID, '%s', currentLine);
Your script has no trouble reading the % characters correctly. The "problem" is with fprintf(). This function correctly interpretes the percent signs in the string as formatting characters. Therefore, I think you have to manually escape every single % character in your currentLine string:
currentLine = strrep(currentLine, '%', '%%');
At least, it worked when I checked it on your example data.
Thanks applesoup for identifying my fundamental oversight - the problem is in the fprintf - not in the file read
Thanks serial for enhancing the fprintf

How to get rid of the punctuation? and check the spelling error

eliminate punctuation
words split when meeting new line and space, then store in array
check the text file got error or not with the function of checkSpelling.m file
sum up the total number of error in that article
no suggestion is assumed to be no error, then return -1
sum of error>20, return 1
sum of error<=20, return -1
I would like to check spelling error of certain paragraph, I face the problem to get rid of the punctuation. It may have problem to the other reason, it return me the error as below:
My data2 file is :
checkSpelling.m
function suggestion = checkSpelling(word)
h = actxserver('word.application');
h.Document.Add;
correct = h.CheckSpelling(word);
if correct
suggestion = []; %return empty if spelled correctly
else
%If incorrect and there are suggestions, return them in a cell array
if h.GetSpellingSuggestions(word).count > 0
count = h.GetSpellingSuggestions(word).count;
for i = 1:count
suggestion{i} = h.GetSpellingSuggestions(word).Item(i).get('name');
end
else
%If incorrect but there are no suggestions, return this:
suggestion = 'no suggestion';
end
end
%Quit Word to release the server
h.Quit
f19.m
for i = 1:1
data2=fopen(strcat('DATA\PRE-PROCESS_DATA\F19\',int2str(i),'.txt'),'r')
CharData = fread(data2, '*char')'; %read text file and store data in CharData
fclose(data2);
word_punctuation=regexprep(CharData,'[`~!##$%^&*()-_=+[{]}\|;:\''<,>.?/','')
word_newLine = regexp(word_punctuation, '\n', 'split')
word = regexp(word_newLine, ' ', 'split')
[sizeData b] = size(word)
suggestion = cellfun(#checkSpelling, word, 'UniformOutput', 0)
A19(i)=sum(~cellfun(#isempty,suggestion))
feature19(A19(i)>=20)=1
feature19(A19(i)<20)=-1
end
Substitute your regexprep call to
word_punctuation=regexprep(CharData,'\W','\n');
Here \W finds all non-alphanumeric characters (inclulding spaces) that get substituted with the newline.
Then
word = regexp(word_punctuation, '\n', 'split');
As you can see you don't need to split by space (see above). But you can remove the empty cells:
word(cellfun(#isempty,word)) = [];
Everything worked for me. However I have to say that you checkSpelling function is very slow. At every call it has to create an ActiveX server object, add new document, and delete the object after check is done. Consider rewriting the function to accept cell array of strings.
UPDATE
The only problem I see is removing the quote ' character (I'm, don't, etc). You can temporary substitute them with underscore (yes, it's considered alphanumeric) or any sequence of unused characters. Or you can use list of all non-alphanumeric characters to be remove in square brackets instead of \W.
UPDATE 2
Another solution to the 1st UPDATE:
word_punctuation=regexprep(CharData,'[^A-Za-z0-9''_]','\n');

printing a vector of string in static text box with new lines

I have a bunch of classes that I am iterating through and collecting which classes the student is failing in. If the student fails , I collect the name of the class in a vector called retake.
retake =[Math History Science]
I have line breaks so when the classes print in the command window it shows as:
retake=
Math
History
Science.
However, I am trying display retake in a static text box in Gui Guide so it looks like the above. Instead, the static text box is showing as:
MathHistoryScience
set(handles.text13,'String', retake) % this is what I tried
can you please show me so it prints:
Math
History
Science
It looks to me like you need to add carriage returns.
Assuming you have a cell array with strings (rather than concatenated strings using [], which will give you a single long line), you can do it as follows:
retake = {'Math', 'History', 'Science'};
rString = '';
for ii = 1:numel(retake)-1
rString = [rString sprintf('%s\n', retake{ii}];
end
rString = [rString retake{end}];
Notice the use of '' to denote strings, {} to denote a cell array, '\n' as the end-of-line character, and [a b] to do simple string concatenation.

What do I have to add at the beginning of this loop?

how I can read the following files using the for loop: (can the loop ignore the characters in filenames?)
abc-1.TXT
cde-2.TXT
ser-3.TXT
wsz-4.TXT
aqz-5.TXT
iop-6.TXT
What do I have to add at the beginning of this loop ??
for i = 1:1:6
nom_fichier = strcat(['MyFile\.......' num2str(i) '.TXT']);
You can avoid constructing the filenames by using the DIR command. For instance:
myfiles = dir('*.txt');
for i = 1:length(myfiles)
nom_fichier = myfiles(i).name;
...do processing here...
end
First of all, why would you use strcat here? This is, by itself, a SINGLE string. All concatenation has already been done by the brackets [].
['MyFile\.......' num2str(i) '.TXT']
Next, I'm not certain what is your question here. Is it how to load in the data? If the files are simply delimited numbers, with the same number of them on each line, then load will suffice to load them in, or perhaps you may need textread.
My guess is you do not know how to build the main part of of the file name. You might do it this way:
Names = {'abc' 'cde 'ser' 'wsz' 'aqz' 'iop'};
for i = 1:6
fn = ['MyFile',filesep,Names{i},'-',num2str(i),'.TXT'];
data = load(fn);
% do other stuff ...
end
If you don't want to create a variable with the names by typing them in, then use dir, perhaps like this to create a list of text file names:
Names = dir('MyFile\*.TXT');