Iterate all XML nodes and their childs - dom

What would be the most efficient way in VBScript to iterate through an XML file.
I am looking for a way to iterate all nodes in the XML file. I cannot use XQL queries, because I really do need to iterate all nodes to check all attributes in the file.
PS: Basically I am writing a script to replace references to file paths. The problem is that these file paths can be in a big number of places. (But that's for me to find out). I only need help with the XML iterating part.

While I suspect that putting some intelligence and XPath expressions into the search would increase effiency, this
Option Explicit
Dim oXDoc : Set oXDoc = CreateObject( "Msxml2.DOMDocument" )
oXDoc.async = False
oXDoc.load "..\data\31677574.xml"
If 0 = oXDoc.ParseError Then
WScript.Echo oXDoc.documentElement.xml
walk oXDoc.documentElement, 0
Else
WScript.Echo oXDoc.ParseError.Reason
End If
Sub walk(e, i)
WScript.Echo Space(i), e.tagName
Dim a
For Each a In e.Attributes
WScript.Echo Space(i + 1), a.name, a.value
Next
Dim c
For Each c In e.childNodes
walk c, i + 2
Next
End Sub
output:
cscript 31694559.vbs
<Configuration>
<Add SourcePath="\\sample" ApplicationEdition="32">
<Product ID="SampleProductID">
<Language ID="en-us"/>
<Language ID="en-us"/>
</Product>
</Add>
</Configuration>
Configuration
Add
SourcePath \\sample
ApplicationEdition 32
Product
ID SampleProductID
Language
ID en-us
Language
ID en-us
will visit all elements and their attributes.

Related

How to use a dynamic string variable in an XPath expression in matlab

I'm trying to find the children of a specific node (concept) of an XMl document in matlab using Xpath.
I used the following code I get 5 children which is true.
expression = xpath.compile('//concept\[#name="con1"\]/\*');
Childs = expression.evaluate(xDoc, XPathConstants.NODESET);
But for my project I have to use the string values of the attributes "name" of each concept in dynamic manner, so I stored them in vector in order to cal them one by one.
For example, ConceptName(1)="con1", however, when I execute the following code, I get zero children:
expression = xpath.compile('//concept\[#name="ConceptName(1)"\]/\*');
Childs = expression.evaluate(xDoc, XPathConstants.NODESET);
If there is someone who can help me to call the sting variables to the path expression I would be very grateful.
Thank you in advance.
Here is how my XML doc look like, My desired outpout whould be a list of four concepts (the first children of the concept which has the name="con1"), but I must extract the name of the parent concept dynamicly because the structure whould be unkowen.
<?xml version="1.0" encoding="UTF-8"?>
<taxonomy>
<concept name="con1">
<concept name="con11">
<concept name="con1033990258">
<concept name="con271874239">
<concept name="con1657241849">
<concept name="con1448945150">
<instance name="inst686829093"/>
<instance name="inst1379512917"/>
<instance name="inst2072196703"/>
</concept>
</concept>
</concept>
</concept>
<concept name="con12"> </concept>
<concept name="con13"></concept>
<concept name="con14"></concept>
</concept>
</taxonomy>
This is my code
% get the xpath mechanism into the workspace
import javax.xml.xpath.*
factory = XPathFactory.newInstance;
xpath = factory.newXPath;
% read the XML file
filedir = 'C:\Users\Asus\Documents\Asma\MatlabCode\Contribution2\WSC2009_XML'; %location of the file
files = dir(fullfile(filedir, '*.xml'));
xDoc = xmlread(fullfile(filedir, files(1).name)); % read the XML doc but return "[#document: null]". The xmlread function returns a Java object that represents the file's Document Object Model, or DOM. The "null" is simply what the org.apache.xerces.dom.DeferredDocumentImpl's implementation of toString() dumps to the MATLAB Command Window
XDocInMatlab = xmlwrite(xDoc); % show the XML file
taxonomy = xDoc.getElementsByTagName('taxonomy'); %% get the root elment
concepts = xDoc.getElementsByTagName('concept'); %% get the concept elemnt node
concept_Matrix = strings(concepts.getLength,1);
for i = 0 : concepts.getLength-1
conceptName = string(concepts.item(i).getAttribute('name'));
concept_Matrix(i+1,1) = conceptName;
if concepts.item(i).hasChildNodes
expression = xpath.compile('//concept[#name=conceptName]/*');
Childs = expression.evaluate(xDoc, XPathConstants.NODESET);
% Iterate through the nodes that are returned.
for j = 0:Childs.getLength-1
ChildsName(j+1) = char(Childs.item(j).getAttribute('name'));
end
end
end
The expression #name="ConceptName(1)" doesn't select anything because you don't have any elements whose name attribute has the value "ConceptName(1)".
It's hard to know how to correct your code because you don't really tell us what you thought it might mean. You say you stored the attribute names "in a vector" but there's no such thing as a vector in XPath, so I really don't know what you did or what you are trying to achieve.
My guess is that you want to replace the following line
expression = xpath.compile('//concept\[#name="ConceptName(1)"\]/\*')
with something like this:
expression = xpath.compile('//concept\[#name="' + ConceptName(1) + '"\]/\*')
Note that this works only if ConceptName is a string array type (with double quotes), not a char vector (single quotes).
Note also that it is not necessary to escape square brackets and asterisks in strings:
expression = xpath.compile('//concept[#name="' + ConceptName(1) + '"]/*')

Finding text AND fields with variable content in Word

I need to find and delete every occurrence of the following pattern in a Word 2010 document:
RPDIS→ text {INCLUDEPICTURE c:\xxx\xxx.png" \*MERGEFORMAT} text ←RPDIS
Where:
RPDIS→ and ←RPDIS are start and end delimiters
Between the start and end delimiters there can be just text or text and fields with variable content
The * wildcard in the Word Find and Replace dialog box will find the pattern if it contains text only but it will ignore patterns where text is combined with fields. And ^19 will find the field but not the rest of the pattern until the end delimiter.
Can anyone help, please?
Here's a VBA solution. It wildcard searches for RPDIS→*←RPDIS. If the found text contains ^19 (assuming field codes visible; if objects are visible instead of field codes, then the appropriate test is text contains ^01), the found text is deleted. Note that this DOES NOT care about the type of embedded field --- it will delete ANY AND ALL embedded fields that occur between RPDIS→ and ←RPDIS, so use at your own risk. Also, the code has ChrW(8594) and ChrW(8592) to match right-arrow and left-arrow respectively. You may need to change that if your arrows are encoded differently.
Sub test()
Dim wdDoc As Word.Document
Dim r As Word.Range
Dim s As String
' Const c As Integer = 19 ' Works when field codes are visible
Const c As Integer = 1 ' Works when objects are visible
Set wdDoc = ActiveDocument
Set r = wdDoc.Content
With r.Find
.Text = "RPDIS" & ChrW(8594) & "*" & ChrW(8592) & "RPDIS"
.MatchWildcards = True
While .Execute
s = r.Text
If InStr(1, s, chr(c), vbTextCompare) > 0 Then
Debug.Print "Delete: " & s
' r.Delete ' This line commented out for testing; remove comments to actively delete
Else
Debug.Print "Keep: " & s
End If
Wend
End With
End Sub
Hope that helps.

How to edit all word documents in a directory?

I've been given the scut job of correcting some hundred or so code testing reports that have been filled out incorrectly by a senior coder who has more import work to do.
Unluckily for me all the files are ms-word documents. But luckily for the formatting is all the same and the errors are all made in the same cells in the same table.
In the past I wrote a bash to edit to change single quotes to double quotes on multiple xml files. But that was with a linux machine. This time around I have only a window machine.
Any hints where to begin?
The answer was to use VBA. I built two subroutines.
The first subRoutine loops through the directory and
opens each *.doc file it finds. Then on the open document file it calls
the second subRoutine. After the second subRoutine is finished the document
is saved and then closed.
Sub DoVBRoutineNow()
Dim file
Dim path As String
path = "C:\Documents and Settings\userName\My Documents\myWorkFolder\"
file = Dir(path & "*.doc")
Do While file <> ""
Documents.Open FileName:=path & file
Call editCellsTableRow2
ActiveDocument.Save
ActiveDocument.Close
file = Dir()
Loop
End Sub
~~~~~~
The second subRoutine only works if all documents have the same formating.
For example: The second row of the only table in the document has cells numbered 6, 7, 8. These contain "dd/MM/yyyy" , "Last Name", "First Name"
These cells need to be changed to "yyyy/MM/dd", "Surname", "Given Name"
Sub editCellsTableRow2()
Application.ScreenUpdating = False
Dim Tbl As Table, cel As Cell, i As Long, n As Long
With ActiveDocument
For Each Tbl In .Tables
Tbl.Rows(2).Alignment = xlCenter
For Each cel In Tbl.Rows(2).Cells
If cel.ColumnIndex = 6 Then
cel.Range.Text = vbCrLf + "yyyy/MM/dd"
End If
If cel.ColumnIndex = 7 Then
cel.Range.Text = vbCrLf + "Surname"
End If
If cel.ColumnIndex = 8 Then
cel.Range.Text = vbCrLf + "Given Name"
End If
Next cel
Next Tbl
End With
Set cel = Nothing: Set Tbl = Nothing
Application.ScreenUpdating = True
End Sub

How to get Matlab to read correct amount of xml nodes

I'm reading a simple xml file using matlab's xmlread internal function.
<root>
<ref>
<requestor>John Doe</requestor>
<project>X</project>
</ref>
</root>
But when I call getChildren() of the ref element, it's telling me that it has 5 children.
It works fine IF I put all the XML in ONE line. Matlab tells me that ref element has 2 children.
It doesn't seem to like the spaces between elements.
Even if I run Canonicalize in oXygen XML editor, I still get the same results. Because Canonicalize still leaves spaces.
Matlab uses java and xerces for xml stuff.
Question:
What can I do so that I can keep my xml file in human readable format (not all in one line) but still have matlab correctly parse it?
Code Update:
filename='example01.xml';
docNode = xmlread(filename);
rootNode = docNode.getDocumentElement;
entries = rootNode.getChildNodes;
nEnt = entries.getLength
The XML parser behind the scenes is creating #text nodes for all whitespace between the node elements. Whereever there is a newline or indentation it will create a #text node with the newline and following indentation spaces in the data portion of the node. So in the xml example you provided when it is parsing the child nodes of the "ref" element it returns 5 nodes
Node 1: #text with newline and indentation spaces
Node 2: "requestor" node which in turn has a #text child with "John Doe" in the data portion
Node 3: #text with newline and indentation spaces
Node 4: "project" node which in turn has a #text child with "X" in the data portion
Node 5: #text with newline and indentation spaces
This function removes all of these useless #text nodes for you. Note that if you intentionally have an xml element composed of nothing but whitespace then this function will remove it but for the 99.99% of xml cases this should work just fine.
function removeIndentNodes( childNodes )
numNodes = childNodes.getLength;
remList = [];
for i = numNodes:-1:1
theChild = childNodes.item(i-1);
if (theChild.hasChildNodes)
removeIndentNodes(theChild.getChildNodes);
else
if ( theChild.getNodeType == theChild.TEXT_NODE && ...
~isempty(char(theChild.getData())) && ...
all(isspace(char(theChild.getData()))))
remList(end+1) = i-1; % java indexing
end
end
end
for i = 1:length(remList)
childNodes.removeChild(childNodes.item(remList(i)));
end
end
Call it like this
tree = xmlread( xmlfile );
removeIndentNodes( tree.getChildNodes );
I felt that #cholland answer was good, but I didn't like the extra xml work. So here is a solution to strip the whitespace from a copy of the xml file which is the root cause of the unwanted elements.
fid = fopen('tmpCopy.xml','wt');
str = regexprep(fileread(filename),'[\n\r]+',' ');
str = regexprep(str,'>[\s]*<','><');
fprintf(fid,'%s', str);
fclose(fid);

EDIFACT macro (readable message structure)

I´m working within the EDI area and would like some help with a EDIFACT macro to make the EDIFACT files more readable.
The message looks like this:
data'data'data'data'
I would like to have the macro converting the structure to:
data'
data'
data'
data'
Pls let me know how to do this.
Thanks in advance!
BR
Jonas
If you merely want to view the files in a more readable format, try downloading the Softshare EDI Notepad. It's a fairly good tool just for that purpose, it supports X12, EDIFACT and TRADACOMS standards, and it's free.
Replacing in VIM (assuming that the standard EDIFACT separators/escape characters for UNOA character set are in use):
:s/\([^?]'\)\(.\)/\1\r\2/g
Breaking down the regex:
\([^?]'\) - search for ' which occurs after any character except ? (the standard escape character) and capture these two characters as the first atom. These are the last two characters of each segment.
\(.\) - Capture any single character following the segment terminator (ie. don't match if the segment terminator is already on the end of a line)
Then replace all matches on this line with a new line between the segment terminator and the beginning of the next segment.
Otherwise you could end up with this:
...
FTX+AAR+++FORWARDING?: Freight under Vendor?'
s care.'
NAD+BY+9312345123452'
CTA+PD+0001:Terence Trent D?'
Arby'
...
instead of this:
...
FTX+AAR+++FORWARDING?: Freight under Vendor?'s care .'
NAD+BY+9312345123452'
CTA+PD+0001:Terence Trent D?'Arby'
...
Is this what you are looking for?
Option Explicit
Dim stmOutput: Set stmOutput = CreateObject("ADODB.Stream")
stmOutput.Open
stmOutput.Type = 2 'adTypeText
stmOutput.Charset = "us-ascii"
Dim stm: Set stm = CreateObject("ADODB.Stream")
stm.Type = 1 'adTypeBinary
stm.Open
stm.LoadFromFile "EDIFACT.txt"
stm.Position = 0
stm.Type = 2 'adTypeText
stm.Charset = "us-ascii"
Dim c: c = ""
Do Until stm.EOS
c = stm.ReadText(1)
Select Case c
Case Chr(39)
stmOutput.WriteText c & vbCrLf
Case Else
stmOutput.WriteText c
End Select
Loop
stm.Close
Set stm = Nothing
stmOutput.SaveToFile "EDIFACT.with-CRLF.txt"
stmOutput.Close
Set stmOutput = Nothing
WScript.Echo "Done."