Sum of DOM elements using XPath - dom

I am using MSXML v3.0 in a VB 6.0 application. The application calculates sum of an attribute of all nodes using for each loop as shown below
Set subNodes = docXML.selectNodes("//Transaction")
For Each subNode In subNodes
total = total + Val(subNode.selectSingleNode("Amount").nodeTypedValue)
Next
This loop is taking too much time, sometime it takes 15-20 minutes for 60 thousand nodes.
I am looking for XPath/DOM solution to eliminate this loop, probably
docXML.selectNodes("//Transaction").Sum("Amount")
or
docXML.selectNodes("Sum(//Transaction/Amount)")
Any suggestion is welcomed to get this sum faster.

// Open the XML.
docNav = new XPathDocument(#"c:\books.xml");
// Create a navigator to query with XPath.
nav = docNav.CreateNavigator();
// Find the sum
// This expression uses standard XPath syntax.
strExpression = "sum(/bookstore/book/price)";
// Use the Evaluate method to return the evaluated expression.
Console.WriteLine("The price sum of the books are {0}", nav.Evaluate(strExpression));
source: http://support.microsoft.com/kb/308333

Any solution that uses the XPath // pseudo-operator on an XML document with 60000+ nodes is going to be quite slow, because //x causes a complete traversal of the tree starting at the root of the document.
The solution can be speeded up significantly, if a more exact XPath expression is used, that doesn't include the // pseudo-operator.
If you know the structure of the XML document, always use a specific chain of location steps -- never //.
If you provide a small example, showing the specific structure of the document, then many people will be able to provide a faster solution than any solution that uses //.
For example, if it is known that all Transaction elements can be selected using this XPath expression:
/x/y/Transaction
then the evaluation of
sum(/x/y/Transaction/Amount)
is likely to be significantly faster than Sum(//Transaction/Amount)
Update:
The OP has revealed in a comment that the structure of the XML file is quite simple.
Accordingly, I tried with a 60000 Transaction nodes XML document the following:
/*/*/Amount
With .NET XslCompiledTransform (Yes, I used XSLT as the host for the XPath engine) this took 220ms (milliseconds), that means 0.22 seconds, to produce the sum.
With MSXML3 it takes 334 seconds.
With MSXML6 it takes 76 seconds -- still quite slow.
Conclusion: This is a bug in MSXML3 -- try to upgrade to another XPath engine, such as the one offered by .NET.

Related

dataFrame keying using pandas groupby method

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
filePath,vp,score
E:\Audio\7168965711_5601_4.wav,Cust_9709495726,-2
E:\Audio\7168965711_5601_4.wav,Cust_9708568031,-80
E:\Audio\7168965711_5601_4.wav,Cust_9702445777,-2
E:\Audio\7168965711_5601_4.wav,Cust_7023544759,-35
E:\Audio\7168965711_5601_4.wav,Cust_9702229339,-77
E:\Audio\7168965711_5601_4.wav,Cust_9513243289,25
E:\Audio\7168965711_5601_4.wav,Cust_2102513187,18
E:\Audio\7168965711_5601_4.wav,Cust_6625625104,-56
E:\Audio\7168965711_5601_4.wav,Cust_6073165338,-40
E:\Audio\7168965711_5601_4.wav,Cust_5105831247,-30
E:\Audio\7168965711_5601_4.wav,Cust_9513082770,-55
E:\Audio\7168965711_5601_4.wav,Cust_5753907026,-79
E:\Audio\7168965711_5601_4.wav,Cust_7403410322,11
E:\Audio\7168965711_5601_4.wav,Cust_4062144116,-70
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
res.index
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Cust_9709495726]
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
res['Cust_4062144116']
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.
Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
res['E:\Audio\7168965711_5601_4.wav]
and will return:
Cust_2102513187 2
Cust_4062144116 8
....
but trying to index by the inner index (the Cust_ indexes) will fail.
You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]

Data Processing, how to approach

I have the following Problem, given this XML Datastructure:
<level1>
<level2ElementTypeA></level2ElementTypeA>
<level2ElementTypeB>
<level3ElementTypeA>String1Ineed<level3ElementTypeB>
</level2ElementTypeB>
...
<level2ElementTypeC>
<level3ElementTypeB attribute1>
<level4ElementTypeA>String2Ineed<level4ElementTypeA>
<level3ElementTypeB>
<level2ElementTypeC>
...
<level2ElementTypeD></level2ElementTypeD>
</level1>
<level1>...</level1>
I need to create an Entity which contain: String1Ineed and String2Ineed.
So every time I came across a level3ElementTypeB with a certain value in attribute1, I have my String2Ineed. The ugly part is how to obtain String1Ineed, which is located in the first element of type level2ElementTypeB above the current level2ElementTypeC.
My 'imperative' solution looks like that that I always keep an variable with the last value of String1Ineed and if I hit criteria for String2Ineed, I simply use that. If we look at this from a plain collection processing point of view. How would you model the backtracking logic between String1Ineed and String2Ineed? Using the State Monad?
Isn't this what XPATH is for? You can find String2Ineed and then change the axis to search back for String1Ineed.

TermQuery not returning on a known search term, but WildcardQuery does

Am hoping someone with enough insight into the inner workings of Lucene might be able to point me in the right direction =)
I'll skip most of the surrounding irellevant code, and cut right to the chase. I have a Lucene index, to which I am adding the following field to the index (variables replaced by their literal values):
document.Add( new Field("Typenummer", "E5CEB501A244410EB1FFC4761F79E7B7",
Field.Store.YES , Field.Index.UN_TOKENIZED));
Later, when I search my index (using other types of queries), I am able to verify that this field does indeed appear in my index - like when looping through all Fields returned by Document.GetFields()
Field: Typenummer, Value: E5CEB501A244410EB1FFC4761F79E7B7
So far so good :-)
Now the real problem is - why can I not use a TermQuery to search against this value and actually get a result.
This code produces 0 hits:
// Returns 0 hits
bq.Add( new TermQuery( new Term( "Typenummer",
"E5CEB501A244410EB1FFC4761F79E7B7" ) ), BooleanClause.Occur.MUST );
But if I switch this to a WildcardQuery (with no wildcards), I get the 1 hit I expect.
// returns the 1 hit I expect
bq.Add( new WildcardQuery( new Term( "Typenummer",
"E5CEB501A244410EB1FFC4761F79E7B7" ) ), BooleanClause.Occur.MUST );
I've checked field lengths, I've checked that I am using the same Analyzer and so on and I am still on square 1 as to why this is.
Can anyone point me in a direction I should be looking?
I finally figured out what was going on. I'm expanding the tags for this question as it, much to my surprise, actually turned out to be an issue with the CMS this particular problem exists in. In summary, the problem came down to this:
The field is stored UN_TOKENIZED, meaning Lucene will store it excactly "as-is"
The BooleanQuery I pasted snippets from gets sent to the Sitecore SearchManager inside a PreparedQuery wrapper
The behaviour I expected from this was, that my query (having already been prepared) would go - unaltered - to the Lucene API
Turns out I was wrong. It passes through a RewriteQuery method that copies my entire set of nested queries as-is, with one exception - all the Term arguments are passed through a LowercaseStrategy()
As I indexed an UPPERCASE Term (UN_TOKENIZED), and Sitecore changes my PreparedQuery to lowercase - 0 results are returned
Am not going to start an argument of whether this is "by design" or "by design flaw" implementation of the Lucene Wrapper API - I'll just note that rewriting my query when using the PreparedQuery overload is... to me... unexpected ;-)
Further teachings from this; storing the field as TOKENIZED will eliminate this problem too, as the StandardAnalyzer by default will lowercase all tokens.

How to process a simple loop in Perl's WWW::Mechanize?

Especially interesting for me as a PHP/Perl-beginner is this site in Switzerland:
see this link:http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de&webgrab_path=http://esv2000.edi.admin.ch/d/entry.asp?Id=1308
Which has a dataset of 2700 foundations. All the data are free to use with no limitations copyrights on it.
what we have so far: Well the harvesting task should be no problem if i take WWW::Mechanize - particularly for doing the form based search and selecting the individual entries. Hmm - i guess that the algorithm would be basically 2 nested loops: the outer loop runs the form based search, the inner loop processes the search results.
The outer loop would use the select() and the submit_form() functions on the second search form on the page. Can we use DOM processing here. Well - how can we get the get the selection values.
The inner loop through the results would use the follow link function to get to the actual entries using the following call.
$mech->follow_link(url_regex => qr/webgrab_path=http:\/\/evs2000.*\?
Id=\d+$/, n => $result_nbr);
This would forward our mechanic browser to the entry page. Basically the URL query looks for links that have the webgrap_path to Id pattern, which is unique for each database entry. The $result_nbr variable tells mecha which one of the results it should follow next.
If we have several result pages we would also use the same trick to traverse through the result pages. For the semantic extraction of the entry information,we could parse the content of the actual entries with XML:LibXML's html parser (which works fine on this page), because it gives you some powerful DOM selection (using XPath) methods.
Well the actual looping through the pages should be doable in a few lines of perl of max. 20 lines - likely less.
But wait: the processing of the entry pages will then be the most complex part
of the script.
Approaches: In principle we could do the same algorithm with a single while loop
if we use the back() function smartly.
Can you give me a hint for the beginning - the processing of the entry pages - doing this in Perl:: Mechanize
"Which has a dataset of 2700 foundations. All the data are free to use with no limitations copyrights on it."
Not true. See http://perlmonks.org/?node_id=905767
"The data is copyrighted even though it is made available freely: "Downloading or copying of texts, illustrations, photos or any other data does not entail any transfer of rights on the content." (and again, in German, as you've been scraping some other German list to spam before)."

Search for a date between given ranges - Lotus

I have been trying to work out what is the best way to search for gather all of the documents in a database that have a certain date.
Originally I was trying to use FTsearch or search to move through a document collection, but I changed over to processing a view and associated documents.
My first question is what is the easiest way to spin through a set of documents and find if a date stored in the documents is greater than or less than a specified date?
So, to continue working I implemented the following code.
If (doc.creationDate(0) > cdat(parm1))
And (doc.creationDate(0) < CDat(parm2)) then
...
end if
but the results are off
Included! Date:3/12/10 11:07:08 P1:3/1/10 P2: 3/5/10
Included! Date:3/13/10 9:15:09 P1:3/1/10 P2: 3/5/10
Included! Date:3/17/10 16:22:07P1:3/1/10 P2: 3/5/10
You can see that the date stored in the doc is not between P1 and P2. BUT! it does limit the documents with a date less than P1 correctly. So I won't get a result for a document with a date less than 3/1/10
If there isn't a better way than the if statement, can someone help me understand why the two examples from above are included?
Hi you can try something like this:
searchStr = {(Form = "yourForm" & ((#Created > [} & parm1 & {]) & (#Created < [} & parm2 & {])))}
Set docCollection = currentDB.Search(searchStr, Nothing, 0)
If(docCollection.Count > 0)Then
'do your stuff with the collection returned
End If
Carlos' response is pretty good.
If you have a lot of documents, you can also use a full-text search which will is much faster. The method call is very similar (db.ftsearch(), online help can be found here).
The standard DB Search method operates in the same way as view index updates, so it can get a little slow if you have thousands of documents to search through.
Just make sure you enable full text index for your database in the database properties, (last tab).
Syntax on this approach is very similar, this link provides a good reference for FTsearch. Using Carlos' syntax, you can substitute FTSearch and searchStr assignment for faster searching.