Parsing XML and retrieving attributes from (nested?) elements - element

I am trying to get specific data from an XML file, namely X, Y coordinates that are appear, to my beginners eyes, attributes of an element called "Point" in my file. I cannot get to that data with anything other than a sledgehammer approach and would gratefully accept some help.
I have used the following successfully:
for Shooter in root.iter('Shooter'):
print(Shooter.attrib)
But if I try the same with "Point" (or "Points") there is no output. I cannot even see "Point" when I use the following:
for child in root:
print(child.tag, child.attrib)
So: the sledgehammer
print([elem.attrib for elem in root.iter()])
Which gives me the attributes for every element. This file is a single collection of data and could contain hundreds of data points and so I would rather try to be a little more subtle and home in on exactly what I need.
My XML file
https://pastebin.com/abQT3t9k
UPDATE: Thanks for the answers so far. I tried the solution posted and ended up with 7000 lines of which wasn't quite what I was after. I should have explained in more detail. I also tried (as suggested)
def find_rec(node, element, result):
for item in node.findall(element):
result.append(item)
find_rec(item, element, result)
return result
print(find_rec(ET.parse(filepath_1), 'Shooter', [])) #Returns <Element 'Shooter' at 0x125b0f958>
print(find_rec(ET.parse(filepath_1), 'Point', [])) #Returns None
I admit I have never worked with XML files before, and I am new to Python (but enjoying it). I wanted to get the solution myself but I have spent days getting nowhere.
I perhaps should have just asked from the beginning how to extract the XY data for each ShotNbr (in this file there is just one) but I didn't want code written for me.
I've managed to get the XY from this file but my code will never work if there is more than one shot, or if I want to specifically look at, say, shot number 20.
How can I find shot number 2 (ShotNbr="2") and extract only its XY data points?

Assuming that you are using:
xml.etree.ElementTree,
You are only looking at the direct children of root.
You need to recurse into the tree to access elements lower in the hierarchical tree.
This seems to be the same problem as ElementTree - findall to recursively select all child elements
which has an excellent answer that I am not going to plagiarize.
Just apply it.
Alternatively,
import xml.etree.ElementTree as ET
root = ET.parse("file.xml")
print root.findall('.//Point')
Should work.
See: https://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax

Related

Is there a way for modifying molecule in RDkit?

I have a branched molecule just like in the Image (left one).
I want to add COOH at the end of each branch like Image (right one)
Here is the SMILES format of my molecule in a simplified form with 4 branches.
[N:1]([CH2:2][CH2:3][N:4]([CH2:47][CH2:48][CH:49]([NH:50][CH2:51][CH2:52][NH2:53])[O-:55])[CH2:66][CH2:67][CH:68]([NH:69][CH2:70][CH2:71][NH2:72])[O-:74])([CH2:9][CH2:10][CH:11]([NH:12][CH2:13][CH2:14][NH2:15])[O-:17])[CH2:28][CH2:29][CH:30]([NH:31][CH2:32][CH2:33][NH2:34])[O-:36]
I actually have a much bigger molecule but if i can find a way to do it with the simple one, i think i can extend the solution to the bigger one.
Here is a code example
mod_mol = Chem.ReplaceSubstructs(m,
Chem.MolFromSmiles('[NH2:34]'),
Chem.MolFromSmiles('[CH2:99]'),
replaceAll=True)
mod_mol[0]
for example i tried to change NH2 to CH2 but nothing happens.
In general, it is helpful to observe where the error shows a Nonetype. In this case,
rdkit.Chem.rdmolops.ReplaceSubstructs(Mol, NoneType, Mol)
The issue was caused because Chem.MolFromSmiles was provided with a SMARTS string, like this:
`Chem.MolFromSmiles('[NH2:34]')`
The solution is to use a Chem.MolFromSmarts instead, like this:
Chem.MolFromSmarts('[NH2:34]')

dataFrame keying using pandas groupby method

I new to pandas and trying to learn how to work with it. Im having a problem when trying to use an example I saw in one of wes videos and notebooks on my data. I have a csv file that looks like this:
filePath,vp,score
E:\Audio\7168965711_5601_4.wav,Cust_9709495726,-2
E:\Audio\7168965711_5601_4.wav,Cust_9708568031,-80
E:\Audio\7168965711_5601_4.wav,Cust_9702445777,-2
E:\Audio\7168965711_5601_4.wav,Cust_7023544759,-35
E:\Audio\7168965711_5601_4.wav,Cust_9702229339,-77
E:\Audio\7168965711_5601_4.wav,Cust_9513243289,25
E:\Audio\7168965711_5601_4.wav,Cust_2102513187,18
E:\Audio\7168965711_5601_4.wav,Cust_6625625104,-56
E:\Audio\7168965711_5601_4.wav,Cust_6073165338,-40
E:\Audio\7168965711_5601_4.wav,Cust_5105831247,-30
E:\Audio\7168965711_5601_4.wav,Cust_9513082770,-55
E:\Audio\7168965711_5601_4.wav,Cust_5753907026,-79
E:\Audio\7168965711_5601_4.wav,Cust_7403410322,11
E:\Audio\7168965711_5601_4.wav,Cust_4062144116,-70
I loading it to a data frame and the group it by "filePath" and "vp", the code is:
res = df.groupby(['filePath','vp']).size()
res.index
and the output is:
[E:\Audio\7168965711_5601_4.wav Cust_2102513187,
Cust_4062144116, Cust_5105831247,
Cust_5753907026, Cust_6073165338,
Cust_6625625104, Cust_7023544759,
Cust_7403410322, Cust_9513082770,
Cust_9513243289, Cust_9702229339,
Cust_9702445777, Cust_9708568031,
Cust_9709495726]
Now Im trying to approach the index like a dict, as i saw in examples, but when im doing
res['Cust_4062144116']
I get an error:
KeyError: 'Cust_4062144116'
I do succeed to get a result when im putting the filepath, but as i understand and saw in previouse examples i should be able to use the vp keys as well, isnt is so?
Sorry if its a trivial one, i just cant understand why it is working in one example but not in the other.
Rutger you are not correct. It is possible to "partial" index a multiIndex series. I simply did it the wrong way.
The index first level is the file name (e.g. E:\Audio\7168965711_5601_4.wav above) and the second level is vp. Meaning, for each file name i have multiple vps.
Now, this is correct:
res['E:\Audio\7168965711_5601_4.wav]
and will return:
Cust_2102513187 2
Cust_4062144116 8
....
but trying to index by the inner index (the Cust_ indexes) will fail.
You groupby two columns and therefore get a MultiIndex in return. This means you also have to slice using those to columns, not with a single index value.
Your .size() on the groupby object converts it into a Series. If you force it in a DataFrame you can use the .xs method to slice a single level:
res = pd.DataFrame(df.groupby(['filePath','vp']).size())
res.xs('Cust_4062144116', level=1)
That works. If you want to keep it as a series, boolean indexing can help, something like:
res[res.index.get_level_values(1) == 'Cust_4062144116']
The last option is a bit less readable, but sometimes also more flexibile, you could test for multiple values at once for example:
res[res.index.get_level_values(1).isin(['Cust_4062144116', 'Cust_6073165338'])]

Data Processing, how to approach

I have the following Problem, given this XML Datastructure:
<level1>
<level2ElementTypeA></level2ElementTypeA>
<level2ElementTypeB>
<level3ElementTypeA>String1Ineed<level3ElementTypeB>
</level2ElementTypeB>
...
<level2ElementTypeC>
<level3ElementTypeB attribute1>
<level4ElementTypeA>String2Ineed<level4ElementTypeA>
<level3ElementTypeB>
<level2ElementTypeC>
...
<level2ElementTypeD></level2ElementTypeD>
</level1>
<level1>...</level1>
I need to create an Entity which contain: String1Ineed and String2Ineed.
So every time I came across a level3ElementTypeB with a certain value in attribute1, I have my String2Ineed. The ugly part is how to obtain String1Ineed, which is located in the first element of type level2ElementTypeB above the current level2ElementTypeC.
My 'imperative' solution looks like that that I always keep an variable with the last value of String1Ineed and if I hit criteria for String2Ineed, I simply use that. If we look at this from a plain collection processing point of view. How would you model the backtracking logic between String1Ineed and String2Ineed? Using the State Monad?
Isn't this what XPATH is for? You can find String2Ineed and then change the axis to search back for String1Ineed.

PDF Table of Contents Parsing with iOS Quartz 2D

This question has been asked before, I know. However, nobody has answered it well. I'm wondering how to parse a PDF's "table of contents" on the iPhone. The docs tell me to use CGPDFDocumentGetCatalog but not how to use it. All they say is that it returns a dictionary. Also, I can't find any example code. Any suggestions?
looks like the closest thing seen on SO is Create a table of contents from a pdf file
It's basically just parsing the CGPDFDictionary called "Outline" in the CGPDFPage.
// get outline & loop through dictionary...
CGPDFDictionaryRef outlineRef;
if(CGPDFDictionaryGetDictionary(pdfDocDictionary, "Outlines", &outlineRef)) {
}
then you start with the First element and parse your way through.
CGPDFDictionaryGetDictionary(outlineRef, "First", &firstEntry)
You want to get the Title and the Destination.
NSString *outlineTitle = PSPDFStringFromPDFDict(outlineElementRef, #"Title");
CGPDFDictionaryGetObject(outlineElementRef, "Dest", &destinationRef)
The tricky thing starts with getting the correct destination, because there are (horray, PDF!) several ways to store it, plus several ways that are not defined in the PDF Reference but still out in the wild. Plus several variants that are just broken and you have to deal with it.
For example, you could get the Count of the outline dictionary using
CGPDFInteger elements;
if(CGPDFDictionaryGetInteger(outlineRef, "Count", &elements)) {
PSPDFLog(#"parsing outline: %ld elements. (Count will be ignored anyway)", (long int)elements);
}else {
PSPDFLogError(#"Error while parsing outline. No outlineRef?");
}
But note that Count sometimes is invalid due to broken PDF creation tools. See PDF as HTML. Even if it's broken, parsers will do their best to display as much data as they can. So my advice is to ignore Count and parse the dictionary anyway. (A few weeks ago I encountered a document that had Count = -10. Go figure)
I can't post the full code, as it's from my commercial PDF library PSPDFKit, and I need to make a living out of it ;) But this should get you started.

Asp.Net MVC 2: How exactly does a view model bind back to the model upon post back?

Sorry for the length, but a picture is worth 1000 words:
In ASP.NET MVC 2, the input form field "name" attribute must contain exactly the syntax below that you would use to reference the object in C# in order to bind it back to the object upon post back. That said, if you have an object like the following where it contains multiple Orders having multiple OrderLines, the names would look and work well like this (case sensitive):
This works:
Order[0].id
Order[0].orderDate
Order[0].Customer.name
Order[0].Customer.Address
Order[0].OrderLine[0].itemID // first order line
Order[0].OrderLine[0].description
Order[0].OrderLine[0].qty
Order[0].OrderLine[0].price
Order[0].OrderLine[1].itemID // second order line, same names
Order[0].OrderLine[1].description
Order[0].OrderLine[1].qty
Order[0].OrderLine[1].price
However we want to add order lines and remove order lines at the client browser. Apparently, the indexes must start at zero and contain every consecutive index number to N.
The black belt ninja Phil Haack's blog entry here explains how to remove the [0] index, have duplicate names, and let MVC auto-enumerate duplicate names with the [0] notation. However, I have failed to get this to bind back using a nested object:
This fails:
Order.id // Duplicate names should enumerate at 0 .. N
Order.orderDate
Order.Customer.name
Order.Customer.Address
Order.OrderLine.itemID // And likewise for nested properties?
Order.OrderLine.description
Order.OrderLine.qty
Order.OrderLine.price
Order.OrderLine.itemID
Order.OrderLine.description
Order.OrderLine.qty
Order.OrderLine.price
I haven't found any advice out there yet that describes how this works for binding back nested ViewModels on post. Any links to existing code examples or strict examples on the exact names necessary to do nested binding with ILists?
Steve Sanderson has code that does this sort of thing here, but we cannot seem to get this to bind back to nested objects. Anything not having the [0]..[n] AND being consecutive in numbering simply drops off of the return object.
Any ideas?
We found a work around, by using the following:
Html.EditorFor(m => m, "ViewNameToUse", "FieldPrefix")
Where FieldPrefix is the "object[0]". This is hardly ideal, but it certainly works pretty well. It's simple and elegant.