I have been using the excellent python-docx package to read, modify, and write Microsoft Word files. The package supports extracting the text from each paragraph. It also allows accessing a paragraph a "run" at a time, where the run is a set of characters that have the same font information. Unfortunately, when you access a paragraph by runs, you lose the links, because the package does not support links. The package also does not support accessing change tracking information.
My problem is that I need to access change tracking information. Or, more specifically, I need to copy paragraphs that have change tracking indicated from one document to another.
I've tried doing this at the XML level. For example, this code snippet appends the contents of file1.docx to file2.docx:
from docx import Document
doc1 = Document("file1.docx")
doc2 = Document("file2.docx")
doc2.element.body.append(doc1.element.body)
doc2.save("file2-appended.docx")
When I try to open the file on my Mac for complicated files, I get this error:
But if I click OK, the contents are there. The manipulation also works without problem for very simple files.
What am I missing?
The .element attribute is really an "internal" interface and should be named ._element. In most other places I have named it that. What you're getting there is the root element of the document part. You can see what it is by calling:
print(doc2.element.xml)
That element has one and only one w:body element below it, which is what you get when with doc2.element.body (.xml will work on that too, btw, if you want to inspect that element).
What your code is doing is appending one body element at the end of another w:body element and thereby forming invalid XML. The WordprocessingML vocabulary is quite strict about what element can follow another and how many and so forth. The only surprise for me is that it actually sometimes works for you, I take it :)
If you want to manipulate the XML directly, which is what the ._element attribute is there for, you need to do it carefully, in view of the (complex) WordprocessingML XML Schema.
Unlike when you stick to the published API, there's no safety net once ._element (or .element) appears in your code.
Inside the body XML can be relationships to external document parts, like images and hyperlinks. These will only be valid within the document in which they appear. This might explain why some files can be repaired.
Setting the scene:
I'm working on a webapp in Eclipse, with a bunch of JSPs and XSLs. And I need to hide some features depending on some users' attributes. This is the third job in the series.
For the previous 2 jobs, I was able to achieve my goals because either the required changes were on JSPs (straightforward) or on XSLs for which the objects existed.
I.e. We have the following sequence:
- searchForm.jsp (a form with some criteria you set up and submit),
- resultsList.jsp (a list of search results - clicking on any result brings the full record for the result),
- displayItem.jsp (the full record),
- record.xsl (the xsl that transforms displayItem.jsp).
Thus for the 2 previous jobs, the features I needed to hide were either on these JSPs themselves or on the xsl, in which case the (relevant) JSPs had:
<c:set target="${item}" property="xsltParameter" value="xxxx=Y"/>
where 'item' is an existing object and 'xxxx' a (usually but not necessarily global) parameter defined in the xsl.
For example:
<xsl:param name="xxxx">N</xsl:param>
Thus, if my changes were on the xsl, I would reuse the 'item' object to pass on my parameter and process it in the XSL(s).
E.g. I'd put on the jsp:
<c:set target="${item}" property="xsltParameter" value="abcd=AAA"/>
and add to the xsl:
<xsl:param name="abcd"></xsl:param>
This way I was able to pass my own parameters to the xsl.
For this last job however:
- The XSL (filters.xsl) is quite short and self-contained.
- It appears on the same page as resultsList.jsp, (therefore after searchForm.jsp) but has no connection I can see with it.
- On the JSPs, there's NO (appropriate) target object I can (re)use.
- I've already tried on a/the JSP the way I know to create variables/parameters:
<c:set var="xyz" value="${abcd}"/>
but this doesn't seem to work (when I create a corresponding xyz global parameter in filters.xsl).
My issues are:
- I'm struggling to create an object (if that's what I need to do).
- I may need to do something else, but I'm not sure what (hence my post).
Plan B
In desperation, with plan B I'm creating as a proof-of-concept, a static xml file (param-val.xml), in the same location as the XSLs, in which I put my external parameter/criterion:
<paramroot>
<paramval>abcd</paramval>
</paramroot>
What I'd like to do is using the document() function, extract this parameter and use it either within filters.xsl if possible or otherwise a go-between prefilters.xsl that filters.xsl would import. And this is where I'd need some help/tips/etc.
<xsl:param name="theXML" select="'prefilters.xml'" />
<xsl:variable name="myDoc" select="document($theXML)" />
I've been reading stuff on the web and tested a few things, but I'm stuck (rusty on some topics and learning others). How can I grab and use the 'abcd' in either filters or prefilters?
Any suggestions on how best to handle this?
Sorry for the lenghty post. Any help would be greatly appreciated.
Many thanks and regards.
I have a service that takes an .odt template file and some text values, and produces an .odt as it's output. I need to make this service available via HTTP, and I don't quite know what is the most RESTful way to make the interface work.
I need to be able to supply the template file, and the input values, to the server - and get the resulting .odt file sent back to me. The options I see for how this would work are:
PUT or POST the template to the server, then do a GET request, passing along the URI of the template I just posted, plus the input values - the GET response body would have the .odt
Send the template and the parameters in a single GET request - the template file would go in the GET request body.
Like (2) above except do the whole thing as a single POST request instead of GET.
The problem with (1) is that I do not want to store the template file on the server. This adds complexity and storing the file is not useful to me beyond the fact that it's a very RESTful approach. Also, a single request would be better than 2, all other things being equal.
The problem with (2) is that putting a body in a GET request is bordering on abuse of HTTP - it is supported by the software I'm using now, but may not always be.
Number (3) seems misleading since this is more naturally a 'read' or 'get' operation than a 'post'.
What I am doing is inherently like a function call - I need to pass a significant amount of data in, and I am really just using HTTP as a convenient way of exposing my code across the network. Perhaps what I'm trying to do is inherently un-RESTful, and there is no REST-friendly solution? Can anyone advise? Thank you!
Wow, so this answer escalated quickly...
Over the last year or so I've attempted to gain a much better understanding of REST through books, mailing lists, etc. For some reason I decided to pick your question as a test of what I've learned.
Sorry :P
Let's make this entire example one step simpler. Rather than worry about the user uploading a file, we'll instead assume that the user just passes a string. So, really, they are going to pass a string, in addition to the arguments of characters to replace (a list of key/values). We'll deal with the file upload part later.
Here's a RESTful way of doing it which doesn't require anything to be stored on the server. I will use some HTML (albeit broken, I'll leave out stuff like HEAD) as my media type, just because it's fairly well known.
A Sample Solution
First, the user will need to access our REST service.
GET /
<body>
<a rel="http://example.com/rels/arguments" href="/arguments">
Start Building Arguments
</a>
</body>
This basically gives the user a way to start actually interacting with our service. Right now they have only one option: use the link to build a new set of arguments (the name/value pairings that will eventually be used to in the string replacement scheme). So the user goes to that link.
GET /arguments
<body>
<a rel="self" href="/arguments"/>
<form rel="http://example.com/rels/arguments" method="get" action="/arguments?{key}={value}">
<input id="key" name="key" type="text"/>
<input id="value" name="value" type="text"/>
</form>
<form rel="http://example.com/rels/processed_string" action="/processed_string/{input_string}">
<input id="input_string" name="input_string" />
</form>
</body>
This brings us to an instance of an "arguments" resource. Notice that this isn't a JSON or XML document that returns to you just the plain data of the key/value pairings; it is hypermedia. It contains controls that direct the user to what they can do next (sometimes referred to allowing the user to "follow their nose"). This specific URL ("/arguments") represents an empty list of key/value pairings. I could very well have named the url "/empty_arguments" if I wanted to: this is an example why it's silly to think about REST in terms of URLs: it really shouldn't matter what the URL is.
In this new HTML, the user is given three different resources that they can navigate to:
They can use the link to "self" to navigate to same resource they are currently on.
They can use the first form to navigate to a new resource which represents an argument list with the additional name/value pairing that they specify in the form.
They can use the second form to provide the string that they wish to finally do their replacement on.
Note: You probably noticed that the second form has a strange "action" url:
/arguments?{key}={value}
Here, I cheated: I'm using URI Templates. This allows me to specify how the arguments are going to be placed onto the URL, rather than using the default HTML scheme of just using <input-name>=<input-value>. Obviously, for this to work, the user can't use a browser (as browsers don't implement this): they would need to use software that understands HTML and URI templating. Of course, I'm using HTML as an example, your REST service could use some kind of XML that supports URI Templating as defined by the URI Template spec.
Anyway, let's say the user wants to add their arguments. The user uses the first form (e.g., filling in the "key" input with "Author" and the "value" input with "John Doe"). This results in...
GET /arguments?Author=John%20Doe
<body>
<a rel="self" href="/arguments?Author=John%20Doe"/>
<form rel="http://example.com/rels/arguments" method="get" action="/arguments?Author=John%20Doe&{key}={value}">
<input id="key" name="key" type="text"/>
<input id="value" name="value" type="text"/>
</form>
<form rel="http://example.com/rels/processed_string" action="/processed_string/{input_string}?Author=John%20Doe">
<input id="input_string" name="input_string" />
</form>
</body>
This is now a brand new resource. You can describe it as an argument list (key/value pairs) with a single key/value pair: "Author"/"John Doe". The HTML is pretty much the same as before, with a few changes:
The "self" link now points to current resources URL (changed from "/arguments" to "/arguments?Author=John%20Doe"
The "action" attribute of the first form now has the longer URL, but once again we use URI Templates to allow us to build a larger URI.
The second form
The user now wants to add a "Date" argument, so they once again submit the first form, this time with key of "Date" and a value of "2003-01-02".
GET /arguments?Author=John%20Doe&Date=2003-01-02
<body>
<a rel="self" href="/arguments?Author=John%20Doe&Date=2003-01-02"/>
<form rel="http://example.com/rels/arguments" method="get" action="/arguments?Author=John%20Doe&Date=2003-01-02&{key}={value}">
<input id="key" name="key" type="text"/>
<input id="value" name="value" type="text"/>
</form>
<form rel="http://example.com/rels/processed_string" action="/processed_string/{input_string}?Author=John%20Doe">
<input id="input_string" name="input_string" />
</form>
</body>
Finally, the user is ready to process their string, so they use the second form and fill in the "input_string" variable. This once again uses URI Templates, thus having bringing the user to the next resource. Let's say that that the string is the following:
{Author} wrote some books in {Date}
The results would be:
GET /processed_string/%7BAuthor%7D+wrote+some+books+in+%7BDate%7D?Author=John%20Doe&Date=2003-01-02
<body>
<a rel="self" href="/processed_string/%7BAuthor%7D+wrote+some+books+in+%7BDate%7D?Author=John%20Doe&Date=2003-01-02">
<span class="results">John Doe wrote some books in 2003-01-02</span>
</body>
PHEW! That's a lot of work! But it's (AFAIC) RESTful, and it fulfills the requirement of not needing to actually store ANYTHING on the server side (including the argument list, or the string that you eventually want to process).
Important Things to Note
One thing that is important here is that I wasn't just talking about URLs. In fact, the majority of time, I'm talking about the HTML. The HTML is the hypermedia, that that's is such a huge part of REST that is forgotten about. All those APIs that say they are "restful" where they say "do a GET on this URL with these parameters and POST on this URL with a document that looks like this" are not practicing REST. Roy Fielding (who literally wrote the book on REST) made this observation himself.
Another thing to note is that it was quite a bit of pain to just set up the arguments. After the initial GET / to get to the root (you can think of it as the "menu") of the service, you would need to do five more GET calls just to build up your argument resource to make an argument resource of four key/value pairings. This could be alleviated by not using HTML. For example, I already did use URI Templates in my example, there's no reason to say that HTML just isn't good enough for REST. Using a hypermedia format (like some derivation of XML) that supports something similar to forms, but with the ability to specify "mappings" of values, you could do this in one go. For example, we could extend the HTML media type to allow another input type called "mappings"...
So long as the client using our API understands what a "mappings" input type is, they will be able to build their arguments resource with a single GET.
At that point, you might not even need an "arguments" resource. You could just skip right to the "processed_string" resource that contains the mapping and the actual string...
What about file upload?
Okay, so originally you mentioned file uploads, and how to get this without needing to store the file. Well, basically, we can use our existing example, but replace the last step with a file.
Here, we are basically doing the same thing as before, except we are uploading a file. What is important to note is that now we are hinting to the user (through the "method" attribute on the form) that they should do a POST rather than a GET. Note that even though everywhere you hear that POST is a non-safe (it could cause changes on the server), non-idempotent operation, there is nothing saying that it MUST be change state on the server.
Finally, the server can return the new file (even better would be to return some hypermedia or LOCATION header with a link to the new file, but that would require storage).
Final Comments
This is just one take on this specific example. While I hope you have gained some sort of insight, I would caution you to accept this as gospel. I'm sure there have been things that I have said that are not really "REST". I plan on posting this question and answer to the REST-Discuss Mailing List and see what others have to say about it.
One main thing I hope to express through this is that your easiest solution might simply be to use RPC. After all, what was your original attempt at making it RESTful attempting to accomplish? If you are trying to be able to tell people that you accomplish "REST", keep in mind that plenty of APIs have claimed themself "RESTful" that have really just been RPC disguised by URLs with nouns rather than verbs.
If it was because you have heard some of the benefits of REST, and how to gain those benefits implicitly by making your API RESTful, the unfortunate truth is that there's more to REST than URLs and whether you GET or POST to them. Hypermedia plays a huge part.
Finally, sometimes you will encounter issues that mean you might do things that SEEM non-RESTful. Perhaps you need to do a POST rather than a GET because the URI (which have a theoretical infinite amount of storage, but plenty of technical limitations) would get too long. Well then, you need to do POST. Maybe
More resources:
REST-Discuss
My e-mail on this answer to REST-Discuss
RESTful Web Services Cookbook
Hypermedia APIs with HTML5 and Node (Not specifically about REST, but a VERY good introduction to Hypermedia)
What you are doing is not REST-ful - or, at least, is difficult to express in REST, because you are thinking about the operation first, not the objects first.
The most REST-ful expression would be to create a new "OdtTemplate" resource (or get the URI of an existing one), create a new "SetOfValues" resource, then create a "FillInTemplateWithValues" job resource that was tied to both those inputs, and which could be read to determine the status of the job, and to obtain a pointer to the final "FilledInDocument" object that contained your result.
REST is all about creating, reading, updating, and destroying objects. If you can't model your process as a CRUD database, it isn't really REST. That means you do need to, eg, store the template on the server.
You might be better off, though, just implementing an RPC over HTTP model, and submitting the template and values, then getting the response synchronously - or one of the other non-REST patterns you named... since that is just what you want.
If there is no value in storing the templates then option 2 is the most RESTful, but as you are aware there is the possibility of having your GET body dropped.
However, if I was a user of this system, I would find it very wasteful to have to upload the template each time I would like to populate it with values. Instead it would seem more appropriate to have the template stored and allow different requests with different values to populate the resulting documents.
I already know how to parse XML Elements that contain content (<this> Content </this> in Objective C but I am currently using a web service that returns the content I need in between two closed elements (<begin-paragraph/> The content I need <end-paragraph/>) I have been looking online for any examples of anyone else doing this, but I could not find anything. If anyone knows how to read between the two empty elements and would care to share, I would appreciate that very much.
I have to say I regard that as an abuse of XML.
But I've checked and sadly it is well formed so NSXMLParser (which I assume is what you are using) should be able to cope with it.
You basically need to check which element you are in by handling the start element and end element events in your NSXMLParserDelegate. Then after receiving the –parser:didEndElement:namespaceURI:qualifiedName: message for begin-paragraph grab all the text you receive in -parser:foundCharacters: until you receive –parser:didStartElement:namespaceURI:qualifiedName:attributes: for end-paragraph
I don't know what the DOM conformance on the iPhone is like, but the general procedure would be:
Navigate to <begin-paragraph /> in your DOM.
Get the next sibling of that node. That is the content you need. (Node::nextSibling property)
If there are other elements in there that you want, keep collecting them by the same method, until you reach <end-paragraph />
i want to parse the xml File. xml File structure is following
<?xml version="1.0" encoding="utf-8"?>
<Level>
<p id='327'>
<Item>
<Id>5877</Id>
<Type>0</Type>
<Icon>---</Icon>
<Title>Btn1Item1</Title>
</Item>
<Item>
<Id>5925</Id>
<Type>0</Type>
<Icon>---</Icon>
<Title>Btn1Item4</Title>
</Item>
</p>
<p id='328'>
<Item>
<Id>5878</Id>
<Type>0</Type>
<Icon>---</Icon>
<Title>Btn2Item1</Title>
</Item>
<Item>
<Id>5926</Id>
<Type>0</Type>
<Icon>---</Icon>
<Title>Btn2Item4</Title>
</Item>
</p>
</Level>
in above code there are only 2 tag for <p>. but in actual there are multiple tag. i want to search the specific tag for which attribute id have some specific value (say 327).
so one way is that i parse the XML file from start to get the desired result. whether there are any other method from which i can direct locate the desired tag. for example if i want to search the <p> tag in above XML for attribute id =328, then it does not parse the id=327 and direct return only those item which are related to id=328
Please suggest
Depends how you define "parse".
A "quick & dirty" (and potentially buggy) way would be to find the fragment using a regex search (or a custom parser) first, then feed the fragment to a true XML parser. I don't know of anything that would do this for you, you'd have to roll it yourself. I would suggest that it's not the way to go.
The next level is to feed it through a SAX-like parser (which NSXMLParser is a form of).
In your handler for the <p> element, check the id attribute and if it matches your value (or values), set a flag to indicate if child elements should be interpreted.
In your child element handlers, just check that flag first (in a raw NSXMLParser handler all elements would go to the same method, of course).
So it's true that NSXMLParser would be parsing the whole document - but just to do the minimal work to establish the correct XML parser context. The real work of handling the elements would be deferred until the value is met. I don't see any way around that without something hacky like the regex suggestion.
If this is too much overhead I'd reconsider whether XML is the right serialization format for you (assuming you have any control over that)?
If you do stick with NSXMLParser, my blog article here might help to at least make the experience nicer.
The libxml2 library can receive XPath queries with the following extensions. With these extensions you might issue the XPath query /p[#id = "328"] to retrieve that specific node's children.