Understanding WordProcessingML tags and avoid unnecessary tags - ms-word

I am using MS Word API to generate .docx which contains the data fetched from DB, in which i am applying the respective styles, fonts, symbols, etc. If the data fetched from the DB is quite huge, then there is a problem in displaying those data in the .docx file. I found that internally MS Word 2007 will write some content through tags which may not be needed to display the data. Hence i am figuring out what are the necessary MS Word tags needed when converting into a .xml file. So that i can avoid unnecessary tags and build only the respective tags which are needed to display the data. Hence i am planning to write my own .xml with the MS Word tags which are needed, than generating a .XML from .docx file
My queries are:-
1) Whether it is right that the MS Word will generate some tags which may not be needed during the conversion of .docx to document.xml? That makes it heavy? If so what are the tags , so that i can avoid them when write by own .xml file.
2) Please send links to understand about the MS Word tags and its advantages, which tags are needed and which are not ?
3) Whether my approach to write a new .xml similar to document.xml (.docx conversion) is worthy one to go forward so that i can build the .xml with the tags i needed , so that i can improve the performance of the data display?
Please shed some light into it and thanks in advance..
Thanks,
Rithu

You'll want to learn WordprocessingML in much more detail to do this. It certainly isn't impossible, but it is quite a learning curve to start with. Probably the best place to start is with this eBook. If you go the manual route, you'll need a zip technology. If you're in Visual Studio, you can make the writing of all of this easier by using the Open XML SDK.
As to your questions on 'unnecessary tags', it's hard to believe that there would be much at all in the file that is unnecessary. But that depends on what you consider not needed - for example, if a word is caught as mispelled, there will be "dirty=1" attribute on the Run tag. If you're okay with displaying mispelled words, then that could be considered unnecessary. Really depends on what you're displaying for and in what.

Related

Generate .hhk file From Word Document

I am trying to convert MS Word file to chm file. I have a well organized word document. But,I could not figure out how to word saved as a html file to chm file. I know I can add html file to created project but there are some issue such that I could not solve how to convert ms word table of content file to index file in html help workshop program. I would be very happy If someone provide some example about conversion of word documents.(I am trying to achieve this thorough HTML Help Workshop program)
Best regards,
Converting a Word document to CHM format is difficult without special (often expensive) tools and has a learning curve.
You should think about whether the PDF format is not sufficient. But the CHM format - integrated in the Windows operating system - has of course some popular functions.
I recommend to read through Search and Index not working after converting from Word 2016 to CHM.
As I mentioned in my answer I never used chmProcessor before (because using other tools) but surprisingly seems to be a good one for converting Word documents in a simple way.
Please try chmProcessor for your needs. You may want to ask a new question here on SO later.
Edit:
Maybe you have additional interest in the following CodeProject article:
How to Easily Write a User's Guide for Your Application using Different File Extensions

Word 2010 additional file format

I'm not sure whether this is the best approach for this or whether I perhaps should ask the question more clearer.
What I want to do is to create an additional file output - e.g. if the user uses Word to create a description consisting of known tags, I want to be able to save this as bbcode.
Now I do have an idea of how to do this, but is there a way to say add another file format to the "Save file"-dialog box and have it run a parser and file writer, that'd read the current document and export it using known bbcode-tags (that perhaps would be adjustable from some configuration window)?
The result would be a file containing bbcode as well as the text information that the user has entered.
How would I hook up my addin to the file output dialog? Is there a way to do this? I'm not sure it's custom XML since I won't be using the XML at all.
Thanks in advance and please excuse my poor English.
Edit: after having a look at the Word 2010 AddIn-project, I figured, that I'm looking for a way to define my own "export"-format. I'd like to export the BBCode to a .txt (or even .bbcode) file. The Microsoft.Office.Interop.Word.WdExportFormat seems to have its own fixed enumeration. Is there a way to add an export-format?
There is some code for this here:
phpbb.com/community/viewtopic.php?f=17&t=395554

Best way to get a database friendly list of Veteran Affairs Hospital

I sincerely apologize if this isn't the proper forum to discuss this, but I wasn't sure where to go or what would be the best option.
Basically, I'm trying to find a database friendly list of veteran affairs hospitals. The closest thing that I've been able to find is www.va.gov/ofcadmin/docs/CATB.pdf as it has all the information I'm looking for:
Region
Address
City in a separate column
Zip Code in a separate column
State
Facility # (also known as StationID)
VISN
Symbol
I've tried exporting that PDF out into CSV but it's a complete nightmare to get working. So, I was curious if anyone had any ideas or insights into how I could accomplish this task.
First, here's a CSV file containing the data found in CATB.pdf. The very first line contains the column headers, and the rest of the file contains the contents.
http://tmp.alexloney.com/CATB.csv
Now, for the more detailed explanation...I took the PDF you provided a link to, converted it to an HTML document using Adobe Acrobat, then I used a lot of Regular Expressions to parse the file and clean it up. Once the file was cleaned up enough, I was able to write a program to parse through the remainder of the file, grab the state and region, and spit it all out in a nicely formatted CSV.
Hope that helps you!
I believe that PDFILL has an option in it that will convert a PDF file to Excell. Once in Excell you should have no problem converting to a CSV file.

which module is efficient for parsing a .pdf file in one go ? CAM::PDF or PDF::API2

I want to extract all the keywords from a huge pdf file [50MB] ?
which module is good for large pdf files to parse ?
I'm concerned with memory for parsing huge file & extracting almost all the keywords !
Here i want SAX kind of parsing [one go parsing ] & not DOM kind of [ analogy to XML].
To read text out of a PDF, we use CAM::PDF, and it worked just fine. It wasn't hugely fast on some larger files, but the ability to handle large files was not bad. We certainly had a few that were ~100Mb, and which were handled OK. If I recall, we struggled with a few that were 130Mb on a 32-bit (Windows) Perl, but we had a whole lot of other stuff in memory at the time. We did look at PDF::API2, but it seemed more oriented to generating PDFs that reading from them. We didn't throw large files into PDF::API2, so I can't give a real benchmark figure.
The only significant downside we found with using CAM::PDF is that PDF 1.6 is becoming more common, and that doesn't work at all in CAM::PDF yet. That might not be an issue for you, but it might be something to consider.
In answer to your question, I'm pretty sure both modules read the whole source PDF into memory in one form or another, but I don't think CAM::PDF builds as many more complex structures out of it. So neither is really SAX-like, but CAM::PDF seemed to be lighter in general, and can retrieve one page at a time, so might reduce the load for extracting very large texts.

Conversion between docx / doc / rtf and lightweight markup

I am looking for a tool or set of tools to convert between file formats D and M where
D is a format handled by MSWord, in order of preference, docx, doc, rtf
M is a lightweight markup, such as markdown, textile, txt2tags, it can be an esoteric one
there is a way to generate html from M
conversion is two-way, it's done both from D to M, and from M to D
utf-8 encoding is handled properly
the content is simple, paragraphs, some simple formatting like bold and italics, maybe lists
the tools are platform-independent
What I've found so far
TeX, LaTeX -- too heavyweight
docx2txt -- too lightweight, it supports no formatting at all
html -- MSWord produces bloated html
a few one-way conversions, like doc to mediawiki,
UPDATE:
The use case is a document workflow between technical and non-technical people
I, the technical guy edit a document in plain text, put it into version control, etc.
I send it to my manager or other non-technical people
They add comments, make changes to it using their Word, then they send it back to me
I want to simply grok their changes, make my changes, put it into version control, without having to use Word
I think that Pandoc much more than meet all requirements.
http://pandoc.org
Adam, I've used docx4j to convert docx to html, edit the html in CKEditor, and then use docx4j to convert the html back to docx. My process made some assumptions about the css (ie it was designed to handle docx4j's clean html, and editing in CKEditor).
You don't say whether there is a way to generate M from HTML?
This is probably hard to do two-way, since you will have impedance mismatches between the various formats.
The best world I can think of would be a sort of Wiki / Word hybrid: Maybe you can get Google Wave to do that for you?
Another solution that might work is a CMS like Plone (did they ever add WYSIWIG capability? I stopped caring after version 1). Keep your documents there. Let the system handle changes, annotations etc. You can automate retrieval of the source (should be ReStructuredText) and commit that to your source control if you have to.
This script I wrote might help you in your workflow:
https://github.com/matb33/docx2md
It is a command-line PHP script that will only work with .docx files. It will extract the XML, run some XSL transformations, and provide you the result in Markdown format.
I encourage you to send me .docx files that don't convert accurately. I'd love to make this script as robust and reliable as possible.