Prevent Word 2010 from saving o:gfxdata base64 or uuencoded VML? - ms-word

I am working with .docx files containing several drawing canvases with images inserted and some lines and arrows drawn in Word 2010. I am using 2010 format with no compatibility mode.
Word inserts an o:gfxdata attribute into each v:shape and v:group element and fills it with ascii encoded something. From what I have read it may be a copy of the VML describing the v:shape or v:group. I don't know if I just don't know what to look for, but I cannot determine what this data is for as its removal has no apparent effect on my ability to read or edit the document in Word 2003, 2007, or 2010.
It does swell the document.xml to almost twice the (apparent) necessary size. This considerably slows OpenTBS' processing so I would like to remove it, if possible. Does anyone know of a way to tell Word 2010 to quit saving this extra data? Or what it is for? I have really struggled to find any documentation on it beyond this post.
Edit:
Here is a sample .docx. The document.xml is ~141KB and OpenTBS takes an average of 10.35 seconds to create a file that includes this as a subtemplate 21 times. If I remove all of the o:ogfxdata attributes, the file size is reduced to ~37KB and OpenTBS takes only 2.99 seconds to produce the same file.
Edit 2:
After further investigation, it appears the removal of the o:gfxdata may cause Word 2003 with an older Compatibilty Pack installed, to object to the file with the following error:
"This is a pre-release version of the Compatibility Pack and can open
pre-release Office 2007 files only. Do you want to check for a newer
version of the Compatibility Pack?"
I have been able to open the file by installing a newer compatibility pack - though it prompts the user about the incompatibility and converts the file in order to open it. This does not damage my file, but it is something to look out for.

Attribute o:ogfxdata is poorly documented in the web.
According to your investigations, it's some kind of compatibility extra information.
You can delete those attributes in your template using OpenTBS.
The cleaning can be done once on your template without any merging, and then save the cleaned template as a new template. Or you can perform the cleaning each time you open the template.
Cleaning the DOCX file:
while ($x = clsTbsXmlLoc::FindStartTagHavingAtt($TBS->Source, 'o:gfxdata', 0) ) {
$x->ReplaceAtt('o:gfxdata', '');
$TBS->Source = str_replace(' o:gfxdata=""', '', $TBS->Source);
}
Note that the class clsTbsXmlLoc is provided with OpenTBS and is undocumented.
The code should work since OpenTBS 1.8.0. (which is currently in stable beta version).
I've noticed that since attributes o:gfxdata are deleted, they do not come back immediately when you edit the docx.

Related

Use older MATLAB save formats

I'm running a model that has a bunch of DLLs which read some .mat files.
When I use an old version of MATLAB (I think 2011a) to generate the files I get files that work okay, but when I create them with 2017a the files seem not to work with the same script.
I've used 2017 to read in the working 2011 file and then saved it, and these files also don't work.
I've also tried the above with the '-vXX' settings at all available values according to the help, with no success.
Example:
clear; load('v2011file.mat'); save('v2017copy.mat', '-v6', 'var1', 'var2', 'var3');
One thing that I have noticed between the two is that when they're selected in the "Current folder" browser, the preview always shows the 2017 files with the variable names in alphabetical order, regardless of the order that I saved them in, while the older 2011 file seems to maintain the order that they were saved. I can only assume that this is something related to a change in the way that files are saved - it might not be the problem but it does hint toward a change (it does this whether or not I include '-vXX' to use older formats).
It's probably worth noting that the 2011 files are created on XP, while the 2017 files are made on Windows 7.
Essentially I'm looking for anyone who might know whether it's possible for me to change the way the file is put together by MATLAB, rather than having to change the DLLs to accept a newer file.
It looks like I can work around the save order issue and have something that works by doing:
save('new2017file.mat', 'var1');
save('new2017file.mat', 'var3'. '-append');
save('new2017file.mat', 'var2', '-append');
Meaning I can put them in a specific order - I have to have the default save set to -v7 in preferences>general>.mat files too.
I wouldn't say no to a more elegant answer if there's one available though!

Huge docx filled with <w:p> tags

My girlfriend is writing a Word document for a homework. She's using the old .doc format as required by her teacher ( :'( ).
At some point, the .doc file went from 150 kB to 2.6 MB with no noticeable change (seen in Dropbox history. Sadly, Word's comparison function fails because Word crashes). From that point, she was unable to save her document without crashing word...
I converted the .doc to docx, unzipped it, and found a 18 MB document.xml file !
I can't even format the xml properly because it crashes Notepad++, but I can see that the file is filled with the same xml tag repeating over and over :
<w:p w:rsidR="002A70E5" w:rsidRDefault="002A70E5" w:rsidP="00565ED9"/>
Do you have any idea what could cause this ?
EDIT: Here's the docx
EDIT2: The motivation for this question is more curiosity than looking for a fix. Thanks for your answers though.
If you're willing to edit the XML directly, you can just delete all the empty <w:p> tags and rezip.
If you're good with Python, you might give python-docx a try and use it to delete all empty paragraphs.
Hopefully that will at least recover the work she's done so far.
Not sure how this would happen, or whether it matters much. Only thing I can think of is a sticking Return key on the keyboard that would insert a huge number of carriage returns. Those each insert a new paragraph. I've actually had that happen occasionally on a Windows virtual machine running on my Mac. No clue why it does it though.
The tag you are talking about is the OpenXml format for building word documents. The openxml stores the document as a zipped file and I am afraid you are seeing the unzipped document.xml file. If you want to keep working with the doc just convert the doc file to docx. Dont unzip it.

current scctext replacement for textual representation of vfp binary files

What are people using in vfp 9 for a replacement for the built-in scctext.prg that translates binary files in vfp to a textual representation?
We’ve moving an existing project that’s in vfp 9 sp1 into tfs source control, but we need a way to make sure that the non-textual files are able to get the benefits of comparison that only non-binary text files allow. We plan to check both the textual representation and the binary file into source control (the binary is more for the “just in case” scenario)
According to the document at
http://www.ita-software.com/papers/Borup_Mercurial_Published.pdf
there are at least three options for converting .scx, .frx, .lbx, .prj and other non-prg dbf files in visual foxpro (vfp) to a textual representation. Only some of them allow for converting the textual information back to binary - not sure how often we’d really use that or not.
ALTERNATE SCCTEXT
This one seems older with latest version in 2009 - not sure if it’s still the preferred tool - and it seems to have no way to take the textual representation and convert it back to a binary file.
http://vfpx.codeplex.com/releases/view/12955
TWOFOX
This one seems similar to the foxbin2prg except it creates xml files - seems like only one dev is working on it unlike the others that are open to contributions from others so not sure how current it is and how much it’s being used by other developers - it does have two way conversion like fox2binprg has.
http://www.foxpert.com/downloads.htm
FOXBIN2PRG
This one is fairly recent - but not sure if it’s production ready enough to use for prod coding working - it does have two way conversion
http://vfpx.codeplex.com/releases/view/116407
TRIGGER INVOKE ONE OF THE ABOVE ON CHANGE OF BINARY FILES IN VFP IDE
What are people using to invoke these textual representation options?
I’ve seen this class that was created to run one of the programs listed above for all files in the project. Apparently it does it when the date time of the last generate is older that the date time on the textual version of the file. One detriment I’ve read is that it generates for foundation classes and other things that really are not items that a dev is working on (code that is referenced by but not included in your project).
http://codepaste.net/9yy1gm
Thanks for any advice from those that are using vfp 9 with source control out there!
You should check out the scX library written by Paul McNett which is published on Ed Leafe's web site. I haven't used it in a mission-critical software project yet, but I have tested it out. It seemed to catch all the potential problems I've encountered with other scctext replacements.
The reason I haven't used it in a big project for a couple of reasons.
It is a breaking change for source control history. So, comparing source code in your current SCA or VCA files with the new files generated by scX isn't going to be simple.
It isn't a drop in replacement for scctext. Instead of checking files into and out of source control directly from the IDE, you'll have an intermediary folder.
You'll check your files out of source control into one folder, convert them to FoxPro format, and then edit them in the FoxPro IDE.
Then, you'll save your changes in the FoxPro IDE, convert them to scX format, and then check them into source control.
I'm sure much of #2 can be automated; but combined with #1, making the change to scX wasn't worth it for me.
FoxBin2Prg is Production ready, and AFAIK, it's the only tool that allow Diff and Merge of the generated text (tx2) files, and can regenerate the binaries from them.
The generated files are PRG style, so developers can see them as modifying a PRG (with PROc/ENDPROC structures and such), but they aren't mean to compile. Primary use is for SCM tools, but can be used seperately.
I'm actually using on production code with a 10 member team using concurrent modifications on forms and classes.
Some documentation is available on VFPx in English and Spanish, Internal messages are vailable on both languages and from version v1.19.24 a new translation to German is available too.
More info on VFPx site,
Best regards!

Apache FOP generated PDFs printing strangely from Adobe Reader; OK on screen

I'm not sure how this can be a FOP issue, but I've never seen it with PDFs from any other source, so I've tried to investigate further.
Our application creates PDFs via xsl-fo, using FOP. This has worked great for a couple of years -- occasionally a user will have trouble printing a specific document, and see a very particular type of corruption, wherein most characters are "incremented". That is to say, 1 becomes 2, M becomes N, period becomes a slash, and the word invoice becomes the mildly amusing "jowpjdf". The document displays fine (typically in Adobe Reader). We've generally worked around it, but now an even odder case presents itself.
A new addition to our application creates 2 substantially similar PDFs created with FOP, then concatenates them using Perl's PDF::Reuse to grab the files from the filesystem and create a new document, which is then sent to the user by email. User opens document fine in Reader, hits print, and something new happens... Page 1 prints perfectly, but page 2 is corrupt in exactly the manner described above.
If it was a consistent print driver issue, I'd expect to see both pages corrupted. If it was a FOP issue, likewise. If it was a PDF::Reuse issue, I'd expect to see more fundamental breakage, and this breakage is not new since we started concatenating documents. I'm at a loss where to investigate next.
Has anyone seen similar corruption in PDFs, especially when generating using Apache FOP?
tl;dr PDFs created using FOP sometimes print with every character shifted by 1, e.g. A->B, 3->4

Understanding WordProcessingML tags and avoid unnecessary tags

I am using MS Word API to generate .docx which contains the data fetched from DB, in which i am applying the respective styles, fonts, symbols, etc. If the data fetched from the DB is quite huge, then there is a problem in displaying those data in the .docx file. I found that internally MS Word 2007 will write some content through tags which may not be needed to display the data. Hence i am figuring out what are the necessary MS Word tags needed when converting into a .xml file. So that i can avoid unnecessary tags and build only the respective tags which are needed to display the data. Hence i am planning to write my own .xml with the MS Word tags which are needed, than generating a .XML from .docx file
My queries are:-
1) Whether it is right that the MS Word will generate some tags which may not be needed during the conversion of .docx to document.xml? That makes it heavy? If so what are the tags , so that i can avoid them when write by own .xml file.
2) Please send links to understand about the MS Word tags and its advantages, which tags are needed and which are not ?
3) Whether my approach to write a new .xml similar to document.xml (.docx conversion) is worthy one to go forward so that i can build the .xml with the tags i needed , so that i can improve the performance of the data display?
Please shed some light into it and thanks in advance..
Thanks,
Rithu
You'll want to learn WordprocessingML in much more detail to do this. It certainly isn't impossible, but it is quite a learning curve to start with. Probably the best place to start is with this eBook. If you go the manual route, you'll need a zip technology. If you're in Visual Studio, you can make the writing of all of this easier by using the Open XML SDK.
As to your questions on 'unnecessary tags', it's hard to believe that there would be much at all in the file that is unnecessary. But that depends on what you consider not needed - for example, if a word is caught as mispelled, there will be "dirty=1" attribute on the Run tag. If you're okay with displaying mispelled words, then that could be considered unnecessary. Really depends on what you're displaying for and in what.