How to edit pasted content using the Open XML SDK - ms-word

I have a custom template in which I'd like to control (as best I can) the types of content that can exist in a document. To that end, I disable controls, and I also intercept pastes to remove some of those content types, e.g. charts. I am aware that this content can also be drag-and-dropped, so I also check for it later, but I'd prefer to stop or warn the user as soon as possible.
I have tried a few strategies:
RTF manipulation
Open XML manipulation
RTF manipulation is so far working fairly well, but I'd really prefer to use Open XML as I expect it to be more useful in the future. I just can't get it working.
Open XML Manipulation
The wonderfully-undocumented (as far as I can tell) "Embed Source" appears to contain a compound document object, which I can use to modify the copied content using the Open XML SDK. But I have been unable to put the modified content back into an object that lets it be pasted correctly.
The modification part seems to work fine. I can see, if I save the modified content to a temporary .docx file, that the changes are being made correctly. It's the return to the clipboard that seems to be giving me trouble.
I have tried assigning just the Embed Source object back to the clipboard (so that the other types such as RTF get wiped out), and in this case nothing at all gets pasted. I've also tried re-assigning the Embed Source object back to the clipboard's data object, so that the remaining data types are still there (but with mismatched content, probably), which results in an empty embedded document getting pasted.
Here's a sample of what I'm doing with Open XML:
using OpenMcdf;
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
...
object dataObj = Forms.Clipboard.GetDataObject();
object embedSrcObj = dateObj.GetData("Embed Source");
if (embedSrcObj is Stream)
{
// read it with OpenMCDF
Stream stream = embedSrcObj as Stream;
CompoundFile cf = new CompoundFile(stream);
CFStream cfs = cf.RootStorage.GetStream("package");
byte[] bytes = cfs.GetData();
string savedDoc = Path.GetTempFileName() + ".docx";
File.WriteAllBytes(savedDoc, bytes);
// And then use the OpenXML SDK to read/edit the document:
using (WordprocessingDocument openDoc = WordprocessingDocument.Open(savedDoc, true))
{
OpenXmlElement body = openDoc.MainDocumentPart.RootElement.ChildElements[0];
foreach (OpenXmlElement ele in body.ChildElements)
{
if (ele is Paragraph)
{
Paragraph para = (Paragraph)ele;
if (para.ParagraphProperties != null && para.ParagraphProperties.ParagraphStyleId != null)
{
string styleName = para.ParagraphProperties.ParagraphStyleId.Val;
Run run = para.LastChild as Run; // I know I'm assuming things here but it's sufficient for a test case
run.RunProperties = new RunProperties();
run.RunProperties.AppendChild(new DocumentFormat.OpenXml.Wordprocessing.Text("test"));
}
}
// etc.
}
openDoc.MainDocumentPart.Document.Save(); // I think this is redundant in later versions than what I'm using
}
// repackage the document
bytes = File.ReadAllBytes(savedDoc);
cf.RootStorage.Delete("Package");
cfs = cf.RootStorage.AddStream("Package");
cfs.Append(bytes);
MemoryStream ms = new MemoryStream();
cf.Save(ms);
ms.Position = 0;
dataObj.SetData("Embed Source", ms);
// or,
// Clipboard.SetData("Embed Source", ms);
}
Question
What am I doing wrong? Is this just a bad/unworkable approach?

Related

Word OpenXml Word Found Unreadable Content

We are trying to manipulate a word document to remove a paragraph based on certain conditions. But the word file produced always ends up being corrupted when we try to open it with the error:
Word found unreadable content
The below code corrupts the file but if we remove the line:
Document document = mdp.Document;
The the file is saved and opens without issue. Is there an obvious issue that I am missing?
var readAllBytes = File.ReadAllBytes(#"C:\Original.docx");
using (var stream = new MemoryStream(readAllBytes))
{
using (WordprocessingDocument wpd = WordprocessingDocument.Open(stream, true))
{
MainDocumentPart mdp = wpd.MainDocumentPart;
Document document = mdp.Document;
}
}
File.WriteAllBytes(#"C:\New.docx", readAllBytes);
UPDATE:
using (WordprocessingDocument wpd = WordprocessingDocument.Open(#"C:\Original.docx", true))
{
MainDocumentPart mdp = wpd.MainDocumentPart;
Document document = mdp.Document;
document.Save();
}
Running the code above on a physical file we can still open Original.docx without the error so it seems limited to modifying a stream.
Here's a method that reads a document into a MemoryStream:
public static MemoryStream ReadAllBytesToMemoryStream(string path)
{
byte[] buffer = File.ReadAllBytes(path);
var destStream = new MemoryStream(buffer.Length);
destStream.Write(buffer, 0, buffer.Length);
destStream.Seek(0, SeekOrigin.Begin);
return destStream;
}
Note how the MemoryStream is instantiated. I am passing the capacity rather than the buffer (as in your own code). Why is that?
When using MemoryStream() or MemoryStream(int), you are creating a resizable MemoryStream instance, which you will want in case you make changes to your document. When using MemoryStream(byte[]) (as in your code), the MemoryStream instance is not resizable, which will be problematic unless you don't make any changes to your document or your changes will only ever make it shrink in size.
Now, to read a Word document into a MemoryStream, manipulate that Word document in memory, and end up with a consistent MemoryStream, you will have to do the following:
// Get a MemoryStream.
// In this example, the MemoryStream is created by reading a file stored
// in the file system. Depending on the Stream you "receive", it makes
// sense to copy the Stream to a MemoryStream before processing.
MemoryStream stream = ReadAllBytesToMemoryStream(#"C:\Original.docx");
// Open the Word document on the MemoryStream.
using (WordprocessingDocument wpd = WordprocessingDocument.Open(stream, true)
{
MainDocumentPart mdp = wpd.MainDocumentPart;
Document document = mdp.Document;
// Manipulate document ...
}
// After having closed the WordprocessingDocument (by leaving the using statement),
// you can use the MemoryStream for whatever comes next, e.g., to write it to a
// file stored in the file system.
File.WriteAllBytes(#"C:\New.docx", stream.GetBuffer());
Note that you will have to reset the stream.Position property by calling stream.Seek(0, SeekOrigin.Begin) whenever your next action depends on that MemoryStream.Position property (e.g., CopyTo, CopyToAsync). Right after having left the using statement, the stream's position will be equal to its length.

Apache Tika - Parsing and extracting only metadata without reading content

Is there a way to configure the Apache Tikka so that it only extracts the metadata properties from the file and does not access the content of the file. ? We need a way to do this so as to avoid reading the entire content in larger files.
The code to extract we are using is as follows:
var tikaConfig = TikaConfig.getDefaultConfig();
var metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser(tikaConfig);
BodyContentHandler handler = new BodyContentHandler();
using (TikaInputStream stream = TikaInputStream.get(new File(filename), metadata))
{
parser.parse(stream, handler, metadata, new ParseContext());
Array metadataKeys = metadata.names();
Array.Sort(metadataKeys);
}
With the above code sample, when we try to extract the metadata even the content is being read. We would need a way to avoid the same.

How to store and compare annotation (with Gold Standard) in GATE

I am very comfortable with UIMA, but my new work require me to use GATE
So, I started learning GATE. My question is regarding how to calculate performance of my tagging engines (java based).
With UIMA, I generally dump all my system annotation into a xmi file and, then using a Java code compare that with a human annotated (gold standard) annotations to calculate Precision/Recall and F-score.
But, I am still struggling to find something similar with GATE.
After going through Gate Annotation-Diff and other info on that page, I can feel there has to be an easy way to do it in JAVA. But, I am not able to figure out how to do it using JAVA. Thought to put this question here, someone might have already figured this out.
How to store system annotation into a xmi or any format file programmatically.
How to create one time gold standard data (i.e. human annotated data) for performance calculation.
Let me know if you need more specific or details.
This code seems helpful in writing the annotations to a xml file.
http://gate.ac.uk/wiki/code-repository/src/sheffield/examples/BatchProcessApp.java
String docXMLString = null;
// if we want to just write out specific annotation types, we must
// extract the annotations into a Set
if(annotTypesToWrite != null) {
// Create a temporary Set to hold the annotations we wish to write out
Set annotationsToWrite = new HashSet();
// we only extract annotations from the default (unnamed) AnnotationSet
// in this example
AnnotationSet defaultAnnots = doc.getAnnotations();
Iterator annotTypesIt = annotTypesToWrite.iterator();
while(annotTypesIt.hasNext()) {
// extract all the annotations of each requested type and add them to
// the temporary set
AnnotationSet annotsOfThisType =
defaultAnnots.get((String)annotTypesIt.next());
if(annotsOfThisType != null) {
annotationsToWrite.addAll(annotsOfThisType);
}
}
// create the XML string using these annotations
docXMLString = doc.toXml(annotationsToWrite);
}
// otherwise, just write out the whole document as GateXML
else {
docXMLString = doc.toXml();
}
// Release the document, as it is no longer needed
Factory.deleteResource(doc);
// output the XML to <inputFile>.out.xml
String outputFileName = docFile.getName() + ".out.xml";
File outputFile = new File(docFile.getParentFile(), outputFileName);
// Write output files using the same encoding as the original
FileOutputStream fos = new FileOutputStream(outputFile);
BufferedOutputStream bos = new BufferedOutputStream(fos);
OutputStreamWriter out;
if(encoding == null) {
out = new OutputStreamWriter(bos);
}
else {
out = new OutputStreamWriter(bos, encoding);
}
out.write(docXMLString);
out.close();
System.out.println("done");

Protovis - dealing with a text source

lets say I have a text file with lines as such:
[4/20/11 17:07:12:875 CEST] 00000059 FfdcProvider W com.test.ws.ffdc.impl.FfdcProvider logIncident FFDC1003I: FFDC Incident emitted on D:/Prgs/testing/WebSphere/AppServer/profiles/ProcCtr01/logs/ffdc/server1_3d203d20_11.04.20_17.07.12.8755227341908890183253.txt com.test.testserver.management.cmdframework.CmdNotificationListener 134
[4/20/11 17:07:27:609 CEST] 0000005d wle E CWLLG2229E: An exception occurred in an EJB call. Error: Snapshot with ID Snapshot.8fdaaf3f-ce3f-426e-9347-3ac7e8a3863e not found.
com.lombardisoftware.core.TeamWorksException: Snapshot with ID Snapshot.8fdaaf3f-ce3f-426e-9347-3ac7e8a3863e not found.
at com.lombardisoftware.server.ejb.persistence.CommonDAO.assertNotNull(CommonDAO.java:70)
Is there anyway to easily import a data source such as this into protovis, if not what would the easiest way to parse this into a JSON format. For example for the first entry might be parsed like so:
[
{
"Date": "4/20/11 17:07:12:875 CEST",
"Status": "00000059",
"Msg": "FfdcProvider W com.test.ws.ffdc.impl.FfdcProvider logIncident FFDC1003I",
},
]
Thanks, David
Protovis itself doesn't offer any utilities for parsing text files, so your options are:
Use Javascript to parse the text into an object, most likely using regex.
Pre-process the text using the text-parsing language or utility of your choice, exporting a JSON file.
Which you choose depends on several factors:
Is the data somewhat static, or are you going to be running this on a new or dynamic file each time you look at it? With static data, it might be easiest to pre-process; with dynamic data, this may add an annoying extra step.
How much data do you have? Parsing a 20K text file in Javascript is totally fine; parsing a 2MB file will be really slow, and will cause the browser to hang while it's working (unless you use Workers).
If there's a lot of processing involved, would you rather put that load on the server (by using a server-side script for pre-processing) or on the client (by doing it in the browser)?
If you wanted to do this in Javascript, based on the sample you provided, you might do something like this:
// Assumes var text = 'your text';
// use the utility of your choice to load your text file into the
// variable (e.g. jQuery.get()), or just paste it in.
var lines = text.split(/[\r\n\f]+/),
// regex to match your log entry beginning
patt = /^\[(\d\d?\/\d\d?\/\d\d? \d\d:\d\d:\d\d:\d{3} [A-Z]+)\] (\d{8})/,
items = [],
currentItem;
// loop through the lines in the file
lines.forEach(function(line) {
// look for the beginning of a log entry
var initialData = line.match(patt);
if (initialData) {
// start a new item, using the captured matches
currentItem = {
Date: initialData[1],
Status: initialData[2],
Msg: line.substr(initialData[0].length + 1)
}
items.push(currentItem);
} else {
// this is a continuation of the last item
currentItem.Msg += "\n" + line;
}
});
// items now contains an array of objects with your data

Custom clipboard data format accross RDC (.NET)

I'm trying to copy a custom object from a RDC window into host (my local) machine. It fails.
Here's the code that i'm using to 1) copy and 2) paste:
1) Remote (client running on Windows XP accessed via RDC):
//copy entry
IDataObject ido = new DataObject();
XmlSerializer x = new XmlSerializer(typeof(EntryForClipboard));
StringWriter sw = new StringWriter();
x.Serialize(sw, new EntryForClipboard(entry));
ido.SetData(typeof(EntryForClipboard).FullName, sw.ToString());
Clipboard.SetDataObject(ido, true);
2) Local (client running on local Windows XP x64 workstation):
//paste entry
IDataObject ido = Clipboard.GetDataObject();
DataFormats.Format cdf = DataFormats.GetFormat(typeof(EntryForClipboard).FullName);
if (ido.GetDataPresent(cdf.Name)) //<- this always returns false
{
//can never get here!
XmlSerializer x = new XmlSerializer(typeof(EntryForClipboard));
string xml = (string)ido.GetData(cdf.Name);
StringReader sr = new StringReader(xml);
EntryForClipboard data = (EntryForClipboard)x.Deserialize(sr);
}
It works perfectly on the same machine though.
Any hints?
There are a couple of things you could look into:
Are you sure the serialization of the object truely converts it into XML? Perhaps the outputted XML have references to your memory space? Try looking at the text of the XML to see.
If you really have a serialized XML version of the object, why not store the value as plain-vanilla text and not using typeof(EntryForClipboard) ? Something like:
XmlSerializer x = new XmlSerializer(typeof(EntryForClipboard));
StringWriter sw = new StringWriter();
x.Serialize(sw, new EntryForClipboard(entry));
Clipboard.SetText(sw.ToString(), TextDataFormat.UnicodeText);
And then, all you'd have to do in the client-program is check if the text in the clipboard can be de-serialized back into your object.
Ok, found what the issue was.
Custom format names get truncated to 16 characters when copying over RDC using custom format.
In the line
ido.SetData(typeof(EntryForClipboard).FullName, sw.ToString());
the format name was quite long.
When i was receiving the copied data on the host machine the formats available had my custom format, but truncated to 16 characters.
IDataObject ido = Clipboard.GetDataObject();
ido.GetFormats(); //used to see available formats.
So i just used a shorter format name:
//to copy
ido.SetData("MyFormat", sw.ToString());
...
//to paste
DataFormats.Format cdf = DataFormats.GetFormat("MyFormat");
if (ido.GetDataPresent(cdf.Name)) {
//this not works
...