Open xml encoding - ms-word

Im using Open XML SDK to read and write information to custom xml parts in a Word document.
Im serializing a xml structure into a class , modifying the data and serializing it back into the document.
Example code
DocumentFormat.OpenXml.Packaging.CustomXmlPart myPart = GetCustomXmlPart(mainPart, AITNamespaces.SiteDataNameSpace);
StreamReader myStrR = new System.IO.StreamReader(myPart.GetStream());
string myXML = myStrR.ReadToEnd();
sitedata = ObjectXMLSerializer<WDSiteData>.LoadString(myXML);
Site selSite = _applicationService.GetApplicationData().Sites.Find(item => item.Id == siteId);
sitedata.SiteId= selSite.Id;
sitedata.SiteName = "ööööaaaaaåååå"; // selSite.SiteName;
myXML = ObjectXMLSerializer<WDSiteData>.GetXMLString(sitedata);
myPart = GetCustomXmlPart(wordProcDocument.MainDocumentPart, AITNamespaces.SiteDataNameSpace);
using (StreamWriter sw = new StreamWriter(myPart.GetStream(FileMode.Create)))
{
sw.Write(myXML);
}
My problem is that the national characters become encoded and the text "ööööaaaaaåååå" is displayed in the word document as "????????aaaaa????????"
The actual encoding is done
myXML = ObjectXMLSerializer<WDSiteData>.GetXMLString(sitedata);
Anyone has a clue on how to go about handling national characters this way.

For whom it concerns , my trouble was that in my serializer the encoding was fixed
return ASCIIEncoding.ASCII.GetString(memStream.ToArray());
When i changed to
return ASCIIEncoding.UTF8.GetString(memStream.ToArray());
all is well.

Related

I need to extract text from a pdf file using itext7 or itextsharp and put html tag for bold around all the words using bold font

I am using iText7 and I want to extract all the texts from a pdf and put html tag for bold ( <b>...</b> ) around all the words that uses bold fonts and save it in text file. Any pointers? I am able to independently extract text and also extract all the bold words but not able to co-relate the two.
Here is the code snippet I am using for extracting the text:
PdfDocument MyDocument = new PdfDocument(new PdfReader("C:\\MyTest.pdf"));
string MyText = PdfTextExtractor.GetTextFromPage(MyDocument.GetPage(1), new
SimpleTextExtractionStrategy());
Here is the code I am using for extracting all the words using the bold font:
MyRectangle = new Rectangle(0, 0, 50, 100);
CustomFontFilter fontFilter = new CustomFontFilter(MyRectangle);
FilteredEventListener listener = new FilteredEventListener();
LocationTextExtractionStrategy extractionStrategy =
listener.AttachEventListener(new LocationTextExtractionStrategy(), fontFilter);
PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
parser.ProcessPageContent(MyDocument.GetPage(1));
String MyBoldTextList = extractionStrategy.GetResultantText();
//------
class CustomFontFilter : TextRegionEventFilter
{
public CustomFontFilter(iText.Kernel.Geom.Rectangle filterRect) : base(filterRect){ }
override public bool Accept(IEventData data, EventType type)
{
if (type == EventType.RENDER_TEXT){
TextRenderInfo renderInfo = (TextRenderInfo)data;
PdfFont font = renderInfo.GetFont();
if (font!=null)
return font.GetFontProgram().GetFontNames().GetFontName().Contains("Bold");
}
return false;
}
}
The problem is that the pdf in question here is a multi-column document. SimpleTextExtractionStrategy brings the text in perfect order but if I use the LocationStrategy, it messes up texts by jumping from one column to next column in each line. I am not able to find any way to get the list of bold words using SimpleTextExtractionStrategy. In LocationStrategy, the list that I get is not in the right order so I am unable to co-relate it.
So to summarize:
You want to extract all the text from a pdf and put the html tag for bold (<b>...</b>) around all the text that uses bold fonts.
Your PDFs allow normal text extraction (without those <b> tags) using the SimpleTextExtractionStrategy. The LocationTextExtractionStrategy on the other hand cannot be used as it messes up the order of the multi-column text.
Bold text in your PDFs can properly be recognized by your CustomFontFilter, i.e. by the
font.GetFontProgram().GetFontNames().GetFontName().Contains("Bold")
condition.
Thus, one way to implement your task would be to extend the SimpleTextExtractionStrategy to check every chunk received using the CustomFontFilter condition and insert <b> tags where required.
For example like this:
public class BoldTaggingSimpleTextExtractionStrategy : SimpleTextExtractionStrategy
{
FieldInfo textField = typeof(TextRenderInfo).GetField("text", BindingFlags.NonPublic | BindingFlags.Instance);
bool currentlyBold = false;
public override void EventOccurred(IEventData data, EventType type)
{
if (type.Equals(EventType.RENDER_TEXT))
{
TextRenderInfo renderInfo = (TextRenderInfo)data;
string fontName = renderInfo.GetFont()?.GetFontProgram()?.GetFontNames()?.GetFontName();
if (fontName != null && fontName.Contains("Bold"))
{
if (!currentlyBold)
{
textField.SetValue(renderInfo, "<b>" + renderInfo.GetText());
currentlyBold = true;
}
}
else if (currentlyBold)
{
AppendTextChunk("</b>");
currentlyBold = false;
}
}
base.EventOccurred(data, type);
}
}
As you see I used reflection here. I did so because (A) TextRenderInfo does not allow public setting of the text and (B) AppendTextChunk must not be used before the first chunk is processed by base.EventOccurred - there the size of a StringBuilder containing the collected text chunks is used to check whether the chunk currently processed is the first one or not; if something is in that builder before at least one chunk has been processed, one gets a NullReferenceException. There are other work-arounds for that but reflection here means but one more line of code.

Word OpenXml Word Found Unreadable Content

We are trying to manipulate a word document to remove a paragraph based on certain conditions. But the word file produced always ends up being corrupted when we try to open it with the error:
Word found unreadable content
The below code corrupts the file but if we remove the line:
Document document = mdp.Document;
The the file is saved and opens without issue. Is there an obvious issue that I am missing?
var readAllBytes = File.ReadAllBytes(#"C:\Original.docx");
using (var stream = new MemoryStream(readAllBytes))
{
using (WordprocessingDocument wpd = WordprocessingDocument.Open(stream, true))
{
MainDocumentPart mdp = wpd.MainDocumentPart;
Document document = mdp.Document;
}
}
File.WriteAllBytes(#"C:\New.docx", readAllBytes);
UPDATE:
using (WordprocessingDocument wpd = WordprocessingDocument.Open(#"C:\Original.docx", true))
{
MainDocumentPart mdp = wpd.MainDocumentPart;
Document document = mdp.Document;
document.Save();
}
Running the code above on a physical file we can still open Original.docx without the error so it seems limited to modifying a stream.
Here's a method that reads a document into a MemoryStream:
public static MemoryStream ReadAllBytesToMemoryStream(string path)
{
byte[] buffer = File.ReadAllBytes(path);
var destStream = new MemoryStream(buffer.Length);
destStream.Write(buffer, 0, buffer.Length);
destStream.Seek(0, SeekOrigin.Begin);
return destStream;
}
Note how the MemoryStream is instantiated. I am passing the capacity rather than the buffer (as in your own code). Why is that?
When using MemoryStream() or MemoryStream(int), you are creating a resizable MemoryStream instance, which you will want in case you make changes to your document. When using MemoryStream(byte[]) (as in your code), the MemoryStream instance is not resizable, which will be problematic unless you don't make any changes to your document or your changes will only ever make it shrink in size.
Now, to read a Word document into a MemoryStream, manipulate that Word document in memory, and end up with a consistent MemoryStream, you will have to do the following:
// Get a MemoryStream.
// In this example, the MemoryStream is created by reading a file stored
// in the file system. Depending on the Stream you "receive", it makes
// sense to copy the Stream to a MemoryStream before processing.
MemoryStream stream = ReadAllBytesToMemoryStream(#"C:\Original.docx");
// Open the Word document on the MemoryStream.
using (WordprocessingDocument wpd = WordprocessingDocument.Open(stream, true)
{
MainDocumentPart mdp = wpd.MainDocumentPart;
Document document = mdp.Document;
// Manipulate document ...
}
// After having closed the WordprocessingDocument (by leaving the using statement),
// you can use the MemoryStream for whatever comes next, e.g., to write it to a
// file stored in the file system.
File.WriteAllBytes(#"C:\New.docx", stream.GetBuffer());
Note that you will have to reset the stream.Position property by calling stream.Seek(0, SeekOrigin.Begin) whenever your next action depends on that MemoryStream.Position property (e.g., CopyTo, CopyToAsync). Right after having left the using statement, the stream's position will be equal to its length.

How to edit pasted content using the Open XML SDK

I have a custom template in which I'd like to control (as best I can) the types of content that can exist in a document. To that end, I disable controls, and I also intercept pastes to remove some of those content types, e.g. charts. I am aware that this content can also be drag-and-dropped, so I also check for it later, but I'd prefer to stop or warn the user as soon as possible.
I have tried a few strategies:
RTF manipulation
Open XML manipulation
RTF manipulation is so far working fairly well, but I'd really prefer to use Open XML as I expect it to be more useful in the future. I just can't get it working.
Open XML Manipulation
The wonderfully-undocumented (as far as I can tell) "Embed Source" appears to contain a compound document object, which I can use to modify the copied content using the Open XML SDK. But I have been unable to put the modified content back into an object that lets it be pasted correctly.
The modification part seems to work fine. I can see, if I save the modified content to a temporary .docx file, that the changes are being made correctly. It's the return to the clipboard that seems to be giving me trouble.
I have tried assigning just the Embed Source object back to the clipboard (so that the other types such as RTF get wiped out), and in this case nothing at all gets pasted. I've also tried re-assigning the Embed Source object back to the clipboard's data object, so that the remaining data types are still there (but with mismatched content, probably), which results in an empty embedded document getting pasted.
Here's a sample of what I'm doing with Open XML:
using OpenMcdf;
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
...
object dataObj = Forms.Clipboard.GetDataObject();
object embedSrcObj = dateObj.GetData("Embed Source");
if (embedSrcObj is Stream)
{
// read it with OpenMCDF
Stream stream = embedSrcObj as Stream;
CompoundFile cf = new CompoundFile(stream);
CFStream cfs = cf.RootStorage.GetStream("package");
byte[] bytes = cfs.GetData();
string savedDoc = Path.GetTempFileName() + ".docx";
File.WriteAllBytes(savedDoc, bytes);
// And then use the OpenXML SDK to read/edit the document:
using (WordprocessingDocument openDoc = WordprocessingDocument.Open(savedDoc, true))
{
OpenXmlElement body = openDoc.MainDocumentPart.RootElement.ChildElements[0];
foreach (OpenXmlElement ele in body.ChildElements)
{
if (ele is Paragraph)
{
Paragraph para = (Paragraph)ele;
if (para.ParagraphProperties != null && para.ParagraphProperties.ParagraphStyleId != null)
{
string styleName = para.ParagraphProperties.ParagraphStyleId.Val;
Run run = para.LastChild as Run; // I know I'm assuming things here but it's sufficient for a test case
run.RunProperties = new RunProperties();
run.RunProperties.AppendChild(new DocumentFormat.OpenXml.Wordprocessing.Text("test"));
}
}
// etc.
}
openDoc.MainDocumentPart.Document.Save(); // I think this is redundant in later versions than what I'm using
}
// repackage the document
bytes = File.ReadAllBytes(savedDoc);
cf.RootStorage.Delete("Package");
cfs = cf.RootStorage.AddStream("Package");
cfs.Append(bytes);
MemoryStream ms = new MemoryStream();
cf.Save(ms);
ms.Position = 0;
dataObj.SetData("Embed Source", ms);
// or,
// Clipboard.SetData("Embed Source", ms);
}
Question
What am I doing wrong? Is this just a bad/unworkable approach?

OpenXML editing docx

I have a dynamically generated docx file.
Need write the text strictly to end of page.
With Microsoft.Interop i insert Paragraphs before text:
int kk = objDoc.ComputeStatistics(WdStatistic.wdStatisticPages, ref wMissing);
while (objDoc.ComputeStatistics(WdStatistic.wdStatisticPages, ref wMissing) != kk + 1)
{
objWord.Selection.TypeParagraph();
}
objWord.Selection.TypeBackspace();
But i can't use same code with Open XML, because pages.count calculated only by word.
Using interop impossible, because it so slowwwww.
There are 2 options of doing this in Open XML.
create Content Place holder from Microsoft Office Developer Tab at the end of your document and now you can access this Content Place Holder programatically and can place any text in it.
you can append text driectly to your word document where it will be inserted at the end of your text. In this approach you got to write all the stuff to your document first and once you are done than you can append your document the following way
//
public void WriteTextToWordDocument()
{
using(WordprocessingDocument doc = WordprocessingDocument.Open(documentPath, true))
{
MainDocumentPart mainPart = doc.MainDocumentPart;
Body body = mainPart.Document.Body;
Paragraph paragraph = new Paragraph();
Run run = new Run();
Text myText = new Text("Append this text at the end of the word document");
run.Append(myText);
paragraph.Append(run);
body.Append(paragraph);
// dont forget to save and close your document as in the following two lines
mainPart.Document.Save();
doc.Close();
}
}
I haven't tested the above code but hope it will give you an idea of dealing with word document in OpenXML.
Regards,

Parsing an XML string containing " " (which must be preserved)

I have code that is passed a string containing XML. This XML may contain one or more instances of (an entity reference for the blank space character). I have a requirement that these references should not be resolved (i.e. they should not be replaced with an actual space character).
Is there any way for me to achieve this?
Basically, given a string containing the XML:
<pattern value="[A-Z0-9 ]" />
I do not want it to be converted to:
<pattern value="[A-Z0-9 ]" />
(What I am actually trying to achieve is to simply take an XML string and write it to a "pretty-printed" file. This is having the side-effect of resolving occurrences of in the string to a single space character, which need to be preserved. The reason for this requirement is that the written XML document must conform to an externally-defined specification.)
I have tried creating a sub-class of XmlTextReader to read from the XML string and overriding the ResolveEntity() method, but this isn't called. I have also tried assigning a custom XmlResolver.
I have also tried, as suggested, to "double encode". Unfortunately, this has not had the desired effect, as the & is not decoded by the parser. Here is the code I used:
string schemaText = #"...<pattern value=""[A-Z0-9&#x20;]"" />...";
XmlWriterSettings writerSettings = new XmlWriterSettings();
writerSettings.Indent = true;
writerSettings.NewLineChars = Environment.NewLine;
writerSettings.Encoding = Encoding.Unicode;
writerSettings.CloseOutput = true;
writerSettings.OmitXmlDeclaration = false;
writerSettings.IndentChars = "\t";
StringBuilder writtenSchema = new StringBuilder();
using ( StringReader sr = new StringReader( schemaText ) )
using ( XmlReader reader = XmlReader.Create( sr ) )
using ( TextWriter tr = new StringWriter( writtenSchema ) )
using ( XmlWriter writer = XmlWriter.Create( tr, writerSettings ) )
{
XPathDocument doc = new XPathDocument( reader );
XPathNavigator nav = doc.CreateNavigator();
nav.WriteSubtree( writer );
}
The written XML ends up with:
<pattern value="[A-Z0-9&#x20;]" />
If you want it to be preserved, you need to double-encode it: &#x20;. The XML-reader will translate entities, that's more or less how XML works.
<pattern value="[A-Z0-9&#x20;]" />
What I did above is replaced "&" with "&" thereby escaping the ampersand.