Insert HTML in docx file - openxml

I have made an application that fills wordfiles with customxmlparts now I am trying to put text into a textfield, but it has HTML in it and I want it to show the styling of it. I tried converting it to rich text format but that just gets pasted in the word file. Here is an example of the code:
var taskId = Guid.NewGuid();
var tempFilePath = $"{Path.GetTempPath()}/{taskId}";
using (var templateStream = new FileStream($"{tempFilePath}.docx", FileMode.CreateNew))
{
templateStream.Write(template, 0, template.Length);
// 1. Fill template.
using (WordprocessingDocument doc = WordprocessingDocument.Open(templateStream, true))
{
MainDocumentPart mainDocument = doc.MainDocumentPart;
if (mainDocument.CustomXmlParts != null)
{
mainDocument.DeleteParts<CustomXmlPart>(mainDocument.CustomXmlParts);
}
CustomXmlPart cxp = mainDocument.AddCustomXmlPart(CustomXmlPartType.CustomXml);
foreach (var line in data.Lines)
{
if (line.MoreInfo != null && line.MoreInfo != " ") {
}
}
var xmlData = ObjectToXml(data);
using (var stream = GenerateStreamFromString(tempFilePath, xmlData))
{
cxp.FeedData(stream);
}
mainDocument.Document.Save();
}
}

You can't just write the HTML formatted text into a DOCX field, you would need to convert it into a WordprocessingML format.
However, there is another way that you could try and that is to insert an "AltChunk" element. That element represents a sort of like a placeholder which can reference a HTML file and then when the DOCX file is opened in MS Word, it will make that HTML to WordprocessingML conversion for you. For details see: How to Use altChunk for Document Assembly
Alternatively you could use some third party, like GemBox.Document, which can make that HTML to WordprocessingML conversion for you.
For example check this Set Content example:
// Set content using HTML tags
document.Sections[0].Blocks[4].Content.LoadText(
"Paragraph 5 <b>(part of this paragraph is bold)</b>", LoadOptions.HtmlDefault);

Related

Word found unreadable content in xxx.docx after split a docx using openxml

I have a full.docx which includes two math questions, the docx embeds some pictures and MathType equation (oleobject), I split the doc according to this, get two files (first.docx, second.docx) , first.docx works fine, the second.docx, however, pops up a warning dialog when I try to open it:
"Word found unreadable content in second.docx. Do you want to recover the contents of this document? If you trust the source of this document, click Yes."
After click "Yes", the doc can be opened, the content is also correct, I want to know what is wrong with the second.docx? I have checked it with "Open xml sdk 2.5 productivity tool", but found no reason. Very appreciated for any help. Thanks.
The three files have been uploaded to here.
Show some code:
byte[] templateBytes = System.IO.File.ReadAllBytes(TEMPLATE_YANG_FILE);
using (MemoryStream templateStream = new MemoryStream())
{
templateStream.Write(templateBytes, 0, (int)templateBytes.Length);
string guidStr = Guid.NewGuid().ToString();
using (WordprocessingDocument document = WordprocessingDocument.Open(templateStream, true))
{
document.ChangeDocumentType(DocumentFormat.OpenXml.WordprocessingDocumentType.Document);
MainDocumentPart mainPart = document.MainDocumentPart;
mainPart.Document = new Document();
Body bd = new Body();
foreach (DocumentFormat.OpenXml.Wordprocessing.Paragraph clonedParagrph in lst)
{
bd.AppendChild<DocumentFormat.OpenXml.Wordprocessing.Paragraph>(clonedParagrph);
clonedParagrph.Descendants<Blip>().ToList().ForEach(blip =>
{
var newRelation = document.CopyImage(blip.Embed, this.wordDocument);
blip.Embed = newRelation;
});
clonedParagrph.Descendants<DocumentFormat.OpenXml.Vml.ImageData>().ToList().ForEach(imageData =>
{
var newRelation = document.CopyImage(imageData.RelationshipId, this.wordDocument);
imageData.RelationshipId = newRelation;
});
}
mainPart.Document.Body = bd;
mainPart.Document.Save();
}
string subDocFile = System.IO.Path.Combine(this.outDir, guidStr + ".docx");
this.subWordFileLst.Add(subDocFile);
File.WriteAllBytes(subDocFile, templateStream.ToArray());
}
the lst contains Paragraph cloned from original docx using:
(DocumentFormat.OpenXml.Wordprocessing.Paragraph)p.Clone();
Using productivity tool, found oleobjectx.bin not copied, so I add below code after copy Blip and ImageData:
clonedParagrph.Descendants<OleObject>().ToList().ForEach(ole =>
{
var newRelation = document.CopyOleObject(ole.Id, this.wordDocument);
ole.Id = newRelation;
});
Solved the issue.

Remove Content controls after adding text using open xml

By the help of some very kind community members here I managed to programatically create a function to replace text inside content controls in a Word document using open xml. After the document is generated it removes the formatting of the text after I replace the text.
Any ideas on how I can still keep the formatting in word and remove the content control tags ?
This is my code:
using (var wordDoc = WordprocessingDocument.Open(mem, true))
{
var mainPart = wordDoc.MainDocumentPart;
ReplaceTags(mainPart, "FirstName", _firstName);
ReplaceTags(mainPart, "LastName", _lastName);
ReplaceTags(mainPart, "WorkPhoe", _workPhone);
ReplaceTags(mainPart, "JobTitle", _jobTitle);
mainPart.Document.Save();
SaveFile(mem);
}
private static void ReplaceTags(MainDocumentPart mainPart, string tagName, string tagValue)
{
//grab all the tag fields
IEnumerable<SdtBlock> tagFields = mainPart.Document.Body.Descendants<SdtBlock>().Where
(r => r.SdtProperties.GetFirstChild<Tag>().Val == tagName);
foreach (var field in tagFields)
{
//remove all paragraphs from the content block
field.SdtContentBlock.RemoveAllChildren<Paragraph>();
//create a new paragraph containing a run and a text element
Paragraph newParagraph = new Paragraph();
Run newRun = new Run();
Text newText = new Text(tagValue);
newRun.Append(newText);
newParagraph.Append(newRun);
//add the new paragraph to the content block
field.SdtContentBlock.Append(newParagraph);
}
}
Keeping the style is a tricky problem as there could be more than one style applied to the text you are trying to replace. What should you do in that scenario?
Assuming a simple case of one style (but potentially over many Paragraphs, Runs and Texts) you could keep the first Text element you come across per SdtBlock and place your required value in that element then delete any further Text elements from the SdtBlock. The formatting from the first Text element will then be maintained. Obviously you can apply this theory to any of the Text blocks; you don't have to necessarily use the first. The following code should show what I mean:
private static void ReplaceTags(MainDocumentPart mainPart, string tagName, string tagValue)
{
IEnumerable<SdtBlock> tagFields = mainPart.Document.Body.Descendants<SdtBlock>().Where
(r => r.SdtProperties.GetFirstChild<Tag>().Val == tagName);
foreach (var field in tagFields)
{
IEnumerable<Text> texts = field.SdtContentBlock.Descendants<Text>();
for (int i = 0; i < texts.Count(); i++)
{
Text text = texts.ElementAt(i);
if (i == 0)
{
text.Text = tagValue;
}
else
{
text.Remove();
}
}
}
}

iText not returning text contents of a PDF after first page

I am trying to use the iText library with c# to capture the text portion of pdf files.
I created a pdf from excel 2013 (exported) and then copied the sample from the web of how to use itext (added the lib ref to the project).
It reads perfectly the first page but it gets garbled info after that. It is keeping part of the first page and merging the info with the next page. The commented lines is when I was trying to solve the problem, the string "thePage" is recreated inside the for loop.
Here is the code. I can email the pdf to whoever can help with this issue.
Thanks in advance
public static string ExtractTextFromPdf(string path)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
//string[] theLines;
//theLines = new string[COLUMNS];
//string thePage;
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string thePage = "";
thePage = PdfTextExtractor.GetTextFromPage(reader, i, its);
string [] theLines = thePage.Split('\n');
foreach (var theLine in theLines)
{
text.AppendLine(theLine);
}
// text.AppendLine(" ");
// Array.Clear(theLines, 0, theLines.Length);
// thePage = "";
}
return text.ToString();
}
}
A strategy object collects text data and does not know if a new page has started or not.
Thus, use a new strategy object for each page.

Open XML: Word - Getting all Paragraphs marked as "Heading1" style

Using Word I have created a Docx with the standard normal.dot as a test. Hello-world level complexity.
I wish to get all the paragraphs which are styled with the "Heading1" style in Word.
I can get all the paragraphs, but don't know how to filter down to Heading1.
using (var doc = WordprocessingDocument.Open(documentFileName, false))
{
paragraphs = doc.MainDocumentPart.Document.Body
.OfType<Paragraph>().ToList();
}
[Test]
public void FindHeadingParagraphs()
{
var paragraphs = new List<Paragraph>();
// Open the file read-only since we don't need to change it.
using (var wordprocessingDocument = WordprocessingDocument.Open(documentFileName, false))
{
paragraphs = wordprocessingDocument.MainDocumentPart.Document.Body
.OfType<Paragraph>()
.Where(p => p.ParagraphProperties != null &&
p.ParagraphProperties.ParagraphStyleId != null &&
p.ParagraphProperties.ParagraphStyleId.Val.Value.Contains("Heading1")).ToList();
}
}

Merge 2 pdf byte streams using Itextsharp

I have a method that returns a pdf byte stream (from fillable pdf) Is there a straight forward way to merge 2 streams into one stream and make one pdf out of it? I need to run my method twice but need the two pdf's into One pdf stream. Thanks.
You didn't say if you're flattening the filled forms with the PdfStamper, so I'll just say you must flatten the before trying to merge them. Here's a working .ashx HTTP handler:
<%# WebHandler Language="C#" Class="mergeByteForms" %>
using System;
using System.IO;
using System.Web;
using iTextSharp.text;
using iTextSharp.text.pdf;
public class mergeByteForms : IHttpHandler {
HttpServerUtility Server;
public void ProcessRequest (HttpContext context) {
Server = context.Server;
HttpResponse Response = context.Response;
Response.ContentType = "application/pdf";
using (Document document = new Document()) {
using (PdfSmartCopy copy = new PdfSmartCopy(
document, Response.OutputStream) )
{
document.Open();
for (int i = 0; i < 2; ++i) {
PdfReader reader = new PdfReader(_getPdfBtyeStream(i.ToString()));
copy.AddPage(copy.GetImportedPage(reader, 1));
}
}
}
}
public bool IsReusable { get { return false; } }
// simulate your method to use __one__ byte stream for __one__ PDF
private byte[] _getPdfBtyeStream(string data) {
// replace with __your__ PDF template
string pdfTemplatePath = Server.MapPath(
"~/app_data/template.pdf"
);
PdfReader reader = new PdfReader(pdfTemplatePath);
using (MemoryStream ms = new MemoryStream()) {
using (PdfStamper stamper = new PdfStamper(reader, ms)) {
AcroFields form = stamper.AcroFields;
// replace this with your form field data
form.SetField("title", data);
// ...
// this is __VERY__ important; since you're using the same fillable
// PDF, if you don't set this property to true the second page will
// lose the filled fields.
stamper.FormFlattening = true;
}
return ms.ToArray();
}
}
}
Hopefully the inline comments make sense. _getPdfBtyeStream() method above simulates your PDF byte streams. The reason you need to set FormFlattening to true is that a when you fill PDF form fields, names are supposed to be unique. In your case the second page is the same fillable PDF form, so it has the same field names as the first page and when you fill them they're ignored. Comment out the example line above:
stamper.FormFlattening = true;
to see what I mean.
In other words, a lot of the generic code to merge PDFs on the Internet and even here on stackoverflow will not work (for fillable forms) because Acrofields are not being accounted for. In fact, if you take a look at stackoverflow's about itextsharp tag "SO FAQ & Popular" to Merge PDFs, it's mentioned in the third comment for the correctly marked answer by #Ray Cheng.
Another way to merge fillable PDF (without flattening the form) is to rename the form fields for the second/following page(s), but that's more work.