OpenXML editing docx - openxml

I have a dynamically generated docx file.
Need write the text strictly to end of page.
With Microsoft.Interop i insert Paragraphs before text:
int kk = objDoc.ComputeStatistics(WdStatistic.wdStatisticPages, ref wMissing);
while (objDoc.ComputeStatistics(WdStatistic.wdStatisticPages, ref wMissing) != kk + 1)
{
objWord.Selection.TypeParagraph();
}
objWord.Selection.TypeBackspace();
But i can't use same code with Open XML, because pages.count calculated only by word.
Using interop impossible, because it so slowwwww.

There are 2 options of doing this in Open XML.
create Content Place holder from Microsoft Office Developer Tab at the end of your document and now you can access this Content Place Holder programatically and can place any text in it.
you can append text driectly to your word document where it will be inserted at the end of your text. In this approach you got to write all the stuff to your document first and once you are done than you can append your document the following way
//
public void WriteTextToWordDocument()
{
using(WordprocessingDocument doc = WordprocessingDocument.Open(documentPath, true))
{
MainDocumentPart mainPart = doc.MainDocumentPart;
Body body = mainPart.Document.Body;
Paragraph paragraph = new Paragraph();
Run run = new Run();
Text myText = new Text("Append this text at the end of the word document");
run.Append(myText);
paragraph.Append(run);
body.Append(paragraph);
// dont forget to save and close your document as in the following two lines
mainPart.Document.Save();
doc.Close();
}
}
I haven't tested the above code but hope it will give you an idea of dealing with word document in OpenXML.
Regards,

Related

I need to extract text from a pdf file using itext7 or itextsharp and put html tag for bold around all the words using bold font

I am using iText7 and I want to extract all the texts from a pdf and put html tag for bold ( <b>...</b> ) around all the words that uses bold fonts and save it in text file. Any pointers? I am able to independently extract text and also extract all the bold words but not able to co-relate the two.
Here is the code snippet I am using for extracting the text:
PdfDocument MyDocument = new PdfDocument(new PdfReader("C:\\MyTest.pdf"));
string MyText = PdfTextExtractor.GetTextFromPage(MyDocument.GetPage(1), new
SimpleTextExtractionStrategy());
Here is the code I am using for extracting all the words using the bold font:
MyRectangle = new Rectangle(0, 0, 50, 100);
CustomFontFilter fontFilter = new CustomFontFilter(MyRectangle);
FilteredEventListener listener = new FilteredEventListener();
LocationTextExtractionStrategy extractionStrategy =
listener.AttachEventListener(new LocationTextExtractionStrategy(), fontFilter);
PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
parser.ProcessPageContent(MyDocument.GetPage(1));
String MyBoldTextList = extractionStrategy.GetResultantText();
//------
class CustomFontFilter : TextRegionEventFilter
{
public CustomFontFilter(iText.Kernel.Geom.Rectangle filterRect) : base(filterRect){ }
override public bool Accept(IEventData data, EventType type)
{
if (type == EventType.RENDER_TEXT){
TextRenderInfo renderInfo = (TextRenderInfo)data;
PdfFont font = renderInfo.GetFont();
if (font!=null)
return font.GetFontProgram().GetFontNames().GetFontName().Contains("Bold");
}
return false;
}
}
The problem is that the pdf in question here is a multi-column document. SimpleTextExtractionStrategy brings the text in perfect order but if I use the LocationStrategy, it messes up texts by jumping from one column to next column in each line. I am not able to find any way to get the list of bold words using SimpleTextExtractionStrategy. In LocationStrategy, the list that I get is not in the right order so I am unable to co-relate it.
So to summarize:
You want to extract all the text from a pdf and put the html tag for bold (<b>...</b>) around all the text that uses bold fonts.
Your PDFs allow normal text extraction (without those <b> tags) using the SimpleTextExtractionStrategy. The LocationTextExtractionStrategy on the other hand cannot be used as it messes up the order of the multi-column text.
Bold text in your PDFs can properly be recognized by your CustomFontFilter, i.e. by the
font.GetFontProgram().GetFontNames().GetFontName().Contains("Bold")
condition.
Thus, one way to implement your task would be to extend the SimpleTextExtractionStrategy to check every chunk received using the CustomFontFilter condition and insert <b> tags where required.
For example like this:
public class BoldTaggingSimpleTextExtractionStrategy : SimpleTextExtractionStrategy
{
FieldInfo textField = typeof(TextRenderInfo).GetField("text", BindingFlags.NonPublic | BindingFlags.Instance);
bool currentlyBold = false;
public override void EventOccurred(IEventData data, EventType type)
{
if (type.Equals(EventType.RENDER_TEXT))
{
TextRenderInfo renderInfo = (TextRenderInfo)data;
string fontName = renderInfo.GetFont()?.GetFontProgram()?.GetFontNames()?.GetFontName();
if (fontName != null && fontName.Contains("Bold"))
{
if (!currentlyBold)
{
textField.SetValue(renderInfo, "<b>" + renderInfo.GetText());
currentlyBold = true;
}
}
else if (currentlyBold)
{
AppendTextChunk("</b>");
currentlyBold = false;
}
}
base.EventOccurred(data, type);
}
}
As you see I used reflection here. I did so because (A) TextRenderInfo does not allow public setting of the text and (B) AppendTextChunk must not be used before the first chunk is processed by base.EventOccurred - there the size of a StringBuilder containing the collected text chunks is used to check whether the chunk currently processed is the first one or not; if something is in that builder before at least one chunk has been processed, one gets a NullReferenceException. There are other work-arounds for that but reflection here means but one more line of code.

copy contents from existing pdf to a new pdf using itextsharp

I am able to copy the contents and edit , but i am not getting the same template as the old one, the template is getting changed, and i have a image on my old file and that image is also not getting copied into my new file , rest of the other contents are getting copied,c an someone help me to make my new pdf file template as the old one, here is my code below.
public static void Main(string[] args)
{
var editedText = ExtractTextFromPdf(#"C:\backup_temp\Template.pdf");
string outputfile =#"C:\backup_temp\Result.pdf";
using (var fileStream = new FileStream(outputfile, FileMode.Create,
FileAccess.Write))
{
Document document = new Document(PageSize.A4, 25, 25, 30, 30);
PdfWriter writer = PdfWriter.GetInstance(document, fileStream);
document.Open();
document.Open();
document.Add(new Paragraph(editedText));
document.Close();
writer.Close();
fileStream.Close();
}
}
public static string ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
text.Replace("[DMxxxxxxx]", "[DM123456]");
}
return text.ToString();
}
}
As Bruno says, if your "template" is another pdf document, you can not achieve this functionality in a trivial way. Pdf documents do not automatically reflow their content. And to the best of my knowledge, there is no pdf library that will allow you to insert/replace/edit content and still produce a nice-looking document.
The best solution in your case would be:
store the template document as an easy to edit format
generate the pdf document based on this easy template
Example use-case:
I have some HTML document that contains the precise layout and images and text, and some placeholders for things I want to fill in.
I use JSoup (or some other library) to edit the DOM structure of my template, this is very easy since I can give elements IDs and simply change the content by ID. I don't need regular expressions.
I use pdfHTML (iText add-on) to convert my html document to pdf

Chapters in iText 7

I'm looking to create a pdf file with chapters and sub chapters with iText 7. I've found examples for previous versions of iText using the Chapter class. However this class does not seem to be included in iText 7.
How is that functionality implemented in iText7?
The Chapter and Section class in iText 5 were problematic. Already with iText 5, we advised people to use PdfOutline.
For an example on how to create chapters, and more specifically, the corresponding outlines in the bookmarks panel, please take a look at the iText 7: Building Blocks tutorial. This tutorial has a recurring theme: the novel "The Strange Case of Dr. Jekyll and Mr. Hyde."
We use that text and a database with movies based on this novel to explain how iText 7 works. If you don't have the time to read it, please jump to Chapter 6.
In this chapter, we create a document that looks like this:
You can download the full sample code here: TOC_OutlinesDestinations
BufferedReader br = new BufferedReader(new FileReader(SRC));
String name, line;
Paragraph p;
boolean title = true;
int counter = 0;
PdfOutline outline = null;
while ((line = br.readLine()) != null) {
p = new Paragraph(line);
p.setKeepTogether(true);
if (title) {
name = String.format("title%02d", counter++);
outline = createOutline(outline, pdf, line, name);
p.setFont(bold).setFontSize(12)
.setKeepWithNext(true)
.setDestination(name);
title = false;
document.add(p);
}
else {
p.setFirstLineIndent(36);
if (line.isEmpty()) {
p.setMarginBottom(12);
title = true;
}
else {
p.setMarginBottom(0);
}
document.add(p);
}
}
In this example, we loop over a text file that contains titles and chapters. Every time we encounter a title, we create a name (title01, title02, and so on), and we use this named as named destination for the title paragraph: setDestination(name).
We create the outlines using the PdfOutline object for which we define a named destination like this: PdfDestination.makeDestination(new PdfString(name))
public PdfOutline createOutline(PdfOutline outline, PdfDocument pdf, String title, String name) {
if (outline == null) {
outline = pdf.getOutlines(false);
outline = outline.addOutline(title);
outline.addDestination(PdfDestination.makeDestination(new PdfString(name)));
return outline;
}
PdfOutline kid = outline.addOutline(title);
kid.addDestination(PdfDestination.makeDestination(new PdfString(name)));
return outline;
}
There are other ways to achieve this result, but using named destinations is the most simple way. Just try the example, you'll discover that most of the complexity of this example is caused by the fact that we turn a simple text file into a document with chapter titles and chapter content.

Inserting a cross-reference with Aspose Words

Does anybody know if it is possible to insert a cross reference (I want to reference a bookmark, but I could make anything else work as well), using Aspose Words, in C#?
If you want to insert a footnote or endnote, you can use the following code.
DocumentBuilder builder = new DocumentBuilder(doc);
builder.Write("Some text is added.");
Footnote endNote = new Footnote(doc, FootnoteType.Endnote);
builder.CurrentParagraph.AppendChild(endNote);
endNote.Paragraphs.Add(new Paragraph(doc));
endNote.FirstParagraph.Runs.Add(new Run(doc, "Endnote text."));
doc.Save(MyDir + #"FootNote.docx");
If you want to insert a bookmark, you can use the following code.
Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
builder.StartBookmark("MyBookmark");
builder.Writeln("Text inside a bookmark.");
builder.EndBookmark("MyBookmark")
If you want to update a bookmark, you can use the following code.
Document doc = new Document(MyDir + "Bookmark.doc");
// Use the indexer of the Bookmarks collection to obtain the desired bookmark.
Bookmark bookmark = doc.Range.Bookmarks["MyBookmark"];
// Get the name and text of the bookmark.
string name = bookmark.Name;
string text = bookmark.Text;
// Set the name and text of the bookmark.
bookmark.Name = "RenamedBookmark";
bookmark.Text = "This is a new bookmarked text.";
I work as developer evangelist at Aspose.
You can use the following code to get the page number of a bookmark.
Document doc = new Document("Bookmark.docx");
Aspose.Words.Layout.LayoutCollector layoutCollector = new Aspose.Words.Layout.LayoutCollector(doc);
// Use the indexer of the Bookmarks collection to obtain the desired bookmark.
Bookmark bookmark = doc.Range.Bookmarks["MyBookmark"];
// Get the name and text of the bookmark.
string name = bookmark.Name;
string text = bookmark.Text;
int pageNumber = layoutCollector.GetStartPageIndex(bookmark.BookmarkStart);
Console.Write("Bookmark name is {0}, it is placed on page number {1} and following is the text inside it: {2}", name, pageNumber, text);

How to get text from textbox of MS word document using Apache POI?

I want to get information written in Textbox in an MS word document. I am using Apache POI to parse word document.
Currently I am iterating through all the Paragraph objects but this Paragraph list does not contain information from TextBox so I am missing this information in output.
e.g.
paragraph in plain text
**<some information in text box>**
one more paragraph in plain text
what i want to extract :
<para>paragraph in plain text</para>
<text_box>some information in text box</text_box>
<para>one more paragraph in plain text</para>
what I am getting currently :
paragraph in plain text
one more paragraph in plain text
Anyone knows how to extract information from text box using Apache POI?
This worked for me,
private void printContentsOfTextBox(XWPFParagraph paragraph) {
XmlObject[] textBoxObjects = paragraph.getCTP().selectPath("
declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main'
declare namespace wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape'
declare namespace v='urn:schemas-microsoft-com:vml'
.//*/wps:txbx/w:txbxContent | .//*/v:textbox/w:txbxContent");
for (int i =0; i < textBoxObjects.length; i++) {
XWPFParagraph embeddedPara = null;
try {
XmlObject[] paraObjects = textBoxObjects[i].
selectChildren(
new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "p"));
for (int j=0; j<paraObjects.length; j++) {
embeddedPara = new XWPFParagraph(
CTP.Factory.parse(paraObjects[j].xmlText()), paragraph.getBody());
//Here you have your paragraph;
System.out.println(embeddedPara.getText());
}
} catch (XmlException e) {
//handle
}
}
}
To extract all occurrences of text from Word .doc and .docx files for crgrep I used the Apache Tika source as a reference of how the Apache POI APIs should be correctly used. This is useful if you want to use POI directly and not depend on Tika.
For Word .docx files, take a look at this Tika class:
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator
if you ignore XHTMLContentHandler and formatting code you can see how to navigate a XWPFDocument correctly using POI.
For .doc files this class is helpful:
org.apache.tika.parser.microsoft.WordExtractor
both from the tika-parsers-1.x.jar. An easy way to access the Tika code through your maven dependencies is add Tika temporarily to your pom.xml such as
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.7</version>
</dependency>
let your IDE resolve attached source and step into the classes above.
If you want to get text from textbox in docx file (using POI 3.10-FINAL) here is sample code:
FileInputStream fileInputStream = new FileInputStream(inputFile);
XWPFDocument document = new XWPFDocument(OPCPackage.open(fileInputStream));
for (XWPFParagraph xwpfParagraph : document.getParagraphs()) {
String text = xwpfParagraph.getParagraphText(); //here is where you receive text from textbox
}
Or you can iterate over each
XWPFRun in XWPFParagraph and invoke toString() method. Same result.