I need to extract text from a pdf file using itext7 or itextsharp and put html tag for bold around all the words using bold font

I need to extract text from a pdf file using itext7 or itextsharp and put html tag for bold around all the words using bold font - itext

I am using iText7 and I want to extract all the texts from a pdf and put html tag for bold ( <b>...</b> ) around all the words that uses bold fonts and save it in text file. Any pointers? I am able to independently extract text and also extract all the bold words but not able to co-relate the two.
Here is the code snippet I am using for extracting the text:
PdfDocument MyDocument = new PdfDocument(new PdfReader("C:\\MyTest.pdf"));
string MyText = PdfTextExtractor.GetTextFromPage(MyDocument.GetPage(1), new
SimpleTextExtractionStrategy());
Here is the code I am using for extracting all the words using the bold font:
MyRectangle = new Rectangle(0, 0, 50, 100);
CustomFontFilter fontFilter = new CustomFontFilter(MyRectangle);
FilteredEventListener listener = new FilteredEventListener();
LocationTextExtractionStrategy extractionStrategy =
listener.AttachEventListener(new LocationTextExtractionStrategy(), fontFilter);
PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
parser.ProcessPageContent(MyDocument.GetPage(1));
String MyBoldTextList = extractionStrategy.GetResultantText();
//------
class CustomFontFilter : TextRegionEventFilter
{
public CustomFontFilter(iText.Kernel.Geom.Rectangle filterRect) : base(filterRect){ }
override public bool Accept(IEventData data, EventType type)
{
if (type == EventType.RENDER_TEXT){
TextRenderInfo renderInfo = (TextRenderInfo)data;
PdfFont font = renderInfo.GetFont();
if (font!=null)
return font.GetFontProgram().GetFontNames().GetFontName().Contains("Bold");
}
return false;
}
}
The problem is that the pdf in question here is a multi-column document. SimpleTextExtractionStrategy brings the text in perfect order but if I use the LocationStrategy, it messes up texts by jumping from one column to next column in each line. I am not able to find any way to get the list of bold words using SimpleTextExtractionStrategy. In LocationStrategy, the list that I get is not in the right order so I am unable to co-relate it.

So to summarize:
You want to extract all the text from a pdf and put the html tag for bold (<b>...</b>) around all the text that uses bold fonts.
Your PDFs allow normal text extraction (without those <b> tags) using the SimpleTextExtractionStrategy. The LocationTextExtractionStrategy on the other hand cannot be used as it messes up the order of the multi-column text.
Bold text in your PDFs can properly be recognized by your CustomFontFilter, i.e. by the
font.GetFontProgram().GetFontNames().GetFontName().Contains("Bold")
condition.
Thus, one way to implement your task would be to extend the SimpleTextExtractionStrategy to check every chunk received using the CustomFontFilter condition and insert <b> tags where required.
For example like this:
public class BoldTaggingSimpleTextExtractionStrategy : SimpleTextExtractionStrategy
{
FieldInfo textField = typeof(TextRenderInfo).GetField("text", BindingFlags.NonPublic | BindingFlags.Instance);
bool currentlyBold = false;
public override void EventOccurred(IEventData data, EventType type)
{
if (type.Equals(EventType.RENDER_TEXT))
{
TextRenderInfo renderInfo = (TextRenderInfo)data;
string fontName = renderInfo.GetFont()?.GetFontProgram()?.GetFontNames()?.GetFontName();
if (fontName != null && fontName.Contains("Bold"))
{
if (!currentlyBold)
{
textField.SetValue(renderInfo, "<b>" + renderInfo.GetText());
currentlyBold = true;
}
}
else if (currentlyBold)
{
AppendTextChunk("</b>");
currentlyBold = false;
}
}
base.EventOccurred(data, type);
}
}
As you see I used reflection here. I did so because (A) TextRenderInfo does not allow public setting of the text and (B) AppendTextChunk must not be used before the first chunk is processed by base.EventOccurred - there the size of a StringBuilder containing the collected text chunks is used to check whether the chunk currently processed is the first one or not; if something is in that builder before at least one chunk has been processed, one gets a NullReferenceException. There are other work-arounds for that but reflection here means but one more line of code.

Related

Arabic problems with converting html to PDF using ITextRenderer

When I use ITextRenderer converting html to PDF.this is my code
ByteArrayOutputStream out = new ByteArrayOutputStream();
ITextRenderer renderer = new ITextRenderer();
String inputFile = "C://Users//Administrator//Desktop//aaa2.html";
String url = new File(inputFile).toURI().toURL().toString();
renderer.setDocument(url);
renderer.getSharedContext().setReplacedElementFactory(
new B64ImgReplacedElementFactory());
// 解决阿拉伯语问题
ITextFontResolver fontResolver = renderer.getFontResolver();
try {
fontResolver.addFont("C://Users//Administrator//Desktop//arialuni.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
} catch (DocumentException e) {
e.printStackTrace();
}
renderer.layout();
OutputStream outputStream = new FileOutputStream("C://Users//Administrator//Desktop//HTMLasPDF.pdf");
renderer.createPDF(outputStream, true);
/*PdfWriter writer = renderer.getWriter();
writer.open();
writer.setRunDirection(PdfWriter.RUN_DIRECTION_RTL);
OutputStream outputStream2 = new FileOutputStream( "C://Users//Administrator//Desktop//HTMLasPDFcopy.txt");
renderer.createPDF(outputStream2);*/
renderer.finishPDF();
out.flush();
out.close();
Actual PDF Result:
Expected PDF Result:
How to make arabic ligature?

If you want to do this properly (I assume using iText, since your post is tagged as such), you should use
iText7
pdfHTML (to convert HTML to PDF)
pdfCalligraph (to handle Arabic ligatures properly)
a font that supports these features (as indicated by another answer)
For an example, please consult the HTML to PDF tutorial, more specifically the following FAQ item: How to convert HTML containing Arabic/Hebrew characters to PDF?
You need fonts that contain the glyphs you need, e.g.:
public static final String[] FONTS = {
"src/main/resources/fonts/noto/NotoSans-Regular.ttf",
"src/main/resources/fonts/noto/NotoNaskhArabic-Regular.ttf",
"src/main/resources/fonts/noto/NotoSansHebrew-Regular.ttf"
};
And you need a FontProvider that knows how to find these fonts in the ConverterProperties:
public void createPdf(String src, String[] fonts, String dest) throws IOException {
ConverterProperties properties = new ConverterProperties();
FontProvider fontProvider = new DefaultFontProvider(false, false, false);
for (String font : fonts) {
FontProgram fontProgram = FontProgramFactory.createFont(font);
fontProvider.addFont(fontProgram);
}
properties.setFontProvider(fontProvider);
HtmlConverter.convertToPdf(new File(src), new File(dest), properties);
}
Note that the text will come out all wrong if you don't have the pdfCalligraph add-on. That add-on didn't exist at the time Flying Saucer was created, hence you can't use Flying Saucer for converting documents with text in Arabic, Hindi, Telugu,... Read the pdFCalligraph white paper if you want to know more about ligatures.

Greek characters seemed to be omitted; they didn’t show up in the document.
In flying saucer the generated PDF uses some kind of default
(probably Helvetica) font, that contains a very limited character set,
that obviously does not contain the Greek code page. link

I change the way to convert pdf by using wkhtmltopdf.

How to replace text with a MS Word web add-in by preserving the formatting?

I'm working on a simple grammar correction web add-in for MS Word. Basically, I want to get the selected text, make minimal changes and update the document with the corrected text. Currently, if I use 'text' as the coercion type, I lose formatting. If there is a table or image in the selected text, they are also gone!
As I understand from the investigation I've been doing so far, openxml is the way to go. But I couldn't find any useful example on the web. How can I manipulate text by preserving the original formatting data? How can I ignore non-text paragraphs? I want to be able to do this with the Office JavaScript API:

I would do something like this:
// get data as OOXML
Office.context.document.getSelectedDataAsync(Office.CoercionType.Ooxml, function (result) {
if (result.status === "succeeded") {
var selectionAsOOXML = result.value;
var bodyContentAsOOXML = selectionAsOOXML.match(/<w:body.*?>(.*?)<\/w:body>/)[1];
// perform manipulations to the body
// it can be difficult to do to OOXML but with som regexps it should be possible
bodyContentAsOOXML = bodyContentAsOOXML.replace(/error/g, 'rorre'); // reverse the word 'error'
// insert the body back in to the OOXML
selectionAsOOXML = selectionAsOOXML.replace(/(<w:body.*?>)(.*?)<\/w:body>/, '$1' + bodyContentAsOOXML + '<\/w:body>');
// replace the selected text with the new OOXML
Office.context.document.setSelectedDataAsync(selectionAsOOXML, { coercionType: Office.CoercionType.Ooxml }, function (asyncResult) {
if (asyncResult.status === "failed") {
console.log("Action failed with error: " + asyncResult.error.message);
}
});
}
});

Chapters in iText 7

I'm looking to create a pdf file with chapters and sub chapters with iText 7. I've found examples for previous versions of iText using the Chapter class. However this class does not seem to be included in iText 7.
How is that functionality implemented in iText7?

The Chapter and Section class in iText 5 were problematic. Already with iText 5, we advised people to use PdfOutline.
For an example on how to create chapters, and more specifically, the corresponding outlines in the bookmarks panel, please take a look at the iText 7: Building Blocks tutorial. This tutorial has a recurring theme: the novel "The Strange Case of Dr. Jekyll and Mr. Hyde."
We use that text and a database with movies based on this novel to explain how iText 7 works. If you don't have the time to read it, please jump to Chapter 6.
In this chapter, we create a document that looks like this:
You can download the full sample code here: TOC_OutlinesDestinations
BufferedReader br = new BufferedReader(new FileReader(SRC));
String name, line;
Paragraph p;
boolean title = true;
int counter = 0;
PdfOutline outline = null;
while ((line = br.readLine()) != null) {
p = new Paragraph(line);
p.setKeepTogether(true);
if (title) {
name = String.format("title%02d", counter++);
outline = createOutline(outline, pdf, line, name);
p.setFont(bold).setFontSize(12)
.setKeepWithNext(true)
.setDestination(name);
title = false;
document.add(p);
}
else {
p.setFirstLineIndent(36);
if (line.isEmpty()) {
p.setMarginBottom(12);
title = true;
}
else {
p.setMarginBottom(0);
}
document.add(p);
}
}
In this example, we loop over a text file that contains titles and chapters. Every time we encounter a title, we create a name (title01, title02, and so on), and we use this named as named destination for the title paragraph: setDestination(name).
We create the outlines using the PdfOutline object for which we define a named destination like this: PdfDestination.makeDestination(new PdfString(name))
public PdfOutline createOutline(PdfOutline outline, PdfDocument pdf, String title, String name) {
if (outline == null) {
outline = pdf.getOutlines(false);
outline = outline.addOutline(title);
outline.addDestination(PdfDestination.makeDestination(new PdfString(name)));
return outline;
}
PdfOutline kid = outline.addOutline(title);
kid.addDestination(PdfDestination.makeDestination(new PdfString(name)));
return outline;
}
There are other ways to achieve this result, but using named destinations is the most simple way. Just try the example, you'll discover that most of the complexity of this example is caused by the fact that we turn a simple text file into a document with chapter titles and chapter content.

OpenXML editing docx

I have a dynamically generated docx file.
Need write the text strictly to end of page.
With Microsoft.Interop i insert Paragraphs before text:
int kk = objDoc.ComputeStatistics(WdStatistic.wdStatisticPages, ref wMissing);
while (objDoc.ComputeStatistics(WdStatistic.wdStatisticPages, ref wMissing) != kk + 1)
{
objWord.Selection.TypeParagraph();
}
objWord.Selection.TypeBackspace();
But i can't use same code with Open XML, because pages.count calculated only by word.
Using interop impossible, because it so slowwwww.

There are 2 options of doing this in Open XML.
create Content Place holder from Microsoft Office Developer Tab at the end of your document and now you can access this Content Place Holder programatically and can place any text in it.
you can append text driectly to your word document where it will be inserted at the end of your text. In this approach you got to write all the stuff to your document first and once you are done than you can append your document the following way
//
public void WriteTextToWordDocument()
{
using(WordprocessingDocument doc = WordprocessingDocument.Open(documentPath, true))
{
MainDocumentPart mainPart = doc.MainDocumentPart;
Body body = mainPart.Document.Body;
Paragraph paragraph = new Paragraph();
Run run = new Run();
Text myText = new Text("Append this text at the end of the word document");
run.Append(myText);
paragraph.Append(run);
body.Append(paragraph);
// dont forget to save and close your document as in the following two lines
mainPart.Document.Save();
doc.Close();
}
}
I haven't tested the above code but hope it will give you an idea of dealing with word document in OpenXML.
Regards,

How to get text from textbox of MS word document using Apache POI?

I want to get information written in Textbox in an MS word document. I am using Apache POI to parse word document.
Currently I am iterating through all the Paragraph objects but this Paragraph list does not contain information from TextBox so I am missing this information in output.
e.g.
paragraph in plain text
**<some information in text box>**
one more paragraph in plain text
what i want to extract :
<para>paragraph in plain text</para>
<text_box>some information in text box</text_box>
<para>one more paragraph in plain text</para>
what I am getting currently :
paragraph in plain text
one more paragraph in plain text
Anyone knows how to extract information from text box using Apache POI?

This worked for me,
private void printContentsOfTextBox(XWPFParagraph paragraph) {
XmlObject[] textBoxObjects = paragraph.getCTP().selectPath("
declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main'
declare namespace wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape'
declare namespace v='urn:schemas-microsoft-com:vml'
.//*/wps:txbx/w:txbxContent | .//*/v:textbox/w:txbxContent");
for (int i =0; i < textBoxObjects.length; i++) {
XWPFParagraph embeddedPara = null;
try {
XmlObject[] paraObjects = textBoxObjects[i].
selectChildren(
new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "p"));
for (int j=0; j<paraObjects.length; j++) {
embeddedPara = new XWPFParagraph(
CTP.Factory.parse(paraObjects[j].xmlText()), paragraph.getBody());
//Here you have your paragraph;
System.out.println(embeddedPara.getText());
}
} catch (XmlException e) {
//handle
}
}
}

To extract all occurrences of text from Word .doc and .docx files for crgrep I used the Apache Tika source as a reference of how the Apache POI APIs should be correctly used. This is useful if you want to use POI directly and not depend on Tika.
For Word .docx files, take a look at this Tika class:
org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator
if you ignore XHTMLContentHandler and formatting code you can see how to navigate a XWPFDocument correctly using POI.
For .doc files this class is helpful:
org.apache.tika.parser.microsoft.WordExtractor
both from the tika-parsers-1.x.jar. An easy way to access the Tika code through your maven dependencies is add Tika temporarily to your pom.xml such as
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.7</version>
</dependency>
let your IDE resolve attached source and step into the classes above.

If you want to get text from textbox in docx file (using POI 3.10-FINAL) here is sample code:
FileInputStream fileInputStream = new FileInputStream(inputFile);
XWPFDocument document = new XWPFDocument(OPCPackage.open(fileInputStream));
for (XWPFParagraph xwpfParagraph : document.getParagraphs()) {
String text = xwpfParagraph.getParagraphText(); //here is where you receive text from textbox
}
Or you can iterate over each
XWPFRun in XWPFParagraph and invoke toString() method. Same result.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

I need to extract text from a pdf file using itext7 or itextsharp and put html tag for bold around all the words using bold font - itext

Related

Arabic problems with converting html to PDF using ITextRenderer

How to replace text with a MS Word web add-in by preserving the formatting?

Chapters in iText 7

OpenXML editing docx

How to get text from textbox of MS word document using Apache POI?

Categories

Resources