IText Pdf Reader with Images - itext

I have the pdf which is of 2 column format. I am able to parse it to simple text, but these pdfs also have images in between . As a result my text output gets jumbled up for that specific page of the pdf which have images in between.
For example consider a 2 column page format
Image Text2
Image Image
Image Text3
Text1 Image
Text4
Output is
Text4 Text3 Text2 Text1 instead of Text1 Text2 Text3 Text4
Any solution for this to read the text in the proper order?
I am using the following code
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
for (int i = 76; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
out.println(strategy.getResultantText());
}
out.flush();
out.close();
}

You are using the SimpleTextExtractionStrategy. This strategy assumes the letter groups in the page content are already in a sensible order. Try the LocationTextExtractionStrategy instead which sorts those letter groups.
You seem to prefer an interesting order, though. According to your question, you want to get Text1 Text2 Text3 Text4 for
Image Text2
Image Image
Image Text3
Text1 Image
Text4
The LocationTextExtractionStrategy will order top to bottom primarily, though, and only secondarily left to right. Thus, you'll get Text2 Text3 Text1 Text4. For your requirement you should copy LocationTextExtractionStrategy and change it to order the text fragments the way you need them.
If that desired order is due to the content being meant to be interpreted as being in two columns, though, you may want to parse the columns separately by filtering the strategy input:
Rectangle rect = new Rectangle(x1, y1, x2, y2);
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy = new FilteredTextRenderListener(
new LocationTextExtractionStrategy(), filter);
Confer the iText in Action, 2nd edition example ExtractPageContentArea.
Regards, Michael

Related

Multiline column copy paste in VS Code

Is it possible to do pasting in multiline editing (cursor |):
text1 = [|]
text2 = [|]
text3 = [|]
text4 = [|]
Assuming I have pasted the following lines:
val1
val2
val3
val4
I would like to have this result:
text1 = [val1]
text2 = [val2]
text3 = [val3]
text4 = [val4]
What actually happens is that the clipboard content is pasted four times, once for each cursor.
Something like mentioned in this answer, but instead of typing simply pasting: https://stackoverflow.com/a/30039968/1374488
Use column-edit instead of the multi-line edit mode:
Click the end of the source text.
Shift Alt, click the beginning.
Copy.
Click the end of the destination text.
Shift Alt, click the beginning.
Paste.
I had some trouble with this until I figured it out. The second selection ( where you want to paste ), must be the same length as the first selection, otherwise it pastes all items at each location ( instead of one item per row ).
1-select column of data you want to copy by holding alt+shift+mouse selection box and copy it with ctrl+c
2- select the places you want to paste into with alt+mouse click(note: this helps if the lines to be pasted into are in different places)
3-paste into the selected locations with ctrl+v
I had to do this for hundreds of lines, mapping db columns.
What I ended up doing to speed this is was creating an excel sheet with 3 columns:
COL1 COL2 COL3
text1 = [ val1 ]
text2 = [ val2 ]
text3 = [ val3 ]
text4 = [ val4 ]
And then searching and replacing tabs.
Worked for me https://github.com/john-guo/columnpaste . Adds Column paste command.

Ə character in Pdf

I want to add paragraph with Ə character to pdf file. I try every Thing but Ə character is not shown in pdf file. What can I do in this situation.
My code is here: BaseFont bf = BaseFont.createFont("assets/font/AZER_TM.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
Paragraph tranportParagraph = new Paragraph("XƏƏƏƏƏöğııəçş\u0259", new Font(bf, 22));
tranportParagraph.setAlignment(Element.ALIGN_CENTER);
document.add(tranportParagraph);
this is a picture of my AZER_TM.ttf file:
As you see in the above picture There Ə character in the my AZER_TM.ttf file
But the character is not shown in the pdf file.

LibreOffice Draw -add hyperlinks based on query table

I am using draw to mark up a pdf format index map. So in grid 99, the text hyperlinks to map99.pdf
There are 1000's of grid cells - is there a way for a (macro) to scan for text in a sheet that is like
Text in File | Link to add
99|file:///c:/maps/map99.pdf
100|file:///c:/maps/map100.pdf
and add links to the relevant file whenever the text is found (99,100 etc).
I don't use libre much but happy to implement any programatic solution.
Ok, after using xray to drill through enumerated content, I finally have the answer. The code needs to create a text field using a cursor. Here is a complete working solution:
Sub AddLinks
Dim oDocument As Object
Dim vDescriptor, vFound
Dim numText As String, tryNumText As Integer
Dim oDrawPages, oDrawPage
Dim oField, oCurs
Dim numChanged As Integer
oDocument = ThisComponent
oDrawPages = oDocument.getDrawPages()
oDrawPage = oDrawPages.getByIndex(0)
numChanged = 0
For tryNumText = 1 to 1000
vDescriptor = oDrawPage.createSearchDescriptor
With vDescriptor
'.SearchString = "[:digit:]+" 'Patterns work in search box but not here?
.SearchString = tryNumText
End With
vFound = oDrawPage.findFirst(vDescriptor)
If Not IsNull(vFound) Then
numText = vFound.getString()
oField = ThisComponent.createInstance("com.sun.star.text.TextField.URL")
oField.Representation = numText
oField.URL = numText & ".pdf"
vFound.setString("")
oCurs = vFound.getText().createTextCursorByRange(vFound)
oCurs.getText().insertTextContent(oCurs, oField, False)
numChanged = numChanged + 1
End If
Next tryNumText
MsgBox("Added " & numChanged & " links.")
End Sub
To save relative links, go to File -> Export as PDF -> Links and check Export URLs relative to file system.
I uploaded an example file here that works. For some reason your example file is hanging on my system -- maybe it's too large.
Replacing text with links is much easier in Writer than in Draw. However Writer does not open PDF files.
There is some related code at https://forum.openoffice.org/en/forum/viewtopic.php?f=20&t=1401.

How can I use regular and bold in a single String?

I have a String that consists of a constant part and a variable part.
I want the variable to be formatted using a regular font within the text paragraph, whereas I want the constant part to be bold.
This is my code:
String cc_cust_name = request.getParameter("CC_CUST_NAME");
document.add(new Paragraph(" NAME " + cc_cust_name, fontsmallbold));
My code for a cell in a table looks like this:
cell1 = new PdfPCell(new Phrase("Date of Birth" + cc_cust_dob ,fontsmallbold));
In both cases, the first part (" NAME " and "Date of Birth") should be bold and the variable part (cc_cust_name and cc_cust_dob) should be regular.
Right now you are creating a Paragraph using a single font: fontsmallbold. You want to create a Paragraph that uses two different fonts:
Font regular = new Font(FontFamily.HELVETICA, 12);
Font bold = Font font = new Font(FontFamily.HELVETICA, 12, Font.BOLD);
Paragraph p = new Paragraph("NAME: ", bold);
p.add(new Chunk(CC_CUST_NAME, regular));
As you can see, we create a Paragraph with content "NAME: " that uses font bold. Then we add a Chunk to the Paragraph with CC_CUST_NAME in font regular.
See also How to set two different colors for a single string in itext and Applying color to Strings in Paragraph using Itext which are two questions that address the same topic.
You can also use this in the context of a PdfPCell in which case you create a Phrase that uses two fonts:
Font regular = new Font(FontFamily.HELVETICA, 12);
Font bold = Font font = new Font(FontFamily.HELVETICA, 12, Font.BOLD);
Phrase p = new Phrase("NAME: ", bold);
p.add(new Chunk(CC_CUST_NAME, regular));
PdfPCell cell = new PdfPCell(p);

How to use non breaking space in iTextSharp

How can the non breaking space can be used to have a multiline content in a PdfPTable cell. iTextSharp is breaking down the words with the space characters.
The scenario is I want a multiline content in a table head, such as in first line it may display "Text1 &" and on second line it would display "Text", on rendering the PDF the Text1 is displayed in first line, then on second line & is displayed and on third it takes the length of the first line and truncates the remaining characters to the next line.
Or can I set specific width for each and every column of the table so as to accomodate text content within it, such as the text would wrap within that specific width.
You didn't specify a language so I'll answer in VB.Net but you can easily convert it to C# if needed.
To your first question, to use a non-breaking space just use the appropriate Unicode code point U+00A0:
In VB.Net you'd declare it like:
Dim NBSP As Char = ChrW(&HA0)
And in C#:
Char NBSP = '\u00a0';
Then you can just concatenate it where needed:
Dim Text2 As String = "This is" & NBSP & "also" & NBSP & "a test"
You might also find the non-breaking hyphen (U+2011) helpful, too.
To your second question, yes you can set the width of every column. However, column widths are always set as relative widths so if you use:
T.SetTotalWidth(New Single() {2.0F, 1.0F})
What you are actually saying is that for the given table, the first column should be twice as large as the second column, you are NOT saying that the first column is 2px wide and the second is 1px. This is very important to understand. The above code is the exact same as the next two lines:
T.SetTotalWidth(New Single() {4.0F, 2.0F})
T.SetTotalWidth(New Single() {100.0F, 50.0F})
The column widths are relative to the table's width which by default (if I remember correctly) is 80% of the writable page's width. If you would like to fix the table's width to an absolute width you need to set two properties:
''//Set the width
T.TotalWidth = 200.0F
''//Lock it from trying to expand
T.LockedWidth = True
Putting the above all together, below is a full working WinForms app targetting iTextSharp 5.1.1.0:
Option Explicit On
Option Strict On
Imports System.IO
Imports iTextSharp.text
Imports iTextSharp.text.pdf
Public Class Form1
Private Sub Form1_Load(sender As System.Object, e As System.EventArgs) Handles MyBase.Load
''//File that we will create
Dim OutputFile As String = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "TableTest.pdf")
''//Standard PDF init
Using FS As New FileStream(OutputFile, FileMode.Create, FileAccess.Write, FileShare.None)
Using Doc As New Document(PageSize.LETTER)
Using writer = PdfWriter.GetInstance(Doc, FS)
Doc.Open()
''//Create our table with two columns
Dim T As New PdfPTable(2)
''//Set the relative widths of each column
T.SetTotalWidth(New Single() {2.0F, 1.0F})
''//Set the table width
T.TotalWidth = 200.0F
''//Lock the table from trying to expand
T.LockedWidth = True
''//Our non-breaking space character
Dim NBSP As Char = ChrW(&HA0)
''//Normal string
Dim Text1 As String = "This is a test"
''//String with some non-breaking spaces
Dim Text2 As String = "This is" & NBSP & "also" & NBSP & "a test"
''//Add the text to the table
T.AddCell(Text1)
T.AddCell(Text2)
''//Add the table to the document
Doc.Add(T)
Doc.Close()
End Using
End Using
End Using
Me.Close()
End Sub
End Class