itextsharp text extraction fails for some pdfs - itext

I have couple of PDF files whose text I am not able to extract from. These PDFs file were created by converting Word files to PDFs.
The main purpose I am extracting text from pdf is to index its text and make it searchable.
PdfReader reader = new PdfReader(inFileName);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
// where strPDFText is string builder
strPDFText.Append(iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, page) + " ");
}
string str = strPDFText.ToString();
I get an empty string. What could be the reason for the same. I am using Itextsharp 5.5

While the sample PDF provided by the OP indeed indicates that it is a MS Word export, it simply does not contain any text, only an image (which incidentally shows text).
The content of the PDF is this:
/P <</MCID 0>> BDC BT
/F1 11.04 Tf
1 0 0 1 540.1 500.95 Tm
/GS7 gs
0 g
0 G
[( )] TJ
ET
EMC /P <</MCID 1>> BDC q
0.000000071 488.88 612 231.12 re
W* n
468 0 0 219.05 72 500.95 cm
/Image8 Do Q
EMC
As you see the only actual text displayed is a single space ([( )] TJ), and the only remaining content is a bitmap image (/Image8 Do).
Thus,
I get an empty string. What could be the reason for the same.
The reason is that there is no text in your document.

Related

Replacement of text in xml file by AHK - get error when trying to open as xml file

I am using an AHK script to replace some text in an .xml file (Result.xml in this case). The script then saves a file as Result_copy.xml. It changes exactly what I need, but when I try to open the new xml file, it won't open, giving me the error:
This page contains the following errors:
error on line 4 at column 16: Encoding error
Below is a rendering of the page up to the first error.
I only replaced text at line 38 using:
#Include TF.ahk
path = %1%
text = %2%
TF_ReplaceLine(path, 38, 38, text)
%1% and %2% are given by another program and are working as should
I also see that the orginal Result.xml is 123 kb and Result_copy.xml is 62 kb, even though I only add text. When I take Result.xml and manually add the text and save it, it's 123 kb and still opens. so now both files contain exactly the same Characters, but one won't open as xml. I think that something happens during saving/copying, which I don't understand.
Could someone help me out on this one? I don't have a lot of experience in AHK scripting and do not have a programming background.
Thank you in advance!
Michel
TF.ahk contains this:
/*
Name : TF: Textfile & String Library for AutoHotkey
Version : 3.8
Documentation : https://github.com/hi5/TF
AutoHotkey.com: https://www.autohotkey.com/boards/viewtopic.php?f=6&t=576
AutoHotkey.com: http://www.autohotkey.com/forum/topic46195.html (Also for examples)
License : see license.txt (GPL 2.0)
Credits & History: See documentation at GH above.
TF_ReplaceLine(Text, StartLine = 1, Endline = 0, ReplaceText = "")
{
TF_GetData(OW, Text, FileName)
TF_MatchList:=_MakeMatchList(Text, StartLine, EndLine, 0, A_ThisFunc) ; create MatchList
Loop, Parse, Text, `n, `r
{
If A_Index in %TF_MatchList%
Output .= ReplaceText "`n"
Else
Output .= A_LoopField "`n"
}
Return TF_ReturnOutPut(OW, OutPut, FileName)
}

LibreOffice Draw -add hyperlinks based on query table

I am using draw to mark up a pdf format index map. So in grid 99, the text hyperlinks to map99.pdf
There are 1000's of grid cells - is there a way for a (macro) to scan for text in a sheet that is like
Text in File | Link to add
99|file:///c:/maps/map99.pdf
100|file:///c:/maps/map100.pdf
and add links to the relevant file whenever the text is found (99,100 etc).
I don't use libre much but happy to implement any programatic solution.
Ok, after using xray to drill through enumerated content, I finally have the answer. The code needs to create a text field using a cursor. Here is a complete working solution:
Sub AddLinks
Dim oDocument As Object
Dim vDescriptor, vFound
Dim numText As String, tryNumText As Integer
Dim oDrawPages, oDrawPage
Dim oField, oCurs
Dim numChanged As Integer
oDocument = ThisComponent
oDrawPages = oDocument.getDrawPages()
oDrawPage = oDrawPages.getByIndex(0)
numChanged = 0
For tryNumText = 1 to 1000
vDescriptor = oDrawPage.createSearchDescriptor
With vDescriptor
'.SearchString = "[:digit:]+" 'Patterns work in search box but not here?
.SearchString = tryNumText
End With
vFound = oDrawPage.findFirst(vDescriptor)
If Not IsNull(vFound) Then
numText = vFound.getString()
oField = ThisComponent.createInstance("com.sun.star.text.TextField.URL")
oField.Representation = numText
oField.URL = numText & ".pdf"
vFound.setString("")
oCurs = vFound.getText().createTextCursorByRange(vFound)
oCurs.getText().insertTextContent(oCurs, oField, False)
numChanged = numChanged + 1
End If
Next tryNumText
MsgBox("Added " & numChanged & " links.")
End Sub
To save relative links, go to File -> Export as PDF -> Links and check Export URLs relative to file system.
I uploaded an example file here that works. For some reason your example file is hanging on my system -- maybe it's too large.
Replacing text with links is much easier in Writer than in Draw. However Writer does not open PDF files.
There is some related code at https://forum.openoffice.org/en/forum/viewtopic.php?f=20&t=1401.

Combine multiple Excel workbooks into one [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Update:
Enclosed below is a sample VBA code that I found on joinedupdata.com. I need help making two modifications: (1) remove the criteria that repeated header rows are deleted and (2) see if there's a way to separate the concatenated data from each Excel file by a blank row in the combined sheet that has the filename of the following table in the left-most cell.
Dim firstRowHeaders As Boolean
Dim fso As Object
Dim dir As Object
Dim filename As Variant
Dim wb As Workbook
Dim s As Sheet1
Dim thisSheet As Sheet1
Dim lastUsedRow As Range
Dim file As String
On Error GoTo ErrMsg
Application.ScreenUpdating = False
firstRowHeaders = True 'Change from True to False if there are no headers in the first row
Set fso = CreateObject("Scripting.FileSystemObject")
'PLEASE NOTE: Change <<Full path to your Excel files folder>> to the path to the folder containing your Excel files to merge
Set dir = fso.Getfolder("<<Full path to your Excel files folder>>")
Set thisSheet = ThisWorkbook.ActiveSheet
For Each filename In dir.Files
'Open the spreadsheet in ReadOnly mode
Set wb = Application.Workbooks.Open(filename, ReadOnly:=True)
'Copy the used range (i.e. cells with data) from the opened spreadsheet
If firstRowHeaders And i > 0 Then 'Only include headers from the first spreadsheet
Dim mr As Integer
mr = wb.ActiveSheet.UsedRange.Rows.Count
wb.ActiveSheet.UsedRange.Offset(1, 0).Resize(mr - 1).Copy
Else
wb.ActiveSheet.UsedRange.Copy
End If
'Paste after the last used cell in the master spreadsheet
If Application.Version < "12.0" Then 'Excel 2007 introduced more rows
Set lastUsedRow = thisSheet.Range("A65536").End(xlUp)
Else
Set lastUsedRow = thisSheet.Range("A1048576").End(xlUp)
End If
'Only offset by 1 if there are current rows with data in them
If thisSheet.UsedRange.Rows.Count > 1 Or Application.CountA(thisSheet.Rows(1)) Then
Set lastUsedRow = lastUsedRow.Offset(1, 0)
End If
lastUsedRow.PasteSpecial
Application.CutCopyMode = False
Next filename
ThisWorkbook.Save
Set wb = Nothing
#If Mac Then
'Do nothing. Closing workbooks fails on Mac for some reason
#Else
'Close the workbooks except this one
For Each filename In dir.Files
file = Right(filename, Len(filename) - InStrRev(filename, Application.PathSeparator, , 1))
Workbooks(file).Close SaveChanges:=False
Next filename
#End If
Application.ScreenUpdating = True
ErrMsg:
If Err.Number <> 0 Then
MsgBox "There was an error. Please try again. [" & Err.Description & "]"
End If
I've been trying (without much success) to find a way to merge multiple Excel spreadsheets into one. I'm using MATLAB to analyze experimental data. A dozen Excel spreadsheets go in and an equal amount come out.
Spreadsheet Structure:
The data in each Excel file is only on the first sheet (Sheet 1).
Each sheet has four columns of data (with headers) and a variable number of data rows underneath.
Each Excel file has a unique filename.
Example:
Header 1 | Header 2 | Header 3 | Header 4
1111 22222 3333 4444
11122 11223 33344 33444
etc etc etc etc
Preferred Merging Behavior:
1) Multiple Excel files are merged into one sheet on a single new spreadsheet.
2) Column headers are maintained during the merge.
3) Instead of adding each successive data set to the bottom of the previous one ("vertical" addition), it would be great if the columns could be placed side-by-side ("horizontal" addition) with a one-column break in-between.
4) The filename of each original file is placed into a row just above the first column header.
5) Preferably cross-platform (Windows/Mac OS X). However, if VBA with ActiveX is the only way to go, that's also fine.
Sample Output:
Filename1 Filename2
Header 1 | Header 2 | Header 3 | Header 4 Header 1 | Header 2 | Header 3 | ...
111 22222 33333 4444 1111 222222 44444
Data... Data... Data... Data... Data... Data... Data...
A simple loop through the workbooks in the same folder as the master workbook should suffice.
Sub collect_wb_data()
Dim wbm As Workbook, wb As Workbook
Dim fp As String, fn As String, nc As Long
'Application.ScreenUpdating = False
Set wbm = ThisWorkbook
With wbm.Worksheets("sheet1") 'set this properly to the receiving worksheet in the master workbook
fp = wbm.Path
fn = "*.xl*"
fn = Dir(fp & Chr(92) & fn)
Do While CBool(Len(fn))
If Not fn = .Parent.Name Then
Set wb = Workbooks.Open(Filename:=fp & Chr(92) & fn, _
UpdateLinks:=False, _
ReadOnly:=True)
nc = nc + 1
.Cells(1, nc) = Left(fn, InStr(1, fn, Chr(46)) - 1)
wb.Worksheets(1).Cells(1, 1).CurrentRegion.Copy Destination:=.Cells(2, nc)
wb.Close SaveChanges:=False
Set wb = Nothing
nc = .Cells(2, Columns.Count).End(xlToLeft).Offset(0, 1).Column
End If
fn = Dir
Loop
'.parent.save 'Uncomment to save before finishing operation
End With
Set wbm = Nothing
Application.ScreenUpdating = True
End Sub
Oddly, there has been scant mention of just how the list of workbooks to be processed was intended to be derived. I've used a simply file mask on the same folder that the master workbook resides in but I have left it easy to change. If specific files are to be processed, a multiple list can be made from a standard File Open dialog instead. A hard-coded array of workbook names is another option.
I've left a couple of commands (e.g. screen updating disabled, saving before finishing) commented out. You might want to uncomment these once you are satisfied with the method(s).

How to use non breaking space in iTextSharp

How can the non breaking space can be used to have a multiline content in a PdfPTable cell. iTextSharp is breaking down the words with the space characters.
The scenario is I want a multiline content in a table head, such as in first line it may display "Text1 &" and on second line it would display "Text", on rendering the PDF the Text1 is displayed in first line, then on second line & is displayed and on third it takes the length of the first line and truncates the remaining characters to the next line.
Or can I set specific width for each and every column of the table so as to accomodate text content within it, such as the text would wrap within that specific width.
You didn't specify a language so I'll answer in VB.Net but you can easily convert it to C# if needed.
To your first question, to use a non-breaking space just use the appropriate Unicode code point U+00A0:
In VB.Net you'd declare it like:
Dim NBSP As Char = ChrW(&HA0)
And in C#:
Char NBSP = '\u00a0';
Then you can just concatenate it where needed:
Dim Text2 As String = "This is" & NBSP & "also" & NBSP & "a test"
You might also find the non-breaking hyphen (U+2011) helpful, too.
To your second question, yes you can set the width of every column. However, column widths are always set as relative widths so if you use:
T.SetTotalWidth(New Single() {2.0F, 1.0F})
What you are actually saying is that for the given table, the first column should be twice as large as the second column, you are NOT saying that the first column is 2px wide and the second is 1px. This is very important to understand. The above code is the exact same as the next two lines:
T.SetTotalWidth(New Single() {4.0F, 2.0F})
T.SetTotalWidth(New Single() {100.0F, 50.0F})
The column widths are relative to the table's width which by default (if I remember correctly) is 80% of the writable page's width. If you would like to fix the table's width to an absolute width you need to set two properties:
''//Set the width
T.TotalWidth = 200.0F
''//Lock it from trying to expand
T.LockedWidth = True
Putting the above all together, below is a full working WinForms app targetting iTextSharp 5.1.1.0:
Option Explicit On
Option Strict On
Imports System.IO
Imports iTextSharp.text
Imports iTextSharp.text.pdf
Public Class Form1
Private Sub Form1_Load(sender As System.Object, e As System.EventArgs) Handles MyBase.Load
''//File that we will create
Dim OutputFile As String = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "TableTest.pdf")
''//Standard PDF init
Using FS As New FileStream(OutputFile, FileMode.Create, FileAccess.Write, FileShare.None)
Using Doc As New Document(PageSize.LETTER)
Using writer = PdfWriter.GetInstance(Doc, FS)
Doc.Open()
''//Create our table with two columns
Dim T As New PdfPTable(2)
''//Set the relative widths of each column
T.SetTotalWidth(New Single() {2.0F, 1.0F})
''//Set the table width
T.TotalWidth = 200.0F
''//Lock the table from trying to expand
T.LockedWidth = True
''//Our non-breaking space character
Dim NBSP As Char = ChrW(&HA0)
''//Normal string
Dim Text1 As String = "This is a test"
''//String with some non-breaking spaces
Dim Text2 As String = "This is" & NBSP & "also" & NBSP & "a test"
''//Add the text to the table
T.AddCell(Text1)
T.AddCell(Text2)
''//Add the table to the document
Doc.Add(T)
Doc.Close()
End Using
End Using
End Using
Me.Close()
End Sub
End Class

Displaying text from two fields, separated by a varying number of "." symbols, while preserving the total string length

I'm trying to create a Table of Contents for a small publication using Filemaker 10, since that's what the data has been stored in previously.
I'm able to generate page numbers, add heading to the TOC and pretty much everything else I've needed to do - one thing withstanding.
Our designer wants to fill each TOC line with "." to make it easier to read.
Currently:
Using Stack Overflow 1
Why Reddit is better than digg 7
Does Filemaker really suck this much 84
Ways to convince bosses 92
Ditching FileMaker 97
Wanted:
Using Stack Overflow..................................................1
Why Reddit is better than digg........................................7
Does Filemaker really suck this much.................................84
Ways to convince bosses..............................................92
Ditching FileMaker...................................................97
The item and page number are in different fields. Using a border is unsatisfactory because it underlines everything.
Solutions?
You can do this using tab stops in the Format -> Text menu
1) Create a calc field with the following definition (the character in the quotes is a tab):
title & " " & page
2) Add this field to your layout (it needs to be an actual field, not a merge field)
3) Highlight the field and choose format -> text -> paragraph -> tabs
4) Create a new Tab with a position of 6 inches and a Fill Character of "." or "…"
Now when viewed, any space from the end of the title up to the tab stop 6 inches away is filled with the fill character. No monospace font required.
You need to break it up into bits and then put it back with the right spacing. Something like this would do :
Let ( [
text = "Why Reddit is better than digg........................................7" ;
len = Length ( text ) ;
end = RightWords ( text ; 1 ) ;
lenEnd = Length ( end ) ;
lenStart = Length ( Trim ( Right ( text ; len - lenEnd ) ) ) ] ;
Left ( text ; lenStart ) &
Left ( "..........................................................................." ; len - lenStart - lenEnd ) &
end )
I've built the "text" variable into the calc for testing, but you could do this as a Custom Function or just inside a calculation with the field instead.
Also this assumes you're using a mono spaced font and the gap in the middle is a space character.