iText not returning text contents of a PDF after first page - itext

I am trying to use the iText library with c# to capture the text portion of pdf files.
I created a pdf from excel 2013 (exported) and then copied the sample from the web of how to use itext (added the lib ref to the project).
It reads perfectly the first page but it gets garbled info after that. It is keeping part of the first page and merging the info with the next page. The commented lines is when I was trying to solve the problem, the string "thePage" is recreated inside the for loop.
Here is the code. I can email the pdf to whoever can help with this issue.
Thanks in advance
public static string ExtractTextFromPdf(string path)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
//string[] theLines;
//theLines = new string[COLUMNS];
//string thePage;
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string thePage = "";
thePage = PdfTextExtractor.GetTextFromPage(reader, i, its);
string [] theLines = thePage.Split('\n');
foreach (var theLine in theLines)
{
text.AppendLine(theLine);
}
// text.AppendLine(" ");
// Array.Clear(theLines, 0, theLines.Length);
// thePage = "";
}
return text.ToString();
}
}

A strategy object collects text data and does not know if a new page has started or not.
Thus, use a new strategy object for each page.

Related

How to remove the extra page at the end of a word document which created during mail merge

I have written a piece of code to create a word document by mail merge using Syncfusion (Assembly Syncfusion.DocIO.Portable, Version=17.1200.0.50,), Angular 7+ and .NET Core. Please see the code below.
private MemoryStream MergePaymentPlanInstalmentsScheduleToPdf(List<PaymentPlanInstalmentReportModel>
PaymentPlanDetails, byte[] templateFileBytes)
{
if (templateFileBytes == null || templateFileBytes.Length == 0)
{
return null;
}
var templateStream = new MemoryStream(templateFileBytes);
var pdfStream = new MemoryStream();
WordDocument mergeDocument = null;
using (mergeDocument = new WordDocument(templateStream, FormatType.Docx))
{
if (mergeDocument != null)
{
var mergeList = new List<PaymentPlanInstalmentScheduleMailMergeModel>();
var obj = new PaymentPlanInstalmentScheduleMailMergeModel();
obj.Applicants = 0;
if (PaymentPlanDetails != null && PaymentPlanDetails.Any()) {
var applicantCount = PaymentPlanDetails.GroupBy(a => a.StudentID)
.Select(s => new
{
StudentID = s.Key,
Count = s.Select(a => a.StudentID).Distinct().Count()
});
obj.Applicants = applicantCount?.Count() > 0 ? applicantCount.Count() : 0;
}
mergeList.Add(obj);
var reportDataSource = new MailMergeDataTable("Report", mergeList);
var tableDataSource = new MailMergeDataTable("PaymentPlanDetails", PaymentPlanDetails);
List<DictionaryEntry> commands = new List<DictionaryEntry>();
commands.Add(new DictionaryEntry("Report", ""));
commands.Add(new DictionaryEntry("PaymentPlanDetails", ""));
MailMergeDataSet ds = new MailMergeDataSet();
ds.Add(reportDataSource);
ds.Add(tableDataSource);
mergeDocument.MailMerge.ExecuteNestedGroup(ds, commands);
mergeDocument.UpdateDocumentFields();
using (var converter = new DocIORenderer())
{
using (var pdfDocument = converter.ConvertToPDF(mergeDocument))
{
pdfDocument.Save(pdfStream);
pdfDocument.Close();
}
}
mergeDocument.Close();
}
}
return pdfStream;
}
Once the document is generated, I notice there is a blank page (with the footer) at the end. I searched for a solution on the internet over and over again, but I was not able to find a solution. According to experts, I have done the initial checks such as making sure that the initial word template file has no page breaks, etc.
I am wondering if there is something that I can do from my code to remove any extra page breaks or anything like that, which can cause this.
Any other suggested solution for this, even including MS Word document modifications also appreciated.
Please refer the below documentation link to remove empty page at the end of Word document using Syncfusion Word library (Essential DocIO).
https://www.syncfusion.com/kb/10724/how-to-remove-empty-page-at-end-of-word-document
Please reuse the code snippet before converting Word to PDF in your sample application.
Note: I work for Syncfusion.

iText7 Merge pdf annotations on a new pdf document

I have multiple copies of a .pdf document that are commented by different users. I would like to merge all these comments into a new pdf "merged".
I wrote this sub inside a class called document with properties "path" and "directory".
Public Sub MergeComments(ByVal pdfDocuments As String())
Dim oSavePath As String = Directory & "\" & FileName & "_Merged.pdf"
Dim oPDFdocument As New iText.Kernel.Pdf.PdfDocument(New PdfReader(Path),
New PdfWriter(New IO.FileStream(oSavePath, IO.FileMode.Create)))
For Each oFile As String In pdfDocuments
Dim oSecundairyPDFdocument As New iText.Kernel.Pdf.PdfDocument(New PdfReader(oFile))
Dim oAnnotations As New PDFannotations
For i As Integer = 1 To oSecundairyPDFdocument.GetNumberOfPages
Dim pdfPage As PdfPage = oSecundairyPDFdocument.GetPage(i)
For Each oAnnotation As Annot.PdfAnnotation In pdfPage.GetAnnotations()
oPDFdocument.GetPage(i).AddAnnotation(oAnnotation)
Next
Next
Next
oPDFdocument.Close()
End Sub
This code results in an exception that I am failing to solve.
iText.Kernel.PdfException: 'Pdf indirect object belongs to other PDF document. Copy object to current pdf document.'
What do I need to change in order to perform this task? Or am I completely off with my code block?
You need to explicitly copy the underlying PDF object to the destination document. After that you will be easily able to add that object to the list of page annotations.
Instead of adding the annotation directly:
oPDFdocument.GetPage(i).AddAnnotation(oAnnotation)
Copy the object to the destination document first, wrap it into PdfAnnotation class with makeAnnotation method and then add it as usual. Code is in Java but you will easily be able to convert it into VB:
PdfObject annotObject = oAnnotation.getPdfObject().copyTo(pdfDocument);
pdfDocument.getPage(i).addAnnotation(PdfAnnotation.makeAnnotation(annotObject));
Here is a working Java code, with annotations copied from one document to other using the copyTo method.
PdfReader reader = new PdfReader(new
RandomAccessSourceFactory().createBestSource(sourceFileName), null);
PdfDocument document = new PdfDocument(reader);
PdfReader toMergeReader = new PdfReader(new RandomAccessSourceFactory().createBestSource(targetFileName), null);
PdfDocument toMergeDocument = new PdfDocument(toMergeReader);
PdfWriter writer = new PdfWriter(targetFileName + "_MergedVersion.pdf");
PdfDocument writeDocument = new PdfDocument(writer);
int pageCount = toMergeDocument.getNumberOfPages();
for (int i = 1; i <= pageCount; i++) {
PdfPage page = document.getPage(i);
writeDocument.addPage(page.copyTo(writeDocument));
PdfPage pdfPage = toMergeDocument.getPage(i);
List<PdfAnnotation> pageAnnots = pdfPage.getAnnotations();
if (pageAnnots != null) {
for (PdfAnnotation pdfAnnotation : pageAnnots) {
PdfObject annotObject = pdfAnnotation.getPdfObject().copyTo(writeDocument);
writeDocument.getPage(i).addAnnotation(PdfAnnotation.makeAnnotation(annotObject));
}
}
}
reader.close();
toMergeReader.close();
toMergeDocument.close();
document.close();
writeDocument.close();
writer.close();

Reading PDF document with iTextSharp creates string with repeating first page

I currently use iTextSharp to read in some PDF files and parse them by using the string I receive. I have encountered a strange behavior with some PDF files. When getting the string back of a for example 4 page PDF, the string is filled with the pages in the following order:
1 2 1 3 1 4
My code for reading the files is as follows:
using (PdfReader reader = new PdfReader(fileStream))
{
StringBuilder sb = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
if (!string.IsNullOrWhiteSpace(text))
sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
}
Debug.WriteLine(sb.ToString());
}
Here is a link to a file with which this behaviour occurs:
https://onedrive.live.com/redir?resid=D9FEFF3BF45E05FD!1536&authkey=!AFLRlskAvlg89yY&ithint=file%2cpdf
Hope you guys can help me out!
Thanks to Chris Haas I found out was going wrong. The samples found online on how to use iTextSharp.Pdf are incorrect or incorrect for my implementation.
The SimpleTextExtractionStrategy needs to be instantiated for every page you try to read. Not doing this will multiply each previous page in the resulting string.
Also the line where the StringBuilder is being appended can be changed from:
sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
to
sb.Append(text);
Thus the following code gives the correct result:
using (PdfReader reader = new PdfReader(fileStream))
{
StringBuilder sb = new StringBuilder();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, new SimpleTextExtractionStrategy());
if (!string.IsNullOrWhiteSpace(text))
sb.Append(text);
}
Debug.WriteLine(sb.ToString());
}

Openxml: Added ImagePart is not showing in Powerpoint / Missing RelationshipID

I'm trying to dynamically create a PowerPoint presentation. One slide has a bunch of placeholder images that need to be changed based on certain values.
My approach is to create a new ImagePart and link it to the according Blip. The image is downloaded and stored to the presentation just fine. The problem is, that there is no relationship created in slide.xml.rels file for the image, which leads to an warning about missing images and empty boxes on the slide.
Any ideas what I am doing wrong?
Thanks in advance for your help! Best wishes
SPSecurity.RunWithElevatedPrivileges(delegate()
{
using (SPSite siteCollection = new SPSite(SPContext.Current.Site.RootWeb.Url))
{
using (SPWeb oWeb = siteCollection.OpenWeb())
{
SPList pictureLibrary = oWeb.Lists[pictureLibraryName];
SPFile imgFile = pictureLibrary.RootFolder.Files[imgPath];
byte[] byteArray = imgFile.OpenBinary();
int pos = Convert.ToInt32(name.Replace("QQ", "").Replace("Image", ""));
foreach (DocumentFormat.OpenXml.Presentation.Picture pic in pictureList)
{
var oldimg = pic.BlipFill.Blip.Embed.ToString(); ImagePart ip = (ImagePart)slidePart.AddImagePart(ImagePartType.Png, oldimg+pos);
using (var writer = new BinaryWriter(ip.GetStream()))
{
writer.Write(byteArray);
}
string newId = slidePart.GetIdOfPart(ip);
setDebugMessage("new img id: " + newId);
pic.BlipFill.Blip.Embed = newId;
}
slidePart.Slide.Save();
}
}
});
So, for everyone who's experiencing a similar problem, I finally found the solution. Quite a stupid mistake. Instad of PresentationDocument document = PresentationDocument.Open(mstream, true); you have to use
using (PresentationDocument document = PresentationDocument.Open(mstream, true))
{
do your editing here
}
This answer brought me on the right way.

Replace the text in pdf document using itextSharp

I want to replace a particular text in PDF document. I am currently using itextSharp library to play with PDF documents.
I had extracted the bytes from pdfdocument and then replaced that byte and then write the document again with the bytes but it is not working. In the below example I am trying to replace string 1234 with 5678
Any advise on how to perform this would be helpful.
PdfReader reader = new PdfReader(opf.FileNames[i]);
byte[] pdfbytes = reader.GetPageContent(1);
PdfString oldstring = new PdfString("1234");
PdfString newstring = new PdfString("5678");
byte[] byte1022 = oldstring.GetOriginalBytes();
byte[] byte1067 = newstring.GetOriginalBytes();
int position = 0;
for (int j = 0; j <pdfbytes.Length ; j++)
{
if (pdfbytes[j] == byte1022[0])
{
if (pdfbytes[j+1] == byte1022[1])
{
if (pdfbytes[j+2] == byte1022[2])
{
if (pdfbytes[j+3] == byte1022[3])
{
position = j;
break;
}
}
}
}
}
pdfbytes[position] = byte1067[0];
pdfbytes[position + 1] = byte1067[1];
pdfbytes[position + 2] = byte1067[2];
pdfbytes[position + 3] = byte1067[3];
File.WriteAllBytes(opf.FileNames[i].Replace(".pdf","j.pdf"), pdfbytes);
What makes you think 1234 is part of the page's content stream and not of a form XObject? Your code is never going to work in general if you don't parse all the resources of a page.
Also: I see GetPageContent(), but I don't see you using SetPageContent() anywhere. How are the changes ever going to be stored in the PdfReader object?
Moreover, I don't see you using PdfStamper to write the altered PdfReader contents to a file.
Finally: I'm to shy to quote the words of Leonard Rosenthol, Adobe's PDF Architect, but ask him, and he'll tell you personally that you shouldn't do what you're trying to do. PDF is NOT a format for editing.Read the intro of chapter 6 of the book I wrote on iText: http://www.manning.com/lowagie2/samplechapter6.pdf