Reading PDF document with iTextSharp creates string with repeating first page - itext

I currently use iTextSharp to read in some PDF files and parse them by using the string I receive. I have encountered a strange behavior with some PDF files. When getting the string back of a for example 4 page PDF, the string is filled with the pages in the following order:
1 2 1 3 1 4
My code for reading the files is as follows:
using (PdfReader reader = new PdfReader(fileStream))
{
StringBuilder sb = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
if (!string.IsNullOrWhiteSpace(text))
sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
}
Debug.WriteLine(sb.ToString());
}
Here is a link to a file with which this behaviour occurs:
https://onedrive.live.com/redir?resid=D9FEFF3BF45E05FD!1536&authkey=!AFLRlskAvlg89yY&ithint=file%2cpdf
Hope you guys can help me out!

Thanks to Chris Haas I found out was going wrong. The samples found online on how to use iTextSharp.Pdf are incorrect or incorrect for my implementation.
The SimpleTextExtractionStrategy needs to be instantiated for every page you try to read. Not doing this will multiply each previous page in the resulting string.
Also the line where the StringBuilder is being appended can be changed from:
sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
to
sb.Append(text);
Thus the following code gives the correct result:
using (PdfReader reader = new PdfReader(fileStream))
{
StringBuilder sb = new StringBuilder();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, new SimpleTextExtractionStrategy());
if (!string.IsNullOrWhiteSpace(text))
sb.Append(text);
}
Debug.WriteLine(sb.ToString());
}

Related

MalformedInputException: Input length = 1 while reading text file with Files.readAllLines(Path.get("file").get(0);

Why am I getting this error? I'm trying to extract information from a bank statement PDF and tally different bills for the month. I write the data from a PDF to a text file so I can get specific data from the file (e.g. ASPEN HOME IMPRO, then iterate down to what the dollar amount is, then read that text line to a string)
When the Files.readAllLines(Path.get("bankData").get(0) code is run, I get the error. Any thoughts why? Encoding issue?
Here is the code:
public static void main(String[] args) throws IOException {
File file = new File("C:\\Users\\wmsai\\Desktop\\BankStatement.pdf");
PDFTextStripper stripper = new PDFTextStripper();
BufferedWriter bw = new BufferedWriter(new FileWriter("bankData"));
BufferedReader br = new BufferedReader(new FileReader("bankData"));
String pdfText = stripper.getText(Loader.loadPDF(file)).toUpperCase();
bw.write(pdfText);
bw.flush();
bw.close();
LineNumberReader lineNum = new LineNumberReader(new FileReader("bankData"));
String aspenHomeImpro = "PAYMENT: ACH: ASPEN HOME IMPRO";
String line;
while ((line = lineNum.readLine()) != null) {
if (line.contains(aspenHomeImpro)) {
int lineNumber = lineNum.getLineNumber();
int newLineNumber = lineNumber + 4;
String aspenData = Files.readAllLines(Paths.get("bankData")).get(0); //This is the code with the error
System.out.println(newLineNumber);
break;
} else if (!line.contains(aspenHomeImpro)) {
continue;
}
}
}
So I figured it out. I had to check the properties of the text file in question (I'm using Eclipse) to figure out what the actual encoding of the text file was.
Then, when creating the file in the program, encode the text file to UTF-8 so that Files.readAllLines could read and grab the data I wanted to get.

How to remove the extra page at the end of a word document which created during mail merge

I have written a piece of code to create a word document by mail merge using Syncfusion (Assembly Syncfusion.DocIO.Portable, Version=17.1200.0.50,), Angular 7+ and .NET Core. Please see the code below.
private MemoryStream MergePaymentPlanInstalmentsScheduleToPdf(List<PaymentPlanInstalmentReportModel>
PaymentPlanDetails, byte[] templateFileBytes)
{
if (templateFileBytes == null || templateFileBytes.Length == 0)
{
return null;
}
var templateStream = new MemoryStream(templateFileBytes);
var pdfStream = new MemoryStream();
WordDocument mergeDocument = null;
using (mergeDocument = new WordDocument(templateStream, FormatType.Docx))
{
if (mergeDocument != null)
{
var mergeList = new List<PaymentPlanInstalmentScheduleMailMergeModel>();
var obj = new PaymentPlanInstalmentScheduleMailMergeModel();
obj.Applicants = 0;
if (PaymentPlanDetails != null && PaymentPlanDetails.Any()) {
var applicantCount = PaymentPlanDetails.GroupBy(a => a.StudentID)
.Select(s => new
{
StudentID = s.Key,
Count = s.Select(a => a.StudentID).Distinct().Count()
});
obj.Applicants = applicantCount?.Count() > 0 ? applicantCount.Count() : 0;
}
mergeList.Add(obj);
var reportDataSource = new MailMergeDataTable("Report", mergeList);
var tableDataSource = new MailMergeDataTable("PaymentPlanDetails", PaymentPlanDetails);
List<DictionaryEntry> commands = new List<DictionaryEntry>();
commands.Add(new DictionaryEntry("Report", ""));
commands.Add(new DictionaryEntry("PaymentPlanDetails", ""));
MailMergeDataSet ds = new MailMergeDataSet();
ds.Add(reportDataSource);
ds.Add(tableDataSource);
mergeDocument.MailMerge.ExecuteNestedGroup(ds, commands);
mergeDocument.UpdateDocumentFields();
using (var converter = new DocIORenderer())
{
using (var pdfDocument = converter.ConvertToPDF(mergeDocument))
{
pdfDocument.Save(pdfStream);
pdfDocument.Close();
}
}
mergeDocument.Close();
}
}
return pdfStream;
}
Once the document is generated, I notice there is a blank page (with the footer) at the end. I searched for a solution on the internet over and over again, but I was not able to find a solution. According to experts, I have done the initial checks such as making sure that the initial word template file has no page breaks, etc.
I am wondering if there is something that I can do from my code to remove any extra page breaks or anything like that, which can cause this.
Any other suggested solution for this, even including MS Word document modifications also appreciated.
Please refer the below documentation link to remove empty page at the end of Word document using Syncfusion Word library (Essential DocIO).
https://www.syncfusion.com/kb/10724/how-to-remove-empty-page-at-end-of-word-document
Please reuse the code snippet before converting Word to PDF in your sample application.
Note: I work for Syncfusion.

Jasper Reports: Exporting report to multiple files

I'm developing a jrxml template for generate job candidate's resume. The candidates are in my database.
I need to generate a Word file (.docx) for 1 record (by job candidate), as the image below:
How can I make Jasper generate one file for each record of my SQL query? And export these files to Word?
I saw there is a parameter called PAGE_INDEX exporter. But I did not find how to use it ...
Can someone help me please?
Note 1: My reports are not generated by JasperServer. I developed a Java program to generate them and send reports by email.
Note 2: The number of pages for each candidate may be different.
Updating status
I managed to generate one record per file. But I could only generate the file to the first record.
I need to generate other files for the remaining records.
I'm still with the another problem too: how to separate into separate files when the number of pages for each record (candidate entity) can change?
final JRDocxExporter exporter = new JRDocxExporter();
exporter.setExporterInput(new SimpleExporterInput(jasperPrint));
exporter.setExporterOutput(new SimpleOutputStreamExporterOutput(new java.io.File("/home/admin/resume candidate.docx")));
SimpleDocxReportConfiguration configuration = new SimpleDocxReportConfiguration();
configuration.setPageIndex(0);
exporter.setConfiguration(configuration);
exporter.exportReport();
PROBLEM SOLUTION
I solved the problem by inserting a variable in the footer of each page with the expression: $V{REPORT_COUNT}, which have record count that is in the Detail Band:
After that, the Java program do loop between the pages of JasperPrint object.
So, i locate that element that tells me what page belongs to candidate.
Based on this information and storing candidate index data and its pages (in a HashMap > mapCandPage), I can determine the page that starts and the page ends for each candidate. And that way I can export one document for each candidate record.
public static void main(String args[]) throws Exception {
File relJasperArqFile = new File("Candidate Resume Template.jasper");
Connection conn = ConnectionFactory.getNewConnectionSQLDRIVER();
JasperReport jasperReport = (JasperReport) JRLoader.loadObject(relJasperArqFile);
JasperPrint jasperPrint
= JasperFillManager.fillReport(jasperReport,
null,
conn);
final JRDocxExporter exporter = new JRDocxExporter();
exporter.setExporterInput(new SimpleExporterInput(jasperPrint));
List<JRPrintPage> listPrintPage = jasperPrint.getPages();
int candIdx = 0;
int fileIdx = 0;
int lastCandIdx = 0;
HashMap<Integer, List<Integer>> mapCandPage = new HashMap<>();
for (int pageIdx = 0; pageIdx < listPrintPage.size(); pageIdx++) {
JRPrintPage page = listPrintPage.get(pageIdx);
candIdx = getCandIdx(page);
if (!mapCandPage.containsKey(candIdx)) {
mapCandPage.put(candIdx, (new ArrayList<>()));
}
mapCandPage.get(candIdx).add(pageIdx);
if (pageIdx > 0 && candIdx != lastCandIdx) {
fileIdx++;
exporter.setExporterOutput(new SimpleOutputStreamExporterOutput(new File(String.format("Candidate Resume %d.docx", fileIdx))));
SimpleDocxReportConfiguration configuration = new SimpleDocxReportConfiguration();
configuration.setStartPageIndex(mapCandPage.get(lastCandIdx).get(0));
configuration.setEndPageIndex(mapCandPage.get(lastCandIdx).get(mapCandPage.get(lastCandIdx).size() - 1));
exporter.setConfiguration(configuration);
exporter.exportReport();
}
lastCandIdx = candIdx;
}
fileIdx++;
exporter.setExporterOutput(new SimpleOutputStreamExporterOutput(new File(String.format("Candidate Resume %d.docx", fileIdx))));
SimpleDocxReportConfiguration configuration = new SimpleDocxReportConfiguration();
configuration.setStartPageIndex(mapCandPage.get(lastCandIdx).get(0));
configuration.setEndPageIndex(mapCandPage.get(lastCandIdx).get(mapCandPage.get(lastCandIdx).size() - 1));
exporter.setConfiguration(configuration);
exporter.exportReport();
}
public static Integer getCandIdx(JRPrintPage page) {
JRPrintElement lastRowNumber = page.getElements().get(page.getElements().size() - 1);
return Integer.parseInt(((JRTemplatePrintText) lastRowNumber).getFullText());
}
This is a test and my code is not optimized. If anyone has suggestions or a better idea, please post here. Thank you.

iText not returning text contents of a PDF after first page

I am trying to use the iText library with c# to capture the text portion of pdf files.
I created a pdf from excel 2013 (exported) and then copied the sample from the web of how to use itext (added the lib ref to the project).
It reads perfectly the first page but it gets garbled info after that. It is keeping part of the first page and merging the info with the next page. The commented lines is when I was trying to solve the problem, the string "thePage" is recreated inside the for loop.
Here is the code. I can email the pdf to whoever can help with this issue.
Thanks in advance
public static string ExtractTextFromPdf(string path)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
//string[] theLines;
//theLines = new string[COLUMNS];
//string thePage;
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string thePage = "";
thePage = PdfTextExtractor.GetTextFromPage(reader, i, its);
string [] theLines = thePage.Split('\n');
foreach (var theLine in theLines)
{
text.AppendLine(theLine);
}
// text.AppendLine(" ");
// Array.Clear(theLines, 0, theLines.Length);
// thePage = "";
}
return text.ToString();
}
}
A strategy object collects text data and does not know if a new page has started or not.
Thus, use a new strategy object for each page.

Replace the text in pdf document using itextSharp

I want to replace a particular text in PDF document. I am currently using itextSharp library to play with PDF documents.
I had extracted the bytes from pdfdocument and then replaced that byte and then write the document again with the bytes but it is not working. In the below example I am trying to replace string 1234 with 5678
Any advise on how to perform this would be helpful.
PdfReader reader = new PdfReader(opf.FileNames[i]);
byte[] pdfbytes = reader.GetPageContent(1);
PdfString oldstring = new PdfString("1234");
PdfString newstring = new PdfString("5678");
byte[] byte1022 = oldstring.GetOriginalBytes();
byte[] byte1067 = newstring.GetOriginalBytes();
int position = 0;
for (int j = 0; j <pdfbytes.Length ; j++)
{
if (pdfbytes[j] == byte1022[0])
{
if (pdfbytes[j+1] == byte1022[1])
{
if (pdfbytes[j+2] == byte1022[2])
{
if (pdfbytes[j+3] == byte1022[3])
{
position = j;
break;
}
}
}
}
}
pdfbytes[position] = byte1067[0];
pdfbytes[position + 1] = byte1067[1];
pdfbytes[position + 2] = byte1067[2];
pdfbytes[position + 3] = byte1067[3];
File.WriteAllBytes(opf.FileNames[i].Replace(".pdf","j.pdf"), pdfbytes);
What makes you think 1234 is part of the page's content stream and not of a form XObject? Your code is never going to work in general if you don't parse all the resources of a page.
Also: I see GetPageContent(), but I don't see you using SetPageContent() anywhere. How are the changes ever going to be stored in the PdfReader object?
Moreover, I don't see you using PdfStamper to write the altered PdfReader contents to a file.
Finally: I'm to shy to quote the words of Leonard Rosenthol, Adobe's PDF Architect, but ask him, and he'll tell you personally that you shouldn't do what you're trying to do. PDF is NOT a format for editing.Read the intro of chapter 6 of the book I wrote on iText: http://www.manning.com/lowagie2/samplechapter6.pdf