OpenPDF columnText.go creating corrupt document - itext

I keep getting an error of
when I try writing to the PDF with columnText. setSimpleColumn() used to work fine, but now this code is throwing the error, also when I tried using a setColumns() it also is throwing an error. I can't think of what is causing the error. Was I supposed to close the columnText somehow?
The immediately related code is:
fun testBox(content: List, font: List, page: Int, leftLimit: Float, rightLimit: Float, topLimit: Float, bottomLimit: Float){
val columnText = ColumnText(setCanvas("$filepath$name.pdf",page))
columnText.alignment = ALIGN_JUSTIFIED
columnText.runDirection = RUN_DIRECTION_RTL
columnText.setSimpleColumn(leftLimit, bottomLimit, rightLimit, getRectangle("LETTER").top-topLimit)
var i = 0
while (i < content.size) {
columnText.addText(Chunk(content[i], font[i]))
columnText.go()
i++
}
}
I'll show more code if needed, but I don't think the rest of it is related to the issue.
I'm really stumped, and I can't find much info on this issue.
This is the resulting file:
https://drive.google.com/file/d/135EhLyiyDj6iAexUJ0upRdG6eXHYXVdw/view?usp=sharing
EDIT:
I forgot that I made the function setCanvas in order to save a bunch of code, here is the function:
fun setCanvas(file: String, page: Int): PdfContentByte? {
val reader = PdfReader(file)
val stamper = PdfStamper(reader, FileOutputStream(File(file)))
if(reader.numberOfPages < page){ stamper.insertPage(reader.numberOfPages + (page-reader.numberOfPages), reader.getPageSize(1) ?: getRectangle("LETTER")); if (page - reader.numberOfPages != 1){throw Error("EMPTY PAGE!")}}
return stamper.getOverContent(page)
}
EDIT 2: [combined two functions]
testBox(listOf(content1,content2,content3), listOf(font,fontBold,font), 1, document.left(), document.right(), document.bottom(),document.top())
fun testBox(content: List<String>, font: List<Font>, page: Int, leftLimit: Float, rightLimit: Float, topLimit: Float, bottomLimit: Float){
val reader = PdfReader("$filepath$name.pdf")
val stamper = PdfStamper(reader, FileOutputStream(File("$filepath$name - edit.pdf")))
if(reader.numberOfPages < page){ stamper.insertPage(reader.numberOfPages + (page-reader.numberOfPages), reader.getPageSize(1) ?: getRectangle("LETTER")); if (page - reader.numberOfPages > 0){throw Error("EMPTY PAGE!")}}
val columnText = ColumnText(stamper.getOverContent(page))
columnText.alignment = ALIGN_JUSTIFIED
columnText.runDirection = RUN_DIRECTION_RTL
columnText.setSimpleColumn(leftLimit, bottomLimit, getRectangle("LETTER").right - rightLimit, getRectangle("LETTER").top- topLimit)
var i = 0
while (i<content.size) {
columnText.addText(Chunk(content[i], font[i]))
columnText.go()
i++
}
}
link:
https://drive.google.com/file/d/1ybvDVSxKOJdbnA2fSRjEmxlDszIktWTL/view?usp=sharing

There are two errors in your original code:
You use the same source and target file for stamping.
You don't keep a reference to the PdfStamper to close it after manipulating the PDF.
The first problem causes the file to be truncated before it is completely read. This you fixed in your edit.
The second may cause your output PDF to not become finished, missing a necessary trailer.
After also fixing the second problem your code finally stopped producing a corrupt file.

Related

java heap space error when converting csv to json but no error with d3.csv()

Platform being used: Apache Zeppelin
Language: scala, javascript
I use d3js to read a csv file of size ~40MB and it works perfectly fine with the below code:
<script type="text/javascript">
d3.csv("test.csv", function(data) {
// data is JSON array. Do something with data;
console.log(data);
});
</script>
Now, the idea is to avoid d3js, instead, construct the JSONarray in scala and access this variable in javascript code through z.angularBind(). Both of the below code works for smaller files, but gives java heap space error for the CSV file of size 40MB. What I am unable to understand is when d3.csv() can perfectly do the job without any heap space error, why cannot these 2 below code?
Edited Code 1: Using scala's
import java.io.BufferedReader;
import java.io.FileReader;
import org.json._
import scala.io.Source
var br = new BufferedReader(new FileReader("/root/test.csv"))
var contentLine = br.readLine();
var keys = contentLine.split(",")
contentLine = br.readLine();
var ja = new JSONArray();
while (contentLine != null) {
var splits = contentLine.split(",")
var i = 0
var jo = new JSONObject()
for(i <- 0 to splits.length-1){
jo.put(keys(i), splits(i));
}
ja.put(jo);
contentLine = br.readLine();
}
//z.angularBind("ja",ja.toString()) //ja can be accessed now in javascript (EDITED-10/11/15)
Edited Code 2:
I thought the heap space issue might go away if I use Apache spark to construct the JSON array like in below code, but this one too gives heap space error:
def myf(keys: Array[String], value: String):String = {
var splits = value.split(",")
var jo = new JSONObject()
for(i <- 0 to splits.length-1){
jo.put(keys(i), splits(i));
}
return(jo.toString())
}
val csv = sc.textFile("/root/test.csv")
val firstrow = csv.first
val header = firstrow.split(",")
val data = csv.filter(x => x != firstrow)
var g = data.map(value => myf(header,value)).collect()
// EDITED BELOW 2 LINES-10/11/15
//var ja= g.mkString("[", ",", "]")
//z.angularBind("ja",ja) //ja can be accessed now in javascript
You are creating JSON-objects. They are not native to java/scala and will therefore take up more space in that environment. What does z.angularBind() really do?
Also what is the heap size of your javascript environment (see https://www.quora.com/What-is-the-maximum-size-of-a-JavaScript-object-in-browser-memory for chrome) and your java environment (see How is the default java heap size determined?).
Update: Removed the original part of the answer where I misunderstood the question

Reading PDF document with iTextSharp creates string with repeating first page

I currently use iTextSharp to read in some PDF files and parse them by using the string I receive. I have encountered a strange behavior with some PDF files. When getting the string back of a for example 4 page PDF, the string is filled with the pages in the following order:
1 2 1 3 1 4
My code for reading the files is as follows:
using (PdfReader reader = new PdfReader(fileStream))
{
StringBuilder sb = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
if (!string.IsNullOrWhiteSpace(text))
sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
}
Debug.WriteLine(sb.ToString());
}
Here is a link to a file with which this behaviour occurs:
https://onedrive.live.com/redir?resid=D9FEFF3BF45E05FD!1536&authkey=!AFLRlskAvlg89yY&ithint=file%2cpdf
Hope you guys can help me out!
Thanks to Chris Haas I found out was going wrong. The samples found online on how to use iTextSharp.Pdf are incorrect or incorrect for my implementation.
The SimpleTextExtractionStrategy needs to be instantiated for every page you try to read. Not doing this will multiply each previous page in the resulting string.
Also the line where the StringBuilder is being appended can be changed from:
sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
to
sb.Append(text);
Thus the following code gives the correct result:
using (PdfReader reader = new PdfReader(fileStream))
{
StringBuilder sb = new StringBuilder();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, new SimpleTextExtractionStrategy());
if (!string.IsNullOrWhiteSpace(text))
sb.Append(text);
}
Debug.WriteLine(sb.ToString());
}

How to check the existence of a value in an ArrayBuffer?

I am new in Scala and I am writing a program in which I have an ArrayBuffer of points of a binary image and I want to check in a loop if a specific point is existing in that ArrayBuffer do not add. This is the part of code I am working on :
var vectVisitedPoint= new scala.collection.mutable.ArrayBuffer[Point]()
var pTemp=new Point (0,0)
var res = new Array[Byte](1)
img.get(pTemp.x.toInt,pTemp.y.toInt,res) //img is a binary image
var value1: Int=0
var value2: Int=0
scala.util.control.Breaks.breakable {
while((value1 < img.rows ) ){
while ( (value2 < img.cols )){
if (res(0) == -1 && vectVisitedPoint.exists(value1,value2)) {//this is where I want to check if the current point (value1,value2) is already exists in vectVisitedPoint
pTemp.x=(pTemp.x.toInt)+value1
pTemp.y=(pTemp.y.toInt)+value2
vectVisitedPoint.append(new Point(pTemp.x,pTemp.y)
scala.util.control.Breaks.break()
}
value2=value2+1
img.get(value1,value2,res)
}
value2=0
value1=value1+1
}
}
}
I think I need to write it in another way but don't know how?!
Thanks.
You can use:
vectVisitedPoint.exists(_ == (value1, value2))
Would you like me to refactor your code for you into much much less code, more functional, more readible and probably more efficient way? If so create another question and I will.

iText not returning text contents of a PDF after first page

I am trying to use the iText library with c# to capture the text portion of pdf files.
I created a pdf from excel 2013 (exported) and then copied the sample from the web of how to use itext (added the lib ref to the project).
It reads perfectly the first page but it gets garbled info after that. It is keeping part of the first page and merging the info with the next page. The commented lines is when I was trying to solve the problem, the string "thePage" is recreated inside the for loop.
Here is the code. I can email the pdf to whoever can help with this issue.
Thanks in advance
public static string ExtractTextFromPdf(string path)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
//string[] theLines;
//theLines = new string[COLUMNS];
//string thePage;
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string thePage = "";
thePage = PdfTextExtractor.GetTextFromPage(reader, i, its);
string [] theLines = thePage.Split('\n');
foreach (var theLine in theLines)
{
text.AppendLine(theLine);
}
// text.AppendLine(" ");
// Array.Clear(theLines, 0, theLines.Length);
// thePage = "";
}
return text.ToString();
}
}
A strategy object collects text data and does not know if a new page has started or not.
Thus, use a new strategy object for each page.

java.nio.BufferUnderflowException when processing files in Scala

I got a similar problem to this guy while processing 4MB log file. Actually I'm processing multiple files simultaneously but since I keep getting this exception, I decide to just test it for a single file:
val temp = Source.fromFile("./datasource/input.txt")
val dummy = new PrintWriter("test.txt")
var itr = 0
println("Default Buffer size: " + Source.DefaultBufSize)
try {
for( chr <- temp) {
dummy.print(chr.toChar)
itr += 1
if(itr == 75703) println("Passed line 85")
if(itr % 256 == 0){ print("..." + itr); temp.reset; System.gc; }
if(itr == 75703) println("Passed line 87")
if(itr % 2048 == 0) println("")
if(itr == 75703) println("Passed line 89")
}
} finally {
println("\nFalied at itr = " + itr)
}
What I always get is that it will fails at itr = 75703, while my output file will always be 64KB (65536 Bytes exact). No matter where I put temp.reset or System.gc, all experiments ends up the same.
It seems like the problem relies on some memory allocation but I cannot find any useful information on this problem. Any idea on how to solve this one?
All your helps are greatly appreciated
EDIT: Actually I want to process it as binary files, so this technique is not a good solution, many had recommend me to use BufferedInputStream instead.
Why are you calling reset on the Source before it has finished iterating thru the file?
val temp = Source.fromFile("./datasource/input.txt")
try {
for (line <- tem p.getLines) {
//whatever
}
finally temp.reset
Should work just fine with no underflows. See also this question