Which java API is best for converting pdf to image - scala

I tried with 3 Java APIs for pdf but all 3 did not work properly.
1. PDFFile
2. PDDocument
3. PDFDocumentReader
If I have a pdf which has a 2 layer in which upper one is bit transparent, so when above 3 APIs converts it into image then only upper layer comes in image with no transparency. But both layer must come.
So suggest me other API to get my requirment fulfill
Code for PDFFile :
val raf = new RandomAccessFile(file, "r")
val channel = raf.getChannel()
val buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size())
raf.close()
val pdffile = new PDFFile(buf)
val numPgs = pdffile.getNumPages() + 1
for (i <- 1 until numPgs) {
val page = pdffile.getPage(i)
val pwdt = page.getBBox().getWidth().toDouble
val phgt = page.getBBox().getHeight().toDouble
val rect = new Rectangle(0, 0, pwdt.toInt, phgt.toInt)
val rsiz = resize(method, size, pwdt, phgt)
val img = page.getImage(rsiz("width"), rsiz("height"),
rect, null, true, true)
result ::= buffer(img)
Code for PDDocument :
val doc = PDDocument.load(new FileInputStream(file));
val pages = doc.getDocumentCatalog().getAllPages()
for (i <- 0 until pages.size()) {
val page = pages.get(i)
val before = page.asInstanceOf[PDPage].convertToImage()
}
Code for PDFDocumentReader :
val inputStream = new FileInputStream(file)
val document = new PDFDocumentReader(inputStream)
val numPgs = document.getNumberOfPages
for (i <- 0 until numPgs) {
val pageDetail = new PageDetail("", "", i, "")
val resourceDetails = document.getPageAsImage(pageDetail)
val image = ImageIO.read(new ByteArrayInputStream(resourceDetails.getBytes()))
result ::= image
}

Related

How to iterate over files and perform action on them - Scala Spark

I am reading 1000 of .eml files (message/email files) one by one from a directory and parsing them and extracting values from them using javax.mail api's and in end storing them into a Dataframe. Sample code below:
var x = Seq[DataFrame]()
val emlFiles = getListOfFiles("tmp/sample")
val fileCount = emlFiles.length
val fs = FileSystem.get(sc.hadoopConfiguration)
for (i <- 0 until fileCount){
var emlData = spark.emptyDataFrame
val f = new File(emlFiles(i))
val fileName = f.getName()
val path = Paths.get(emlFiles(i))
val session = Session.getInstance(new Properties())
val messageIn = new FileInputStream(path.toFile())
val mimeJournal = new MimeMessage(session, messageIn)
// Extracting Metadata
val Receivers = mimeJournal.getHeader("From")(0)
val Senders = mimeJournal.getHeader("To")(0)
val Date = mimeJournal.getHeader("Date")(0)
val Subject = mimeJournal.getHeader("Subject")(0)
val Size = mimeJournal.getSize
emlData =Seq((fileName,Receivers,Senders,Date,Subject,Size)).toDF("fileName","Receivers","Senders","Date","Subject","Size")
x = emlData +: x
}
Problem is that I am using a for loop to do the same and its taking a lot of time. Is there a way to break the for loop and read the files?

When i am trying to convert a base64 to image i am getting the following error Cannot resolve overloaded method 'write'

val decoder = new BASE64Decoder
val decodedBytes = decoder.decodeBuffer(base64String)
val uploadFile = "C:/Users/BabuSuku/Downloads/SpineorDownloads/test.png"
val image = ImageIO.read(new ByteArrayInputStream(decodedBytes))
val f = new Nothing(uploadFile)
ImageIO.write(image, "png", uploadFile)
you passed a string as third parameter to write. You need a Filevariable instead. Change the last two lines accordingly:
val decoder = new BASE64Decoder
val decodedBytes = decoder.decodeBuffer(base64String)
val uploadFile = "C:/Users/BabuSuku/Downloads/SpineorDownloads/test.png"
val image = ImageIO.read(new ByteArrayInputStream(decodedBytes))
val f = new File(uploadFile)
ImageIO.write(image, "png", f)
see Docs

Flatten array in yield statement in Scala

I have the following piece of code
var splitDf = fullCertificateSourceDf.map(row => {
val ID = row.getAs[String]("ID")
val CertificateID = row.getAs[String]("CertificateID")
val CertificateTag = row.getAs[String]("CertificateTag")
val CertificateDescription = row.getAs[String]("CertificateDescription")
val WorkBreakdownUp1Summary = row.getAs[String]("WorkBreakdownUp1Summary")
val ProcessBreakdownSummaryList = row.getAs[String]("ProcessBreakdownSummaryList")
val ProcessBreakdownUp1SummaryList = row.getAs[String]("ProcessBreakdownUp1SummaryList")
val ProcessBreakdownUp2Summary = row.getAs[String]("ProcessBreakdownUp2Summary")
val ProcessBreakdownUp3Summary = row.getAs[String]("ProcessBreakdownUp3Summary")
val ActualStartDate = row.getAs[java.sql.Date]("ActualStartDate")
val ActualEndDate = row.getAs[java.sql.Date]("ActualEndDate")
val ApprovedDate = row.getAs[java.sql.Date]("ApprovedDate")
val CurrentState = row.getAs[String]("CurrentState")
val DataType = row.getAs[String]("DataType")
val PullDate = row.getAs[String]("PullDate")
val PullTime = row.getAs[String]("PullTime")
val split_ProcessBreakdownSummaryList = ProcessBreakdownSummaryList.split(",")
val split_ProcessBreakdownUp1SummaryList = ProcessBreakdownUp1SummaryList.split(",")
val Pattern = "^.*?(?= - *[a-zA-Z])".r
for{
subSystem : String <- split_ProcessBreakdownSummaryList
} yield(ID,
CertificateID,
CertificateTag,
CertificateDescription,
WorkBreakdownUp1Summary,
subSystem,
for{ system: String <- split_ProcessBreakdownUp1SummaryList if(system contains subSystem.trim().substring(0,11))}yield(system),
ProcessBreakdownUp2Summary,
ProcessBreakdownUp3Summary,
ActualStartDate,
ActualEndDate,
ApprovedDate,
CurrentState,
DataType,
PullDate,
PullTime
)
}).flatMap(identity(_))
display(splitDf)
How can I get the first matching element from the following portion of the above statement:
for{ system: String <- split_ProcessBreakdownUp1SummaryList if(system contains subSystem.trim().substring(0,11))}yield(system)
At the moment it returns an array with one element in it. I dont want the array I just want the element.
Thank you in advance.

Join two strings in Scala with one to one mapping

I have two strings in Scala
Input 1 : "a,c,e,g,i,k"
Input 2 : "b,d,f,h,j,l"
How do I join the two Strings in Scala?
Required output = "ab,cd,ef,gh,ij,kl"
I tried something like:
var columnNameSetOne:Array[String] = Array(); //v1 = "a,c,e,g,i,k"
var columnNameSetTwo:Array[String] = Array(); //v2 = "b,d,f,h,j,l"
After I get the input data as mentioned above
columnNameSetOne = v1.split(",")
columnNameSetTwo = v2.split(",");
val newColumnSet = IntStream.range(0, Math.min(columnNameSetOne.length, columnNameSetTwo.length)).mapToObj(j => (columnNameSetOne(j) + columnNameSetTwo(j))).collect(Collectors.joining(","));
println(newColumnSet)
But I am getting error on j
Also, I am not sure if this would work!
object Solution1 extends App {
val input1 = "a,c,e,g,i,k"
val input2 = "b,d,f,h,j,l"
val i1= input1.split(",")
val i2 = input2.split(",")
val x =i1.zipAll(i2, "", "").map{
case (a,b)=> a + b
}
println(x.mkString(","))
}
//output : ab,cd,ef,gh,ij,kl
Easy to do using zip function on list.
val v1 = "a,c,e,g,i,k"
val v2 = "b,d,f,h,j,l"
val list1 = v1.split(",").toList
val list2 = v2.split(",").toList
list1.zip(list2).mkString(",") // res0: String = (a,b),( c,d),( e,f),( g,h),( i,j),( k,l)

Round-tripping through Deflater in Scala fails

I'm looking to roundtrip bytes through java's Deflater and running into issues. First the output, then the code. What am I doing wrong here, and how can I properly round trip through these streams?
Output:
scala> new String(decompress(compress("face".getBytes)))
(crazy output string of length 20)
Code:
def compress(bytes: Array[Byte]): Array[Byte] = {
val deflater = new java.util.zip.Deflater
val baos = new ByteArrayOutputStream
val dos = new DeflaterOutputStream(baos, deflater)
dos.write(bytes)
baos.close
dos.finish
dos.close
baos.toByteArray
}
def decompress(bytes: Array[Byte]): Array[Byte] = {
val deflater = new java.util.zip.Deflater
val baos = new ByteArrayOutputStream(512)
val bytesIn = new ByteArrayInputStream(bytes)
val in = new DeflaterInputStream(bytesIn, deflater)
var go = true
while (go) {
val b = in.read
if (b == -1)
go = false
else
baos.write(b)
}
baos.close
in.close
baos.toByteArray
}
You're (re-)Deflater-ing the result of the original deflation when you should be Inflater-ing it...