Writing Files take a lot of time - scala

I am writing three List of TripleInts with 277270 rows aprox,
My class TripleInts is the following:
class tripleInt (var sub:Int, var pre:Int, var obj:Int)
Additional I create each lists with Apache Jena components from an RDF file, I transform the RDF elements to ids and I store this ids in the diferent lists. Once I have the lists, I write the files with the following code:
class Indexes (val listSPO:List[tripleInt], val listPSO:List[tripleInt], val listOSP:List[tripleInt] ){
val sl = listSPO.sortBy(l => (l.sub, l.pre))
val pl = listPSO.sortBy(l => (l.sub, l.pre))
//val ol = listOSP.sortBy(l => (l.sub, l.pre))
var y1:Int=0
var y2:Int=0
var y3:Int=0
val fstream:FileWriter = new FileWriter("patSPO.dat")
var out:BufferedWriter = new BufferedWriter(fstream)
//val fstream:FileOutputStream = new FileOutputStream("patSPO.dat")
//var out:ObjectOutputStream = new ObjectOutputStream(fstream)
//out.writeObject(listSPO)
val fstream2:FileWriter = new FileWriter("patPSO.dat")
var out2:BufferedWriter = new BufferedWriter(fstream2)
/*val fstream3:FileOutputStream = new FileOutputStream("patOSP.dat")
var out3:BufferedOutputStream = new BufferedOutputStream(fstream3)*/
for ( a <- 0 to sl.size-1){
y1 = sl(a).sub
y2 = sl(a).pre
y3 = sl(a).obj
out.write((y1.toString+","+y2.toString+","+y3.toString+"\n"))
}
for ( a <- 0 to pl.size-1){
y1 = pl(a).sub
y2 = pl(a).pre
y3 = pl(a).obj
out2.write((y1.toString+","+y2.toString+","+y3.toString+"\n"))
}
out.close()
out2.close()
This process takes 30 min aprox. My pc is 16 Gb Ram, core i7. Then I don't understand why is taking a lot of time, and Is there a way to optimize this performance?
Thank you

Yes, you need to choose your data structures wisely. List is for sequential access (Seq), not random access (IndexedSeq). What you are doing is O(n^2) because of indexing large Lists. The following should be much faster (O(n), and hopefully easier to read too):
class Indexes (val listSPO: List[tripleInt], val listPSO: List[tripleInt], val listOSP: List[tripleInt] ){
val sl = listSPO.sortBy(l => (l.sub, l.pre))
val pl = listPSO.sortBy(l => (l.sub, l.pre))
var y1:Int=0
var y2:Int=0
var y3:Int=0
val fstream:FileWriter = new FileWriter("patSPO.dat")
val out:BufferedWriter = new BufferedWriter(fstream)
for (s <- sl){
y1 = s.sub
y2 = s.pre
y3 = s.obj
out.write(s"$y1,$y2,$y3\n"))
}
// TODO close in finally
out.close()
val fstream2:FileWriter = new FileWriter("patPSO.dat")
val out2:BufferedWriter = new BufferedWriter(fstream2)
for ( p <- pl){
y1 = p.sub
y2 = p.pre
y3 = p.obj
out2.write(s"$y1,$y2,$y3\n"))
}
// TODO close in finally
out2.close()
}
(It would not hurt using IndexedSeq/Vector as inputs, but there might be constraints why List is preferred in your case.)

Related

Load a csv file into a Breeze DenseMatrix[Double]

I have a csv file and I want to load into a Breeze DenseMatrix[Double]
This code eventually will work but I think it's not the scala way of doing things:
val resource = Source.fromResource("data/houses.txt")
val lines: Iterator[String] = resource.getLines
val tmp = lines.toArray
val numRows: Int = tmp.size
val numCols: Int = tmp(0).split(",").size
val m = DenseMatrix.zeros[Double](numRows, numCols)
//Now do some for loops and fill the matrix
Is there a more elegant and functional way of doing this?
val resource = Source.fromResource("data/houses.txt")
val lines: Iterator[String] = resource.getLines
val tmp = lines.map(l => l.split(",").map(str => str.toDouble)).toList
val m = DenseMatrix(tmp:_*)
much better

How to create a DataFrame using a string consisting of key-value pairs?

I'm getting logs from a firewall in CEF Format as a string which looks as:
ABC|XYZ|F123|1.0|DSE|DSE|4|externalId=e705265d0d9e4d4fcb218b cn2=329160 cn1=3053998 dhost=SRV2019 duser=admin msg=Process accessed NTDS fname=ntdsutil.exe filePath=\\Device\\HarddiskVolume2\\Windows\\System32 cs5="C:\\Windows\\system32\\ntdsutil.exe" "ac i ntds" ifm "create full ntdstest3" q q fileHash=80c8b68240a95 dntdom=adminDomain cn3=13311 rt=1610948650000 tactic=Credential Access technique=Credential Dumping objective=Gain Access patternDisposition=Detection. outcome=0
How can I create a DataFrame from this kind of string where I'm getting key-value pairs separated by = ?
My objective is to infer schema from this string using the keys dynamically, i.e extract the keys from left side of the = operator and create a schema using them.
What I have been doing currently is pretty lame(IMHO) and not very dynamic in approach.(because the number of key-value pairs can change as per different type of logs)
val a: String = "ABC|XYZ|F123|1.0|DSE|DCE|4|externalId=e705265d0d9e4d4fcb218b cn2=329160 cn1=3053998 dhost=SRV2019 duser=admin msg=Process accessed NTDS fname=ntdsutil.exe filePath=\\Device\\HarddiskVolume2\\Windows\\System32 cs5="C:\\Windows\\system32\\ntdsutil.exe" "ac i ntds" ifm "create full ntdstest3" q q fileHash=80c8b68240a95 dntdom=adminDomain cn3=13311 rt=1610948650000 tactic=Credential Access technique=Credential Dumping objective=Gain Access patternDisposition=Detection. outcome=0"
val ttype: String = "DCE"
type parseReturn = (String,String,List[String],Int)
def cefParser(a: String, ttype: String): parseReturn = {
val firstPart = a.split("\\|")
var pD = new ListBuffer[String]()
var listSize: Int = 0
if (firstPart.size == 8 && firstPart(4) == ttype) {
pD += firstPart(0)
pD += firstPart(1)
pD += firstPart(2)
pD += firstPart(3)
pD += firstPart(4)
pD += firstPart(5)
pD += firstPart(6)
val secondPart = parseSecondPart(firstPart(7), ttype)
pD ++= secondPart
listSize = pD.toList.length
(firstPart(2), ttype, pD.toList, listSize)
} else {
val temp: List[String] = List(a)
(firstPart(2), "IRRELEVANT", temp, temp.length)
}
}
The method parseSecondPart is:
def parseSecondPart(m:String, ttype:String): ListBuffer[String] = ttype match {
case auditActivity.ttype=>parseAuditEvent(m)
Another function call to just replace some text in the logs
def parseAuditEvent(msg: String): ListBuffer[String] = {
val updated_msg = msg.replace("cat=", "metadata_event_type=")
.replace("destinationtranslatedaddress=", "event_user_ip=")
.replace("duser=", "event_user_id=")
.replace("deviceprocessname=", "event_service_name=")
.replace("cn3=", "metadata_offset=")
.replace("outcome=", "event_success=")
.replace("devicecustomdate1=", "event_utc_timestamp=")
.replace("rt=", "metadata_event_creation_time=")
parseEvent(updated_msg)
}
Final function to get only the values:
def parseEvent(msg: String): ListBuffer[String] = {
val newMsg = msg.replace("\\=", "$_equal_$")
val pD = new ListBuffer[String]()
val splitData = newMsg.split("=")
val mSize = splitData.size
for (i <- 1 until mSize) {
if(i < mSize-1) {
val a = splitData(i).split(" ")
val b = a.size-1
val c = a.slice(0,b).mkString(" ")
pD += c.replace("$_equal_$","=")
} else if(i == mSize-1) {
val a = splitData(i).replace("$_equal_$","=")
pD += a
} else {
logExceptions(newMsg)
}
}
pD
}
The returns contains a ListBuffer[String]at 3rd position, using which I create a DataFrame as follows:
val df = ss.sqlContext
.createDataFrame(tempRDD.filter(x => x._1 != "IRRELEVANT")
.map(x => Row.fromSeq(x._3)), schema)
People of stackoverflow, i really need your help in improving my code, both for performance and approach.
Any kind of help and/or suggestions will be highly appreciated.
Thanks In Advance.

How to iterate over files and perform action on them - Scala Spark

I am reading 1000 of .eml files (message/email files) one by one from a directory and parsing them and extracting values from them using javax.mail api's and in end storing them into a Dataframe. Sample code below:
var x = Seq[DataFrame]()
val emlFiles = getListOfFiles("tmp/sample")
val fileCount = emlFiles.length
val fs = FileSystem.get(sc.hadoopConfiguration)
for (i <- 0 until fileCount){
var emlData = spark.emptyDataFrame
val f = new File(emlFiles(i))
val fileName = f.getName()
val path = Paths.get(emlFiles(i))
val session = Session.getInstance(new Properties())
val messageIn = new FileInputStream(path.toFile())
val mimeJournal = new MimeMessage(session, messageIn)
// Extracting Metadata
val Receivers = mimeJournal.getHeader("From")(0)
val Senders = mimeJournal.getHeader("To")(0)
val Date = mimeJournal.getHeader("Date")(0)
val Subject = mimeJournal.getHeader("Subject")(0)
val Size = mimeJournal.getSize
emlData =Seq((fileName,Receivers,Senders,Date,Subject,Size)).toDF("fileName","Receivers","Senders","Date","Subject","Size")
x = emlData +: x
}
Problem is that I am using a for loop to do the same and its taking a lot of time. Is there a way to break the for loop and read the files?

Join two strings in Scala with one to one mapping

I have two strings in Scala
Input 1 : "a,c,e,g,i,k"
Input 2 : "b,d,f,h,j,l"
How do I join the two Strings in Scala?
Required output = "ab,cd,ef,gh,ij,kl"
I tried something like:
var columnNameSetOne:Array[String] = Array(); //v1 = "a,c,e,g,i,k"
var columnNameSetTwo:Array[String] = Array(); //v2 = "b,d,f,h,j,l"
After I get the input data as mentioned above
columnNameSetOne = v1.split(",")
columnNameSetTwo = v2.split(",");
val newColumnSet = IntStream.range(0, Math.min(columnNameSetOne.length, columnNameSetTwo.length)).mapToObj(j => (columnNameSetOne(j) + columnNameSetTwo(j))).collect(Collectors.joining(","));
println(newColumnSet)
But I am getting error on j
Also, I am not sure if this would work!
object Solution1 extends App {
val input1 = "a,c,e,g,i,k"
val input2 = "b,d,f,h,j,l"
val i1= input1.split(",")
val i2 = input2.split(",")
val x =i1.zipAll(i2, "", "").map{
case (a,b)=> a + b
}
println(x.mkString(","))
}
//output : ab,cd,ef,gh,ij,kl
Easy to do using zip function on list.
val v1 = "a,c,e,g,i,k"
val v2 = "b,d,f,h,j,l"
val list1 = v1.split(",").toList
val list2 = v2.split(",").toList
list1.zip(list2).mkString(",") // res0: String = (a,b),( c,d),( e,f),( g,h),( i,j),( k,l)

Task not serializable in scala

In my application, I'm using parallelize method to save an Array into file.
code as follows:
val sourceRDD = sc.textFile(inputPath + "/source")
val destinationRDD = sc.textFile(inputPath + "/destination")
val source_primary_key = sourceRDD.map(rec => (rec.split(",")(0).toInt, rec))
val destination_primary_key = destinationRDD.map(rec => (rec.split(",")(0).toInt, rec))
val extra_in_source = source_primary_key.subtractByKey(destination_primary_key)
val extra_in_destination = destination_primary_key.subtractByKey(source_primary_key)
val source_subtract = source_primary_key.subtract(destination_primary_key)
val Destination_subtract = destination_primary_key.subtract(source_primary_key)
val exact_bestmatch_src = source_subtract.subtractByKey(extra_in_source).sortByKey(true).map(rec => (rec._2))
val exact_bestmatch_Dest = Destination_subtract.subtractByKey(extra_in_destination).sortByKey(true).map(rec => (rec._2))
val exact_bestmatch_src_p = exact_bestmatch_src.map(rec => (rec.split(",")(0).toInt))
val primary_key_distinct = exact_bestmatch_src_p.distinct.toArray()
for (i <- primary_key_distinct) {
var dummyVar: String = ""
val src = exact_bestmatch_src.filter(line => line.split(",")(0).toInt.equals(i))
var dest = exact_bestmatch_Dest.filter(line => line.split(",")(0).toInt.equals(i)).toArray
for (print1 <- src) {
var sourceArr: Array[String] = print1.split(",")
var exactbestMatchCounter: Int = 0
var index: Array[Int] = new Array[Int](1)
println(print1 + "source")
for (print2 <- dest) {
var bestMatchCounter = 0
var i: Int = 0
println(print1 + "source + destination" + print2)
for (i <- 0 until sourceArr.length) {
if (print1.split(",")(i).equals(print2.split(",")(i))) {
bestMatchCounter += 1
}
}
if (exactbestMatchCounter < bestMatchCounter) {
exactbestMatchCounter = bestMatchCounter
dummyVar = print2
index +:= exactbestMatchCounter //9,8,9
}
}
var z = index.zipWithIndex.maxBy(_._1)._2
if (exactbestMatchCounter >= 0) {
var samparr: Array[String] = new Array[String](4)
samparr +:= print1 + " BEST_MATCH " + dummyVar
var deletedest: Array[String] = new Array[String](1)
deletedest = dest.take(z) ++ dest.drop(1)
dest = deletedest
val myFile = sc.parallelize((samparr)).saveAsTextFile(outputPath)
I have used parallelize method and I even tried with below method to save it as a file
val myFile = sc.textFile(samparr.toString())
val finalRdd = myFile
finalRdd.coalesce(1).saveAsTextFile(outputPath)
but its keep throwing the error :
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
You can't treat an RDD like a local collection. All operations against it happen over a distributed cluster. To work, all functions you run in that rdd must be serializable.
The line
for (print1 <- src) {
Here you are iterating over the RDD src, everything inside the loop must be serialize, as it will be run on the executors.
Inside however, you try to run sc.parallelize( while still inside that loop. SparkContext is not serializable. Working with rdds and sparkcontext are things you do on the driver, and cannot do within an RDD operation.
I'm entirely sure what you are trying to accomplish, but it looks like some sort of hand-coded join operation with the source and destination. You can't work with loops in rdds like you can with local collections. Make use of the apis map, join, groupby, and the like to create your final rdd then save that.
If you absolutely feel you must use a foreach loop over the rdd like this, then you can't use sc.parallelize().saveAsTextFile() Instead open an outputstream using the hadoop file api and write your array to the file manually.
Finally this piece of code helps me to save an array to file.
new PrintWriter(outputPath) { write(array.mkString(" ")); close }