Load a csv file into a Breeze DenseMatrix[Double] - scala

I have a csv file and I want to load into a Breeze DenseMatrix[Double]
This code eventually will work but I think it's not the scala way of doing things:
val resource = Source.fromResource("data/houses.txt")
val lines: Iterator[String] = resource.getLines
val tmp = lines.toArray
val numRows: Int = tmp.size
val numCols: Int = tmp(0).split(",").size
val m = DenseMatrix.zeros[Double](numRows, numCols)
//Now do some for loops and fill the matrix
Is there a more elegant and functional way of doing this?

val resource = Source.fromResource("data/houses.txt")
val lines: Iterator[String] = resource.getLines
val tmp = lines.map(l => l.split(",").map(str => str.toDouble)).toList
val m = DenseMatrix(tmp:_*)
much better

Related

How to iterate over files and perform action on them - Scala Spark

I am reading 1000 of .eml files (message/email files) one by one from a directory and parsing them and extracting values from them using javax.mail api's and in end storing them into a Dataframe. Sample code below:
var x = Seq[DataFrame]()
val emlFiles = getListOfFiles("tmp/sample")
val fileCount = emlFiles.length
val fs = FileSystem.get(sc.hadoopConfiguration)
for (i <- 0 until fileCount){
var emlData = spark.emptyDataFrame
val f = new File(emlFiles(i))
val fileName = f.getName()
val path = Paths.get(emlFiles(i))
val session = Session.getInstance(new Properties())
val messageIn = new FileInputStream(path.toFile())
val mimeJournal = new MimeMessage(session, messageIn)
// Extracting Metadata
val Receivers = mimeJournal.getHeader("From")(0)
val Senders = mimeJournal.getHeader("To")(0)
val Date = mimeJournal.getHeader("Date")(0)
val Subject = mimeJournal.getHeader("Subject")(0)
val Size = mimeJournal.getSize
emlData =Seq((fileName,Receivers,Senders,Date,Subject,Size)).toDF("fileName","Receivers","Senders","Date","Subject","Size")
x = emlData +: x
}
Problem is that I am using a for loop to do the same and its taking a lot of time. Is there a way to break the for loop and read the files?

How to convert a Seq of tuples into set's of individual elements Scala

We have a sequence of tuples Seq(department, title) depTitleSeq we would like to extract Set(department) and Set(title) looking for the best way to do so far we could come up with is
val depTitleSeq = getDepTitleTupleSeq()
var departmentSeq = ArrayBuffer[String]()
var titleSeq = ArrayBuffer[String]()
for (depTitle <- depTitleSeq) yield {
departmentSeq += depTitle._1
titleSeq += depTitle._2
}
val depSet = departmentSeq.toSet
val titleSet = titleSeq.toSet
Fairly new to scala, i'm sure there are better and more efficient ways to achieve this if you could please point us in the right direction it would of great help
If you have two Seqs of data that you want combined into a Seq of tuples, you can zip them together.
If you have a Seq of tuples and you want the elements separated, then you can unzip them.
val (departmentSeq, titleSeq) = getDepTitleTupleSeq().unzip
val depSet :Set[String] = departmentSeq.toSet
val titleSet :Set[String] = titleSeq.toSet
val depTitleSeq = Seq(("x","a"),("y","b"))
val depSet = depTitleSeq.map(_._1).toSet
val titleSet = depTitleSeq.map(_._2).toSet
In Scala REPL:
scala> val depTitleSeq = Seq(("x","a"),("y","b"))
depTitleSeq: Seq[(String, String)] = List((x,a), (y,b))
scala> val depSet = depTitleSeq.map(_._1).toSet
depSet: scala.collection.immutable.Set[String] = Set(x, y)
scala> val titleSet = depTitleSeq.map(_._2).toSet
titleSet: scala.collection.immutable.Set[String] = Set(a, b)
val result:(Set[String], Set[String]) = depTitleSeq.foldLeft((Set[String](), Set[String]())){(a, b) => (a._1 + b._1, a._2 + b._2) }
you can use foldLeft to achieve this.

Overlap in the data - Spark Scala

I want to find the overlap between each brand like when I compare HORLICKS VS BOOST, I will have to find whats HORLICKS only percentage , BOOST ONLY and their intersection.Its basically a venn diagram problem.I have computed for one combination. But I want to compute for all combination , like HORLICKS vs BOOST,HORLICKS VS nESTLE , HORLICKS vs BOURNVITA etc.
Can somebody help me ? I am new to spark
Below is my code:
val A_storeList = sourceDf.where(col("CATEGORY").equalTo(item1(0)) and col("SUBCATEGORY").equalTo(item1(1)) and col("product_char_name").equalTo(item1(2)) and col("product_charval_dsc").equalTo(item1(3))).select("store_id").collect().map(_(0)).distinct
val B_storeList = sourceDf.where(col("CATEGORY").equalTo(item2(0)) and col("SUBCATEGORY").equalTo(item2(1)) and col("product_char_name").equalTo(item2(2)) and col("product_charval_dsc").equalTo(item2(3))).select("store_id").collect().map(_(0)).distinct
val aAndBstoreList = A_storeList.intersect(B_storeList)
val AunionB_storeList = A_storeList.union(B_storeList).distinct
val AOnly_storeList = A_storeList.diff(B_storeList)
val Bonly_storeList = B_storeList.diff(A_storeList)
val subSetOfSourceDf = sourceDf.withColumn("Versus",lit(item1(3)+"Vs"+item2(3)))
val A = subSetOfSourceDf.where(col("store_id").isin(A_storeList:_*)).withColumn("Venn",lit("A"))
val B = subSetOfSourceDf.where(col("store_id").isin(B_storeList:_*)).withColumn("Venn",lit("B"))
val AandB = subSetOfSourceDf.where(col("store_id").isin(aAndBstoreList:_*)).withColumn("product_charval_dsc",when(col("product_charval_dsc").equalTo(item1(3)),item1(3)+" and "+item2(3)).when(col("product_charval_dsc").equalTo(item2(3)),item1(3)+" and "+item2(3)).otherwise(col("product_charval_dsc"))).withColumn("Venn",lit("AintersectB"))
val AunionB = subSetOfSourceDf.where(col("store_id").isin(AunionB_storeList:_*)).withColumn("product_charval_dsc",when(col("product_charval_dsc").equalTo(item2(3)),item1(3)+" and "+item2(3)).when(col("product_charval_dsc").equalTo(item1(3)),item1(3)+" and "+item2(3)).otherwise(col("product_charval_dsc"))).withColumn("Venn",lit("AunionB"))
val AOnly = subSetOfSourceDf.where(col("store_id").isin(AOnly_storeList:_*)).withColumn("Venn",lit("AOnly"))
val BOnly = subSetOfSourceDf.where(col("store_id").isin(Bonly_storeList:_*)).withColumn("Venn",lit("BOnly"))
val allInOne = A.union(B.union(AandB.union(AunionB).union(AOnly.union(BOnly))))
val divisor = allInOne.where((col("Venn").equalTo("A").and(col("product_charval_dsc").equalTo(item1(3)))) or (col("Venn").equalTo("B").and(col("product_charval_dsc").equalTo(item2(3)))) )
.groupBy("CATEGORY","SUBCATEGORY","product_char_name").agg(sum("SALVAL") as "TOTAL")
val finalDf1 = allInOne.groupBy("CATEGORY","SUBCATEGORY","product_char_name","product_charval_dsc","Venn")
.agg(sum("SALVAL") as "SALVAL")
.where(col("product_charval_dsc").equalTo(item1(3)) or col("product_charval_dsc").equalTo(item2(3)) or col("product_charval_dsc").equalTo(item1(3)+" and "+item2(3)))
val outputDf =finalDf1.join(divisor,Seq("CATEGORY","SUBCATEGORY","product_char_name"))
.withColumn("SALE_PERCENT",col("SALVAL")/col("TOTAL") multiply(100)).withColumn("Versus",lit(item1(3)+" Vs "+item2(3)))
With this Code I have generated output. But I want to how to do this for all combination.
Generated Result:

Iterating through files in scala to create values based on the file names

I think there may be a simple solution to this, I was wondering if anybody knew how to iterate over a set of files and output a value based on the files name.
My problem is, I want to read in a set of graph edges for each month, and then create a seperate monthly graphs.
Currently I've done this the long way, which is fine for doing one years worth, but I'd like a way to automate it.
You can see my code below which hopefully clearly shows what I am doing.
//Load vertex data
val vertices= (sc.textFile("D:~vertices.csv")
.map(line => line.split(",")).map(parts => (parts.head.toLong, parts.tail)))
//Define function for creating edges from csv file
def EdgeMaker(file: RDD[String]): RDD[Edge[String]] = {
file.flatMap { line =>
if (!line.isEmpty && line(0) != '#') {
val lineArray = line.split(",")
if (lineArray.length < 0) {
None
} else {
val srcId = lineArray(0).toInt
val dstId = lineArray(1).toInt
val ID = lineArray(2).toString
(Array(Edge(srcId.toInt, dstId.toInt, ID)))
}
} else {
None
}
}
}
//make graphs -This is where I want automation, so I can iterate through a
//folder of edge files and output corresponding monthly graphs.
val edgesJan = EdgeMaker(sc.textFile("D:~edges2011Jan.txt"))
val graphJan = Graph(vertices, edgesJan)
val edgesFeb = EdgeMaker(sc.textFile("D:~edges2011Feb.txt"))
val graphFeb = Graph(vertices, edgesFeb)
val edgesMar = EdgeMaker(sc.textFile("D:~edges2011Mar.txt"))
val graphMar = Graph(vertices, edgesMar)
val edgesApr = EdgeMaker(sc.textFile("D:~edges2011Apr.txt"))
val graphApr = Graph(vertices, edgesApr)
val edgesMay = EdgeMaker(sc.textFile("D:~edges2011May.txt"))
val graphMay = Graph(vertices, edgesMay)
val edgesJun = EdgeMaker(sc.textFile("D:~edges2011Jun.txt"))
val graphJun = Graph(vertices, edgesJun)
val edgesJul = EdgeMaker(sc.textFile("D:~edges2011Jul.txt"))
val graphJul = Graph(vertices, edgesJul)
val edgesAug = EdgeMaker(sc.textFile("D:~edges2011Aug.txt"))
val graphAug = Graph(vertices, edgesAug)
val edgesSep = EdgeMaker(sc.textFile("D:~edges2011Sep.txt"))
val graphSep = Graph(vertices, edgesSep)
val edgesOct = EdgeMaker(sc.textFile("D:~edges2011Oct.txt"))
val graphOct = Graph(vertices, edgesOct)
val edgesNov = EdgeMaker(sc.textFile("D:~edges2011Nov.txt"))
val graphNov = Graph(vertices, edgesNov)
val edgesDec = EdgeMaker(sc.textFile("D:~edges2011Dec.txt"))
val graphDec = Graph(vertices, edgesDec)
Any help or pointers on this would be much appreciated.
you can use Spark Context wholeTextFiles to map the filename, and use the String for naming/calling/filtering/etc your values/output/etc
val fileLoad = sc.wholeTextFiles("hdfs:///..Path").map { case (filename, content) => ... }
The Spark Context textFile only reads the data, but does not keep the file name.
----EDIT----
Sorry I seem to have mis-understood the question; you can load multiple files using
sc.wholeTextFiles("~/path/file[0-5]*,/anotherPath/*.txt").map { case (filename, content) => ... }
the asterisk * should load in all files in the path assuming they are all supported input file types.
This read will concatenate all your files into 1 single large RDD to avoid multiple calling (because each call, you have to specify the path and filename which is what you want to avoid I think).
Reading with the filename allows you to GroupBy the file name and apply your graph function to each group.

convert scala string to RDD[seq[string]]

// 4 workers
val sc = new SparkContext("local[4]", "naivebayes")
// Load documents (one per line).
val documents: RDD[Seq[String]] = sc.textFile("/tmp/test.txt").map(_.split(" ").toSeq)
documents.zipWithIndex.foreach{
case (e, i) =>
val collectedResult = Tokenizer.tokenize(e.mkString)
}
val hashingTF = new HashingTF()
//pass collectedResult instead of document
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
in the above code snippet, i would want to extract collectedResult to reuse it for hashingTF.transform, How can this be achieved where the signature of tokenize function is
def tokenize(content: String): Seq[String] = {
...
}
Looks like you want map rather than foreach. I don't understand what you're using zipWithIndex for, nor why you're calling split on your lines only to join them straight back up again with mkString.
val lines: Rdd[String] = sc.textFile("/tmp/test.txt")
val tokenizedLines = lines.map(tokenize)
val hashes = tokenizedLines.map(hashingTF)
hashes.cache()
...