Build sparse matrix scala spark - scala

I have an input file of the form
(id | column_name | value)
...
column_name can take on some 50 names, list of ids can be huge.
I want to build a tall-and-skinny matrix whose (i,j) coefficient corresponds to the value found at (id, column_name) with id mapped to i and column name mapped to j.
So far, here's my approach
I load the file
val f = sc.textFile("example.txt")
val data = f.map(_.split('|') match {
case Array(id, column_name, score) =>
(id.toInt, column_name.toString, score.toDouble)
}
)
Then I will build the column_name and ids lists
val column_name_list = data.map(x=>(x._2)._1).distinct.collect.zipWithIndex
val ids_list = data.map(x=>x._1).distinct.collect.zipWithIndex
val nCols = column_name_list.length
val nRows = ids_list.length
and then I will build a coordinatematrix defining the entries using the mapping I just created;
val broadcastcolumn_name = sc.broadcast(column_name_list.toMap)
val broadcastIds = sc.broadcast(ids_list.toMap)
val matrix_entries_tmp = data.map{
case(id, column_name, score) => (broadcastIds.value.getOrElse(id,0), broadcastcolumn_name.value.getOrElse(column_name,0), score)
}
val matrix_entries = matrix_entries_tmp.map{
e => MatrixEntry(e._1, e._2, e._3)
}
val coo_matrix = new CoordinateMatrix(matrix_entries)
This work fine on small examples. However, I get a memory error when the id list is getting huge. The problem seems to be:
val ids_list = data.map(x=>x._1).distinct.collect.zipWithIndex
that induces a memory error
What would be a workaround ? I actually don't really need the id mapping. What is important are the column names and that each row corresponds to some (lost) id. I was thinking about using a IndexedRowMatrix but I am stuck in how to do it.
Thanks for the help!!

CoordinateMatrix
Too ugly to be a decent solution but it should give you some place to start.
First lets create a mapping between column name and index:
val colIdxMap = sc.broadcast(data.
map({ case (row, col, value) => col }).
distinct.
zipWithIndex.
collectAsMap)
Group columns by row id and map values to pairs (colIdx, value):
val values = data.
groupBy({ case (row, col, value) => row }).
mapValues({ _.map { case (_, col, value) => (colIdxMap.value(col), value)}}).
values
Generate entries:
val entries = values.
zipWithIndex.
flatMap { case (vals, row) =>
vals.map {case (col, value) => MatrixEntry(row, col, value)}
}
Create a final matrix:
val mat: CoordinateMatrix = new CoordinateMatrix(entries)
RowMatrix
If row ids are not important at all you can use a RowMatrix as follows:
First lets group data by row
val dataByRow = data.groupBy { case (row, col, value) => row }
Generate sparse vector for each row:
val rows = dataByRow.mapValues((vals) => {
val cols = vals.map {
case (_, col, value) => (colIdxMap.value(col).toInt, value)
}
Vectors.sparse(colIdxMap.value.size, cols.toSeq)
}).values
Create a matrix:
val mat: RowMatrix = new RowMatrix(rows)
You can use zipWithIndex on rows to create a RDD[IndexedRow] and IndexedRowMatrix as well.

Related

Scala: How to return column name and value from a dataframe

I am trying to create a function which can scan a dataframe row by row and, for each row, spit out the non empty columns and the column names. But the challenge is that I dont know the number of columns or their names in the input dataframe.
A function something like GetNotEmptyCols(InputRow: Row): (Colname:String, ColValue:String)
As sample data, consider the following dataframes.
val DataFrameA = Seq(("tot","","ink"), ("yes","yes",""), ("","","many")).toDF("ColA","ColB","ColC")
val DataFrameB = Seq(("yes",""), ("","")).toDF("ColD","ColE")
I have been trying to get the column value for each row object but dont know how to do that when I dont have the names of columns. I could extract the column names from the dataframe and pass it to the function as an additional variable but am hoping for a better approach since row object should have the column names and I should be able to extract them.
The output I am working to get is something like this:
DataFrameA.foreach{ row => GetNotEmptyCols(row)} gives output
For row1: ("ColA", "tot"), ("ColC", "ink")
For row2: ("ColA","yes"),("ColB","yes")
For row3: ("ColC","many")
DataFrameB.foreach{ row => GetNotEmptyCols(row)} gives output
For row1: ("ColD", "yes")
For row2: ()
Please find below my implementation for GetNonEmptyCols, which takes row along with columns -
import org.apache.spark.sql.{Row, SparkSession}
import scala.collection.mutable.ArrayBuffer
object StackoverFlowProblem {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("Test").master("local").getOrCreate()
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
val DataFrameA = Seq(("tot","","ink"), ("yes","yes",""), ("","","many")).toDF("ColA","ColB","ColC")
val DataFrameB = Seq(("yes",""), ("","")).toDF("ColD","ColE")
//Store column names in a variable, append to-be-added column 'index' as well
val columns = DataFrameA.columns :+ "index"
//Use monotonically_increasing_id() API to add row indices in the dataframe
DataFrameA.withColumn("index",monotonically_increasing_id()).foreach(a => GetNotEmptyCols(a,columns))
}
def GetNotEmptyCols(inputRow: Row, columns:Array[String]): Unit ={
val rowIndex = inputRow.getAs[Long]("index")
val a = inputRow.length
val nonEmptyCols = ArrayBuffer[(String,String)]()
for(i <- 0 until a-1){
val value = inputRow.getAs[String](i)
if(!value.isEmpty){
val name = columns(i)
nonEmptyCols += Tuple2(name,value)
}
}
println(s"For row $rowIndex: " + nonEmptyCols.mkString(","))
}
}
This will print the below output for your first Dataframe(I have used zero-based indexing for row printing) -
For row 0: (ColA,tot),(ColC,ink)
For row 1: (ColA,yes),(ColB,yes)
For row 2: (ColC,many)
I found a solution by myself. I can use getValueMap method to create a map of column names and the values which i can return and then convert it to a list.
def returnNotEmptyCols(inputRow: Row): Map[String,String] = {
val colValues = inputRow.getValuesMap[String](inputRow.schema.fieldNames)
.filter(x => x._2!= null && x._2!= "")
colValues
}
returnNotEmptyCols(rowA1).map{case(k,v) => (k, v)}toList

Finding Max sum of marks each year

I am new to Scala and Spark, can someone optimize below Scala code for finding maximum marks scored by students each year
val m=sc.textFile("marks.csv")
val SumOfMarks=m.map(_.split(",")).mapPartitionsWithIndex {(idx, iter) => if (idx == 0) iter.drop(1) else iter}.map(l=>((l(0),l(1)),l(3).toInt)).reduceByKey(_+_).sortBy(line => (line._1._1, line._2), ascending=false)
var s:Int=0
var y:String="0"
for(i<-SumOfMarks){ if((i._1._1!=y) || (i._2==s && i._1._1==y)){ println(i);s=i._2;y=i._1._1}}
Input : marks.csv
year,student,sub,marks
2016,ram,maths,90
2016,ram,physics,86
2016,ram,chemistry,88
2016,raj,maths,84
2016,raj,physics,96
2016,raj,chemistry,98
2017,raghu,maths,96
2017,raghu,physics,98
2017,raghu,chemistry,94
2017,rajesh,maths,92
2017,rajesh,physics,98
2017,rajesh,chemistry,98
Output :
2017,raghu,288
2017,rajesh,288
2016,raj,278
I am not sure what you mean exactly by "Optimised", but a more "scala-y" and "spark-y" way of doing this might be as follows:
import org.apache.spark.sql.expressions.Window
// Read your data file as a CSV file with row headers.
val marksDF = spark.read.option("header","true").csv("marks.csv")
// Calculate the total marks for each student in each year. The new total mark column will be called "totMark"
val marksByStudentYear = marksDF.groupBy(col("year"), col("student")).agg(sum(col("marks")).as("totMark"))
// Rank the marks within each year. Highest Mark will get rank 1, second highest rank 2 and so on.
// A benefit of rank is that if two scores have the same mark, they will both get the
// same rank.
val marksRankedByYear = marksByStudentYear.withColumn("rank", dense_rank().over(Window.partitionBy("year").orderBy($"totMark".desc)))
// Finally filter so that we only have the "top scores" (rank = 1) for each year,
// order by year and student name and display the result.
val topStudents = marksRankedByYear.filter($"rank" === 1).orderBy($"year", $"student").show
topStudents.show
This will produce the following output in Spark-shell:
+----+-------+-------+----+
|year|student|totMark|rank|
+----+-------+-------+----+
|2016| raj| 278.0| 1|
|2017| raghu| 288.0| 1|
|2017| rajesh| 288.0| 1|
+----+-------+-------+----+
If you need a CSV displayed as per your question, you can use:
topStudents.collect.map(_.mkString(",")).foreach(println)
which produces:
2016,raj,278.0,1
2017,raghu,288.0,1
2017,rajesh,288.0,1
I have broken the process up into individual steps. This will allow you to see what is going on at each step by simply running show on an intermediate result. For example to see what the spark.read.option... does, simply enter marksDF.show into spark-shell
Since OP wanted an RDD version, here is one example. Probably it is not optimal, but it does give the correct result:
import org.apache.spark.rdd.RDD
// A Helper function which makes it slightly easier to view RDD content.
def dump[R] (rdd : RDD[R]) = rdd.collect.foreach(println)
val marksRdd = sc.textFile("marks.csv")
// A case class to annotate the content in the RDD
case class Report(year:Int, student:String, sub:String, mark:Int)
// Create the RDD as a series of Report objects - ignore the header.
val marksReportRdd = marksRdd.map(_.split(",")).mapPartitionsWithIndex {
(idx, iter) => if (idx == 0) iter.drop(1) else iter
}.map(r => Report(r(0).toInt,r(1),r(2),r(3).toInt))
// Group the data by year and student.
val marksGrouped = marksReportRdd.groupBy(report => (report.year, report.student))
// Calculate the total score for each student for each year by adding up the scores
// of each subject the student has taken in that year.
val totalMarkStudentYear = marksGrouped.map{ case (key, marks:Iterable[Report]) => (key, marks.foldLeft(0)((acc, rep) => acc + rep.mark))}
// Determine the highest score for each year.
val yearScoreHighest = totalMarkStudentYear.map{ case (key, score:Int) => (key._1, score) }.reduceByKey(math.max(_, _))
// Determine the list of students who have received the highest score in each year.
// This is achieved by joining the total marks each student received in each year
// to the highest score in each year.
// The join is performed on the key which must is a Tuple2(year, score).
// To achieve this, both RDD's must be mapped to produce this key with a data attribute.
// The data attribute for the highest scores is a dummy value "x".
// The data attribute for the student scores is the student's name.
val highestRankStudentByYear = totalMarkStudentYear.map{ case (key, score) => ((key._1, score), key._2)}.join (yearScoreHighest.map (k => (k, "x")))
// Finally extract the year, student name and score from the joined RDD
// Sort by year and name.
val result = highestRankStudentByYear.map{ case (key, score) => (key._1, score._1, key._2)}.sortBy( r => (r._1, r._2))
// Show the final result.
dump(result)
val result = highestRankStudentByYear.map{ case (key, score) => (key._1, score._1, key._2)}.sortBy( r => (r._1, r._2))
dump(result)
The result of the above is:
(2016,raj,278)
(2017,raghu,288)
(2017,rajesh,288)
As before, you can view the intermediate RDD's simply by dumping them using the dump function. NB: the dump function takes an RDD. If you want to show the content of a DataFrame or dataset use it's show method.
It is probably that there is a more optimal solution than the one above, but it does the job.
Hopefully the RDD version will encourage you to use DataFrames and/or DataSets if you can. Not only is the code simpler, but:
Spark will evaluate DataFrames and DataSets and can optimise the overall transformation process. RDD's are not (i.e. they are executed one after another without optimisation). Translations DataFrame and DataSet based processes will likely run faster (assuming you don't manually optimise the RDD equivalent)
DataSets and DataFrames allow schemas to varying degrees (e.g. named columns and data typing).
DataFrames and DataSets can be queried using SQL.
DataFrame and DataSet operations/methods are more aligned with SQL constructs
DataFrames and DataSets are easier to use than RDD's
DataSets (and RDD's) offer compile time error detection.
DataSets are the future direction.
Check out these couple of links for more information:
https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/
https://www.linkedin.com/pulse/apache-spark-rdd-vs-dataframe-dataset-chandan-prakash/
https://medium.com/#sachee/apache-spark-dataframe-vs-rdd-24a04d2eb1b9
or simply google "spark should i use rdd or dataframe"
All the best with your project.
Try It on SCALA spark-shell
scala> val df = spark.read.format("csv").option("header", "true").load("/CSV file location/marks.csv")
scala> df.registerTempTable("record")
scala> sql(" select year, student, marks from (select year, student, marks, RANK() over (partition by year order by marks desc) rank From ( Select year, student, SUM(marks) as marks from record group by Year, student)) where rank =1 ").show
It will generate the following table
+----+-------+-----+
|year|student|marks|
+----+-------+-----+
|2016| raj|278.0|
|2017| raghu|288.0|
|2017| rajesh|288.0|
+----+-------+-----+
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions
//Finding Max sum of marks each year
object Marks2 {
def getSparkContext() = {
val conf = new SparkConf().setAppName("MaxMarksEachYear").setMaster("local")
val sc = new SparkContext(conf)
sc
}
def dump[R](rdd: RDD[R]) = rdd.collect.foreach(println)
def main(args: Array[String]): Unit = {
// System.setProperty("hadoop.home.dir", "D:\\Setup\\hadoop_home")
val sc = getSparkContext()
val inpRDD = sc.textFile("marks.csv")
val head = inpRDD.first()
val marksRdd = inpRDD.filter(record=> !record.equals(head)).map(rec => rec.split(","))
val marksByNameyear = marksRdd.map(rec =>((rec(0).toInt,rec(1)),rec(3).toInt))
//marksByNameyear.cache()
val aggMarksByYearName = marksByNameyear.reduceByKey(_+_)
val maxMarksByYear = aggMarksByYearName.map(s=> (s._1._1,s._2))reduceByKey(math.max(_, _))
val markYearName = aggMarksByYearName.map(s => (s._2.toInt,s._1._2))
val marksAndYear = maxMarksByYear.map(s => (s._2.toInt,s._1))
val tt = sc.broadcast(marksAndYear.collect().toMap)
marksAndYear.flatMap {case(key,value) => tt.value.get(key).map {other => (other,value, key) } }
val yearMarksName = marksAndYear.leftOuterJoin(markYearName)
val result = yearMarksName.map(s =>(s._2._1,s._2._2,s._1)).sortBy(f=>f._3, true)
//dump(markYearName);'
dump(result)
}
}

Call the function n times by passing 1 to n as a parameter with dataframe as the output of the function

I have a dataframe and lookup dataframe, I want to join my inputDf with my lookupDf N times by passing N as parameter to the function joinByColumn as N becomes one of the joining condition. Also the output should be combination of inputDf and the selected columns in lookupDf.
I can achieve this by foldLeft function but I want to do it using map or iterator function
val result = (0 to n).foldLeft(inputDf) {
case (df, colName) => joinByColumn(colName.toString(), df).toDF()
}
def joinByColumn( value: String, inputDf: DataFrame): DataFrame = {
val lookupDF= readFromCassandraTableAsDataFrame(sqlContext,"keyspace","table")
inputDf.as("src").join(lookupDF,lookupDF("a").equalTo(inputDf("input_a")) && lookupDF("b").equalTo((value.toInt + 1).toString) ], "left")
.select("src.*", "c")
.withColumnRenamed("c", value)
}
I want the output to be datframe with all the joined columns.

Iterate rows and columns in Spark dataframe

I have the following Spark dataframe that is created dynamically:
val sf1 = StructField("name", StringType, nullable = true)
val sf2 = StructField("sector", StringType, nullable = true)
val sf3 = StructField("age", IntegerType, nullable = true)
val fields = List(sf1,sf2,sf3)
val schema = StructType(fields)
val row1 = Row("Andy","aaa",20)
val row2 = Row("Berta","bbb",30)
val row3 = Row("Joe","ccc",40)
val data = Seq(row1,row2,row3)
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.createOrReplaceTempView("people")
val sqlDF = spark.sql("SELECT * FROM people")
Now, I need to iterate each row and column in sqlDF to print each column, this is my attempt:
sqlDF.foreach { row =>
row.foreach { col => println(col) }
}
row is type Row, but is not iterable that's why this code throws a compilation error in row.foreach. How to iterate each column in Row?
Consider you have a Dataframe like below
+-----+------+---+
| name|sector|age|
+-----+------+---+
| Andy| aaa| 20|
|Berta| bbb| 30|
| Joe| ccc| 40|
+-----+------+---+
To loop your Dataframe and extract the elements from the Dataframe, you can either chose one of the below approaches.
Approach 1 - Loop using foreach
Looping a dataframe directly using foreach loop is not possible. To do this, first you have to define schema of dataframe using case class and then you have to specify this schema to the dataframe.
import spark.implicits._
import org.apache.spark.sql._
case class cls_Employee(name:String, sector:String, age:Int)
val df = Seq(cls_Employee("Andy","aaa", 20), cls_Employee("Berta","bbb", 30), cls_Employee("Joe","ccc", 40)).toDF()
df.as[cls_Employee].take(df.count.toInt).foreach(t => println(s"name=${t.name},sector=${t.sector},age=${t.age}"))
Please see the result below :
Approach 2 - Loop using rdd
Use rdd.collect on top of your Dataframe. The row variable will contain each row of Dataframe of rdd row type. To get each element from a row, use row.mkString(",") which will contain value of each row in comma separated values. Using split function (inbuilt function) you can access each column value of rdd row with index.
for (row <- df.rdd.collect)
{
var name = row.mkString(",").split(",")(0)
var sector = row.mkString(",").split(",")(1)
var age = row.mkString(",").split(",")(2)
}
Note that there are two drawback of this approach.
1. If there is a , in the column value, data will be wrongly split to adjacent column.
2. rdd.collect is an action that returns all the data to the driver's memory where driver's memory might not be that much huge to hold the data, ending up with getting the application failed.
I would recommend to use Approach 1.
Approach 3 - Using where and select
You can directly use where and select which will internally loop and finds the data. Since it should not throws Index out of bound exception, an if condition is used
if(df.where($"name" === "Andy").select(col("name")).collect().length >= 1)
name = df.where($"name" === "Andy").select(col("name")).collect()(0).get(0).toString
Approach 4 - Using temp tables
You can register dataframe as temptable which will be stored in spark's memory. Then you can use a select query as like other database to query the data and then collect and save in a variable
df.registerTempTable("student")
name = sqlContext.sql("select name from student where name='Andy'").collect()(0).toString().replace("[","").replace("]","")
You can convert Row to Seq with toSeq. Once turned to Seq you can iterate over it as usual with foreach, map or whatever you need
sqlDF.foreach { row =>
row.toSeq.foreach{col => println(col) }
}
Output:
Berta
bbb
30
Joe
Andy
aaa
20
ccc
40
You should use mkString on your Row:
sqlDF.foreach { row =>
println(row.mkString(","))
}
But note that this will be printed inside the executors JVM's, so norally you won't see the output (unless you work with master = local)
sqlDF.foreach is not working for me but Approach 1 from #Sarath Avanavu answer works but it was also playing with the order of the records sometime.
I found one more way which is working
df.collect().foreach { row =>
println(row.mkString(","))
}
You should iterate over the partitions which allows the data to be processed by Spark in parallel and you can do foreach on each row inside the partition.
You can further group the data in partition into batches if need be
sqlDF.foreachPartition { partitionedRows: Iterator[Model1] =>
if (partitionedRows.take(1).nonEmpty) {
partitionedRows.grouped(numberOfRowsPerBatch).foreach { batch =>
batch.foreach { row =>
.....
This worked fine for me
sqlDF.collect().foreach(row => row.toSeq.foreach(col => println(col)))
simple collect result and then apply foreach
df.collect().foreach(println)
My solution using FOR because it was I need:
Solution 1:
case class campos_tablas(name:String, sector:String, age:Int)
for (row <- df.as[campos_tablas].take(df.count.toInt))
{
print(row.name.toString)
}
Solution 2:
for (row <- df.take(df.count.toInt))
{
print(row(0).toString)
}
Let's assume resultDF is the Dataframe.
val resultDF = // DataFrame //
var itr = 0
val resultRow = resultDF.count
val resultSet = resultDF.collectAsList
var load_id = 0
var load_dt = ""
var load_hr = 0
while ( itr < resultRow ){
col1 = resultSet.get(itr).getInt(0)
col2 = resultSet.get(itr).getString(1) // if column is having String value
col3 = resultSet.get(itr).getLong(2) // if column is having Long value
// Write other logic for your code //
itr = itr + 1
}

How to add corresponding Integer values in 2 different DataFrames

I have two DataFrames in my code with exact same dimensions, let's say 1,000,000 X 50. I need to add corresponding values in both dataframes. How to achieve that.
One option would be to add another column with ids, union both DataFrames and then use reduceByKey. But is there any other more elegent way?
Thanks.
Your approach is good. Another option can be two take the RDD and zip those together and then iterate over those to sum the columns and create a new dataframe using any of the original dataframe schemas.
Assuming the data types for all the columns are integer, this code snippets should work. Please note that, this has been done in spark 2.1.0.
import spark.implicits._
val a: DataFrame = spark.sparkContext.parallelize(Seq(
(1, 2),
(3, 6)
)).toDF("column_1", "column_2")
val b: DataFrame = spark.sparkContext.parallelize(Seq(
(3, 4),
(1, 5)
)).toDF("column_1", "column_2")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => {
val totalColumns = rowLeft.schema.fields.size
val summedRow = for(i <- (0 until totalColumns)) yield rowLeft.getInt(i) + rowRight.getInt(i)
Row.fromSeq(summedRow)
}
}
// Create new data frame
val ab: DataFrame = spark.createDataFrame(rows, a.schema) // use any of the schemas
ab.show()
Update:
So, I tried to experiment with the performance of my solution vs yours. I tested with 100000 rows and each row has 50 columns. In case of your approach it has 51 columns, the extra one is for the ID column. In a single machine(no cluster), my solution seems to work a bit faster.
The union and group by approach takes about 5598 milliseconds.
Where as my solution takes about 5378 milliseconds.
My assumption is the first solution takes a bit more time because of the union operation of the two dataframes.
Here are the methods which I created for testing the approaches.
def option_1()(implicit spark: SparkSession): Unit = {
import spark.implicits._
val a: DataFrame = getDummyData(withId = true)
val b: DataFrame = getDummyData(withId = true)
val allData = a.union(b)
val result = allData.groupBy($"id").agg(allData.columns.collect({ case col if col != "id" => (col, "sum") }).toMap)
println(result.count())
// result.show()
}
def option_2()(implicit spark: SparkSession): Unit = {
val a: DataFrame = getDummyData()
val b: DataFrame = getDummyData()
// Merge rows
val rows = a.rdd.zip(b.rdd).map {
case (rowLeft, rowRight) => {
val totalColumns = rowLeft.schema.fields.size
val summedRow = for (i <- (0 until totalColumns)) yield rowLeft.getInt(i) + rowRight.getInt(i)
Row.fromSeq(summedRow)
}
}
// Create new data frame
val result: DataFrame = spark.createDataFrame(rows, a.schema) // use any of the schemas
println(result.count())
// result.show()
}