Dot product in Spark Scala - scala

I have two data frames in Spark Scala where the second column of each data frame is an array of numbers
val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf")
data22.show(truncate=false)
+---+---------------------+
|ID |tf_idf |
+---+---------------------+
|1 |[0.693, 0.702] |
|2 |[0.69314, 0.0] |
|3 |[0.0, 0.693147] |
+---+---------------------+
val data12= Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")
data12.show(truncate=false)
+---+--------------------+
|ID |tf_idf |
+---+--------------------+
|1 |[0.693, 0.805] |
+---+--------------------+
I need to perform the dot product between rows in this two data frames. That is I need to multiply the tf_idf array in data12 with each row of tf_idf in data22.
(Ex: The first row in dot product should be like this : 0.693*0.693 + 0.702*0.805
Second row : 0.69314*0.693 + 0.0*0.805
Third row : 0.0*0.693 + 0.693147*0.805 )
Basically I want something(like matrix multiplication) data22*transpose(data12)
I would be grateful if someone can suggest a method to do this in Spark Scala .
Thank you

Spark Version 2.4+: Use the several functions for array such as zip_with and aggregate, that give you more simpler code. To follow your detailed description, I have changed the join into crossJoin.
val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf")
val data12= Seq((1,List(0.693,0.805))).toDF("ID2","tf_idf2")
val df = data22.crossJoin(data12).drop("ID2")
df.withColumn("DotProduct", expr("aggregate(zip_with(tf_idf, tf_idf2, (x, y) -> x * y), 0D, (sum, x) -> sum + x)")).show(false)
Here is the result.
+---+---------------------+--------------+-------------------+
|ID |tf_idf |tf_idf2 |DotProduct |
+---+---------------------+--------------+-------------------+
|1 |[0.693147, 0.6931471]|[0.693, 0.805]|1.0383342865 |
|2 |[0.69314, 0.0] |[0.693, 0.805]|0.48034601999999993|
|3 |[0.0, 0.693147] |[0.693, 0.805]|0.557983335 |
+---+---------------------+--------------+-------------------+

The solution is shown below:
scala> val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf")
data22: org.apache.spark.sql.DataFrame = [ID: int, tf_idf: array<double>]
scala> val data12= Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")
data12: org.apache.spark.sql.DataFrame = [ID: int, tf_idf: array<double>]
scala> val arrayDot = data12.take(1).map(row => (row.getAs[Int](0), row.getAs[WrappedArray[Double]](1).toSeq))
arrayDot: Array[(Int, Seq[Double])] = Array((1,WrappedArray(0.69314, 0.6931471)))
scala> val dotColumn = arrayDot(0)._2
dotColumn: Seq[Double] = WrappedArray(0.69314, 0.6931471)
scala> val dotUdf = udf((y: Seq[Double]) => y zip dotColumn map(z => z._1*z._2) reduce(_ + _))
dotUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,DoubleType,Some(List(ArrayType(DoubleType,false))))
scala> data22.withColumn("dotProduct", dotUdf('tf_idf)).show
+---+--------------------+-------------------+
| ID| tf_idf| dotProduct|
+---+--------------------+-------------------+
| 1|[0.693147, 0.6931...| 0.96090081381841|
| 2| [0.69314, 0.0]|0.48044305959999994|
| 3| [0.0, 0.693147]| 0.4804528329237|
+---+--------------------+-------------------+
Note that it multiplies multiply the tf_idf array in data12 with each row of tf_idf in data22.
Let me know if it helps!!

Related

In spark iterate through each column and find the max length

I am new to spark scala and I have following situation as below
I have a table "TEST_TABLE" on cluster(can be hive table)
I am converting that to dataframe
as:
scala> val testDF = spark.sql("select * from TEST_TABLE limit 10")
Now the DF can be viewed as
scala> testDF.show()
COL1|COL2|COL3
----------------
abc|abcd|abcdef
a|BCBDFG|qddfde
MN|1234B678|sd
I want an output like below
COLUMN_NAME|MAX_LENGTH
COL1|3
COL2|8
COL3|6
Is this feasible to do so in spark scala?
Plain and simple:
import org.apache.spark.sql.functions._
val df = spark.table("TEST_TABLE")
df.select(df.columns.map(c => max(length(col(c)))): _*)
You can try in the following way:
import org.apache.spark.sql.functions.{length, max}
import spark.implicits._
val df = Seq(("abc","abcd","abcdef"),
("a","BCBDFG","qddfde"),
("MN","1234B678","sd"),
(null,"","sd")).toDF("COL1","COL2","COL3")
df.cache()
val output = df.columns.map(c => (c, df.agg(max(length(df(s"$c")))).as[Int].first())).toSeq.toDF("COLUMN_NAME", "MAX_LENGTH")
+-----------+----------+
|COLUMN_NAME|MAX_LENGTH|
+-----------+----------+
| COL1| 3|
| COL2| 8|
| COL3| 6|
+-----------+----------+
I think it's good idea to cache input dataframe df to make the computation faster.
Here is one more way to get the report of column names in vertical
scala> val df = Seq(("abc","abcd","abcdef"),("a","BCBDFG","qddfde"),("MN","1234B678","sd")).toDF("COL1","COL2","COL3")
df: org.apache.spark.sql.DataFrame = [COL1: string, COL2: string ... 1 more field]
scala> df.show(false)
+----+--------+------+
|COL1|COL2 |COL3 |
+----+--------+------+
|abc |abcd |abcdef|
|a |BCBDFG |qddfde|
|MN |1234B678|sd |
+----+--------+------+
scala> val columns = df.columns
columns: Array[String] = Array(COL1, COL2, COL3)
scala> val df2 = columns.foldLeft(df) { (acc,x) => acc.withColumn(x,length(col(x))) }
df2: org.apache.spark.sql.DataFrame = [COL1: int, COL2: int ... 1 more field]
scala> df2.select( columns.map(x => max(col(x))):_* ).show(false)
+---------+---------+---------+
|max(COL1)|max(COL2)|max(COL3)|
+---------+---------+---------+
|3 |8 |6 |
+---------+---------+---------+
scala> df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).show(false)
+----+---+
|_1 |_2 |
+----+---+
|COL1|3 |
|COL2|8 |
|COL3|6 |
+----+---+
scala>
To get the results into Scala collections, say Map()
scala> val result = df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).as[(String,Int)].collect.toMap
result: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)
scala> result
res47: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)
scala>

conditional operator with groupby in spark rdd level - scala

I am using Spark 1.60 and Scala 2.10.5
I have a dataframe like this,
+------------------+
|id | needed |
+------------------+
|1 | 2 |
|1 | 0 |
|1 | 3 |
|2 | 0 |
|2 | 0 |
|3 | 1 |
|3 | 2 |
+------------------+
From this df I created an rdd like this,
val dfRDD = df.rdd
from my rdd, I want to group by id and count of needed is > 0.
((1, 2), (2,0), (3,2))
So, I tried like this,
val groupedDF = dfRDD.map(x =>(x(0), x(1) > 0)).count.redueByKey(_+_)
In this case, I am getting an error:
error: value > is not a member of any
I need that in rdd level. Any help to get my desired output would be great.
The problem is that in your map you're calling the apply method of Row, and as you can see in its scaladoc, that method returns Any - and as you can see for the error and from the scaladoc there is not such method < in Any
You can fix it using the getAs[T] method.
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
val spark =
SparkSession
.builder
.master("local[*]")
.getOrCreate()
import spark.implicits._
val df =
List(
(1, 2),
(1, 0),
(1, 3),
(2, 0),
(2, 0),
(3, 1),
(3, 2)
).toDF("id", "needed")
val rdd: RDD[(Int, Int)] = df.rdd.map(row => (row.getAs[Int](fieldName = "id"), row.getAs[Int](fieldName = "needed")))
From there you can continue with the aggregation, you have a few mistakes in your logic.
First, you don't need the count call.
And second, if you need to count the amount of times "needed" was greater than one you can't do _ + _, because that is the sum of needed values.
val grouped: RDD[(Int, Int)] = rdd.reduceByKey { (acc, v) => if (v > 0) acc + 1 else acc }
val result: Array[(Int, Int)] = grouped.collect()
// Array((1,3), (2,0), (3,2))
PS: You should tell your professor to upgrade to Spark 2 and Scala 2.11 ;)
Edit
Using case classes in the above example.
final case class Data(id: Int, needed: Int)
val rdd: RDD[Data] = df.as[Data].rdd
val grouped: RDD[(Int, Int)] = rdd.map(d => d.id -> d.needed).reduceByKey { (acc, v) => if (v > 0) acc + 1 else acc }
val result: Array[(Int, Int)] = grouped.collect()
// Array((1,3), (2,0), (3,2))
There's no need to do the calculation at the rdd level. Aggregation with the data frame should work:
df.groupBy("id").agg(sum(($"needed" > 0).cast("int")).as("positiveCount")).show
+---+-------------+
| id|positiveCount|
+---+-------------+
| 1| 2|
| 3| 2|
| 2| 0|
+---+-------------+
If you have to work with RDD, use row.getInt or as #Luis' answer row.getAs[Int] to get the value with explicit type, and then do the comparison and reduceByKey:
df.rdd.map(r => (r.getInt(0), if (r.getInt(1) > 0) 1 else 0)).reduceByKey(_ + _).collect
// res18: Array[(Int, Int)] = Array((1,2), (2,0), (3,2))

How to use Except function with spark Dataframe

I would like to get differences between two dataframe but returning the row with the different fields only. For example, I have 2 dataframes as follow:
val DF1 = Seq(
(3,"Chennai", "rahman",9846, 45000,"SanRamon"),
(1,"Hyderabad","ram",9847, 50000,"SF")
).toDF("emp_id","emp_city","emp_name","emp_phone","emp_sal","emp_site")
val DF2 = Seq(
(3,"Chennai", "rahman",9846, 45000,"SanRamon"),
(1,"Sydney","ram",9847, 48000,"SF")
).toDF("emp_id","emp_city","emp_name","emp_phone","emp_sal","emp_site")
The only difference between these two dataframe is emp_city and emp_sal for the second row.
Now, I am using the except function which gives me the entire row as follow:
DF1.except(DF2)
+------+---------+--------+---------+-------+--------+
|emp_id| emp_city|emp_name|emp_phone|emp_sal|emp_site|
+------+---------+--------+---------+-------+--------+
| 1|Hyderabad| ram| 9847| 50000| SF|
+------+---------+--------+---------+-------+--------+
However, I need the output to be like this:
+---------+--------+-----+
|emp_id| emp_city|emp_sal|
+------+---------+-------+
| 1|Hyderabad| 50000|
+------+---------+-------+
Which shows the different cells as well as emp_id.
Edit :
if there is change in column then it should appear if there is no change then it should be hidden or Null
The following should give you the result you are looking for.
DF1.except(DF2).select("emp_id","emp_city","emp_sal")
You should consider the comment from #user238607 as we cannot predict which columns are going to differ,
Still you can try this workaround.
I'm assuming emp_id is unique,
scala> val diff = udf((col: String, c1: String, c2: String) => if (c1 == c2) "" else col )
scala> DF1.join(DF2, DF1("emp_id") === DF2("emp_id"))
res15: org.apache.spark.sql.DataFrame = [emp_id: int, emp_city: string ... 10 more fields]
scala> res15.withColumn("diffcolumn", split(concat_ws(",",DF1.columns.map(x => diff(lit(x), DF1(x), DF2(x))):_*),","))
res16: org.apache.spark.sql.DataFrame = [emp_id: int, emp_city: string ... 11 more fields]
scala> res16.show(false)
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
|emp_id|emp_city |emp_name|emp_phone|emp_sal|emp_site|emp_id|emp_city|emp_name|emp_phone|emp_sal|emp_site|diffcolumn |
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
|3 |Chennai |rahman |9846 |45000 |SanRamon|3 |Chennai |rahman |9846 |45000 |SanRamon|[, , , , , ] |
|1 |Hyderabad|ram |9847 |50000 |SF |1 |Sydney |ram |9847 |48000 |SF |[, emp_city, , , emp_sal, ]|
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
scala> val diff_cols = res16.select(explode($"diffcolumn")).filter("col != ''").distinct.collect.map(a=>col(a(0).toString))
scala> val exceptOpr = DF1.except(DF2)
scala> exceptOpr.select(diff_cols:_*).show
+-------+---------+
|emp_sal| emp_city|
+-------+---------+
| 50000|Hyderabad|
+-------+---------+
I found this solution which seems to be working fine :
val cols = DF1.columns.filter(_ != "emp_id").toList
val DF3 = DF1.except(DF2)
def mapDiffs(name: String) = when($"l.$name" === $"r.$name", null ).otherwise(array($"l.$name", $"r.$name")).as(name)
val result = DF2.as("l").join(DF3.as("r"), "emp_id").select($"emp_id" :: cols.map(mapDiffs): _*)
It generates the output as follow :
+------+-------------------+--------+---------+--------------+--------+
|emp_id| emp_city|emp_name|emp_phone| emp_sal|emp_site|
+------+-------------------+--------+---------+--------------+--------+
| 1|[Sydney, Hyderabad]| null| null|[48000, 50000]| null|
|
+------+-------------------+--------+---------+--------------+--------+

How to subtract one Scala Spark DataFrame from another (Normalise to the mean)

I have two Spark DataFrames:
df1 with 80 columns
CO01...CO80
+----+----+
|CO01|CO02|
+----+----+
|2.06|0.56|
|1.96|0.72|
|1.70|0.87|
|1.90|0.64|
+----+----+
and df2 with 80 columns
avg(CO01)...avg(CO80)
which is mean of each column
+------------------+------------------+
| avg(CO01)| avg(CO02)|
+------------------+------------------+
|2.6185106382978716|1.0080985915492937|
+------------------+------------------+
How can i subtract df2 from df1 for corresponding values?
I'm looking for solution that does not require to list all the columns.
P.S
In pandas it could be simply done by:
df2=df1-df1.mean()
Here is what you can do
scala> val df = spark.sparkContext.parallelize(List(
| (2.06,0.56),
| (1.96,0.72),
| (1.70,0.87),
| (1.90,0.64))).toDF("c1","c2")
df: org.apache.spark.sql.DataFrame = [c1: double, c2: double]
scala>
scala> def subMean(mean: Double) = udf[Double, Double]((value: Double) => value - mean)
subMean: (mean: Double)org.apache.spark.sql.expressions.UserDefinedFunction
scala>
scala> val result = df.columns.foldLeft(df)( (df, col) =>
| { val avg = df.select(mean(col)).first().getAs[Double](0);
| df.withColumn(col, subMean(avg)(df(col)))
| })
result: org.apache.spark.sql.DataFrame = [c1: double, c2: double]
scala>
scala> result.show(10, false)
+---------------------+---------------------+
|c1 |c2 |
+---------------------+---------------------+
|0.15500000000000025 |-0.13749999999999996 |
|0.05500000000000016 |0.022499999999999964 |
|-0.20499999999999985 |0.1725 |
|-0.004999999999999893|-0.057499999999999996|
+---------------------+---------------------+
Hope, this helps!
Please note that, this will work for n number of columns as long as all columns in dataframe are of numeric type

Get minimum value from an Array in a Spark DataFrame column

I have a DataFrame with Arrays.
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select($"id", split($"complete1", "\\|").as("complete1"), split($"complete2", "\\|").as("complete2"))
|id |complete1|complete2|
+-------------+---------+---------+
| 123| [, 1, 2]|[3, 3, 4]|
| 124| [, 3, 2]| [, 3, 4]|
+-------------+---------+---------+
How do I extract the minimum of each arrays?
|id |complete1|complete2|
+-------------+---------+---------+
| 123| 1 | 3 |
| 124| 2 | 3 |
+-------------+---------+---------+
I have tried defining a UDF to do this but I am getting an error.
def minArray(a:Array[String]) :String = a.filter(_.nonEmpty).min.mkString
val minArrayUDF = udf(minArray _)
def getMinArray(df: DataFrame, i: Int): DataFrame = df.withColumn("complete" + i, minArrayUDF(df("complete" + i)))
val minDf = (1 to 2).foldLeft(DF){ case (df, i) => getMinArray(df, i)}
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
Since Spark 2.4, you can use array_min to find the minimum value in an array. To use this function you will first have to cast your arrays of strings to arrays of integers. Casting will also take care of the empty strings by converting them into null values.
DF.select($"id",
array_min(expr("cast(complete1 as array<int>)")).as("complete1"),
array_min(expr("cast(complete2 as array<int>)")).as("complete2"))
You can define your udf function as below
def minUdf = udf((arr: Seq[String])=> arr.filterNot(_ == "").map(_.toInt).min)
and call it as
DF.select(col("id"), minUdf(col("complete1")).as("complete1"), minUdf(col("complete2")).as("complete2")).show(false)
which should give you
+---+---------+---------+
|id |complete1|complete2|
+---+---------+---------+
|123|1 |3 |
|124|2 |3 |
+---+---------+---------+
Updated
In case if the array passed to udf functions are empty or array of empty strings then you will encounter
java.lang.UnsupportedOperationException: empty.min
You should handle that with if else condition in udf function as
def minUdf = udf((arr: Seq[String])=> {
val filtered = arr.filterNot(_ == "")
if(filtered.isEmpty) 0
else filtered.map(_.toInt).min
})
I hope the answer is helpful
Here is how you can do it without using udf
First explode the array you got with split() and then group by the same id and find min
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select($"id", split($"complete1", "\\|").as("complete1"), split($"complete2", "\\|").as("complete2"))
.withColumn("complete1", explode($"complete1"))
.withColumn("complete2", explode($"complete2"))
.groupBy($"id").agg(min($"complete1".cast(IntegerType)).as("complete1"), min($"complete2".cast(IntegerType)).as("complete2"))
Output:
+---+---------+---------+
|id |complete1|complete2|
+---+---------+---------+
|124|2 |3 |
|123|1 |3 |
+---+---------+---------+
You don't need an UDF for this, you can use sort_array:
val DF = Seq(
("123", "|1|2","3|3|4" ),
("124", "|3|2","|3|4" )
).toDF("id", "complete1", "complete2")
.select(
$"id",
split(regexp_replace($"complete1","^\\|",""), "\\|").as("complete1"),
split(regexp_replace($"complete2","^\\|",""), "\\|").as("complete2")
)
// now select minimum
DF.
.select(
$"id",
sort_array($"complete1")(0).as("complete1"),
sort_array($"complete2")(0).as("complete2")
).show()
+---+---------+---------+
| id|complete1|complete2|
+---+---------+---------+
|123| 1| 3|
|124| 2| 3|
+---+---------+---------+
Note that I removed the leading | before splitting to avoid empty strings in the array