conditional operator with groupby in spark rdd level - scala - scala

I am using Spark 1.60 and Scala 2.10.5
I have a dataframe like this,
+------------------+
|id | needed |
+------------------+
|1 | 2 |
|1 | 0 |
|1 | 3 |
|2 | 0 |
|2 | 0 |
|3 | 1 |
|3 | 2 |
+------------------+
From this df I created an rdd like this,
val dfRDD = df.rdd
from my rdd, I want to group by id and count of needed is > 0.
((1, 2), (2,0), (3,2))
So, I tried like this,
val groupedDF = dfRDD.map(x =>(x(0), x(1) > 0)).count.redueByKey(_+_)
In this case, I am getting an error:
error: value > is not a member of any
I need that in rdd level. Any help to get my desired output would be great.

The problem is that in your map you're calling the apply method of Row, and as you can see in its scaladoc, that method returns Any - and as you can see for the error and from the scaladoc there is not such method < in Any
You can fix it using the getAs[T] method.
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
val spark =
SparkSession
.builder
.master("local[*]")
.getOrCreate()
import spark.implicits._
val df =
List(
(1, 2),
(1, 0),
(1, 3),
(2, 0),
(2, 0),
(3, 1),
(3, 2)
).toDF("id", "needed")
val rdd: RDD[(Int, Int)] = df.rdd.map(row => (row.getAs[Int](fieldName = "id"), row.getAs[Int](fieldName = "needed")))
From there you can continue with the aggregation, you have a few mistakes in your logic.
First, you don't need the count call.
And second, if you need to count the amount of times "needed" was greater than one you can't do _ + _, because that is the sum of needed values.
val grouped: RDD[(Int, Int)] = rdd.reduceByKey { (acc, v) => if (v > 0) acc + 1 else acc }
val result: Array[(Int, Int)] = grouped.collect()
// Array((1,3), (2,0), (3,2))
PS: You should tell your professor to upgrade to Spark 2 and Scala 2.11 ;)
Edit
Using case classes in the above example.
final case class Data(id: Int, needed: Int)
val rdd: RDD[Data] = df.as[Data].rdd
val grouped: RDD[(Int, Int)] = rdd.map(d => d.id -> d.needed).reduceByKey { (acc, v) => if (v > 0) acc + 1 else acc }
val result: Array[(Int, Int)] = grouped.collect()
// Array((1,3), (2,0), (3,2))

There's no need to do the calculation at the rdd level. Aggregation with the data frame should work:
df.groupBy("id").agg(sum(($"needed" > 0).cast("int")).as("positiveCount")).show
+---+-------------+
| id|positiveCount|
+---+-------------+
| 1| 2|
| 3| 2|
| 2| 0|
+---+-------------+
If you have to work with RDD, use row.getInt or as #Luis' answer row.getAs[Int] to get the value with explicit type, and then do the comparison and reduceByKey:
df.rdd.map(r => (r.getInt(0), if (r.getInt(1) > 0) 1 else 0)).reduceByKey(_ + _).collect
// res18: Array[(Int, Int)] = Array((1,2), (2,0), (3,2))

Related

Add new column containing an Array of column names sorted by the row-wise values

Given a dataFrame with a few columns, I'm trying to create a new column containing an array of these columns' names sorted by decreasing order, based on the row-wise values of these columns.
| a | b | c | newcol|
|---|---|---|-------|
| 1 | 4 | 3 |[b,c,a]|
| 4 | 1 | 3 |[a,c,b]|
---------------------
The names of the columns are stored in a var names:Array[String]
What approach should I go for?
Using UDF is most simple way to achieve custom tasks here.
val df = spark.createDataFrame(Seq((1,4,3), (4,1,3))).toDF("a", "b", "c")
val names=df.schema.fieldNames
val sortNames = udf((v: Seq[Int]) => {v.zip(names).sortBy(_._1).map(_._2)})
df.withColumn("newcol", sortNames(array(names.map(col): _*))).show
Something like this can be an approach using Dataset:
case class Element(name: String, value: Int)
case class Columns(a: Int, b: Int, c: Int, elements: Array[String])
def function1()(implicit spark: SparkSession) = {
import spark.implicits._
val df0: DataFrame =
spark.createDataFrame(spark.sparkContext
.parallelize(Seq(Row(1, 2, 3), Row(4, 1, 3))),
StructType(Seq(StructField("a", IntegerType, false),
StructField("b", IntegerType, false),
StructField("c", IntegerType, false))))
val df1 = df0
.flatMap(row => Seq(Columns(row.getAs[Int]("a"),
row.getAs[Int]("b"),
row.getAs[Int]("c"),
Array(Element("a", row.getAs[Int]("a")),
Element("b", row.getAs[Int]("b")),
Element("c", row.getAs[Int]("c"))).sortBy(-_.value).map(_.name))))
df1
}
def main(args: Array[String]) : Unit = {
implicit val spark = SparkSession.builder().master("local[1]").getOrCreate()
function1().show()
}
gives:
+---+---+---+---------+
| a| b| c| elements|
+---+---+---+---------+
| 1| 2| 3|[a, b, c]|
| 4| 1| 3|[b, c, a]|
+---+---+---+---------+
Try something like this:
val sorted_column_names = udf((column_map: Map[String, Int]) =>
column_map.toSeq.sortBy(- _._2).map(_._1)
)
df.withColumn("column_map", map(lit("a"), $"a", lit("b"), $"b", lit("c"), $"c")
.withColumn("newcol", sorted_column_names($"column_map"))

Spark Scala - drop the first element from the array in dataframe

I have a following dataframe
+--------------------+
| values |
+--------------------+
|[[1,1,1],[3,2,4],[1,|
|[[1,1,2],[2,2,4],[1,|
|[[1,1,3],[4,2,4],[1,|
I want a column with the tail of the list. So far I know how to select the first element
val df1 = df.select("values").getItem(0) , but is there a method which would allow me drop the first element ?
A UDF with a simple size check seems to be the simplest solution:
val df = Seq((1, Seq(1, 2, 3)), (2, Seq(4, 5))).toDF("c1", "c2")
def tail = udf( (s: Seq[Int]) => if (s.size > 1) s.tail else Seq.empty[Int] )
df.select($"c1", tail($"c2").as("c2tail")).show
// +---+------+
// | c1|c2tail|
// +---+------+
// | 1|[2, 3]|
// | 2| [5]|
// +---+------+
As per suggestion in the comment section, a preferred solution would be to use Spark built-in function slice:
df.select($"c1", slice($"c2", 2, Int.MaxValue).as("c2tail"))
I don't think exists a built-in operator for this.
But you can use UDFs, for example:
import collection.mutable.WrappedArray
def tailUdf = udf((array: WrappedArray[WrappedArray[Int]])=> array.tail)
df.select(tailUdf(col("value"))).show()

In spark iterate through each column and find the max length

I am new to spark scala and I have following situation as below
I have a table "TEST_TABLE" on cluster(can be hive table)
I am converting that to dataframe
as:
scala> val testDF = spark.sql("select * from TEST_TABLE limit 10")
Now the DF can be viewed as
scala> testDF.show()
COL1|COL2|COL3
----------------
abc|abcd|abcdef
a|BCBDFG|qddfde
MN|1234B678|sd
I want an output like below
COLUMN_NAME|MAX_LENGTH
COL1|3
COL2|8
COL3|6
Is this feasible to do so in spark scala?
Plain and simple:
import org.apache.spark.sql.functions._
val df = spark.table("TEST_TABLE")
df.select(df.columns.map(c => max(length(col(c)))): _*)
You can try in the following way:
import org.apache.spark.sql.functions.{length, max}
import spark.implicits._
val df = Seq(("abc","abcd","abcdef"),
("a","BCBDFG","qddfde"),
("MN","1234B678","sd"),
(null,"","sd")).toDF("COL1","COL2","COL3")
df.cache()
val output = df.columns.map(c => (c, df.agg(max(length(df(s"$c")))).as[Int].first())).toSeq.toDF("COLUMN_NAME", "MAX_LENGTH")
+-----------+----------+
|COLUMN_NAME|MAX_LENGTH|
+-----------+----------+
| COL1| 3|
| COL2| 8|
| COL3| 6|
+-----------+----------+
I think it's good idea to cache input dataframe df to make the computation faster.
Here is one more way to get the report of column names in vertical
scala> val df = Seq(("abc","abcd","abcdef"),("a","BCBDFG","qddfde"),("MN","1234B678","sd")).toDF("COL1","COL2","COL3")
df: org.apache.spark.sql.DataFrame = [COL1: string, COL2: string ... 1 more field]
scala> df.show(false)
+----+--------+------+
|COL1|COL2 |COL3 |
+----+--------+------+
|abc |abcd |abcdef|
|a |BCBDFG |qddfde|
|MN |1234B678|sd |
+----+--------+------+
scala> val columns = df.columns
columns: Array[String] = Array(COL1, COL2, COL3)
scala> val df2 = columns.foldLeft(df) { (acc,x) => acc.withColumn(x,length(col(x))) }
df2: org.apache.spark.sql.DataFrame = [COL1: int, COL2: int ... 1 more field]
scala> df2.select( columns.map(x => max(col(x))):_* ).show(false)
+---------+---------+---------+
|max(COL1)|max(COL2)|max(COL3)|
+---------+---------+---------+
|3 |8 |6 |
+---------+---------+---------+
scala> df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).show(false)
+----+---+
|_1 |_2 |
+----+---+
|COL1|3 |
|COL2|8 |
|COL3|6 |
+----+---+
scala>
To get the results into Scala collections, say Map()
scala> val result = df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).as[(String,Int)].collect.toMap
result: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)
scala> result
res47: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)
scala>

Spark: reduce/aggregate by key

I am new to Spark and Scala, so I have no idea how this kind of problem is called (which makes searching for it pretty hard).
I have data of the following structure:
[(date1, (name1, 1)), (date1, (name1, 1)), (date1, (name2, 1)), (date2, (name3, 1))]
In some way, this has to be reduced/aggregated to:
[(date1, [(name1, 2), (name2, 1)]), (date2, [(name3, 1)])]
I know how to do reduceByKey on a list of key-value pairs, but this particular problem is a mystery to me.
Thanks in advance!
My data, but here goes, step-wise:
val rdd1 = sc.makeRDD(Array( ("d1",("A",1)), ("d1",("A",1)), ("d1",("B",1)), ("d2",("E",1)) ),2)
val rdd2 = rdd1.map(x => ((x._1, x._2._1), x._2._2))
val rdd3 = rdd2.groupByKey
val rdd4 = rdd3.map{
case (str, nums) => (str, nums.sum)
}
val rdd5 = rdd4.map(x => (x._1._1, (x._1._2, x._2))).groupByKey
rdd5.collect
returns:
res28: Array[(String, Iterable[(String, Int)])] = Array((d2,CompactBuffer((E,1))), (d1,CompactBuffer((A,2), (B,1))))
Better approach avoiding groupByKey is as follows:
val rdd1 = sc.makeRDD(Array( ("d1",("A",1)), ("d1",("A",1)), ("d1",("B",1)), ("d2",("E",1)) ),2)
val rdd2 = rdd1.map(x => ((x._1, x._2._1), (x._2._2))) // Need to add quotes around V part for reduceByKey
val rdd3 = rdd2.reduceByKey(_+_)
val rdd4 = rdd3.map(x => (x._1._1, (x._1._2, x._2))).groupByKey // Necessary Shuffle
rdd4.collect
As I stated in the columns it can be done with DataFrames for structured data, so run this below:
// This above should be enough.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val rddA = sc.makeRDD(Array( ("d1","A",1), ("d1","A",1), ("d1","B",1), ("d2","E",1) ),2)
val dfA = rddA.toDF("c1", "c2", "c3")
val dfB = dfA
.groupBy("c1", "c2")
.agg(sum("c3").alias("sum"))
dfB.show
returns:
+---+---+---+
| c1| c2|sum|
+---+---+---+
| d1| A| 2|
| d2| E| 1|
| d1| B| 1|
+---+---+---+
But you can do this to approximate the above of the CompactBuffer above.
import org.apache.spark.sql.functions.{col, udf}
case class XY(x: String, y: Long)
val xyTuple = udf((x: String, y: Long) => XY(x, y))
val dfC = dfB
.withColumn("xy", xyTuple(col("c2"), col("sum")))
.drop("c2")
.drop("sum")
dfC.printSchema
dfC.show
// Then ... this gives you the CompactBuffer answer but from a DF-perspective
val dfD = dfC.groupBy(col("c1")).agg(collect_list(col("xy")))
dfD.show
returns - some renaming req'd and possible sorting:
---+----------------+
| c1|collect_list(xy)|
+---+----------------+
| d2| [[E, 1]]|
| d1|[[A, 2], [B, 1]]|
+---+----------------+

Comparing two array columns in Scala Spark

I have a dataframe of format given below.
movieId1 | genreList1 | genreList2
--------------------------------------------------
1 |[Adventure,Comedy] |[Adventure]
2 |[Animation,Drama,War] |[War,Drama]
3 |[Adventure,Drama] |[Drama,War]
and trying to create another flag column which shows whether genreList2 is a subset of genreList1.
movieId1 | genreList1 | genreList2 | Flag
---------------------------------------------------------------
1 |[Adventure,Comedy] | [Adventure] |1
2 |[Animation,Drama,War] | [War,Drama] |1
3 |[Adventure,Drama] | [Drama,War] |0
I have tried this:
def intersect_check(a: Array[String], b: Array[String]): Int = {
if (b.sameElements(a.intersect(b))) { return 1 }
else { return 2 }
}
def intersect_check_udf =
udf((colvalue1: Array[String], colvalue2: Array[String]) => intersect_check(colvalue1, colvalue2))
data = data.withColumn("Flag", intersect_check_udf(col("genreList1"), col("genreList2")))
But this throws error
org.apache.spark.SparkException: Failed to execute user defined function.
P.S.: The above function (intersect_check) works for Arrays.
We can define an udf that calculates the length of the intersection between the two Array columns and checks whether it is equal to the length of the second column. If so, the second array is a subset of the first one.
Also, the inputs of your udf need to be class WrappedArray[String], not Array[String] :
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.functions.col
val same_elements = udf { (a: WrappedArray[String],
b: WrappedArray[String]) =>
if (a.intersect(b).length == b.length){ 1 }else{ 0 }
}
df.withColumn("test",same_elements(col("genreList1"),col("genreList2")))
.show(truncate = false)
+--------+-----------------------+------------+----+
|movieId1|genreList1 |genreList2 |test|
+--------+-----------------------+------------+----+
|1 |[Adventure, Comedy] |[Adventure] |1 |
|2 |[Animation, Drama, War]|[War, Drama]|1 |
|3 |[Adventure, Drama] |[Drama, War]|0 |
+--------+-----------------------+------------+----+
Data
val df = List((1,Array("Adventure","Comedy"), Array("Adventure")),
(2,Array("Animation","Drama","War"), Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))).toDF("movieId1","genreList1","genreList2")
Here is the solution converting using subsetOf
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(
Seq(
(1,Array("Adventure","Comedy"),Array("Adventure")),
(2,Array("Animation","Drama","War"),Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))
)).toDF("movieId1", "genreList1", "genreList2")
val subsetOf = udf((col1: Seq[String], col2: Seq[String]) => {
if (col2.toSet.subsetOf(col1.toSet)) 1 else 0
})
data.withColumn("flag", subsetOf(data("genreList1"), data("genreList2"))).show()
Hope this helps!
One solution may be to exploit spark array builtin functions: genreList2 is subset of genreList1 if the intersection between the two is equal to genreList2. In the code below a sort_array operation has been added to avoid a mismatch between two arrays with different ordering but same elements.
val spark = {
SparkSession
.builder()
.master("local")
.appName("test")
.getOrCreate()
}
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val df = Seq(
(1, Array("Adventure","Comedy"), Array("Adventure")),
(2, Array("Animation","Drama","War"), Array("War","Drama")),
(3, Array("Adventure","Drama"), Array("Drama","War"))
).toDF("movieId1", "genreList1", "genreList2")
df
.withColumn("flag",
sort_array(array_intersect($"genreList1",$"genreList2"))
.equalTo(
sort_array($"genreList2")
)
.cast("integer")
)
.show()
The output is
+--------+--------------------+------------+----+
|movieId1| genreList1| genreList2|flag|
+--------+--------------------+------------+----+
| 1| [Adventure, Comedy]| [Adventure]| 1|
| 2|[Animation, Drama...|[War, Drama]| 1|
| 3| [Adventure, Drama]|[Drama, War]| 0|
+--------+--------------------+------------+----+
This can also work here and it does not use udf
import spark.implicits._
val data = Seq(
(1,Array("Adventure","Comedy"),Array("Adventure")),
(2,Array("Animation","Drama","War"),Array("War","Drama")),
(3,Array("Adventure","Drama"),Array("Drama","War"))
).toDF("movieId1", "genreList1", "genreList2")
data
.withColumn("size",size(array_except($"genreList2",$"genreList1")))
.withColumn("flag",when($"size" === lit(0), 1) otherwise(0))
.show(false)
Spark 3.0+ (forall)
forall($"genreList2", x => array_contains($"genreList1", x)).cast("int")
Full example:
val df = Seq(
(1, Seq("Adventure", "Comedy"), Seq("Adventure")),
(2, Seq("Animation", "Drama","War"), Seq("War", "Drama")),
(3, Seq("Adventure", "Drama"), Seq("Drama", "War"))
).toDF("movieId1", "genreList1", "genreList2")
val df2 = df.withColumn("Flag", forall($"genreList2", x => array_contains($"genreList1", x)).cast("int"))
df2.show()
// +--------+--------------------+------------+----+
// |movieId1| genreList1| genreList2|Flag|
// +--------+--------------------+------------+----+
// | 1| [Adventure, Comedy]| [Adventure]| 1|
// | 2|[Animation, Drama...|[War, Drama]| 1|
// | 3| [Adventure, Drama]|[Drama, War]| 0|
// +--------+--------------------+------------+----+