isin throws stackoverflow error in withcolumn function in spark - scala

I am using spark 2.3 in my scala application. I have a dataframe which create from spark sql that name is sqlDF in the sample code which I shared. I have a string list that has the items below
List[] stringList items
-9,-8,-7,-6
I want to replace all values that match with this lists item in all columns in dataframe to 0.
Initial dataframe
column1 | column2 | column3
1 |1 |1
2 |-5 |1
6 |-6 |1
-7 |-8 |-7
It must return to
column1 | column2 | column3
1 |1 |1
2 |-5 |1
6 |0 |1
0 |0 |0
For this I am itarating the query below for all columns (more than 500) in sqlDF.
sqlDF = sqlDF.withColumn(currColumnName, when(col(currColumnName).isin(stringList:_*), 0).otherwise(col(currColumnName)))
But getting the error below, by the way if I choose only one column for iterating it works, but if I run the code above for 500 columns iteration it fails
Exception in thread "streaming-job-executor-0"
java.lang.StackOverflowError at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
at
scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
at
scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.immutable.List.map(List.scala:285) at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
What is the thing that I am missing?

Here is a different approach applying left anti join between columnX and X where X is your list of items transferred into a dataframe. The left anti join will return all the items not present in X, the results we concatenate them all together through an outer join (which can be replaced with left join for better performance, this though will exclude records with all zeros i.e id == 3) based on the id assigned with monotonically_increasing_id:
import org.apache.spark.sql.functions.{monotonically_increasing_id, col}
val df = Seq(
(1, 1, 1),
(2, -5, 1),
(6, -6, 1),
(-7, -8, -7))
.toDF("c1", "c2", "c3")
.withColumn("id", monotonically_increasing_id())
val exdf = Seq(-9, -8, -7, -6).toDF("x")
df.columns.map{ c =>
df.select("id", c).join(exdf, col(c) === $"x", "left_anti")
}
.reduce((df1, df2) => df1.join(df2, Seq("id"), "outer"))
.na.fill(0)
.show
Output:
+---+---+---+---+
| id| c1| c2| c3|
+---+---+---+---+
| 0| 1| 1| 1|
| 1| 2| -5| 1|
| 3| 0| 0| 0|
| 2| 6| 0| 1|
+---+---+---+---+

foldLeft works perfect for your case here as below
val df = spark.sparkContext.parallelize(Seq(
(1, 1, 1),
(2, -5, 1),
(6, -6, 1),
(-7, -8, -7)
)).toDF("a", "b", "c")
val list = Seq(-7, -8, -9)
val resultDF = df.columns.foldLeft(df) { (acc, name) => {
acc.withColumn(name, when(col(name).isin(list: _*), 0).otherwise(col(name)))
}
}
Output:
+---+---+---+
|a |b |c |
+---+---+---+
|1 |1 |1 |
|2 |-5 |1 |
|6 |-6 |1 |
|0 |0 |0 |
+---+---+---+

I would suggest you to broadcast the list of String :
val stringList=sc.broadcast(<Your List of List[String]>)
After that use this :
sqlDF = sqlDF.withColumn(currColumnName, when(col(currColumnName).isin(stringList.value:_*), 0).otherwise(col(currColumnName)))
Make sure your currColumnName also is in String Format. Comparison should be String to String

Related

Add values to a dataframe against some particular ID in Spark Scala

I have the following dataframe:
ID Name City
1 Ali swl
2 Sana lhr
3 Ahad khi
4 ABC fsd
And a list of values like (1,2,1):
val nums: List[Int] = List(1, 2, 1)
I want to add these values against ID == 3. So that DataFrame looks like:
ID Name City newCol newCol2 newCol3
1 Ali swl null null null
2 Sana lhr null null null
3 Ahad khi 1 2 1
4 ABC fsd null null null
I wonder if it is possible? Any help will be appreciated. Thanks
Yes, Its possible.
Use when for populating matched values & otherwise for not matched values.
I have used zipWithIndex for making column names unique.
Please check below code.
scala> import org.apache.spark.sql.functions._
scala> val df = Seq((1,"Ali","swl"),(2,"Sana","lhr"),(3,"Ahad","khi"),(4,"ABC","fsd")).toDF("id","name","city") // Creating DataFrame with given sample data.
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> val nums = List(1,2,1) // List values.
nums: List[Int] = List(1, 2, 1)
scala> val filterData = List(3,4)
scala> spark.time{ nums.zipWithIndex.foldLeft(df)((df,c) => df.withColumn(s"newCol${c._2}",when($"id".isin(filterData:_*),c._1).otherwise(null))).show(false) } // Used zipWithIndex to make column names unique.
+---+----+----+-------+-------+-------+
|id |name|city|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
|1 |Ali |swl |null |null |null |
|2 |Sana|lhr |null |null |null |
|3 |Ahad|khi |1 |2 |1 |
|4 |ABC |fsd |1 |2 |1 |
+---+----+----+-------+-------+-------+
Time taken: 43 ms
scala>
Firstly you can convert it to DataFrame with single array column and then "decompose" the array column into columns as follows:
import org.apache.spark.sql.functions.{col, lit}
import spark.implicits._
val numsDf =
Seq(nums)
.toDF("nums")
.select(nums.indices.map(i => col("nums")(i).alias(s"newCol$i")): _*)
After that you can use outer join for joining data to numsDf with ID == 3 condition as follows:
val resultDf = data.join(numsDf, data.col("ID") === lit(3), "outer")
resultDf.show() will print:
+---+----+----+-------+-------+-------+
| ID|Name|City|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
| 1| Ali| swl| null| null| null|
| 2|Sana| lhr| null| null| null|
| 3|Ahad| khi| 1| 2| 3|
| 4| ABC| fsd| null| null| null|
+---+----+----+-------+-------+-------+
Make sure you have added spark.sql.crossJoin.crossJoin.enabled = true option to the spark session:
val spark = SparkSession.builder()
...
.config("spark.sql.crossJoin.enabled", value = true)
.getOrCreate()

Spark GroupBy and Aggregate Strings to Produce a Map of Counts of the Strings Based on a Condition

I have a dataframe with two multiple columns, two of which are id and label as shown below.
+---+---+---+
| id| label|
+---+---+---+
| 1| "abc"|
| 1| "abc"|
| 1| "def"|
| 2| "def"|
| 2| "def"|
+---+---+---+
I want to groupBy "id" and aggregate the label column by counts (ignore null) of label in a map data structure and the expected result is as shown below:
+---+---+--+--+--+--+--+--
| id| label |
+---+-----+----+----+----+
| 1| {"abc":2, "def":1}|
| 2| {"def":2} |
+---+-----+----+----+----+
Is it possible to do this without using user-defined aggregate functions? I saw a similar answer here, but it doesn't aggregate based on the count of each item.
I apologize if this question is silly, I am new to both Scala and Spark.
Thanks
Without Custom UDFs
import org.apache.spark.sql.functions.{map, collect_list}
df.groupBy("id", "label")
.count
.select($"id", map($"label", $"count").as("map"))
.groupBy("id")
.agg(collect_list("map"))
.show(false)
+---+------------------------+
|id |collect_list(map) |
+---+------------------------+
|1 |[[def -> 1], [abc -> 2]]|
|2 |[[def -> 2]] |
+---+------------------------+
Using Custom UDF,
import org.apache.spark.sql.functions.udf
val customUdf = udf((seq: Seq[String]) => {
seq.groupBy(x => x).map(x => x._1 -> x._2.size)
})
df.groupBy("id")
.agg(collect_list("label").as("list"))
.select($"id", customUdf($"list").as("map"))
.show(false)
+---+--------------------+
|id |map |
+---+--------------------+
|1 |[abc -> 2, def -> 1]|
|2 |[def -> 2] |
+---+--------------------+

Speed up spark dataframe groupBy

I am fairly inexperienced in Spark, and need help with groupBy and aggregate functions on a dataframe. Consider the following dataframe:
val df = (Seq((1, "a", "1"),
(1,"b", "3"),
(1,"c", "6"),
(2, "a", "9"),
(2,"c", "10"),
(1,"b","8" ),
(2, "c", "3"),
(3,"r", "19")).toDF("col1", "col2", "col3"))
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a| 1|
| 1| b| 3|
| 1| c| 6|
| 2| a| 9|
| 2| c| 10|
| 1| b| 8|
| 2| c| 3|
| 3| r| 19|
+----+----+----+
I need to group by col1 and col2 and calculate the mean of col3, which I can do using:
val col1df = df.groupBy("col1").agg(round(mean("col3"),2).alias("mean_col1"))
val col2df = df.groupBy("col2").agg(round(mean("col3"),2).alias("mean_col2"))
However, on a large dataframe with a few million rows and tens of thousands of unique elements in the columns to group by, it takes a very long time. Besides, I have many more columns to group by and it takes insanely long, which I am looking to reduce. Is there a better way to do the groupBy followed by the aggregation?
You could use ideas from Multiple Aggregations, it might do everything in one shuffle operations, which is the most expensive operation.
Example:
val df = (Seq((1, "a", "1"),
(1,"b", "3"),
(1,"c", "6"),
(2, "a", "9"),
(2,"c", "10"),
(1,"b","8" ),
(2, "c", "3"),
(3,"r", "19")).toDF("col1", "col2", "col3"))
df.createOrReplaceTempView("data")
val grpRes = spark.sql("""select grouping_id() as gid, col1, col2, round(mean(col3), 2) as res
from data group by col1, col2 grouping sets ((col1), (col2)) """)
grpRes.show(100, false)
Output:
+---+----+----+----+
|gid|col1|col2|res |
+---+----+----+----+
|1 |3 |null|19.0|
|2 |null|b |5.5 |
|2 |null|c |6.33|
|1 |1 |null|4.5 |
|2 |null|a |5.0 |
|1 |2 |null|7.33|
|2 |null|r |19.0|
+---+----+----+----+
gid is a bit funny to use, as it has some binary calculations underneath. But if your grouping columns can not have nulls, than you can use it for selecting the correct groups.
Execution Plan:
scala> grpRes.explain
== Physical Plan ==
*(2) HashAggregate(keys=[col1#111, col2#112, spark_grouping_id#108], functions=[avg(cast(col3#9 as double))])
+- Exchange hashpartitioning(col1#111, col2#112, spark_grouping_id#108, 200)
+- *(1) HashAggregate(keys=[col1#111, col2#112, spark_grouping_id#108], functions=[partial_avg(cast(col3#9 as double))])
+- *(1) Expand [List(col3#9, col1#109, null, 1), List(col3#9, null, col2#110, 2)], [col3#9, col1#111, col2#112, spark_grouping_id#108]
+- LocalTableScan [col3#9, col1#109, col2#110]
As you can see there is single Exchange operation, the expensive shuffle.

Element position from nested DataFrame array (Spark 2.2)

I'm trying to explode a nested DataFrame in Spark Scala. I have a DataFrame df which contains the following information:
root
|-- id: integer (nullable = false)
|-- features: array (nullable = true)
| |-- element: float (containsNull = false)
I've exploded the array information into to a flat DataFrame with:
df.selectExpr("id","explode(features) as features")
and got the following DataFrame:
id features
0 0.0629885
0 0.15931357
0 0.08922347
My end goal is to pivot the data and calculate some similarities with that information. To do that, it would be very cool to get the actual position of the feature for every ID into the DataFrame, like this:
id features feature_pos
0 0.0629885 0
0 0.15931357 1
0 0.08922347 2
Use posexplode in place of explode:
Creates a new row for each element with position in the given array or map column.
Unlike posexplode, if the array/map is null or empty then the row (null, null) is produced.
Here is the example with posexplode.
scala> val df = Seq((0, Seq(0.1f, 0.2f, 0.3f)),(1, Seq(0.4f, 0.5f, 0.6f))).toDF("id", "features")
df: org.apache.spark.sql.DataFrame = [id: int, features: array<float>]
scala> df.show(false)
+---+---------------+
|id |features |
+---+---------------+
|0 |[0.1, 0.2, 0.3]|
|1 |[0.4, 0.5, 0.6]|
+---+---------------+
Note that df.withColumn("pos cols",posexplode('features)).show(false) will throw error, so use df.select()
scala> df.select(posexplode('features)).show(false)
+---+---+
|pos|col|
+---+---+
|0 |0.1|
|1 |0.2|
|2 |0.3|
|0 |0.4|
|1 |0.5|
|2 |0.6|
+---+---+
scala>
The default names are "pos" and "col". You can rename them as
scala> df.select(posexplode('features).as(Seq("a","b"))).show(false)
+---+---+
|a |b |
+---+---+
|0 |0.1|
|1 |0.2|
|2 |0.3|
|0 |0.4|
|1 |0.5|
|2 |0.6|
+---+---+
scala>
When you want to explode and select all columns, use
scala> df.select(col("*"), posexplode('features).as( Seq("a","b")) ).show(false)
+---+---------------+---+---+
|id |features |a |b |
+---+---------------+---+---+
|0 |[0.1, 0.2, 0.3]|0 |0.1|
|0 |[0.1, 0.2, 0.3]|1 |0.2|
|0 |[0.1, 0.2, 0.3]|2 |0.3|
|1 |[0.4, 0.5, 0.6]|0 |0.4|
|1 |[0.4, 0.5, 0.6]|1 |0.5|
|1 |[0.4, 0.5, 0.6]|2 |0.6|
+---+---------------+---+---+
scala>
You can also apply Scala's zipWithIndex via a UDF as follows:
val df = Seq(
(0, Seq(0.1f, 0.2f, 0.3f)),
(1, Seq(0.4f, 0.5f, 0.6f))
).toDF("id", "features")
def addIndex = udf(
(s: Seq[Float]) => s.zipWithIndex
)
val df2 = df.withColumn( "features_idx", explode(addIndex($"features")) )
df2.select( $"id", $"features_idx._1".as("features"), $"features_idx._2".as("features_pos") ).show
+---+--------+------------+
| id|features|features_pos|
+---+--------+------------+
| 0| 0.1| 0|
| 0| 0.2| 1|
| 0| 0.3| 2|
| 1| 0.4| 0|
| 1| 0.5| 1|
| 1| 0.6| 2|
+---+--------+------------+

Parse nested data using SCALA

I have the dataframe as follows :
ColA ColB ColC
1 [2,3,4] [5,6,7]
I need to convert it to the below
ColA ColB ColC
1 2 5
1 3 6
1 4 7
Can someone please help with the Code in SCALA?
You can zip the two array columns by means of a UDF and explode the zipped column as follows:
val df = Seq(
(1, Seq(2, 3, 4), Seq(5, 6, 7))
).toDF("ColA", "ColB", "ColC")
def zip = udf(
(x: Seq[Int], y: Seq[Int]) => x zip y
)
val df2 = df.select($"ColA", zip($"ColB", $"ColC").as("BzipC")).
withColumn("BzipC", explode($"BzipC"))
val df3 = df2.select($"ColA", $"BzipC._1".as("ColB"), $"BzipC._2".as("ColC"))
df3.show
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
| 1| 2| 5|
| 1| 3| 6|
| 1| 4| 7|
+----+----+----+
The idea I am presenting here is a bit complex which requires you to use map to combine the two arrays of ColB and ColC. Then use the explode function to explode the combined array. and finally extract the exploded combined array to different columns.
import org.apache.spark.sql.functions._
val tempDF = df.map(row => {
val colB = row(1).asInstanceOf[mutable.WrappedArray[Int]]
val colC = row(2).asInstanceOf[mutable.WrappedArray[Int]]
var array = Array.empty[(Int, Int)]
for(loop <- 0 to colB.size-1){
array = array :+ (colB(loop), colC(loop))
}
(row(0).asInstanceOf[Int], array)
})
.toDF("ColA", "ColB")
.withColumn("ColD", explode($"ColB"))
tempDF.withColumn("ColB", $"ColD._1").withColumn("ColC", $"ColD._2").drop("ColD").show(false)
this would give you result as
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
|1 |2 |5 |
|1 |3 |6 |
|1 |4 |7 |
+----+----+----+
You can also use a combination of posexplode and lateral view from HiveQL
sqlContext.sql("""
select 1 as colA, array(2,3,4) as colB, array(5,6,7) as colC
""").registerTempTable("test")
sqlContext.sql("""
select
colA , b as colB, c as colC
from
test
lateral view
posexplode(colB) columnB as seqB, b
lateral view
posexplode(colC) columnC as seqC, c
where
seqB = seqC
""" ).show
+----+----+----+
|colA|colB|colC|
+----+----+----+
| 1| 2| 5|
| 1| 3| 6|
| 1| 4| 7|
+----+----+----+
Credits: https://stackoverflow.com/a/40614822/7224597 ;)