Spark find key/value pairs with key equals to other values and join - scala

If we have the following key value pairs:
[T,V] [V,W] [A,B] [B,C]
I need to result to be
[T,V] [V,W] [T,W] [A,B] [B,C] [A,C]
So basically to generate [T,W] from [T,V] and [V,W] and append to the existing set
I'm not sure how to do this in spark with scala, please help.

val df = sc.parallelize(
Array(("T","V"),("V","W"),("A","B"),("B","C"))
).toDF("key","value")
df.show
+---+-----+
|key|value|
+---+-----+
| T| V|
| V| W|
| A| B|
| B| C|
+---+-----+
df.join(
df.toDF("keyR", "valueR"),
$"value" === $"keyR"
).explode($"key",$"value",$"keyR",$"valueR"){row => Seq(
(row.getString(0), row.getString(1)),
(row.getString(2), row.getString(3)),
(row.getString(0), row.getString(3))
)}.select($"_1" as "key", $"_2" as "value").show
+---+-----+
|key|value|
+---+-----+
| A| B|
| B| C|
| A| C|
| T| V|
| V| W|
| T| W|
+---+-----+

Using purely Scala collection functions (in Set) - I don't use Spark:
val ex = Set("T" -> "V", "V" -> "W", "A" -> "B", "B" -> "C")
val keysEquallingValues = ex.flatMap { tuple =>
ex.find(t => tuple._2 == t._1).map(t => tuple -> t)
}
val r = ex ++ keysEquallingValues.map(pair => pair._1._1 -> pair._2._2)
Explanation:
ex is your example input Set
We flatMap over it, using an expression that returns an Option[((String,String), (String, String))] - i.e. if the condition is there a tuple with a key the same as the current value? is true, we'll have a Some containing a tuple of the two tuples (!) that satisfy the condition.
Using flatMap and Option like this allows us to drop out non-matching cases (like a filter) but also simultaneously transform the content of the collection in the one pass.
Finally we cherrypick the key of the first tuple and the value of the second, to get the desired combination, and add it to your original Set.

Related

How to create a map column to count occurrences without udaf

I would like to create a Map column which counts the number of occurrences.
For instance:
+---+----+
| b| a|
+---+----+
| 1| b|
| 2|null|
| 1| a|
| 1| a|
+---+----+
would result in
+---+--------------------+
| b| res|
+---+--------------------+
| 1|[a -> 2.0, b -> 1.0]|
| 2| []|
+---+--------------------+
For the moment, in Spark 2.4.6, I was able to make it using udaf.
While bumping to Spark3 I was wondering if I could get rid of this udaf (I tried using the new method aggregate without success)
Is there an efficient way to do it?
(For the efficiency part, I am able to test easily)
Here a Spark 3 solution:
import org.apache.spark.sql.functions._
df.groupBy($"b",$"a").count()
.groupBy($"b")
.agg(
map_from_entries(
collect_list(
when($"a".isNotNull,struct($"a",$"count"))
)
).as("res")
)
.show()
gives:
+---+----------------+
| b| res|
+---+----------------+
| 1|[b -> 1, a -> 2]|
| 2| []|
+---+----------------+
Here the solution using Aggregator:
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Encoder
val countOcc = new Aggregator[String, Map[String,Int], Map[String,Int]] with Serializable {
def zero: Map[String,Int] = Map.empty.withDefaultValue(0)
def reduce(b: Map[String,Int], a: String) = if(a!=null) b + (a -> (b(a) + 1)) else b
def merge(b1: Map[String,Int], b2: Map[String,Int]) = {
val keys = b1.keys.toSet.union(b2.keys.toSet)
keys.map{ k => (k -> (b1(k) + b2(k))) }.toMap
}
def finish(b: Map[String,Int]) = b
def bufferEncoder: Encoder[Map[String,Int]] = implicitly(ExpressionEncoder[Map[String,Int]])
def outputEncoder: Encoder[Map[String, Int]] = implicitly(ExpressionEncoder[Map[String, Int]])
}
val countOccUDAF = udaf(countOcc)
df
.groupBy($"b")
.agg(countOccUDAF($"a").as("res"))
.show()
gives:
+---+----------------+
| b| res|
+---+----------------+
| 1|[b -> 1, a -> 2]|
| 2| []|
+---+----------------+
You could always use collect_list with UDF, but only if you groupings are not too lage:
val udf_histo = udf((x:Seq[String]) => x.groupBy(identity).mapValues(_.size))
df.groupBy($"b")
.agg(
collect_list($"a").as("as")
)
.select($"b",udf_histo($"as").as("res"))
.show()
gives:
+---+----------------+
| b| res|
+---+----------------+
| 1|[b -> 1, a -> 2]|
| 2| []|
+---+----------------+
This should be faster than UDAF: Spark custom aggregation : collect_list+UDF vs UDAF
We can achieve this is spark 2.4
//GET THE COUNTS
val groupedCountDf = originalDf.groupBy("b","a").count
//CREATE MAPS FOR EVERY COUNT | EMPTY MAP FOR NULL KEY
//AGGREGATE THEM AS ARRAY
val dfWithArrayOfMaps = groupedCountDf
.withColumn("newMap", when($"a".isNotNull, map($"a",$"count")).otherwise(map()))
.groupBy("b").agg(collect_list($"newMap") as "multimap")
//EXPRESSION TO CONVERT ARRAY[MAP] -> MAP
val mapConcatExpr = expr("aggregate(multimap, map(), (k, v) -> map_concat(k, v))")
val finalDf = dfWithArrayOfMaps.select($"b", mapConcatExpr.as("merged_data"))
Here a solution with a single groupBy and a slightly complex sql expression. This solution works for Spark 2.4+
df.groupBy("b")
.agg(expr("sort_array(collect_set(a)) as set"),
expr("sort_array(collect_list(a)) as list"))
.withColumn("res",
expr("map_from_arrays(set,transform(set, x -> size(filter(list, y -> y=x))))"))
.show()
Output:
+---+------+---------+----------------+
| b| set| list| res|
+---+------+---------+----------------+
| 1|[a, b]|[a, a, b]|[a -> 2, b -> 1]|
| 2| []| []| []|
+---+------+---------+----------------+
The idea is to collect the data from column a twice: one time into a set and one time into a list. Then with the help of transform for each element of the set the number of occurences of the particular element in the list is counted. Finally, the set and the number of elements are combined with map_from_arrays.
However I cannot say if this approach is really faster than a UDAF.

Collect most occurring unique values across columns after a groupby in Spark

I have the following dataframe
val input = Seq(("ZZ","a","a","b","b"),
("ZZ","a","b","c","d"),
("YY","b","e",null,"f"),
("YY","b","b",null,"f"),
("XX","j","i","h",null))
.toDF("main","value1","value2","value3","value4")
input.show()
+----+------+------+------+------+
|main|value1|value2|value3|value4|
+----+------+------+------+------+
| ZZ| a| a| b| b|
| ZZ| a| b| c| d|
| YY| b| e| null| f|
| YY| b| b| null| f|
| XX| j| i| h| null|
+----+------+------+------+------+
I need to group by the main column and pick the two most occurring values from the remaining columns for each main value
I did the following
val newdf = input.select('main,array('value1,'value2,'value3,'value4).alias("values"))
val newdf2 = newdf.groupBy('main).agg(collect_set('values).alias("values"))
val newdf3 = newdf2.select('main, flatten($"values").alias("values"))
To get the data in the following form
+----+--------------------+
|main| values|
+----+--------------------+
| ZZ|[a, a, b, b, a, b...|
| YY|[b, e,, f, b, b,, f]|
| XX| [j, i, h,]|
+----+--------------------+
Now I need to pick the most occurring two items from the list as two columns. Dunno how to do that.
So, in this case the expected output should be
+----+------+------+
|main|value1|value2|
+----+------+------+
| ZZ| a| b|
| YY| b| f|
| XX| j| i|
+----+------+------+
null should not be counted and the final values should be null only if there are no other values to fill
Is this the best way to do things ? Is there a better way of doing it ?
You can use an udf to select the two values from the array that occur the most often.
input.withColumn("values", array("value1", "value2", "value3", "value4"))
.groupBy("main").agg(flatten(collect_list("values")).as("values"))
.withColumn("max", maxUdf('values)) //(1)
.cache() //(2)
.withColumn("value1", 'max.getItem(0))
.withColumn("value2", 'max.getItem(1))
.drop("values", "max")
.show(false)
with maxUdf being defined as
def getMax[T](array: Seq[T]) = {
array
.filter(_ != null) //remove null values
.groupBy(identity).mapValues(_.length) //count occurences of each value
.toSeq.sortWith(_._2 > _._2) //sort (3)
.map(_._1).take(2) //return the two (or one) most common values
}
val maxUdf = udf(getMax[String] _)
Remarks:
using an udf here means that the whole array with all entries for a single value of main has to fit into the memory of one Spark executor
cache is required here or the the udf will be called twice, once for value1 and once for value2
the sortWith here is stable but it might be necessary to add some extra logic to handle the situation if two elements have the same number of occurences (like i, j and h for the main value XX)
Here is my try without udf.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('main).orderBy('count.desc)
newdf3.withColumn("values", explode('values))
.groupBy('main, 'values).agg(count('values).as("count"))
.filter("values is not null")
.withColumn("target", concat(lit("value"), lit(row_number().over(w))))
.filter("target < 'value3'")
.groupBy('main).pivot('target).agg(first('values)).show
+----+------+------+
|main|value1|value2|
+----+------+------+
| ZZ| a| b|
| YY| b| f|
| XX| j| null|
+----+------+------+
The last row has the null value because I have modified your dataframe in this way,
+----+--------------------+
|main| values|
+----+--------------------+
| ZZ|[a, a, b, b, a, b...|
| YY|[b, e,, f, b, b,, f]|
| XX| [j,,,]| <- For null test
+----+--------------------+

Sum columns of a Spark dataframe and create another dataframe

I have a dataframe like below -
I am trying to create another dataframe from this which has 2 columns - the column name and the sum of values in each column like this -
So far, I've tried this (in Spark 2.2.0) but throws a stack trace -
val get_count: (String => Long) = (c: String) => {
df.groupBy("id")
.agg(sum(c) as "s")
.select("s")
.collect()(0)
.getLong(0)
}
val sqlfunc = udf(get_count)
summary = summary.withColumn("sum_of_column", sqlfunc(col("c")))
Are there any other alternatives of accomplishing this task?
I think that the most efficient way is to do an aggregation and then build a new dataframe. That way you avoid a costly explode.
First, let's create the dataframe. BTW, it's always nice to provide the code to do it when you ask a question. This way we can reproduce your problem in seconds.
val df = Seq((1, 1, 0, 0, 1), (1, 1, 5, 0, 0),
(0, 1, 0, 6, 0), (0, 1, 0, 4, 3))
.toDF("output_label", "ID", "C1", "C2", "C3")
Then we build the list of columns that we are interested in, the aggregations, and compute the result.
val cols = (1 to 3).map(i => s"C$i")
val aggs = cols.map(name => sum(col(name)).as(name))
val agg_df = df.agg(aggs.head, aggs.tail :_*) // See the note below
agg_df.show
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 5| 10| 4|
+---+---+---+
We almost have what we need, we just need to collect the data and build a new dataframe:
val agg_row = agg_df.first
cols.map(name => name -> agg_row.getAs[Long](name))
.toDF("column", "sum")
.show
+------+---+
|column|sum|
+------+---+
| C1| 5|
| C2| 10|
| C3| 4|
+------+---+
EDIT:
NB: df.agg(aggs.head, aggs.tail :_*) may seem strange. The idea is simply to compute all the aggregations computed in aggs. One would expect something more simple like df.agg(aggs : _*). Yet the signature of the agg method is as follows:
def agg(expr: org.apache.spark.sql.Column,exprs: org.apache.spark.sql.Column*)
maybe to ensure that at least one column is used, and this is why you need to split aggs in aggs.head and aggs.tail.
What i do is to define a method to create a struct from the desired values:
def kv (columnsToTranspose: Array[String]) = explode(array(columnsToTranspose.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
This functions receives a list of columns to transpose (your 3 last columns in your case) and transform them in a struct with the column name as key and the column value as value
And then use that method to create an struct and process it as you want
df.withColumn("kv", kv(df.columns.tail.tail))
.select( $"kv.k".as("column"), $"kv.v".alias("values"))
.groupBy("column")
.agg(sum("values").as("sum"))
First apply the previous defined function to have the desired columns as the said struct, and then deconstruct the struct to have a column key and a column value in each row.
Then you can aggregate by the column name and sum the values
INPUT
+------------+---+---+---+---+
|output_label| id| c1| c2| c3|
+------------+---+---+---+---+
| 1| 1| 0| 0| 1|
| 1| 1| 5| 0| 0|
| 0| 1| 0| 6| 0|
| 0| 1| 0| 4| 3|
+------------+---+---+---+---+
OUTPUT
+------+---+
|column|sum|
+------+---+
| c1| 5|
| c3| 4|
| c2| 10|
+------+---+

How to compose column name using another column's value for withColumn in Scala Spark

I'm trying to add a new column to a DataFrame. The value of this column is the value of another column whose name depends on other columns from the same DataFrame.
For instance, given this:
+---+---+----+----+
| A| B| A_1| B_2|
+---+---+----+----+
| A| 1| 0.1| 0.3|
| B| 2| 0.2| 0.4|
+---+---+----+----+
I'd like to obtain this:
+---+---+----+----+----+
| A| B| A_1| B_2| C|
+---+---+----+----+----+
| A| 1| 0.1| 0.3| 0.1|
| B| 2| 0.2| 0.4| 0.4|
+---+---+----+----+----+
That is, I added column C whose value came from either column A_1 or B_2. The name of the source column A_1 comes from concatenating the value of columns A and B.
I know that I can add a new column based on another and a constant like this:
df.withColumn("C", $"B" + 1)
I also know that the name of the column can come from a variable like this:
val name = "A_1"
df.withColumn("C", col(name) + 1)
However, what I'd like to do is something like this:
df.withColumn("C", col(s"${col("A")}_${col("B")}"))
Which doesn't work.
NOTE: I'm coding in Scala 2.11 and Spark 2.2.
You can achieve your requirement by writing a udf function. I am suggesting udf, as your requirement is to process dataframe row by row contradicting to inbuilt functions which functions column by column.
But before that you would need array of column names
val columns = df.columns
Then write a udf function as
import org.apache.spark.sql.functions._
def getValue = udf((A: String, B: String, array: mutable.WrappedArray[String]) => array(columns.indexOf(A+"_"+B)))
where
A is the first column value
B is the second column value
array is the Array of all the columns values
Now just call the udf function using withColumn api
df.withColumn("C", getValue($"A", $"B", array(columns.map(col): _*))).show(false)
You should get your desired output dataframe.
You can select from a map. Define map which translates name to column value:
import org.apache.spark.sql.functions.{col, concat_ws, lit, map}
val dataMap = map(
df.columns.diff(Seq("A", "B")).flatMap(c => lit(c) :: col(c) :: Nil): _*
)
df.select(dataMap).show(false)
+---------------------------+
|map(A_1, A_1, B_2, B_2) |
+---------------------------+
|Map(A_1 -> 0.1, B_2 -> 0.3)|
|Map(A_1 -> 0.2, B_2 -> 0.4)|
+---------------------------+
and select from it with apply:
df.withColumn("C", dataMap(concat_ws("_", $"A", $"B"))).show
+---+---+---+---+---+
| A| B|A_1|B_2| C|
+---+---+---+---+---+
| A| 1|0.1|0.3|0.1|
| B| 2|0.2|0.4|0.4|
+---+---+---+---+---+
You can also try mapping, but I suspect it won't perform well with very wide data:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val outputEncoder = RowEncoder(df.schema.add(StructField("C", DoubleType)))
df.map(row => {
val a = row.getAs[String]("A")
val b = row.getAs[String]("B")
val key = s"${a}_${b}"
Row.fromSeq(row.toSeq :+ row.getAs[Double](key))
})(outputEncoder).show
+---+---+---+---+---+
| A| B|A_1|B_2| C|
+---+---+---+---+---+
| A| 1|0.1|0.3|0.1|
| B| 2|0.2|0.4|0.4|
+---+---+---+---+---+
and in general I wouldn't recommend this approach.
If data comes from csv you might consider skipping default csv reader and use custom logic to push column selection directly into parsing process. With pseudocode:
spark.read.text(...).map { line => {
val a = ??? // parse A
val b = ??? // parse B
val c = ??? // find c, based on a and b
(a, b, c)
}}

Spark join produces wrong results

Presenting here before possibly filing a bug. I'm using Spark 1.6.0.
This is a simplified version of the problem I'm dealing with. I've filtered a table, and then I'm trying to do a left outer join with that subset and the main table, matching all the columns.
I've only got 2 rows in the main table and one in the filtered table. I'm expecting the resulting table to only have the single row from the subset.
scala> val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
b: org.apache.spark.sql.DataFrame = [a: string, b: string, c: int]
scala> val a = b.where("c = 1").withColumnRenamed("a", "filta").withColumnRenamed("b", "filtb")
a: org.apache.spark.sql.DataFrame = [filta: string, filtb: string, c: int]
scala> a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> b("c"), "left_outer").show
+-----+-----+---+---+---+---+
|filta|filtb| c| a| b| c|
+-----+-----+---+---+---+---+
| a| b| 1| a| b| 1|
| a| b| 1| a| b| 2|
+-----+-----+---+---+---+---+
I didn't expect that result at all. I expected the first row, but not the second. I suspected it's the null-safe equality, so I tried it without.
scala> a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === b("c"), "left_outer").show
16/03/21 12:50:00 WARN Column: Constructing trivially true equals predicate, 'c#18232 = c#18232'. Perhaps you need to use aliases.
+-----+-----+---+---+---+---+
|filta|filtb| c| a| b| c|
+-----+-----+---+---+---+---+
| a| b| 1| a| b| 1|
+-----+-----+---+---+---+---+
OK, that's the result I expected, but then I got suspicious of the warning. There is a separate StackOverflow question to deal with that warning here: Spark SQL performing carthesian join instead of inner join
So I create a new column that avoids the warning.
scala> a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === $"b" and $"newc" === b("c"), "left_outer").show
+-----+-----+---+----+---+---+---+
|filta|filtb| c|newc| a| b| c|
+-----+-----+---+----+---+---+---+
| a| b| 1| 1| a| b| 1|
| a| b| 1| 1| a| b| 2|
+-----+-----+---+----+---+---+---+
But now the result is wrong again!
I have a lot of null-safe equality checks, and the warning isn't fatal, so I don't see a clear path to working with/around this.
Is the behaviour a bug, or is this expected behaviour? If expected, why?
If you want an expected behavior use either join on names:
val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
val a = b.where("c = 1")
a.join(b, Seq("a", "b", "c")).show
// +---+---+---+
// | a| b| c|
// +---+---+---+
// | a| b| 1|
// +---+---+---+
or aliases:
val aa = a.alias("a")
val bb = b.alias("b")
aa.join(bb, $"a.a" === $"b.a" && $"a.b" === $"b.b" && $"a.c" === $"b.c")
You can use <=> as well:
aa.join(bb, $"a.a" <=> $"b.a" && $"a.b" <=> $"b.b" && $"a.c" <=> $"b.c")
As far as I remember there's been a special case for simple equality for a while. That's why you get correct results despite the warning.
The second behavior looks indeed like a bug related to the fact that you still have a.c in your data. It looks like it is picked downstream before b.c and the evaluated condition is actually a.newc = a.c.
val expr = $"filta" === $"a" and $"filtb" === $"b" and $"newc" === $"c"
a.withColumnRenamed("c", "newc").join(b, expr, "left_outer")