Spark: how to group rows into a fixed size array? - scala

I have a dataset that looks like this:
+---+
|col|
+---+
| a|
| b|
| c|
| d|
| e|
| f|
| g|
+---+
I want to reformat this dataset so that I aggregate the rows into a arrays of fixed length, like so:
+------+
| col|
+------+
|[a, b]|
|[c, d]|
|[e, f]|
| [g]|
+------+
I tried this:
spark.sql("select collect_list(col) from (select col, row_number() over (order by col) row_number from dataset) group by floor(row_number/2)")
But the problem with this is that my actual dataset is too large to process in a single partition for row_number()

As you wish to distribute this, there are a couple of steps necessary.
In case, you wish to run the code, I am starting from this:
var df = List(
"a", "b", "c", "d", "e", "f", "g"
).toDF("col")
val desiredArrayLength = 2
First, split tyour dataframe into a small one which you can process on single node, and larger one which has number of rows which is multiple of size of desired array (in your example, this is 2)
val nRowsPrune = 1 //number of rows to prune such that remaining dataframe has number of
// rows is multiples of the desired length of array
val dfPrune = df.sort(desc("col")).limit(nRowsPrune)
df = df.join(dfPrune,Seq("col"),"left_anti") //separate small from large dataframe
By construction, you can apply the original code on the small dataframe,
val groupedPruneDf = dfPrune//.withColumn("g",floor((lit(-1)+row_number().over(w))/lit(desiredArrayLength ))) //added -1 as row-number starts from 1
//.groupBy("g")
.agg( collect_list("col").alias("col"))
.select("col")
Now, we need to figure a way to deal with the remaining large dataframe. However, now we made sure, that df has a number of rows which is a multiple of the array size.
This is where we use a great trick, which is repartitioning using repartitionByRange. Basically, the partitioning guarantees to preserve the sorting and as you are partitioning each partition will have same size.
You can now, collect each array within each partition,
val nRows = df.count()
val maxNRowsPartition = desiredArrayLength //make sure its a multiple of desired array length
val nPartitions = math.max(1,math.floor(nRows/maxNRowsPartition) ).toInt
df = df.repartitionByRange(nPartitions, $"col".desc)
.withColumn("partitionId",spark_partition_id())
val w = Window.partitionBy($"partitionId").orderBy("col")
val groupedDf = df
.withColumn("g", floor( (lit(-1)+row_number().over(w))/lit(desiredArrayLength ))) //added -1 as row-number starts from 1
.groupBy("partitionId","g")
.agg( collect_list("col").alias("col"))
.select("col")
Finally combining the two results yields what you are looking for,
val result = groupedDf.union(groupedPruneDf)
result.show(truncate=false)

Related

spark scala cartesian product of each element in a column

I have a dataframe which is like :
df:
col1 col2
a [p1,p2,p3]
b [p1,p4]
Desired output is that:
df_out:
col1 col2 col3
p1 p2 a
p1 p3 a
p2 p3 a
p1 p4 b
I did some research and i think that converting df to rdd and then flatMap with cartesian product are ideal for the problem. However i could not combine them together.
Thanks,
It looks like you are trying to do combination rather than cartesian. Please check my understanding.
This is in PySpark but the only python thing is the UDF, the rest is just DataFrame operations.
process is
Create dataframe
define UDF to get all pairs of combinations ignoring order
use UDF to convert array into array of pairs of structs, one for each element of the combination
explode the results to get rows of pair of structs
select each struct and original column 1 into desired result columns
from itertools import combinations
from pyspark.sql import functions as F
df = spark.createDataFrame([
("a", ["p1", "p2", "p3"]),
("b", ["p1", "p4"])
],
["col1", "col2"]
)
# define and register udf that takes an array and returns an array of struct of two strings
#udf("array<struct<_1: string, _2: string>>")
def combinations_list(x):
return combinations(x, 2)
resultDf = df.select("col1", F.explode(combinations_list(df.col2)).alias("combos"))
resultDf.selectExpr("combos._1 as col1", "combos._2 as col2", "col1 as col3").show()
Result:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| p1| p2| a|
| p1| p3| a|
| p2| p3| a|
| p1| p4| b|
+----+----+----+

How to find intersection of dataframes based on multiple columns?

I have two dataframes as below. I'm trying to find the intersection of two dataframes based on either of the two columns, not only both of them.
So In this case, I want to return dataframe C, which has df A row 1 (as A row1 col1= row one col1 in B), df A row 2(A row 2 Col 2=row 1 Col2 in B) and df A row 4(as Col1 row 2 in B = Col 1 row 4 in A), and row 5 in A. But if I do a intersect of A and B, it will only return row 5 in A, as that's a match of both columns. How do I do this? Many thanks.Let me know if I'm not explaining the question very well.
A:
Col1 Col2
1 2
2 3
3 7
5 4
1 3
B:
Col1 Col2
1 3
5 1
C:
1 2
2 3
5 4
1 3
With the following data:
val df1 = sc.parallelize(Seq(1->2, 2->3, 3->7, 5->4, 1->3)).toDF("col1", "col2")
val df2 = sc.parallelize(Seq(1->3, 5->1)).toDF("col1", "col2")
Then you can join your datasets with a or condition:
val cols = df1.columns
df1.join(df2, cols.map(c => df1(c) === df2(c)).reduce(_ || _) )
.select(cols.map(df1(_)) :_*)
.distinct
.show
+----+----+
|col1|col2|
+----+----+
| 2| 3|
| 1| 2|
| 1| 3|
| 5| 4|
+----+----+
The join condition is generic and would work for any number of columns. The code maps each column to an equality between that column in df1 and the same one in df2 cols.map(c => df1(c) === df2(c)). The the reduce takes the logical or of all these equalities, which is what you want.
The select is there because otherwise the columns of both dataframes would be kept. Here I simply keep the ones from df1. I also added a distinct in case several lines of df2 would match a line of df1 or vice versa. Indeed, you may get a cartesian product.
Note that this method does not need any collection to the driver so it will work regardless of the size of the datasets. Yet, if df2 is small enough to be collected to the driver and braodcasted, you would get faster results with a method like this:
// to each column name, we map the set of values in df2.
val valueMap = df2.rdd
.flatMap(row => cols.map(name => name -> row.getAs[Any](name)))
.distinct
.groupByKey
.mapValues(_.toSet)
.collectAsMap
//we create a udf that looks up in valueMap
val filter = udf((name : String, value : Any) =>
valueMap(name).contains(value))
//Finally we apply the filter.
df1.where( cols.map(c => filter(lit(c), df1(c))).reduce(_||_))
.show
With this method, no shuffling of df1 and no cartesian product. If df2 is small, this is definitely the way to go.
You should perform two join operations individually on each of the join columns, and then perform a union of the two resulting Dataframes:
val dfA = List((1,2),(2,3),(3,7),(5,4),(1,3)).toDF("Col1", "Col2")
val dfB = List((1,3),(5,1)).toDF("Col1", "Col2")
val res1 = dfA.join(dfB, dfA.col("Col1")===dfB.col("Col1"))
val res2 = dfA.join(dfB, dfA.col("Col2")===dfB.col("Col2"))
val res = res1.union(res2)

How to merge two columns into a new DataFrame?

I have two DataFrames (Spark 2.2.0 and Scala 2.11.8). The first DataFrame df1 has one column called col1, and the second one df2 has also 1 column called col2. The number of rows is equal in both DataFrames.
How can I merge these two columns into a new DataFrame?
I tried join, but I think that there should be some other way to do it.
Also, I tried to apply withColumm, but it does not compile.
val result = df1.withColumn(col("col2"), df2.col1)
UPDATE:
For example:
df1 =
col1
1
2
3
df2 =
col2
4
5
6
result =
col1 col2
1 4
2 5
3 6
If that there's no actual relationship between these two columns, it sounds like you need the union operator, which will return, well, just the union of these two dataframes:
var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")
df1.union(df2).show
+---+
|one|
+---+
| a |
| b |
| c |
| d |
| e |
| f |
+---+
[edit]
Now you've made clear that you just want two columns, then with DataFrames you can use the trick of adding a row index with the function monotonically_increasing_id() and joining on that index value:
import org.apache.spark.sql.functions.monotonically_increasing_id
var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")
df1.withColumn("id", monotonically_increasing_id())
.join(df2.withColumn("id", monotonically_increasing_id()), Seq("id"))
.drop("id")
.show
+---+---+
|one|two|
+---+---+
| a | d |
| b | e |
| c | f |
+---+---+
As far as I know, the only way to do want you want with DataFrames is by adding an index column using RDD.zipWithIndex to each and then doing a join on the index column. Code for doing zipWithIndex on a DataFrame can be found in this SO answer.
But, if the DataFrames are small, it would be much simpler to collect the two DFs in the driver, zip them together, and make the result into a new DataFrame.
[Update with example of in-driver collect/zip]
val df3 = spark.createDataFrame(df1.collect() zip df2.collect()).withColumnRenamed("_1", "col1").withColumnRenamed("_2", "col2")
Depends in what you want to do.
If you want to merge two DataFrame you should use the join. There are the same join's types has in relational algebra (or any DBMS)
You are saying that your Data Frames just had one column each.
In that case you might want todo a cross join (cartesian product) with give you a two columns table of all possible combination of col1 and col2, or you might want the uniao (as referred by #Chondrops) witch give you a one column table with all elements.
I think all other join's types uses can be done specialized operations in spark (in this case two Data Frames one column each).

How to calculate product of columns followed by sum over all columns?

Table 1 --Spark DataFrame table
There is a column called "productMe" in Table 1; and there are also other columns like a, b, c and so on whose schema name is contained in a schema array T.
What I want is the inner product of columns(product each row of the two columns) in schema array T with the column productMe(Table 2). And sum each column of Table 2 to get Table 3.
Table 2 is not necessary if you have good idea to get Table 3 in one step.
Table 2 -- Inner product table
For example, the column "a·productMe" is (3*0.2, 6*0.6, 5*0.4) to get (0.6, 3.6, 2)
Table 3 -- sum table
For example, the column "sum(a·productMe)" is 0.6+3.6+2=6.2.
Table 1 is DataFrame of Spark, how can I get Table 3?
You can try something like the following :
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)).toDF("productMe", "a", "b", "c")
import org.apache.spark.sql.functions.col
val columnsToSum = df.
columns. // <-- grab all the columns by their name
tail. // <-- skip productMe
map(col). // <-- create Column objects
map(c => round(sum(c * col("productMe")), 3).as(s"sum_${c}_productMe"))
val df2 = df.select(columnsToSum: _*)
df2.show()
# +---------------+---------------+---------------+
# |sum_a_productMe|sum_b_productMe|sum_c_productMe|
# +---------------+---------------+---------------+
# | 6.2| 6.3| 4.3|
# +---------------+---------------+---------------+
The trick is to use df.select(columnsToSum: _*) which means that you want to select all the columns on which we did the sum of columns times the productMe column. The :_* is a Scala-specific syntax to specify that we are passing repeated arguments because we don't have a fix number of arguments.
We can do it with simple SparkSql
val table1 = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)
).toDF("productMe", "a", "b", "c")
table1.show
table1.createOrReplaceTempView("table1")
val table2 = spark.sql("select a*productMe, b*productMe, c*productMe from table1") //spark is sparkSession here
table2.show
val table3 = spark.sql("select sum(a*productMe), sum(b*productMe), sum(c*productMe) from table1")
table3.show
All the other answers use sum aggregation that use groupBy under the covers.
groupBy always introduces a shuffle stage and usually (always?) is slower than corresponding window aggregates.
In this particular case, I also believe that window aggregates give better performance as you can see in their physical plans and details for their only one job.
CAUTION
Either solution uses one single partition to do the calculation that in turn makes them unsuitable for large datasets as their size together may easily exceed the memory size of a single JVM.
Window Aggregates
What follows is a window aggregate-based calculation which, in this particular case where we group over all the rows in a dataset, unfortunately gives the same physical plan. That makes my answer just a (hopefully) nice learning experience.
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)).toDF("productMe", "a", "b", "c")
// yes, I did borrow this trick with columns from #eliasah's answer
import org.apache.spark.sql.functions.col
val columns = df.columns.tail.map(col).map(c => c * col("productMe") as s"${c}_productMe")
val multiplies = df.select(columns: _*)
scala> multiplies.show
+------------------+------------------+------------------+
| a_productMe| b_productMe| c_productMe|
+------------------+------------------+------------------+
|0.6000000000000001| 1.5|1.2000000000000002|
|3.5999999999999996|1.7999999999999998|0.6000000000000001|
| 2.0| 3.0| 2.5|
+------------------+------------------+------------------+
def sumOverRows(name: String) = sum(name) over ()
val multipliesCols = multiplies.
columns.
map(c => sumOverRows(c) as s"sum_${c}")
val answer = multiplies.
select(multipliesCols: _*).
limit(1) // <-- don't use distinct or dropDuplicates here
scala> answer.show
+-----------------+---------------+-----------------+
| sum_a_productMe|sum_b_productMe| sum_c_productMe|
+-----------------+---------------+-----------------+
|6.199999999999999| 6.3|4.300000000000001|
+-----------------+---------------+-----------------+
Physical Plan
Let's see the physical plan then (as it was the only reason why we wanted to see how to do the query using window aggregates, wasn't it?)
The following is the details for the only job 0.
If I understand your question correctly then following can be your solution
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)
).toDF("productMe", "a", "b", "c")
This gives input dataframe as you have (you can add more)
+---------+---+---+---+
|productMe|a |b |c |
+---------+---+---+---+
|3 |0.2|0.5|0.4|
|6 |0.6|0.3|0.1|
|5 |0.4|0.6|0.5|
+---------+---+---+---+
And
val productMe = df.columns.head
val colNames = df.columns.tail
var tempdf = df
for(column <- colNames){
tempdf = tempdf.withColumn(column, col(column)*col(productMe))
}
Above steps should give you Table2
+---------+------------------+------------------+------------------+
|productMe|a |b |c |
+---------+------------------+------------------+------------------+
|3 |0.6000000000000001|1.5 |1.2000000000000002|
|6 |3.5999999999999996|1.7999999999999998|0.6000000000000001|
|5 |2.0 |3.0 |2.5 |
+---------+------------------+------------------+------------------+
Table3 can be achieved as following
tempdf.select(sum("a").as("sum(a.productMe)"), sum("b").as("sum(b.productMe)"), sum("c").as("sum(c.productMe)")).show(false)
Table3 is
+-----------------+----------------+-----------------+
|sum(a.productMe) |sum(b.productMe)|sum(c.productMe) |
+-----------------+----------------+-----------------+
|6.199999999999999|6.3 |4.300000000000001|
+-----------------+----------------+-----------------+
Table2 can be achieved for any number of columns you have but Table3 would require you to define columns explicitly

How to union only those rows (from large table) with keys in left small table?

I have two dataframe, one is large, the other is small:
val small_df = sc.parallelize(List(("Alice", 15), ("Bob", 20)).toDF("name", "age")
val large_df = sc.parallelize(("Bob", 40), ("SomeOne", 50) , ... ).toDF("name", "age")
I want to add up these two dataframe but only those with the key in my small table, that is, I want my result to be like:
List(("Alice", 15), ("Bob", 60))
My first attempt is try to do union and reduceByKey, but I can't seem to find a way to union two tables and keep those rows with keys in the smaller one only.
Is there a way to do something like "left union" or other way to approach my answer?
One way to solve this problem would be to make an outer join and then sum the two resulting age columns together. Note that spark.implicits._ should be imported for use of $ and org.apache.spark.sql.functions.broadcast for the broadcast.
If any of the two dataframes contains duplicates (in the name column) the final dataframe will contains duplicates as well, which could be what you want or not. For duplicates in large_df those will only show up if there is a corresponding name in small_df, as specified in the question.
As an optimization, as one of the dataframes are small it can be broadcasted before the join to increase the performance.
val small_df = sc.parallelize(List(("Alice", 15), ("Bob", 20)).toDF("name", "age")
val large_df = sc.parallelize(("Bob", 40), ("SomeOne", 50)).toDF("name", "age")
val df = large_df.withColumnRenamed("age", "large_age").join(broadcast(small_df), Array("name"), "right_outer")
val df2 = df.withColumn("age", when($"large_age".isNotNull, $"age" + $"large_age").otherwise($"age")).select("name", "age")
df2.show
+-----+----+
| name| age|
+-----+----+
|Alice|15.0|
| Bob|60.0|
+-----+----+
That should give you what you want:
val existingKeys = small_df.
join(large_df, "name").
select($"name", large_df("age"))
val all = small_df.
union(existingKeys).
groupBy("name").
agg(sum("age") as "age")
scala> all.show
+-----+---+
| name|age|
+-----+---+
| Bob| 60|
|Alice| 15|
+-----+---+