Spark - group and aggregate only several smallest items - scala

In short
I have cartesian-product (cross-join) of two dataframes and function which gives some score for given element of this product. I want now to get few "best matched" elements of the second DF for every member of the first DF.
In details
What follows is a simplified example as my real code is somewhat bloated with additional fields and filters.
Given two sets of data, each having some id and value:
// simple rdds of tuples
val rdd1 = sc.parallelize(Seq(("a", 31),("b", 41),("c", 59),("d", 26),("e",53),("f",58)))
val rdd2 = sc.parallelize(Seq(("z", 16),("y", 18),("x",3),("w",39),("v",98), ("u", 88)))
// convert them to dataframes:
val df1 = spark.createDataFrame(rdd1).toDF("id1", "val1")
val df2 = spark.createDataFrame(rdd2).toDF("id2", "val2")
and some function which for pair of the elements from the first and second dataset gives their "matching score":
def f(a:Int, b:Int):Int = (a * a + b * b * b) % 17
// convert it to udf
val fu = udf((a:Int, b:Int) => f(a, b))
we can create the product of two sets and calculate score for every pair:
val dfc = df1.crossJoin(df2)
val r = dfc.withColumn("rez", fu(col("val1"), col("val2")))
r.show
+---+----+---+----+---+
|id1|val1|id2|val2|rez|
+---+----+---+----+---+
| a| 31| z| 16| 8|
| a| 31| y| 18| 10|
| a| 31| x| 3| 2|
| a| 31| w| 39| 15|
| a| 31| v| 98| 13|
| a| 31| u| 88| 2|
| b| 41| z| 16| 14|
| c| 59| z| 16| 12|
...
And now we want to have this result grouped by id1:
r.groupBy("id1").agg(collect_set(struct("id2", "rez")).as("matches")).show
+---+--------------------+
|id1| matches|
+---+--------------------+
| f|[[v,2], [u,8], [y...|
| e|[[y,5], [z,3], [x...|
| d|[[w,2], [x,6], [v...|
| c|[[w,2], [x,6], [v...|
| b|[[v,2], [u,8], [y...|
| a|[[x,2], [y,10], [...|
+---+--------------------+
But really we want only to retain only few (say 3) of "matches", those having the best score (say, least score).
The question is
How to get the "matches" sorted and reduced to top-N elements? Probably it is something about collect_list and sort_array, though I don't know how to sort by inner field.
Is there a way to ensure optimization in case of large input DFs - e.g. choosing minimums directly while aggregating. I know it could be done easily if I wrote the code without spark - keeping small array or priority queue for every id1 and adding element where it should be, possibly dropping out some previously added.
E.g. it's ok that cross-join is costly operation, but I want to avoid wasting memory on the results most of which I'm going to drop in the next step. My real use case deals with DFs with less than 1 mln entries so cross-join is yet viable but as we want to select only 10-20 top matches for each id1 it seems to be quite desirable not to keep unnecessary data between steps.

For start we need to take only the first n rows. To do this we are partitioning the DF by 'id1' and sorting the groups by the res. We use it to add row number column to the DF, like that we can use where function to take the first n rows. Than you can continue doing the same code your wrote. Grouping by 'id1' and collecting the list. Only now you already have the highest rows.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val n = 3
val w = Window.partitionBy($"id1").orderBy($"res".desc)
val res = r.withColumn("rn", row_number.over(w)).where($"rn" <= n).groupBy("id1").agg(collect_set(struct("id2", "res")).as("matches"))
A second option that might be better because you won't need to group the DF twice:
val sortTakeUDF = udf{(xs: Seq[Row], n: Int)} => xs.sortBy(_.getAs[Int]("res")).reverse.take(n).map{case Row(x: String, y:Int)}}
r.groupBy("id1").agg(sortTakeUDF(collect_set(struct("id2", "res")), lit(n)).as("matches"))
In here we create a udf that take the array column and an integer value n. The udf sorts the array by your 'res' and returns only the first n elements.

Related

How to transform two data-frames into a List of Tuples of Rows

I have a list of tuple of numbers which brings the data from the dataframe. I extract the data from the dataframe which corresponds to the numbers(SNO). I want to pass that data into a function which accepts Row as a parameter.
I am thinking to convert that dataframe into List of tuple of Rows => List(Tuple2(Row, Row))
So that I can pass those rows into a function in interative basis.
Any efficient method would e appreciated.
Imagine I have
val list0: List[(Int, Int)] = List((1,2),(5,4),(3,6))
& I have two sample dataframe
+-------+-----+-------+
|Country| Item|groupNo|
+-------+-----+-------+
| India|mango| 1|
| India|Apple| 5|
| India| musk| 3|
+-------+-----+-------+
and another dataframe is like
+-------+-----+-------+
|Country| Item|groupNo|
+-------+-----+-------+
| India| musk| 2|
| India|mango| 6|
| India|mango| 4|
+-------+-----+-------+
So I want result like
List((Row(India,mango,1), Row(India,musk,2)), (Row(India,Apple,5), Row(India,mango,4)), etc...)
So that I can pass that List(Tuple2(Row, Row)) to a certain function as it is.

How to efficiently select only columns which contain variation in values for a given subset of rows?

For context, my ultimate goal is to remove nearly-duplicated rows from a very large dataframe. Here is some dummy data:
+---+--------+----------+---+-------+-------+---+-------+-------+
|key|unique_1| unique_2|...|col_125|col_126|...|col_414|col_415|
+---+--------+----------+---+-------+-------+---+-------+-------+
| 1| 123|01-01-2000|...| 1| true|...| 100| 25|
| 2| 123|01-01-2000|...| 0| false|...| 100| 25|
| 3| 321|12-12-2012|...| 3| true|...| 99| 1|
| 4| 321|12-12-2012|...| 3| false|...| 99| 5|
+---+--------+----------+---+-------+-------+---+-------+-------+
In this data, combinations of observations from unique_1 and unique_2 should be distinct, but they aren't always. When they are repeated, they have the same values for the vast majority of the columns, but have variation on a very small set of other columns. I am trying to develop a strategy to deal with the near-duplicates, but it is complicated because each set of near-duplicates has a different set of columns which contain variation.
I'm trying to see the columns that contain variation for a single set of near-duplicates at a time - like this:
+---+-------+-------+
|key|col_125|col_126|
+---+-------+-------+
| 1| 1| true|
| 2| 20| false|
+---+-------+-------+
or this:
+---+-------+-------+
|key|col_126|col_415|
+---+-------+-------+
| 3| true| 1|
| 4| false| 5|
+---+-------+-------+
I've successfully gotten this result with a few different approaches. This was my first attempt:
def findColumnsWithDiffs(df: DataFrame): DataFrame = {
df.columns.foldLeft(df){(a,b) =>
a.select(b).distinct.count match {
case 1 => a.drop(b)
case _ => a
}
}
}
val smallFrame = originalData.filter(($"key" === 1) || ($"key" === 2))
val desiredOutput = findColumnsWithDiffs(smallFrame)
And this works insofar as it gave me what I want, but it is so unbelievably slow. It is approximately 10x slower for the function above to run then it takes to display all of the data in smallFrame (and I think that the performance only gets worse with the size of the data - although I have not tested that hypothesis thoroughly).
I thought that using fold instead of foldLeft might yield some improvements, so I rewrote the findColumnsWithDiffs function like this:
def findColumnsWithDiffsV2(df: DataFrame): DataFrame = {
val colsWithDiffs = df.columns.map(colName => List(colName)).toList.fold(Nil){(a,b) =>
df.select(col(b(0))).distinct.count match {
case 1 => a
case _ => a ++ b
}
}
df.select(colsWithDiffs.map(colName => col(colName)):_*)
}
But performance was the same. I also tried was to map each column to the number of distinct values it has and work from there, but again performance was the same. At this point I'm out of ideas. My hunch is that the filter is being performed for each column which is why it is so terribly slow, but I don't know how to verify that theory and/or change what I'm doing to fix it if I'm correct. Does anyone have ideas to improve the efficiency of what I'm doing?
I'm currently using spark 2.1.0 / scala 2.11.8
All of the approaches to identifying the distinct values are fine, the issue is with the lazy evaluation of the filter. To improve performance, call smallFrame.cache before using findColsWithDiffs. This will save the filtered data in memory, which will be fine because it is only a few rows at a time.

how to select elements in scala dataframe?

Reference to How do I select item with most count in a dataframe and define is as a variable in scala?
Given a table below, how can I select nth src_ip and put it as a variable?
+--------------+------------+
| src_ip|src_ip_count|
+--------------+------------+
| 58.242.83.11| 52|
|58.218.198.160| 33|
|58.218.198.175| 22|
|221.194.47.221| 6|
You can create another column with row number as
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val tempdf = df.withColumn("row_number", monotonically_increasing_id())
tempdf.withColumn("row_number", row_number().over(Window.orderBy("row_number")))
which should give you tempdf as
+--------------+------------+----------+
| src_ip|src_ip_count|row_number|
+--------------+------------+----------+
| 58.242.83.11| 52| 1|
|58.218.198.160| 33| 2|
|58.218.198.175| 22| 3|
|221.194.47.221| 6| 4|
+--------------+------------+----------+
Now you can use filter to filter in the nth row as
.filter($"row_number" === n)
That should be it.
For extracting the ip, lets say your n is 2 as
val n = 2
Then the above process would give you
+--------------+------------+----------+
| src_ip|src_ip_count|row_number|
+--------------+------------+----------+
|58.218.198.160| 33| 2|
+--------*------+------------+----------+
getting the ip address* is explained in the link you provided in the question by doing
.head.get(0)
Safest way is to use zipWithIndex in the dataframe converted into rdd and then convert back to dataframe, so that we have unmistakable row_number column.
val finalDF = df.rdd.zipWithIndex().map(row => (row._1(0).toString, row._1(1).toString, (row._2+1).toInt)).toDF("src_ip", "src_ip_count", "row_number")
Rest of the steps are already explained before.

Randomly join two dataframes

I have two tables, one called Reasons that has 9 records and another containing IDs with 40k records.
IDs:
+------+------+
|pc_pid|pc_aid|
+------+------+
| 4569| 1101|
| 63961| 1101|
|140677| 4364|
|127113| 7|
| 96097| 480|
| 8309| 3129|
| 45218| 89|
|147036| 3289|
| 88493| 3669|
| 29973| 3129|
|127444| 3129|
| 36095| 89|
|131001| 1634|
|104731| 781|
| 79219| 244|
+-------------+
Reasons:
+-----------------+
| reasons|
+-----------------+
| follow up|
| skin chk|
| annual meet|
|review lab result|
| REF BY DR|
| sick visit|
| body pain|
| test|
| other|
+-----------------+
I want output like this
|pc_pid|pc_aid| reason
+------+------+-------------------
| 4569| 1101| body pain
| 63961| 1101| review lab result
|140677| 4364| body pain
|127113| 7| sick visit
| 96097| 480| test
| 8309| 3129| other
| 45218| 89| follow up
|147036| 3289| annual meet
| 88493| 3669| review lab result
| 29973| 3129| REF BY DR
|127444| 3129| skin chk
| 36095| 89| other
In the reasons I have only 9 records and in the ID dataframe I have 40k records, I want to assign reason randomly to each and every id.
The following solution tries to be more robust to the number of reasons (ie. you can have as many reasons as you can reasonably fit in your cluster). If you just have few reasons (like the OP asks), you can probably broadcast them or embed them in a udf and easily solve this problem.
The general idea is to create an index (sequential) for the reasons and then random values from 0 to N (where N is the number of reasons) on the IDs dataset and then join the two tables using these two new columns. Here is how you can do this:
case class Reasons(s: String)
defined class Reasons
case class Data(id: Long)
defined class Data
Data will hold the IDs (simplified version of the OP) and Reasons will hold some simplified reasons.
val d1 = spark.createDataFrame( Data(1) :: Data(2) :: Data(10) :: Nil)
d1: org.apache.spark.sql.DataFrame = [id: bigint]
d1.show()
+---+
| id|
+---+
| 1|
| 2|
| 10|
+---+
val d2 = spark.createDataFrame( Reasons("a") :: Reasons("b") :: Reasons("c") :: Nil)
+---+
| s|
+---+
| a|
| b|
| c|
+---+
We will later need the number of reasons so we calculate that first.
val numerOfReasons = d2.count()
val d2Indexed = spark.createDataFrame(d2.rdd.map(_.getString(0)).zipWithIndex)
d2Indexed.show()
+---+---+
| _1| _2|
+---+---+
| a| 0|
| b| 1|
| c| 2|
+---+---+
val d1WithRand = d1.select($"id", (rand * numerOfReasons).cast("int").as("rnd"))
The last step is to join on the new columns and the remove them.
val res = d1WithRand.join(d2Indexed, d1WithRand("rnd") === d2Indexed("_2")).drop("_2").drop("rnd")
res.show()
+---+---+
| id| _1|
+---+---+
| 2| a|
| 10| b|
| 1| c|
+---+---+
pyspark random join itself
data_neg = data_pos.sortBy(lambda x: uniform(1, 10000))
data_neg = data_neg.coalesce(1, False).zip(data_pos.coalesce(1, True))
The fastest way to randomly join dataA (huge dataframe) and dataB (smaller dataframe, sorted by any column):
dfB = dataB.withColumn(
"index", F.row_number().over(Window.orderBy("col")) - 1
)
dfA = dataA.withColumn("index", (F.rand() * dfB.count()).cast("bigint"))
df = dfA.join(dfB, on="index", how="left").drop("index")
Since dataB is already sorted, row numbers can be assigned over sorted window with high degree of parallelism. F.rand() is another highly parallel function, so adding index to dataA will be very fast as well.
If dataB is small enough, you may benefit from broadcasting it.
This method is better than using:
zipWithIndex: Can be very expensive to convert dataframe to rdd, zipWithIndex, and then to df.
monotonically_increasing_id: Need to be used with row_number which will collect all the partitions into a single executor.
Reference: https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6

tidyr::spread() with plain Scala and Spark (rows to columns)

my prototype (written in R with the packages dplyr and tidyr) is hitting a wall in terms of computational complexity - even on my powerfull working station. Therefore, I want to port the code to Spark using Scala.
I looked up all transformations, actions, functions (SparkSQL) and column operations (also SparkSQL) and found all function equivalents except the one for the tidyr::spread() function, available in R.
df %>% tidyr::spread(key = COL_KEY , value = COL_VAL) basically spreads a key-value pair across multiple columns. E.g. the table
COL_KEY | COL_VAL
-----------------
A | 1
B | 1
A | 2
will be transformed to by
A | B
------------
1 | 0
0 | 1
2 | 1
In case there is no "out-of-the-box"-solution available: Could you point me in the right direction? Maybe a user defined function?
I'm free which Spark (and Scala) version to choose (therefore I'd go for the latest, 2.0.0).
Thanks!
Out-of-the-box but requires a shuffle:
df
// A dummy unique key to perform grouping
.withColumn("_id", monotonically_increasing_id)
.groupBy("_id")
.pivot("COL_KEY")
.agg(first("COL_VAL"))
.drop("_id")
// +----+----+
// | A| B|
// +----+----+
// | 1|null|
// |null| 1|
// | 2|null|
// +----+----+
You can optionally follow it with .na.fill(0).
Manually without shuffle:
// Find distinct keys
val keys = df.select($"COL_KEY").as[String].distinct.collect.sorted
// Create column expressions for each key
val exprs = keys.map(key =>
when($"COL_KEY" === key, $"COL_VAL").otherwise(lit(0)).alias(key)
)
df.select(exprs: _*)
// +---+---+
// | A| B|
// +---+---+
// | 1| 0|
// | 0| 1|
// | 2| 0|
// +---+---+