Spark2 Dataframe/RDD process in groups - scala

I have the following table stored in Hive called ExampleData:
+--------+-----+---|
|Site_ID |Time |Age|
+--------+-----+---|
|1 |10:00| 20|
|1 |11:00| 21|
|2 |10:00| 24|
|2 |11:00| 24|
|2 |12:00| 20|
|3 |11:00| 24|
+--------+-----+---+
I need to be able to process the data by Site. Unfortunately partitioning it by Site doesn't work (there are over 100k sites, all with fairly small amounts of data).
For each site, I need to select the Time column and Age column separately, and use this to feed into a function (which ideally I want to run on the executors, not the driver)
I've got a stub of how I think I want it to work, but this solution would only run on the driver, so it's very slow. I need to find a way of writing it so it will run an executor level:
// fetch a list of distinct sites and return them to the driver
//(if you don't, you won't be able to loop around them as they're not on the executors)
val distinctSites = spark.sql("SELECT site_id FROM ExampleData GROUP BY site_id LIMIT 10")
.collect
val allSiteData = spark.sql("SELECT site_id, time, age FROM ExampleData")
distinctSites.foreach(row => {
allSiteData.filter("site_id = " + row.get(0))
val times = allSiteData.select("time").collect()
val ages = allSiteData.select("ages").collect()
processTimesAndAges(times, ages)
})
def processTimesAndAges(times: Array[Row], ages: Array[Row]) {
// do some processing
}
I've tried broadcasting the distinctSites across all nodes, but this did not prove fruitful.
This seems such a simple concept and yet I have spent a couple of days looking into this. I'm very new to Scala/Spark, so apologies if this is a ridiculous question!
Any suggestions or tips are greatly appreciated.

RDD API provides a number of functions which can be used to perform operations in groups starting with low level repartition / repartitionAndSortWithinPartitions and ending with a number of *byKey methods (combineByKey, groupByKey, reduceByKey, etc.).
Example :
rdd.map( tup => ((tup._1, tup._2, tup._3), tup) ).
groupByKey().
forEachPartition( iter => doSomeJob(iter) )
In DataFrame you can use aggregate functions,GroupedData class provides a number of methods for the most common functions, including count, max, min, mean and sum
Example :
val df = sc.parallelize(Seq(
(1, 10.3, 10), (1, 11.5, 10),
(2, 12.6, 20), (3, 2.6, 30))
).toDF("Site_ID ", "Time ", "Age")
df.show()
+--------+-----+---+
|Site_ID |Time |Age|
+--------+-----+---+
| 1| 10.3| 10|
| 1| 11.5| 10|
| 2| 12.6| 20|
| 3| 2.6| 30|
+--------+-----+---+
df.groupBy($"Site_ID ").count.show
+--------+-----+
|Site_ID |count|
+--------+-----+
| 1| 2|
| 3| 1|
| 2| 1|
+--------+-----+
Note : As you have mentioned that solution is very slow ,You need to use partition ,in your case range partition is good option.
http://dev.sortable.com/spark-repartition/
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/

Related

How I can transpose a data frame in pyspark?

I couldn't find any function in pyspark for transposing a dataframe.
Cal Cal2 Cal3
'A' 12 11
'U' 10 9
'O' 5 5
'ER' 6 5
Cal 'A' 'U' 'O' 'ER'
Cal2 12 10 5 6
Cal3 11 9 5 5
in pandas is very easy: df.T but I am not sure how it is in pyspark!
Generation of the sample dataframe
df = spark.createDataFrame([('A' ,12 ,11),('U' ,10 ,9 ),('O' , 5 ,5 ),('ER', 6 ,5 )], ['Cal','Cal2','Cal3'])
Option 1: pyspark.pandas.DataFrame.T
For large dataframes compute.max_rows might be required
import pyspark.pandas as ps
ps.get_option("compute.max_rows") # 1000
ps.set_option("compute.max_rows", 2000)
(df
.to_pandas_on_spark()
.set_index('Cal')
.T
.reset_index()
.rename(columns={"index":"Cal"})
.to_spark()
.show())
+----+---+---+---+---+
| Cal| A| U| O| ER|
+----+---+---+---+---+
|Cal2| 12| 10| 5| 6|
|Cal3| 11| 9| 5| 5|
+----+---+---+---+---+
Option 2: pyspark, the hard way
import pyspark.sql.functions as F
header_col = 'Cal'
cols_minus_header = df.columns
cols_minus_header.remove(header_col)
df1 = (df
.groupBy()
.pivot('Cal')
.agg(F.first(F.array(cols_minus_header)))
.withColumn(header_col, F.array(*map(F.lit, cols_minus_header)))
)
df1.show(truncate = False)
+--------+------+------+-------+------------+
| A| ER| O| U| Cal|
+--------+------+------+-------+------------+
|[12, 11]|[6, 5]|[5, 5]|[10, 9]|[Cal2, Cal3]|
+--------+------+------+-------+------------+
df2 = df1.select(F.arrays_zip(*df1.columns).alias('az')).selectExpr('inline(az)')
df2.show(truncate = False)
+---+---+---+---+----+
|A |ER |O |U |Cal |
+---+---+---+---+----+
|12 |6 |5 |10 |Cal2|
|11 |5 |5 |9 |Cal3|
+---+---+---+---+----+
You can unpivot the dataframe and then pivot it based on a different column.
from pyspark.sql import functions as F
data = [('A', 12, 11,),
('U', 10, 9,),
('O', 5, 5,),
('ER', 6, 5,), ]
df = spark.createDataFrame(data, ("Cal", "Cal2", "Cal3",))
key_column = "Cal"
unpivot_cols = [x for x in df.columns if x != key_column]
unpivot_col_expr = " ,".join([f"'{c}', {c}" for c in unpivot_cols])
unpivot_expr = f"stack({len(unpivot_cols)}, {unpivot_col_expr}) as (key,value)"
unpivoted_df = df.select("Cal", F.expr(unpivotExpr))
unpivoted_df.groupBy("key").pivot(key_column).agg(F.first("value")).withColumnRenamed("key", key_column).show()
"""
+----+---+---+---+---+
| Cal| A| ER| O| U|
+----+---+---+---+---+
|Cal3| 11| 5| 5| 9|
|Cal2| 12| 6| 5| 10|
+----+---+---+---+---+
"""
With Spark 3.2.1 supports pyspark supports pandas API as well.
If your dataframe is small you can make use of the same.
This method is based on an expensive operation due to the nature of big data. Internally it needs to generate each row for each value, and then group twice - it is a huge operation. To prevent misusage, this method has the ‘compute.max_rows’ default limit of input length, and raises a ValueError.
See the below link for more details -
pyspark.pandas.DataFrame.transpose
In addition to the above, you can also use Koalas (available in databricks) and is similar to Pandas except makes more sense for distributed processing and available in Pyspark (from 3.0.0 onwards). Something as below -
kdf = df.to_koalas()
Transpose_kdf = kdf.transpose()
TransposeDF = Transpose_kdf.to_spark()
Koalas documentation - Databricks Koalas
One thing to note is, you need to define partitions so as to use Koalas efficiently, else there could be serious performance issues.
There is another option that pyspark natively provides is pivot and stack option. See the below documentation for details -
Stack
Pivot
I'll leave these two for you to explore in detail.

Spark Scala - Bitwise operation in filter

I have a source dataset aggregated with columns col1, and col2. Col2 values are aggregated by bitwise OR operation. I need to apply filter on the Col2 values to select records whose bits are on for 8,4,2
initial source raw data
val SourceRawData = Seq(("Flag1", 2),
("Flag1", 4),("Flag1", 8), ("Flag2", 8), ("Flag2", 16),("Flag2", 32)
,("Flag3", 2),("Flag4", 32),
("Flag5", 2), ("Flag5", 8)).toDF("col1", "col2")
SourceRawData.show()
+-----+----+
| col1|col2|
+-----+----+
|Flag1| 2|
|Flag1| 4|
|Flag1| 8|
|Flag2| 8|
|Flag2| 16|
|Flag2| 32|
|Flag3| 2|
|Flag4| 32|
|Flag5| 2|
|Flag5| 8|
+-----+----+
Aggregated source data based on 'SourceRawData above' after collapsing Col1 values to single row per Col1 value and this is provided other team and Col2 values are aggregated by Bitwise OR operation. Note I here i am providing the output not the actual aggregation logic
val AggregatedSourceData = Seq(("Flag1", 14L),
("Flag2", 56L),("Flag3", 2L), ("Flag4", 32L), ("Flag5", 10L)).toDF("col1", "col2")
AggregatedSourceData.show()
+-----+----+
| col1|col2|
+-----+----+
|Flag1| 14|
|Flag2| 56|
|Flag3| 2|
|Flag4| 32|
|Flag5| 10|
+-----+----+
Now I need to apply filter on the aggregated dataset above to get the rows whose col2 values meeting any of the (8,4,2) col2 bits are on as per the original source raw data values
expected output
+-----+----+
|Col1 |Col2|
+-----+----+
|Flag1|14 |
|Flag2|56 |
|Flag3|2 |
|Flag5|10 |
+-----+----+
I tried something like below and seems to be getting hte expected output but unable to understand how its working. Is this the correct approach?? if so ,how its working ( I am not that knowledgeable in bitwise operations so looking for expert help to understand please)
`
``
var myfilter:Long = 2 | 4| 8
AggregatedSourceData.filter($"col2".bitwiseAND(myfilter) =!= 0).show()
+-----+----+
| col1|col2|
+-----+----+
|Flag1| 14|
|Flag2| 56|
|Flag3| 2|
|Flag5| 10|
+-----+----+
I think you do not need to use bitWiseAnd to filter, instead, just use contains/in “A set of decimal representation of the bit numbers you want” or == to “a decimal representation of a bit number you want”
Also if you try your existing calculations without Scala or spark, you will see where you understood things wrong, eg use here :
https://www.rapidtables.com/calc/math/binary-calculator.html
you will find you defined you filter “wrong”.
18&18 is 18
18|2 is 18
Your dataset the flag column each row will only be one value, so just filter the flag column , whose values are in the set of numbers you want .
$"flag == 18 or
(18,2,20) contains $"flag for example

Spark - group and aggregate only several smallest items

In short
I have cartesian-product (cross-join) of two dataframes and function which gives some score for given element of this product. I want now to get few "best matched" elements of the second DF for every member of the first DF.
In details
What follows is a simplified example as my real code is somewhat bloated with additional fields and filters.
Given two sets of data, each having some id and value:
// simple rdds of tuples
val rdd1 = sc.parallelize(Seq(("a", 31),("b", 41),("c", 59),("d", 26),("e",53),("f",58)))
val rdd2 = sc.parallelize(Seq(("z", 16),("y", 18),("x",3),("w",39),("v",98), ("u", 88)))
// convert them to dataframes:
val df1 = spark.createDataFrame(rdd1).toDF("id1", "val1")
val df2 = spark.createDataFrame(rdd2).toDF("id2", "val2")
and some function which for pair of the elements from the first and second dataset gives their "matching score":
def f(a:Int, b:Int):Int = (a * a + b * b * b) % 17
// convert it to udf
val fu = udf((a:Int, b:Int) => f(a, b))
we can create the product of two sets and calculate score for every pair:
val dfc = df1.crossJoin(df2)
val r = dfc.withColumn("rez", fu(col("val1"), col("val2")))
r.show
+---+----+---+----+---+
|id1|val1|id2|val2|rez|
+---+----+---+----+---+
| a| 31| z| 16| 8|
| a| 31| y| 18| 10|
| a| 31| x| 3| 2|
| a| 31| w| 39| 15|
| a| 31| v| 98| 13|
| a| 31| u| 88| 2|
| b| 41| z| 16| 14|
| c| 59| z| 16| 12|
...
And now we want to have this result grouped by id1:
r.groupBy("id1").agg(collect_set(struct("id2", "rez")).as("matches")).show
+---+--------------------+
|id1| matches|
+---+--------------------+
| f|[[v,2], [u,8], [y...|
| e|[[y,5], [z,3], [x...|
| d|[[w,2], [x,6], [v...|
| c|[[w,2], [x,6], [v...|
| b|[[v,2], [u,8], [y...|
| a|[[x,2], [y,10], [...|
+---+--------------------+
But really we want only to retain only few (say 3) of "matches", those having the best score (say, least score).
The question is
How to get the "matches" sorted and reduced to top-N elements? Probably it is something about collect_list and sort_array, though I don't know how to sort by inner field.
Is there a way to ensure optimization in case of large input DFs - e.g. choosing minimums directly while aggregating. I know it could be done easily if I wrote the code without spark - keeping small array or priority queue for every id1 and adding element where it should be, possibly dropping out some previously added.
E.g. it's ok that cross-join is costly operation, but I want to avoid wasting memory on the results most of which I'm going to drop in the next step. My real use case deals with DFs with less than 1 mln entries so cross-join is yet viable but as we want to select only 10-20 top matches for each id1 it seems to be quite desirable not to keep unnecessary data between steps.
For start we need to take only the first n rows. To do this we are partitioning the DF by 'id1' and sorting the groups by the res. We use it to add row number column to the DF, like that we can use where function to take the first n rows. Than you can continue doing the same code your wrote. Grouping by 'id1' and collecting the list. Only now you already have the highest rows.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val n = 3
val w = Window.partitionBy($"id1").orderBy($"res".desc)
val res = r.withColumn("rn", row_number.over(w)).where($"rn" <= n).groupBy("id1").agg(collect_set(struct("id2", "res")).as("matches"))
A second option that might be better because you won't need to group the DF twice:
val sortTakeUDF = udf{(xs: Seq[Row], n: Int)} => xs.sortBy(_.getAs[Int]("res")).reverse.take(n).map{case Row(x: String, y:Int)}}
r.groupBy("id1").agg(sortTakeUDF(collect_set(struct("id2", "res")), lit(n)).as("matches"))
In here we create a udf that take the array column and an integer value n. The udf sorts the array by your 'res' and returns only the first n elements.

Randomly join two dataframes

I have two tables, one called Reasons that has 9 records and another containing IDs with 40k records.
IDs:
+------+------+
|pc_pid|pc_aid|
+------+------+
| 4569| 1101|
| 63961| 1101|
|140677| 4364|
|127113| 7|
| 96097| 480|
| 8309| 3129|
| 45218| 89|
|147036| 3289|
| 88493| 3669|
| 29973| 3129|
|127444| 3129|
| 36095| 89|
|131001| 1634|
|104731| 781|
| 79219| 244|
+-------------+
Reasons:
+-----------------+
| reasons|
+-----------------+
| follow up|
| skin chk|
| annual meet|
|review lab result|
| REF BY DR|
| sick visit|
| body pain|
| test|
| other|
+-----------------+
I want output like this
|pc_pid|pc_aid| reason
+------+------+-------------------
| 4569| 1101| body pain
| 63961| 1101| review lab result
|140677| 4364| body pain
|127113| 7| sick visit
| 96097| 480| test
| 8309| 3129| other
| 45218| 89| follow up
|147036| 3289| annual meet
| 88493| 3669| review lab result
| 29973| 3129| REF BY DR
|127444| 3129| skin chk
| 36095| 89| other
In the reasons I have only 9 records and in the ID dataframe I have 40k records, I want to assign reason randomly to each and every id.
The following solution tries to be more robust to the number of reasons (ie. you can have as many reasons as you can reasonably fit in your cluster). If you just have few reasons (like the OP asks), you can probably broadcast them or embed them in a udf and easily solve this problem.
The general idea is to create an index (sequential) for the reasons and then random values from 0 to N (where N is the number of reasons) on the IDs dataset and then join the two tables using these two new columns. Here is how you can do this:
case class Reasons(s: String)
defined class Reasons
case class Data(id: Long)
defined class Data
Data will hold the IDs (simplified version of the OP) and Reasons will hold some simplified reasons.
val d1 = spark.createDataFrame( Data(1) :: Data(2) :: Data(10) :: Nil)
d1: org.apache.spark.sql.DataFrame = [id: bigint]
d1.show()
+---+
| id|
+---+
| 1|
| 2|
| 10|
+---+
val d2 = spark.createDataFrame( Reasons("a") :: Reasons("b") :: Reasons("c") :: Nil)
+---+
| s|
+---+
| a|
| b|
| c|
+---+
We will later need the number of reasons so we calculate that first.
val numerOfReasons = d2.count()
val d2Indexed = spark.createDataFrame(d2.rdd.map(_.getString(0)).zipWithIndex)
d2Indexed.show()
+---+---+
| _1| _2|
+---+---+
| a| 0|
| b| 1|
| c| 2|
+---+---+
val d1WithRand = d1.select($"id", (rand * numerOfReasons).cast("int").as("rnd"))
The last step is to join on the new columns and the remove them.
val res = d1WithRand.join(d2Indexed, d1WithRand("rnd") === d2Indexed("_2")).drop("_2").drop("rnd")
res.show()
+---+---+
| id| _1|
+---+---+
| 2| a|
| 10| b|
| 1| c|
+---+---+
pyspark random join itself
data_neg = data_pos.sortBy(lambda x: uniform(1, 10000))
data_neg = data_neg.coalesce(1, False).zip(data_pos.coalesce(1, True))
The fastest way to randomly join dataA (huge dataframe) and dataB (smaller dataframe, sorted by any column):
dfB = dataB.withColumn(
"index", F.row_number().over(Window.orderBy("col")) - 1
)
dfA = dataA.withColumn("index", (F.rand() * dfB.count()).cast("bigint"))
df = dfA.join(dfB, on="index", how="left").drop("index")
Since dataB is already sorted, row numbers can be assigned over sorted window with high degree of parallelism. F.rand() is another highly parallel function, so adding index to dataA will be very fast as well.
If dataB is small enough, you may benefit from broadcasting it.
This method is better than using:
zipWithIndex: Can be very expensive to convert dataframe to rdd, zipWithIndex, and then to df.
monotonically_increasing_id: Need to be used with row_number which will collect all the partitions into a single executor.
Reference: https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6

Scala: Any better way to join two DataFrames by the relationship from the third one

I have to two DataFrames, and want to outer join them. But the joining mapping is in another dataframe.
Now I am using below way, it works, but I hope there is more efficient way for I have >1,000,000 rows
val ta = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("A", "B", "C")
scala> ta.show
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 1| 1|
| 2| 2| 2|
+---+---+---+
val tb = sc.parallelize(Array(
(2,1)
)).toDF("C", "D")
scala> tb.show
+---+---+
| C| D|
+---+---+
| 2| 1|
+---+---+
val tc = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("D", "E", "F")
scala> tc.show
+---+---+---+
| D| E| F|
+---+---+---+
| 1| 1| 1|
| 2| 2| 2|
+---+---+---+
scala> val tmp = ta.join(tb, Seq("C"), "left_outer")
tmp: org.apache.spark.sql.DataFrame = [C: int, A: int, B: int, D: int]
scala> tmp.show
+---+---+---+----+
| C| A| B| D|
+---+---+---+----+
| 1| 1| 1|null|
| 2| 2| 2| 1|
+---+---+---+----+
scala> tmp.join(tc, Seq("D"), "outer").show
+----+----+----+----+----+----+
| D| C| A| B| E| F|
+----+----+----+----+----+----+
|null| 1| 1| 1|null|null|
| 1| 2| 2| 2| 1| 1|
| 2|null|null|null| 2| 2|
+----+----+----+----+----+----+
As Umberto noted, a good reference on how to improve performance of your joins is Holden Karau and Rachel Warren's High Performance Spark > Chapter 4. Joins (SQL & Core).
From the standpoint of your code, running it as you noted or the SQL equivalent (as noted below) should result in about the same performance.
// Create initial tables
val ta = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("A", "B", "C")
val tb = sc.parallelize(Array(
(2,1)
)).toDF("C", "D")
val tc = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("D", "E", "F")
// _.createOrReplaceTempView
ta.createOrReplaceTempView("ta")
tb.createOrReplaceTempView("tb")
tc.createOrReplaceTempView("tc")
// SQL Query
spark.sql("
select tc.D, ta.A, ta.B, ta.C, tc.E, tc.F
from ta
left outer join tb
on tb.C = ta.C
full outer join tc
on tc.D = tb.D
")
The reason why is because the Spark SQL Catalyst Optimizer (as noted in the diagram below) takes the DataFrame query and builds up an optimized logical plan. A number of physical plans are developed and Spark SQL Engine's Cost Optimizer chooses the best physical plan and generates the code to produce the RDDs.
Saying this, the key concern is that when you're working with a lot of rows that use up a lot of memory, you have to take into account of the partitioning. For example, if you can ensure that the mapping DataFrame (tc) have the same / similar partitioning scheme as the other DataFrames (ta, tb) so that way you can have a co-located join (this is Figure 4-3 within High Performance Spark > Chapter 4. Join).
If the partitions for your three DataFrames (ta, tb, tc) all have different partitioning, this means the keys for your DataFrames will not have a 1-to-1 matching between the partitions. That is, this will result in a shuffle join (this is Figure 4-2 within High Performance Spark > Chapter 4. Join) which potentially could be more costly.
Basically, from the standpoint of your query, the concern is less about the query itself and more about the partitioning schemes for your DataFrames. But before experimenting too much with the partitioning schemes of your DataFrames, experiment with your queries to see if the default Spark SQL / DataFrame queries are able to take care of the partitioning by itself.