regarding train-test split of data in spark scala - scala

I have a spark scala data frame like this
val df = Seq(
(10, 12),
(44, 14),
(32, 25),
(31, 24),
(75, 25),
(80, 20),
(35, 55),
(32, 25),
(67, 72),
(32, 21)
).toDF("x1","x2")
df.show()
+---+---+
| x1| x2|
+---+---+
| 10| 12|
| 44| 14|
| 32| 25|
| 31| 24|
| 75| 25|
| 80| 20|
| 35| 55|
| 32| 25|
| 67| 72|
| 32| 21|
+---+---+
I need to split this data as training and testing where training data would be the first 8 rows (80%) and testing data would be the last 2 rows (20%).
I tried , val Array(train, test) = df.randomSplit(Array(0.8, 0.2)) But it is select 8 rows randomly (instead of choosing first 8 rows) as training and others as testing
So can anyone suggest how to select the partitions as i mentioned above ?
Thank you

Maybe there is a better way but nothing else comes to my mind as you require data to be ordered.
val cnt = df.count
val testSize = (0.2 * cnt).toInt
val trainSize = cnt - testSize
val trainDf = df.sort(monotonically_increasing_id).limit(trainSize)
val testDf = df.sort(monotonically_increasing_id.desc).limit(testSize)

Related

Last unique entries above current row in Spark

I have a Spark dataframe with the following data:
val df = sc.parallelize(Seq(
(1, "A", "2022-01-01", 30, 0),
(1, "A", "2022-01-02", 20, 30),
(1, "B", "2022-01-03", 50, 20),
(1, "A", "2022-01-04", 10, 70),
(1, "B", "2022-01-05", 30, 60),
(1, "A", "2022-01-06", 0, 40),
(1, "C", "2022-01-07", 100,30),
(2, "D", "2022-01-08", 5, 0)
)).toDF("id", "event", "eventTimestamp", "amount", "expected")
display(df)
id
event
eventTimestamp
amount
expected
1
"A"
"2022-01-01"
30
0
1
"A"
"2022-01-02"
20
30
1
"B"
"2022-01-03"
50
20
1
"A"
"2022-01-04"
10
70
1
"B"
"2022-01-05"
30
60
1
"A"
"2022-01-06"
0
40
1
"C"
"2022-01-07"
100
30
2
"D"
"2022-01-08"
5
0
I want to find the following for each row: The sum of all last entries (above the current row) for each id and each unique event. The desired outcome is in the column "expected".
E.g. for the order "C" I'd like to get the latest amounts for "A" and "B": 30 + 0 = 30
I tried the following query, however it would sum up the amounts of all previous orders, including duplications, (I'm not sure, if it's possible to apply a filter on the sum to take only distinct values):
val days = (x:Int) => x * 86400
val idWindow = Window.partitionBy("id").orderBy(col("eventTimestamp")
.cast("timestamp").cast("long"))
.rangeBetween(Window.unboundedPreceding, -days(1))
val res = df.withColumn("totalAmount", sum($"amount").over(idWindow))
Please note that the rangeBetween functionality is important for my use-case and should be preserved.
The trick is to convert amounts to diffs within (id, event) pairs, which allows you to calculate moving sum in the next step. That moving sum maintains latest amounts of each unique event.
df
.withColumn("diff", coalesce($"amount" - lag($"amount", 1).over(wIdEvent), $"amount")).
.withColumn("sum", sum($"diff").over(wId)).
.withColumn("final", coalesce(lag($"sum", 1).over(wId), lit(0))).
.orderBy($"eventTimestamp").show
+---+-----+--------------+------+--------+----+---+-----+
| id|event|eventTimestamp|amount|expected|diff|sum|final|
+---+-----+--------------+------+--------+----+---+-----+
| 1| A| 2022-01-01| 30| 0| 30| 30| 0|
| 1| A| 2022-01-02| 20| 30| -10| 20| 30|
| 1| B| 2022-01-03| 50| 20| 50| 70| 20|
| 1| A| 2022-01-04| 10| 70| -10| 60| 70|
| 1| B| 2022-01-05| 30| 60| -20| 40| 60|
| 1| A| 2022-01-06| 0| 40| -10| 30| 40|
| 1| C| 2022-01-07| 100| 30| 100|130| 30|
| 2| D| 2022-01-08| 5| 0| 5| 5| 0|
+---+-----+--------------+------+--------+----+---+-----+

Finding Percentile in Spark-Scala per a group

I am trying to do a percentile over a column using a Window function as below. I have referred here to use the ApproxQuantile definition over a group.
val df1 = Seq(
(1, 10.0), (1, 20.0), (1, 40.6), (1, 15.6), (1, 17.6), (1, 25.6),
(1, 39.6), (2, 20.5), (2 ,70.3), (2, 69.4), (2, 74.4), (2, 45.4),
(3, 60.6), (3, 80.6), (4, 30.6), (4, 90.6)
).toDF("ID","Count")
val idBucketMapping = Seq((1, 4), (2, 3), (3, 2), (4, 2))
.toDF("ID", "Bucket")
//jpp
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
import org.apache.spark.sql.expressions.Window
object PercentileApprox {
def percentile_approx(col: Column, percentage: Column,
accuracy: Column): Column = {
val expr = new ApproximatePercentile(
col.expr, percentage.expr, accuracy.expr
).toAggregateExpression
new Column(expr)
}
def percentile_approx(col: Column, percentage: Column): Column =
percentile_approx(col, percentage,
lit(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))
}
import PercentileApprox._
var res = df1
.withColumn("percentile",
percentile_approx(col("count"), typedLit(doBucketing(2)))
.over(Window.partitionBy("ID"))
)
def doBucketing(bucket_size : Int) = (1 until bucket_size)
.scanLeft(0d)((a, _) => a + (1 / bucket_size.toDouble))
scala> df1.show
+---+-----+
| ID|Count|
+---+-----+
| 1| 10.0|
| 1| 20.0|
| 1| 40.6|
| 1| 15.6|
| 1| 17.6|
| 1| 25.6|
| 1| 39.6|
| 2| 20.5|
| 2| 70.3|
| 2| 69.4|
| 2| 74.4|
| 2| 45.4|
| 3| 60.6|
| 3| 80.6|
| 4| 30.6|
| 4| 90.6|
+---+-----+
scala> idBucketMapping.show
+---+------+
| ID|Bucket|
+---+------+
| 1| 4|
| 2| 3|
| 3| 2|
| 4| 2|
+---+------+
scala> res.show
+---+-----+------------------+
| ID|Count| percentile|
+---+-----+------------------+
| 1| 10.0|[10.0, 20.0, 40.6]|
| 1| 20.0|[10.0, 20.0, 40.6]|
| 1| 40.6|[10.0, 20.0, 40.6]|
| 1| 15.6|[10.0, 20.0, 40.6]|
| 1| 17.6|[10.0, 20.0, 40.6]|
| 1| 25.6|[10.0, 20.0, 40.6]|
| 1| 39.6|[10.0, 20.0, 40.6]|
| 3| 60.6|[60.6, 60.6, 80.6]|
| 3| 80.6|[60.6, 60.6, 80.6]|
| 4| 30.6|[30.6, 30.6, 90.6]|
| 4| 90.6|[30.6, 30.6, 90.6]|
| 2| 20.5|[20.5, 69.4, 74.4]|
| 2| 70.3|[20.5, 69.4, 74.4]|
| 2| 69.4|[20.5, 69.4, 74.4]|
| 2| 74.4|[20.5, 69.4, 74.4]|
| 2| 45.4|[20.5, 69.4, 74.4]|
+---+-----+------------------+
Upto here it is well and good and the logic is simple. But I need results in a dynamic fashion. This means the argument doBucketing(2) to this function should be taken from idBucketMapping based on the ID - Value.
This seems to be little bit tricky for me. Is this possible by any means?
Expected Output --
This means the percentile bucket is based on - idBucketMapping Dataframe .
+---+-----+------------------------+
|ID |Count|percentile |
+---+-----+------------------------+
|1 |10.0 |[10.0, 15.6, 20.0, 39.6]|
|1 |20.0 |[10.0, 15.6, 20.0, 39.6]|
|1 |40.6 |[10.0, 15.6, 20.0, 39.6]|
|1 |15.6 |[10.0, 15.6, 20.0, 39.6]|
|1 |17.6 |[10.0, 15.6, 20.0, 39.6]|
|1 |25.6 |[10.0, 15.6, 20.0, 39.6]|
|1 |39.6 |[10.0, 15.6, 20.0, 39.6]|
|3 |60.6 |[60.6, 60.6] |
|3 |80.6 |[60.6, 60.6] |
|4 |30.6 |[30.6, 30.6] |
|4 |90.6 |[30.6, 30.6] |
|2 |20.5 |[20.5, 45.4, 70.3] |
|2 |70.3 |[20.5, 45.4, 70.3] |
|2 |69.4 |[20.5, 45.4, 70.3] |
|2 |74.4 |[20.5, 45.4, 70.3] |
|2 |45.4 |[20.5, 45.4, 70.3] |
+---+-----+------------------------+
I have a solution for you that is extremely unelegant and works only if you have a limited number of possible bucketing.
My first version is very ugly.
// for the sake of clarity, let's define a function that generates the
// window aggregation
def per(x : Int) = percentile_approx(col("count"), typedLit(doBucketing(x)))
.over(Window.partitionBy("ID"))
// then, we simply try to match the Bucket column with a possible value
val res = df1
.join(idBucketMapping, Seq("ID"))
.withColumn("percentile", when('Bucket === 2, per(2)
.otherwise(when('Bucket === 3, per(3))
.otherwise(per(4)))
)
That's nasty but it works in your case.
Slightly less ugly but very same logic, you can define a set of possible numbers of buckets and use it to do the same thing as above.
val possible_number_of_buckets = 2 to 5
val res = df1
.join(idBucketMapping, Seq("ID"))
.withColumn("percentile", possible_number_of_buckets
.tail
.foldLeft(per(possible_number_of_buckets.head))
((column, size) => when('Bucket === size, per(size))
.otherwise(column)))
percentile_approx takes percentage and accuracy. It seems, they both must be a constant literal. Thus we can't compute the percentile_approx at runtime with dynamically calculated percentage and accuracy.
ref- apache spark git percentile_approx source

How do I sum a column and add the summed column to a Spark DataFrame?

I have a Spark DataFrame as follows:
val someDF5 = Seq(
("202003101750", "202003101700",122),
("202003101800", "202003101700",12),
("202003101750", "202003101700",42),
("202003101810", "202003101700",2)
).toDF("number", "word","value")
With a column num_records by doing the following:
val DF1 = someDF5.groupBy("number","word").agg(count("*").alias("num_records"))
DF1:
+------------+------------+-------------+
| number| word|num_records |
+------------+------------+-------------+
|202003101750|202003101700| 2|
|202003101800|202003101700| 1|
|202003101810|202003101700| 1|
+------------+------------+-------------+
How can I add another column say total_records which keeps track of the total of num_records and adds to the dataframe? For example, this is what I expect:
+------------+------------+-------------+-------------+--
| number| word|num_records |total_records |
+------------+------------+-------------+----------------
|202003101750|202003101700| 2| 4 |
|202003101800|202003101700| 1| 4 |
|202003101810|202003101700| 1| 4 |
+------------+------------+-------------+----------------
Note: total_records should keep updating/adding whenever num_records changes
add withColumn and count thats all..
val someDF5 = Seq(
("202003101750", "202003101700", 122),
("202003101800", "202003101700", 12),
("202003101750", "202003101700", 42),
("202003101810", "202003101700", 2)
).toDF("number", "word", "value")
val DF1 = someDF5.groupBy("number", "word").agg(count("*").alias("num_records"))
.withColumn("total_records",lit(someDF5.count))
DF1.show
Result :
+------------+------------+-----------+-------------+
| number| word|num_records|total_records|
+------------+------------+-----------+-------------+
|202003101750|202003101700| 2| 4|
|202003101800|202003101700| 1| 4|
|202003101810|202003101700| 1| 4|
+------------+------------+-----------+-------------+
number of records increased like this count is automatically updated.
val someDF5 = Seq(
("202003101750", "202003101700", 122),
("202003101800", "202003101700", 12),
("202003101750", "202003101700", 42),
("202003101810", "202003101700", 2),
("202003101810", "22222222", 222)
).toDF("number", "word", "value")
val DF1 = someDF5.groupBy("number", "word").agg(count("*").alias("num_records"))
.withColumn("total_records",lit(someDF5.count))
Result :
+------------+------------+-----------+-------------+
| number| word|num_records|total_records|
+------------+------------+-----------+-------------+
|202003101750|202003101700| 2| 5|
|202003101800|202003101700| 1| 5|
|202003101810|202003101700| 1| 5|
|202003101810| 22222222| 1| 5|
+------------+------------+-----------+-------------+
I think you can do it creating new dataframe with sum:
val total = DF1.agg(sum(col("num_records"))).head().getAs[Long](0)
val dfWithTotal = DF1.withColumn("total_records", lit(total))
dfWithTotal.show()
+------------+------------+-----------+-------------+
| number| word|num_records|total_records|
+------------+------------+-----------+-------------+
|202003101810|202003101700| 1| 4|
|202003101750|202003101700| 2| 4|
|202003101800|202003101700| 1| 4|
+------------+------------+-----------+-------------+

How to subtract DataFrames using subset of columns in Apache Spark

How can I perform filter operation on Dataframe1 using Dataframe2.
I want to remove rows from DataFrame1 for below matching condition
Dataframe1.col1 = Dataframe2.col1
Dataframe1.col2 = Dataframe2.col2
My question is different than substract two dataframes because while substract we use all columns but in my question I want to use limited number of columns
join with "left_anti"
scala> df1.show
+----+-----+-----+
|col1| col2| col3|
+----+-----+-----+
| 1| one| ek|
| 2| two| dho|
| 3|three|theen|
| 4| four|chaar|
+----+-----+-----+
scala> df2.show
+----+----+-----+
|col1|col2| col3|
+----+----+-----+
| 2| two| dho|
| 4|four|chaar|
+----+----+-----+
scala> df1.join(df2, Seq("col1", "col2"), "left_anti").show
+----+-----+-----+
|col1| col2| col3|
+----+-----+-----+
| 1| one| ek|
| 3|three|theen|
+----+-----+-----+
Possible duplicate of :Spark: subtract two DataFrames if both datasets have exact same coulmns
If you want custom join condition then you can use "anti" join. Here is the pysaprk version
Creating two data frames:
Dataframe1 :
l1 = [('col1_row1', 10), ('col1_row2', 20), ('col1_row3', 30)
df1 = spark.createDataFrame(l1).toDF('col1','col2')
df1.show()
+---------+----+
| col1|col2|
+---------+----+
|col1_row1| 10|
|col1_row2| 20|
|col1_row3| 30|
+---------+----+
Dataframe2 :
l2 = [('col1_row1', 10), ('col1_row2', 20), ('col1_row4', 40)]
df2 = spark.createDataFrame(l2).toDF('col1','col2')
df2.show()
+---------+----+
| col1|col2|
+---------+----+
|col1_row1| 10|
|col1_row2| 20|
|col1_row4| 40|
+---------+----+
Using subtract api :
df_final = df1.subtract(df2)
df_final.show()
+---------+----+
| col1|col2|
+---------+----+
|col1_row3| 30|
+---------+----+
Using left_anti :
Join condition:
join_condition = [df1["col1"] == df2["col1"], df1["col2"] == df2["col2"]]
Join finally
df_final = df1.join(df2, join_condition, 'left_anti')
df_final.show()
+---------+----+
| col1|col2|
+---------+----+
|col1_row3| 30|
+---------+----+

What is the right way to join these 2 Spark DataFrames?

Let's assume I have 2 spark DataFrames:
val addStuffDf = Seq(
("A", "2018-03-22", 5),
("A", "2018-03-24", 1),
("B", "2018-03-24, 3))
.toDF("user", "dt", "count")
val removedStuffDf = Seq(
("C", "2018-03-25", 10),
("A", "2018-03-24", 5),
("B", "2018-03-25", 1)
).toDF("user", "dt", "count")
and in the end I want to get a single dataframe with a summary statistics like this (ordering doesn't matter, actually):
+----+----------+-----+-------+
|user| dt|added|removed|
+----+----------+-----+-------+
| A|2018-03-22| 5| 0|
| A|2018-03-24| 1| 5|
| B|2018-03-24| 3| 0|
| B|2018-03-25| 0| 1|
| C|2018-03-25| 0| 10|
+----+----------+-----+-------+
It's quite clear that I can simply rename the "count" columns at "step 0", so to have dataframes df1 and df2
val df1 = addedDf.withColumnRenamed("count", "added")
df1.show()
+----+----------+-----+
|user| dt|added|
+----+----------+-----+
| A|2018-03-22| 5|
| A|2018-03-24| 1|
| B|2018-03-24| 3|
+----+----------+-----+
val df2 = removedDf.withColumnRenamed("count", "removed")
df2.show()
+----+----------+-------+
|user| dt|applied|
+----+----------+-------+
| C|2018-03-25| 10|
| A|2018-03-24| 5|
| B|2018-03-25| 1|
+----+----------+-------+
But now I'm failing to define "step 1" - namely, to determine the transform that would zip together df1 and df2.
From the logical standpoint full_outer join brings all the rows I need in a single DF, but then I need to merge duplicating columns somehow:
df1.as('d1)
.join(df2.as('d2),
($"d1.user"===$"d2.user" && $"d1.dt"===$"d2.dt"),
"full_outer")
.show()
+----+----------+-----+----+----------+-------+
|user| dt|added|user| dt|applied|
+----+----------+-----+----+----------+-------+
|null| null| null| C|2018-03-25| 10|
|null| null| null| B|2018-03-25| 1|
| B|2018-03-24| 3|null| null| null|
| A|2018-03-22| 5|null| null| null|
| A|2018-03-24| 1| A|2018-03-24| 5|
+----+----------+-----+----+----------+-------+
How can I merge these user and dt columns together? And, overall - am I using the correct approach to solve my problem or is there a more straightforward/efficient solution?
Since the columns to be joined for the two DataFrames have matching names, using Seq("user", "dt") for the join conditions will result in the merged table you want:
val addStuffDf = Seq(
("A", "2018-03-22", 5),
("A", "2018-03-24", 1),
("B", "2018-03-24", 3)
).toDF("user", "dt", "count")
val removedStuffDf = Seq(
("C", "2018-03-25", 10),
("A", "2018-03-24", 5),
("B", "2018-03-25", 1)
).toDF("user", "dt", "count")
val df1 = addStuffDf.withColumnRenamed("count", "added")
val df2 = removedStuffDf.withColumnRenamed("count", "removed")
df1.as('d1).join(df2.as('d2), Seq("user", "dt"), "full_outer").
na.fill(0).
show
// +----+----------+-----+-------+
// |user| dt|added|removed|
// +----+----------+-----+-------+
// | C|2018-03-25| 0| 10|
// | B|2018-03-25| 0| 1|
// | B|2018-03-24| 3| 0|
// | A|2018-03-22| 5| 0|
// | A|2018-03-24| 1| 5|
// +----+----------+-----+-------+