Spark Combining Disparate rate Dataframes in Time - scala

Using Spark and Scala, I have two DataFrames with data values.
I'm trying to accomplish something that, when processing serially would be trival, but when processing in a cluster seems daunting.
Let's say I have to sets of values. One of them is very regular:
Relative Time
Value1
10
1
20
2
30
3
And I want to combine it with another value that is very irregular:
Relative Time
Value2
1
100
22
200
And get this (driven by Value1):
Relative Time
Value1
Value2
10
1
100
20
2
100
30
3
200
Note: There are a few scenarios here. One of them is that Value1 is a massive DataFrame and Value2 only has a few hundred values. The other scenario is that they're both massive.
Also note: I depict Value2 as being very slow, and it might be, but also could may be much faster than Value1, so I may have 10 or 100 values of Value2 before my next value of Value1, and I'd want the latest. Because of this doing a union of them and windowing it doesn't seem practical.
How would I accomplish this in Spark?

I think you can do:
Full outer join between the two tables
Use the last function to look back the closest value of value2
import spark.implicits._
import org.apache.spark.sql.expressions.Window
val df1 = spark.sparkContext.parallelize(Seq(
(10, 1),
(20, 2),
(30, 3)
)).toDF("Relative Time", "value1")
val df2 = spark.sparkContext.parallelize(Seq(
(1, 100),
(22, 200)
)).toDF("Relative Time", "value2_temp")
val df = df1.join(df2, Seq("Relative Time"), "outer")
val window = Window.orderBy("Relative Time")
val result = df.withColumn("value2", last($"value2_temp", ignoreNulls = true).over(window)).filter($"value1".isNotNull).drop("value2_temp")
result.show()
+-------------+------+------+
|Relative Time|value1|value2|
+-------------+------+------+
| 10| 1| 100|
| 20| 2| 100|
| 30| 3| 200|
+-------------+------+------+

Related

Spark: how to group rows into a fixed size array?

I have a dataset that looks like this:
+---+
|col|
+---+
| a|
| b|
| c|
| d|
| e|
| f|
| g|
+---+
I want to reformat this dataset so that I aggregate the rows into a arrays of fixed length, like so:
+------+
| col|
+------+
|[a, b]|
|[c, d]|
|[e, f]|
| [g]|
+------+
I tried this:
spark.sql("select collect_list(col) from (select col, row_number() over (order by col) row_number from dataset) group by floor(row_number/2)")
But the problem with this is that my actual dataset is too large to process in a single partition for row_number()
As you wish to distribute this, there are a couple of steps necessary.
In case, you wish to run the code, I am starting from this:
var df = List(
"a", "b", "c", "d", "e", "f", "g"
).toDF("col")
val desiredArrayLength = 2
First, split tyour dataframe into a small one which you can process on single node, and larger one which has number of rows which is multiple of size of desired array (in your example, this is 2)
val nRowsPrune = 1 //number of rows to prune such that remaining dataframe has number of
// rows is multiples of the desired length of array
val dfPrune = df.sort(desc("col")).limit(nRowsPrune)
df = df.join(dfPrune,Seq("col"),"left_anti") //separate small from large dataframe
By construction, you can apply the original code on the small dataframe,
val groupedPruneDf = dfPrune//.withColumn("g",floor((lit(-1)+row_number().over(w))/lit(desiredArrayLength ))) //added -1 as row-number starts from 1
//.groupBy("g")
.agg( collect_list("col").alias("col"))
.select("col")
Now, we need to figure a way to deal with the remaining large dataframe. However, now we made sure, that df has a number of rows which is a multiple of the array size.
This is where we use a great trick, which is repartitioning using repartitionByRange. Basically, the partitioning guarantees to preserve the sorting and as you are partitioning each partition will have same size.
You can now, collect each array within each partition,
val nRows = df.count()
val maxNRowsPartition = desiredArrayLength //make sure its a multiple of desired array length
val nPartitions = math.max(1,math.floor(nRows/maxNRowsPartition) ).toInt
df = df.repartitionByRange(nPartitions, $"col".desc)
.withColumn("partitionId",spark_partition_id())
val w = Window.partitionBy($"partitionId").orderBy("col")
val groupedDf = df
.withColumn("g", floor( (lit(-1)+row_number().over(w))/lit(desiredArrayLength ))) //added -1 as row-number starts from 1
.groupBy("partitionId","g")
.agg( collect_list("col").alias("col"))
.select("col")
Finally combining the two results yields what you are looking for,
val result = groupedDf.union(groupedPruneDf)
result.show(truncate=false)

How to union only those rows (from large table) with keys in left small table?

I have two dataframe, one is large, the other is small:
val small_df = sc.parallelize(List(("Alice", 15), ("Bob", 20)).toDF("name", "age")
val large_df = sc.parallelize(("Bob", 40), ("SomeOne", 50) , ... ).toDF("name", "age")
I want to add up these two dataframe but only those with the key in my small table, that is, I want my result to be like:
List(("Alice", 15), ("Bob", 60))
My first attempt is try to do union and reduceByKey, but I can't seem to find a way to union two tables and keep those rows with keys in the smaller one only.
Is there a way to do something like "left union" or other way to approach my answer?
One way to solve this problem would be to make an outer join and then sum the two resulting age columns together. Note that spark.implicits._ should be imported for use of $ and org.apache.spark.sql.functions.broadcast for the broadcast.
If any of the two dataframes contains duplicates (in the name column) the final dataframe will contains duplicates as well, which could be what you want or not. For duplicates in large_df those will only show up if there is a corresponding name in small_df, as specified in the question.
As an optimization, as one of the dataframes are small it can be broadcasted before the join to increase the performance.
val small_df = sc.parallelize(List(("Alice", 15), ("Bob", 20)).toDF("name", "age")
val large_df = sc.parallelize(("Bob", 40), ("SomeOne", 50)).toDF("name", "age")
val df = large_df.withColumnRenamed("age", "large_age").join(broadcast(small_df), Array("name"), "right_outer")
val df2 = df.withColumn("age", when($"large_age".isNotNull, $"age" + $"large_age").otherwise($"age")).select("name", "age")
df2.show
+-----+----+
| name| age|
+-----+----+
|Alice|15.0|
| Bob|60.0|
+-----+----+
That should give you what you want:
val existingKeys = small_df.
join(large_df, "name").
select($"name", large_df("age"))
val all = small_df.
union(existingKeys).
groupBy("name").
agg(sum("age") as "age")
scala> all.show
+-----+---+
| name|age|
+-----+---+
| Bob| 60|
|Alice| 15|
+-----+---+

How to Merge values from multiple rows so they can be processed together - Spark scala

I have multiple database rows per personId with columns that may or may not have values - I'm using colors here as the data is text not numeric so doesn't lend itself to built-in aggregation functions. A simplified example is
PersonId ColA ColB ColB
100 red
100 green
100 gold
100 green
110 yellow
110 white
110
120
etc...
I want to be able to decide in a function which column data to use per unique PersonId. A three-way join on the table against itself would be a good solution if the data didn't have multiple values(colors) per column. E.g. that join merges 3 of the rows into one but still produces multiple rows.
PersonId ColA ColB ColB
100 red green gold
100 green
110 white yellow
110
120
So the solution I'm looking for is something that will allow me to address all the values (colors) for a person in one place (function) so the decision can be made across all their data.
The real data of course has more columns but the primary ones for this decision are the three columns. The data is being read in Scala Spark as a Dataframe and I'd prefer using the API to sql. I don't know if any of the exotic windows or groupby functions will help or if it's gonna be down to plain old iterate and accumulate.
The technique used in [How to aggregate values into collection after groupBy? might be applicable but it's a bit of a leap.
Think of using customUDF for doing this.
import org.apache.spark.sql.functions._
val df = Seq((100, "red", null, null), (100, null, "white", null), (100, null, null, "green"), (200, null, "red", null)).toDF("PID", "A", "B", "C")
df.show()
+---+----+-----+-----+
|PID| A| B| C|
+---+----+-----+-----+
|100| red| null| null|
|100|null|white| null|
|100|null| null|green|
|200|null| red| null|
+---+----+-----+-----+
val customUDF = udf((array: Seq[String]) => {
val newts = array.filter(_.nonEmpty)
if (newts.size == 0) null
else newts.head
})
df.groupBy($"PID").agg(customUDF(collect_set($"A")).as("colA"), customUDF(collect_set($"B")).as("colB"), customUDF(collect_set($"C")).as("colC")).show
+---+----+-----+-----+
|PID|colA| colB| colC|
+---+----+-----+-----+
|100| red|white|green|
|200|null| red| null|
+---+----+-----+-----+

Find columns with different values

My dataframe has 120 columns.Suppose my dataframe has the below structure
Id value1 value2 value3
a 10 1983 19
a 20 1983 20
a 10 1983 21
b 10 1984 1
b 10 1984 2
we can see here the id a, value1 have different values(10,20). I have to find columns having the different values for a particular id. Is there any statistical or any other approach in spark to solve this problem?
Expected output
id new_column
a value1,value3
b value3
The following code might be a start of an answer:
val result = log.select("Id","value1","value2","value3").groupBy('Id).agg('Id, countDistinct('value1),countDistinct('value2),countDistinct('value3))
Should do the following:
1)
log.select("Id","value1","value2","value3")
select relevant columns (if you want to take all columns it might be redundant)
2)
groupBy('Id)
group rows with the same ID
3)
agg('Id, countDistinct('value1),countDistinct('value2),countDistinct('value3))
output : ID, and number(count) of unique(distinct) values per ID/specific column
You can do it in several ways, one of them being the distinct method, that is similar to the SQL behaviour. Another one would be the groupBy method, where you have to pass in parameters the name of the columns you want to group (e.g. df.groupBy("Id", "value1")).
Below is an example using the distinct method.
scala> case class Person(name : String, age: Int)
defined class Person
scala> val persons = Seq(Person("test", 10), Person("test", 20), Person("test", 10)).toDF
persons: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> persons.show
+----+---+
|name|age|
+----+---+
|test| 10|
|test| 20|
|test| 10|
+----+---+
scala> persons.select("name", "age").distinct().show
+-----+---+
| name|age|
+-----+---+
| test| 10|
| test| 20|
+-----+---+

spark (Scala) dataframe filtering (FIR)

Let say I have a dataframe ( stored in scala val as df) which contains the data from a csv:
time,temperature
0,65
1,67
2,62
3,59
which I have no problem reading this from file as a spark dataframe in scala language.
I would like to add a filtered column (by filter I meant signal processing moving average filtering), (say I want to do (T[n]+T[n-1])/2.0):
time,temperature,temperatureAvg
0,65,(65+0)/2.0
1,67,(67+65)/2.0
2,62,(62+67)/2.0
3,59,(59+62)/2.0
(Actually, say for the first row, I want 32.5 instead of (65+0)/2.0. I wrote it to clarify the expected 2-time-step filtering operation output)
So how to achieve this? I am not familiar with spark dataframe operation which combine rows iteratively along column...
Spark 3.1+
Replace
$"time".cast("timestamp")
with
import org.apache.spark.sql.functions.timestamp_seconds
timestamp_seconds($"time")
Spark 2.0+
In Spark 2.0 and later it is possible to use window function as a input for groupBy. It allows you to specify windowDuration, slideDuration and startTime (offset). It works only with TimestampType column but it is not that hard to find a workaround for that. In your case it will require some additional steps to correct for boundaries but general solution can expressed as shown below:
import org.apache.spark.sql.functions.{window, avg}
df
.withColumn("ts", $"time".cast("timestamp"))
.groupBy(window($"ts", windowDuration="2 seconds", slideDuration="1 second"))
.avg("temperature")
Spark < 2.0
If there is a natural way to partition your data you can use window functions as follows:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.mean
val w = Window.partitionBy($"id").orderBy($"time").rowsBetween(-1, 0)
val df = sc.parallelize(Seq(
(1L, 0, 65), (1L, 1, 67), (1L, 2, 62), (1L, 3, 59)
)).toDF("id", "time", "temperature")
df.select($"*", mean($"temperature").over(w).alias("temperatureAvg")).show
// +---+----+-----------+--------------+
// | id|time|temperature|temperatureAvg|
// +---+----+-----------+--------------+
// | 1| 0| 65| 65.0|
// | 1| 1| 67| 66.0|
// | 1| 2| 62| 64.5|
// | 1| 3| 59| 60.5|
// +---+----+-----------+--------------+
You can create windows with arbitrary weights using lead / lag functions:
lit(0.6) * $"temperature" +
lit(0.3) * lag($"temperature", 1) +
lit(0.2) * lag($"temperature", 2)
It is still possible without partitionBy clause but will be extremely inefficient. If this is the case you won't be able to use DataFrames. Instead you can use sliding over RDD (see for example Operate on neighbor elements in RDD in Spark). There is also spark-timeseries package you may find useful.