Conditional Concatenation in Spark - scala

I have a dataframe with the below structure :
+----------+------+------+----------------+--------+------+
| date|market|metric|aggregator_Value|type |rank |
+----------+------+------+----------------+--------+------+
|2018-08-05| m1| 16 | m1|median | 1 |
|2018-08-03| m1| 5 | m1|median | 2 |
|2018-08-01| m1| 10 | m1|mean | 3 |
|2018-08-05| m2| 35 | m2|mean | 1 |
|2018-08-03| m2| 25 | m2|mean | 2 |
|2018-08-01| m2| 5 | m2|mean | 3 |
+----------+------+------+----------------+--------+------+
In this dataframe the rank column is calculated on the order of date and groupings of the market column.
Like this
val w_rank = Window.partitionBy("market").orderBy(desc("date"))
val outputDF2=outputDF1.withColumn("rank",rank().over(w_rank))
I want to extract the concatenated value of the metric column in the output dataframe when the rank=1 , with the condition that if the type="median" in rank=1 row is then concatenate all the metric values with that market .Otherwise if the type="mean" in rank=1 row , then concatenate only the previous 2 metric values .Like this
+----------+------+------+----------------+--------+---------+
| date|market|metric|aggregator_Value|type |result |
+----------+------+------+----------------+--------+---------+
|2018-08-05| m1| 16 | m1|median |10|5|16 |
|2018-08-05| m2| 35 | m1|mean |25|35 |
+----------+------+------+----------------+--------+---------+
How can I achieve this ?

You could nullify column metric according to the specific condition and apply collect_list followed by concat_ws to get the wanted result, as show below:
val df = Seq(
("2018-08-05", "m1", 16, "m1", "median", 1),
("2018-08-03", "m1", 5, "m1", "median", 2),
("2018-08-01", "m1", 10, "m1", "mean", 3),
("2018-08-05", "m2", 35, "m2", "mean", 1),
("2018-08-03", "m2", 25, "m2", "mean", 2),
("2018-08-01", "m2", 5, "m2", "mean", 3)
).toDF("date", "market", "metric", "aggregator_value", "type", "rank")
val win_desc = Window.partitionBy("market").orderBy(desc("date"))
val win_asc = Window.partitionBy("market").orderBy(asc("date"))
df.
withColumn("rank1_type", first($"type").over(win_desc.rowsBetween(Window.unboundedPreceding, 0))).
withColumn("cond_metric", when($"rank1_type" === "mean" && $"rank" > 2, null).otherwise($"metric")).
withColumn("result", concat_ws("|", collect_list("cond_metric").over(win_asc))).
where($"rank" === 1).
show
// +----------+------+------+----------------+------+----+----------+-----------+-------+
// | date|market|metric|aggregator_value| type|rank|rank1_type|cond_metric| result|
// +----------+------+------+----------------+------+----+----------+-----------+-------+
// |2018-08-05| m1| 16| m1|median| 1| median| 16|10|5|16|
// |2018-08-05| m2| 35| m2| mean| 1| mean| 35| 25|35|
// +----------+------+------+----------------+------+----+----------+-----------+-------+

Related

Spark Scala split column values in a dataframe to appended lists

I have data in a spark dataframe that I need to search for elements by name, append the values to a list, and split searched elements into separate columns of the dataframe.
I am using scala and the below is an example of my current code that works to get the first value but I need to append all values available not just the first.
I'm new to Scala (and python) so any help will be greatly appreciated!
val getNumber: (String => String) = (colString: String) => {
if (colString != null) {
raw"number:(\d+)".r
.findAllIn(colString)
.group(1)
}
else
null
}
val udfGetColumn = udf(getNumber)
val mydf = df.select(cols.....)
.withColumn("var_number", udfGetColumn($"var"))
Example Data:
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| key| var |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |["[number:123456 rate:111970 position:1]","[number:123457 rate:662352 position:2]","[number:123458 rate:890 position:3]","[number:123459 rate:190 position:4]"] | |
|2 |["[number:654321 rate:211971 position:1]","[number:654322 rate:124 position:2]","[number:654323 rate:421 position:3]"] |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
Desired Result:
+------+------------------------------------------------------------+
| key| var_number | var_rate | var_position |
+------+------------------------------------------------------------+
|1 | 123456 | 111970 | 1 |
|1 | 123457 | 662352 | 2 |
|1 | 123458 | 890 | 3 |
|1 | 123459 | 190 | 4 |
|2 | 654321 | 211971 | 1 |
|2 | 654322 | 124 | 2 |
|2 | 654323 | 421 | 3 |
+------+-----------------+---------------------+--------------------+
You don't need to use UDF here. You can easily transform the array column var by converting each element into a map using str_to_map after removing the square brackets ([]) with regexp_replace function. Finally, explode the transformed array and select the fields:
val df = Seq(
(1, Seq("[number:123456 rate:111970 position:1]", "[number:123457 rate:662352 position:2]", "[number:123458 rate:890 position:3]", "[number:123459 rate:190 position:4]")),
(2, Seq("[number:654321 rate:211971 position:1]", "[number:654322 rate:124 position:2]", "[number:654323 rate:421 position:3]"))
).toDF("key", "var")
val result = df.withColumn(
"var",
explode(expr(raw"transform(var, x -> str_to_map(regexp_replace(x, '[\\[\\]]', ''), ' '))"))
).select(
col("key"),
col("var").getField("number").alias("var_number"),
col("var").getField("rate").alias("var_rate"),
col("var").getField("position").alias("var_position")
)
result.show
//+---+----------+--------+------------+
//|key|var_number|var_rate|var_position|
//+---+----------+--------+------------+
//| 1| 123456| 111970| 1|
//| 1| 123457| 662352| 2|
//| 1| 123458| 890| 3|
//| 1| 123459| 190| 4|
//| 2| 654321| 211971| 1|
//| 2| 654322| 124| 2|
//| 2| 654323| 421| 3|
//+---+----------+--------+------------+
From you comment, it appears the column var is of type string not array. In this case, you can first transform it by removing [] and " characters then split by comma to get an array:
val df = Seq(
(1, """["[number:123456 rate:111970 position:1]", "[number:123457 rate:662352 position:2]", "[number:123458 rate:890 position:3]", "[number:123459 rate:190 position:4]"]"""),
(2, """["[number:654321 rate:211971 position:1]", "[number:654322 rate:124 position:2]", "[number:654323 rate:421 position:3]"]""")
).toDF("key", "var")
val result = df.withColumn(
"var",
split(regexp_replace(col("var"), "[\\[\\]\"]", ""), ",")
).withColumn(
"var",
explode(expr("transform(var, x -> str_to_map(x, ' '))"))
).select(
// select your columns as above...
)

How to sum(case when then) in SparkSQL DataFrame just like sql?

I'm new to SparkSQL, and I want to calculate the percentage in my data with every status.
Here is my data like below:
A B
11 1
11 3
12 1
13 3
12 2
13 1
11 1
12 2
So,I can do it in SQL like this:
select (C.oneTotal / C.total) as onePercentage,
(C.twoTotal / C.total) as twotPercentage,
(C.threeTotal / C.total) as threPercentage
from (select count(*) as total,
A,
sum(case when B = '1' then 1 else 0 end) as oneTotal,
sum(case when B = '2' then 1 else 0 end) as twoTotal,
sum(case when B = '3' then 1 else 0 end) as threeTotal
from test
group by A) as C;
But in the SparkSQL DataFrame, first I calculate totalCount in every status like below:
// wrong code
val cc = transDF.select("transData.*").groupBy("A")
.agg(count("transData.*").alias("total"),
sum(when(col("B") === "1", 1)).otherwise(0)).alias("oneTotal")
sum(when(col("B") === "2", 1).otherwise(0)).alias("twoTotal")
The problem is the sum(when)'s result is zero.
Do I have wrong use with it?
How to implement it in SparkSQL just like my above SQL? And then calculate the portion of every status?
Thank you for your help. In the end, I solve it with sum(when). below is my current code.
val cc = transDF.select("transData.*").groupBy("A")
.agg(count("transData.*").alias("total"),
sum(when(col("B") === "1", 1).otherwise(0)).alias("oneTotal"),
sum(when(col("B") === "2", 1).otherwise(0)).alias("twoTotal"))
.select(col("total"),
col("A"),
col("oneTotal") / col("total").alias("oneRate"),
col("twoTotal") / col("total").alias("twoRate"))
Thanks again.
you can use sum(when(... or also count(when.., the second option being shorter to write:
val df = Seq(
(11, 1),
(11, 3),
(12, 1),
(13, 3),
(12, 2),
(13, 1),
(11, 1),
(12, 2)
).toDF("A", "B")
df
.groupBy($"A")
.agg(
count("*").as("total"),
count(when($"B"==="1",$"A")).as("oneTotal"),
count(when($"B"==="2",$"A")).as("twoTotal"),
count(when($"B"==="3",$"A")).as("threeTotal")
)
.select(
$"A",
($"oneTotal"/$"total").as("onePercentage"),
($"twoTotal"/$"total").as("twoPercentage"),
($"threeTotal"/$"total").as("threePercentage")
)
.show()
gives
+---+------------------+------------------+------------------+
| A| onePercentage| twoPercentage| threePercentage|
+---+------------------+------------------+------------------+
| 12|0.3333333333333333|0.6666666666666666| 0.0|
| 13| 0.5| 0.0| 0.5|
| 11|0.6666666666666666| 0.0|0.3333333333333333|
+---+------------------+------------------+------------------+
alternatively, you could produce a "long" table with window-functions:
df
.groupBy($"A",$"B").count()
.withColumn("total",sum($"count").over(Window.partitionBy($"A")))
.select(
$"A",
$"B",
($"count"/$"total").as("percentage")
).orderBy($"A",$"B")
.show()
+---+---+------------------+
| A| B| percentage|
+---+---+------------------+
| 11| 1|0.6666666666666666|
| 11| 3|0.3333333333333333|
| 12| 1|0.3333333333333333|
| 12| 2|0.6666666666666666|
| 13| 1| 0.5|
| 13| 3| 0.5|
+---+---+------------------+
As far as I understood you want to implement the logic like above sql showed in the question.
one way is like below example
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object AggTest extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder.appName(getClass.getName)
.master("local[*]").getOrCreate
import spark.implicits._
val df = Seq(
(11, 1),
(11, 3),
(12, 1),
(13, 3),
(12, 2),
(13, 1),
(11, 1),
(12, 2)
).toDF("A", "B")
df.show(false)
df.createOrReplaceTempView("test")
spark.sql(
"""
|select (C.oneTotal / C.total) as onePercentage,
| (C.twoTotal / C.total) as twotPercentage,
| (C.threeTotal / C.total) as threPercentage
|from (select count(*) as total,
| A,
| sum(case when B = '1' then 1 else 0 end) as oneTotal,
| sum(case when B = '2' then 1 else 0 end) as twoTotal,
| sum(case when B = '3' then 1 else 0 end) as threeTotal
| from test
| group by A) as C
""".stripMargin).show
}
Result :
+---+---+
|A |B |
+---+---+
|11 |1 |
|11 |3 |
|12 |1 |
|13 |3 |
|12 |2 |
|13 |1 |
|11 |1 |
|12 |2 |
+---+---+
+------------------+------------------+------------------+
| onePercentage| twotPercentage| threPercentage|
+------------------+------------------+------------------+
|0.3333333333333333|0.6666666666666666| 0.0|
| 0.5| 0.0| 0.5|
|0.6666666666666666| 0.0|0.3333333333333333|
+------------------+------------------+------------------+

Find a record with max value in a group

I have the next dataset:
|month|temperature|city|
| 1| 15.0 |foo |
| 1| 20.0 |bar |
| 2| 25.0 |baz |
| 2| 30.0 |quok|
I want to find cities with highest temperatures per month:
|month|temperature|city|
| 1|20.0 |bar |
| 2|30.0 |quok|
How to do this using apache spark SQL? I tried to use window functions but failed to get the right results
Using a window function you can do it as follows:
import org.apache.spark.sql.expressions.{Window}
import org.apache.spark.sql.functions.{max}
val l = Seq((1, 15.0, "foo"), (1, 20.0, "bar"), (2, 25.0, "baz"), (2, 30.0, "quok"))
val df = l.toDF("month", "temperature", "city")
val w = Window.partitionBy("month")
df.withColumn("m", max("temperature").over(w))
.filter($"temperature" === $"m")
.select("month", "temperature", "city")
.show()
+-----+-----------+----+
|month|temperature|city|
+-----+-----------+----+
| 1| 20.0| bar|
| 2| 30.0|quok|
+-----+-----------+----+
Alternatively, you can do it also using groupBy + join:
val maxT = df.groupBy("month").agg(max("temperature").alias("maxT"))
df.join(maxT, Seq("month"), "left")
.filter($"temperature" === $"maxT")
.select("month", "temperature", "city")
.show()
+-----+-----------+----+
|month|temperature|city|
+-----+-----------+----+
| 1| 20.0| bar|
| 2| 30.0|quok|
+-----+-----------+----+
What is more efficient depends on the data. If the aggregated DataFrame can be broadcasted, the join will be more efficient.
The most efficient way is probabely to put both temperature and city in a struct in combination with max aggregation:
val df = Seq((1, 15.0, "foo"), (1, 20.0, "bar"), (2, 25.0, "baz"), (2, 30.0, "quok")).toDF("month", "temperature", "city")
df
.groupBy($"month")
.agg(max(struct($"temperature",$"city")).as("maxtemp"))
.select($"month",$"maxtemp.*")
.show()
gives :
+-----+-----------+----+
|month|temperature|city|
+-----+-----------+----+
| 1| 20.0| bar|
| 2| 30.0|quok|
+-----+-----------+----+

Comparing DataFrames in Spark

I have 2 dataframes
df1
+----------+----------------+--------------------+--------------+-------------+
| WEEK|DIM1 |DIM2 |T1 | T2 |
+----------+----------------+--------------------+--------------+-------------+
|2016-04-02| 14| NULL| 9874| 880 |
|2016-04-30| 14|FR | 9875| 13 |
|2017-06-10| 15| PQR| 9867| 57721 |
+----------+----------------+--------------------+--------------+-------------+
df2
+----------+----------------+--------------------+--------------+-------------+
| WEEK|DIM1 |DIM2 |T1 | T2 |
+----------+----------------+--------------------+--------------+-------------+
|2016-04-02| 14| NULL| 9879| 820 |
|2016-04-30| 14|FR | 9785| 9 |
|2017-06-10| 15| XYZ| 9967| 57771 |
+----------+----------------+--------------------+--------------+-------------+
I want to write a comparator in spark which compares T1, T2 in both dataframes upon WEEK, DIM1, DIM2 with T1, T2 in df1 should be greater than T1, T2 by 3. I want to return all rows which do not match the above criterion with difference between T1, T2 among dataframes. I also want to have rows present in df1 not present in df2 and vice versa for the following combination WEEK, DIM1, DIM2.
The output should be like this
+----------+----------------+--------------------+--------------+-------------+------------------+-----------------+
| WEEK|DIM1 |DIM2 |T1_dIFF | T2_dIFF | Presenent_In_DF1 | Presenent_In_DF2|
+----------+----------------+--------------------+--------------+-------------+------------------+-----------------+
|2016-04-30| 14|FR | 90| 4 | Y | Y |
|2017-06-10| 15|PQR | 9867| 57721 | Y | N |
|2017-06-10| 15|XYZ | 9967| 57771 | N | Y |
+----------+----------------+--------------------+--------------+-------------+------------------+-----------------+
What is the best way to go around this ?
I have implemented the following but do not know how to proceed after this -
val df1 = Seq(
("2016-04-02", "14", "NULL", 9874, 880), ("2016-04-30", "14", "FR", 9875, 13), ("2017-06-10", "15", "PQR", 9867, 57721)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")
val df2 = Seq(
("2016-04-02", "14", "NULL", 9879, 820), ("2016-04-30", "14", "FR", 9785, 9), ("2017-06-10", "15", "XYZ", 9967, 57771)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")
import org.apache.spark.sql.functions._
val joined = df1.as("l").join(df2.as("r"), Seq("WEEK", "DIM1", "DIM2"), "fullouter")
The joined look like this -
+----------+----+----+----+-----+----+-----+
| WEEK|DIM1|DIM2| T1| T2| T1| T2|
+----------+----+----+----+-----+----+-----+
|2016-04-02| 14|NULL|9874| 880|9879| 820|
|2017-06-10| 15| PQR|9867|57721|null| null|
|2017-06-10| 15| XYZ|null| null|9967|57771|
|2016-04-30| 14| FR|9875| 13|9785| 9|
+----------+----+----+----+-----+----+-----+
I do not know how to proceed after this in a good way, relatively new to scala.
One easy solution could be to join df1 and df2 with the WEEK as unique Key. In the joined data you need to keep all the columns from df1 and df2.
Then you can do a map operation on the dataframe to produce the rest of the columns.
Something like
df1.createOrReplaceTempTable("df1")
df2.createOrReplaceTempTable("df2")
val df = spark.sql("select df1.*, df2.DIM1 as df2_DIM1, df2.DIM2 as df2_DIM2, df2.T1 as df2_T1, df2.T2 as df2_T2 from df1 join df2 on df1.WEEK = df2.WEEK")
// Now map on the dataframe to produce the diff dataframe
// Or you can use the SQL to do that.

Pyspark - Depth-First Search on Dataframe

I have the following pyspark application that generates sequences of child/parent processes from a csv of child/parent process id's. Considering the problem as a tree, I'm using an iterative depth-first search starting at leaf nodes (a process that has no children) and iterating through my file to create these closures where process 1 is the parent to process 2 which is the parent of process 3 so on and so forth.
In other words, given a csv as shown below, is it possible to implement a depth-first search (iteratively or recursively) using pyspark dataframes & appropriate pyspark-isms to generate said closures without having to use the .collect() function (which is incredible expensive)?
from pyspark.sql.functions import monotonically_increasing_id
import copy
from pyspark.sql import SQLContext
from pyspark import SparkContext
class Test():
def __init__(self):
self.process_list = []
def main():
test = Test()
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
df = sc.textFile("<path to csv>")
df = df.map(lambda line: line.split(","))
header = df.first()
data = df.filter(lambda row: row != header)
data = data.toDF(header)
data.createOrReplaceTempView("flat")
data = sqlContext.sql("select doc_process_pid, doc_parent_pid from flat
where doc_parent_pid is not null AND
doc_process_pid is not null")
data = data.select(monotonically_increasing_id().alias("rowId"), "*")
data.createOrReplaceTempView("data")
leaf_df = sqlContext.sql("select doc_process_pid, doc_parent_pid from data
where doc_parent_pid != -1 AND
doc_process_pid == -1")
leaf_df = leaf_df.rdd.collect()
data = sqlContext.sql("select doc_process_pid, doc_parent_pid from data
where doc_process_pid != -1")
data.createOrReplaceTempView("data")
for row in leaf_df:
path = []
rowID = row[0]
data = data.filter(data['rowId'] != rowID)
parentID = row[4]
path.append(parentID)
while (True):
next_df = sqlContext.sql(
"select doc_process_pid, doc_parent_pid from data where
doc_process_pid == " + str(parentID))
next_df_rdd = next_df.collect()
print("parent: ", next_df_rdd[0][1])
parentID = next_df_rdd[0][1]
if (int(parentID) != -1):
path.append(next_df_rdd[0][1])
else:
test.process_list.append(copy.deepcopy(path))
break
print("final: ", test.process_list)
main()
Here is my csv:
doc_process_pid doc_parent_pid
1 -1
2 1
6 -1
7 6
8 7
9 8
21 -1
22 21
24 -1
25 24
26 25
27 26
28 27
29 28
99 6
107 99
108 -1
109 108
222 109
1000 7
1001 1000
-1 9
-1 22
-1 29
-1 107
-1 1001
-1 222
-1 2
It represents child/parent process relationships. If we consider this as a tree, then leaf nodes are defined by doc_process_id == -1 and root nodes are process where doc_parent_process == -1.
The code above generates two data frames:
Leaf Nodes:
+---------------+--------------+
|doc_process_pid|doc_parent_pid|
+---------------+--------------+
| -1| 9|
| -1| 22|
| -1| 29|
| -1| 107|
| -1| 1001|
| -1| 222|
| -1| 2|
+---------------+--------------+
The remaining child/parent processes sans leaf nodes:
+---------------+--------------+
|doc_process_pid|doc_parent_pid|
+---------------+--------------+
| 1| -1|
| 2| 1|
| 6| -1|
| 7| 6|
| 8| 7|
| 9| 8|
| 21| -1|
| 22| 21|
| 24| -1|
| 25| 24|
| 26| 25|
| 27| 26|
| 28| 27|
| 29| 28|
| 99| 6|
| 107| 99|
| 108| -1|
| 109| 108|
| 222| 109|
| 1000| 7|
+---------------+--------------+
The output would be:
[[1, 2],
[6, 99, 107],
[6, 99, 7, 1000, 1001],
[6, 7, 1000, 8, 9],
[21, 22],
[24, 25, 26, 27, 28, 29],
[108, 109, 222]])
Thoughts? While this it a bit specific, I want to emphasize the generalized question of performing depth-first searches to generate closures of sequences represented in this DataFrame format.
Thanks in advance for the help!
I don't think pyspark it's the best language to do this.
A solution would be to iterate through the tree node levels joining the dataframe with itself everytime.
Let's create our dataframe, no need to split it into leaf and other nodes, we'll just keep the original dataframe:
data = spark.createDataFrame(
sc.parallelize(
[[1, -1], [2, 1], [6, -1], [7, 6], [8, 7], [9, 8], [21,-1], [22,21], [24,-1], [25,24], [26,25], [27,26], [28,27],
[29,28], [99, 6], [107,99], [108,-1], [109,108], [222,109], [1000,7], [1001,1000], [ -1,9], [ -1,22], [ -1,29],
[ -1,107], [ -1, 1001], [ -1,222], [ -1,2]]
),
["doc_process_pid", "doc_parent_pid"]
)
We'll now create two dataframes from this tree, one will be our building base and the other one will be our construction bricks:
df1 = data.filter("doc_parent_pid = -1").select(data.doc_process_pid.alias("node"))
df2 = data.select(data.doc_process_pid.alias("son"), data.doc_parent_pid.alias("node")).filter("node != -1")
Let's define a function for step i of the construction:
def add_node(df, i):
return df.filter("node != -1").join(df2, "node", "inner").withColumnRenamed("node", "node" + str(i)).withColumnRenamed("son", "node")
Let's define our initial state:
from pyspark.sql.types import *
df = df1
i = 0
df_end = spark.createDataFrame(
sc.emptyRDD(),
StructType([StructField("branch", ArrayType(LongType()), True)])
)
When a branch is fully constructed we take it out of dfand put it in df_end:
import pyspark.sql.functions as psf
while df.count() > 0:
i = i + 1
df = add_node(df, i)
df_end = df.filter("node = -1").drop('node').select(psf.array(*[c for c in reversed(df.columns) if c != "node"]).alias("branch")).unionAll(
df_end
)
df = df.filter("node != -1")
At the end, df is empty and we have
df_end.show(truncate=False)
+------------------------+
|branch |
+------------------------+
|[24, 25, 26, 27, 28, 29]|
|[6, 7, 8, 9] |
|[6, 7, 1000, 1001] |
|[108, 109, 222] |
|[6, 99, 107] |
|[21, 22] |
|[1, 2] |
+------------------------+
The worst case for this algorithm is as many joins as the maximum branch length.