Duplicating the record count in apache spark - scala

This is an extension of this question, Apache Spark group by combining types and sub types.
val sales = Seq(
("Warsaw", 2016, "facebook","share",100),
("Warsaw", 2017, "facebook","like",200),
("Boston", 2015,"twitter","share",50),
("Boston", 2016,"facebook","share",150),
("Toronto", 2017,"twitter","like",50)
).toDF("city", "year","media","action","amount")
All good with that solution, however the expected output should be counted in different categories conditionally.
So, the output should look like,
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| share1 | 2|
| Boston| share2 | 2|
| Boston| twitter| 1|
|Toronto| twitter| 1|
|Toronto| like | 1|
| Warsaw|facebook| 2|
| Warsaw|share1 | 1|
| Warsaw|share2 | 1|
| Warsaw|like | 1|
+-------+--------+-----+
Here if the action is share, I need to have that counted both in share1 and share2. When I count it programmatically, I use case statement and say case when action is share, share1 = share1 +1 , share2 = share2+1
But how can I do this in Scala or pyspark or sql ?

Simple filter and unions should give you your desired output
val media = sales.groupBy("city", "media").count()
val action = sales.groupBy("city", "action").count().select($"city", $"action".as("media"), $"count")
val share = action.filter($"media" === "share")
media.union(action.filter($"media" =!= "share"))
.union(share.withColumn("media", lit("share1")))
.union(share.withColumn("media", lit("share2")))
.show(false)
which should give you
+-------+--------+-----+
|city |media |count|
+-------+--------+-----+
|Boston |facebook|1 |
|Boston |twitter |1 |
|Toronto|twitter |1 |
|Warsaw |facebook|2 |
|Warsaw |like |1 |
|Toronto|like |1 |
|Boston |share1 |2 |
|Warsaw |share1 |1 |
|Boston |share2 |2 |
|Warsaw |share2 |1 |
+-------+--------+-----+

Related

How to perform one to many mapping on spark scala dataframe column using flatmaps

I am looking for specifically a flatmap solution to a problem of mocking the data column in a spark-scala dataframe by using data duplicacy technique like 1 to many mapping inside flatmap
My given data is something like this
|id |name|marks|
+---+----+-----+
|1 |ABCD|12 |
|2 |CDEF|12 |
|3 |FGHI|14 |
+---+----+-----+
and my expectation after doing 1 to 3 mapping of the id column will be something like this
|id |name|marks|
+---+----+-----+
|1 |ABCD|12 |
|2 |CDEF|12 |
|3 |FGHI|14 |
|2 |null|null |
|3 |null|null |
|1 |null|null |
|2 |null|null |
|1 |null|null |
|3 |null|null |
+---+----+-----+
Please feel free to let me know if there is any clarification required on the requirement part
Thanks in advance!!!
I see that you are attempting to generate data with a requirement of re-using values in the ID column.
You can just select the ID column and generate random values and do a union back to your original dataset.
For example:
val data = Seq((1,"asd",15), (2,"asd",20), (3,"test",99)).toDF("id","testName","marks")
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| asd| 15|
| 2| asd| 20|
| 3| test| 99|
+---+--------+-----+
import org.apache.spark.sql.types._
val newRecords = data.select("id").withColumn("testName", concat(lit("name_"), lit(rand()*10).cast(IntegerType).cast(StringType))).withColumn("marks", lit(rand()*100).cast(IntegerType))
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| name_2| 35|
| 2| name_9| 20|
| 3| name_3| 7|
+---+--------+-----+
val result = data.unionAll(newRecords)
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| asd| 15|
| 2| asd| 20|
| 3| test| 99|
| 1| name_2| 35|
| 2| name_9| 20|
| 3| name_3| 7|
+---+--------+-----+
you can run the randomisation portion of the code using a loop and do a union of all the generated dataframes.

Scala spark to filter out reoccurring zero values

I created a dataframe in spark with the following schema:
root
|-- user_id: string (nullable = true)
|-- rate: decimal(32,16) (nullable =true)
|-- date: timestamp (nullable =true)
|-- type: string (nullable = true)
Data is like this in my schema
+----------+----------+-------------+---------+
| user_id| rate |date | type |
+----------+----------+-------------+---------+
| XO_121 | 10 |2020-04-20 | A |
| XO_121 | 10 |2020-04-21 | A |
| XO_121 | 30 |2020-04-22 | A |
| XO_121 |0 |2020-04-23 | A |
| XO_121 |0 |2020-04-24 | A |
| XO_121 |0 |2020-04-25 | A |
| XO_121 |0 |2020-04-26 | A |
| XO_121 |5 |2020-04-27 | A |
| XO_121 |0 |2020-04-28 | A |
| XO_121 |0 |2020-04-29 | A |
| XO_121 |1 |2020-04-30 | A |
I want to save space so I want to skip rate which has zero value but just want it's initial occurrence only other rate duplicates are allowed like you see case of 10 and they need to preserve Date order . So after applying filter my data should look like this
+----------+----------+-------------+---------+
| user_id| rate |date | type |
+----------+----------+-------------+---------+
| XO_121 | 10 |2020-04-20 | A |
| XO_121 | 10 |2020-04-21 | A |
| XO_121 | 30 |2020-04-22 | A |
| XO_121 |0 |2020-04-23 | A |
| XO_121 |5 |2020-04-27 | A |
| XO_121 |0 |2020-04-28 | A |
| XO_121 |1 |2020-04-30 | A |
I'm new to spark so just want to find out way to filter . I used Rank concept but that don't work .If any body can provide solution to this problem
Data Preparation :
val df = Seq( ("XO_121","10","2020-04-20"),("XO_121","10","2020-04-21"),("XO_121","30","2020-04-22"),("XO_121","0","2020-04-23"),("XO_121","0","2020-04-24"),("XO_121","0","2020-04-25"),("XO_121","0","2020-04-26"),("XO_121","5","2020-04-27"),("XO_121","0","2020-04-28"),("XO_121","0","2020-04-29"),("XO_121","1","2020-04-30"))
.toDF("user_id","rate","date")
Get the previous value of rate, and check for each record "rate" === "0" && "previous_rate" === "0"
import org.apache.spark.sql.expressions.Window
val winSpec = Window.partitionBy("user_id").orderBy("date")
val finalDf = df.withColumn("previous_rate", lag("rate", 1).over(winSpec))
.filter( !($"rate" === "0" && $"previous_rate" === "0"))
.drop("previous_rate")
Output :
scala> finalDf.show
+-------+----+----------+
|user_id|rate| date|
+-------+----+----------+
| XO_121| 10|2020-04-20|
| XO_121| 10|2020-04-21|
| XO_121| 30|2020-04-22|
| XO_121| 0|2020-04-23|
| XO_121| 5|2020-04-27|
| XO_121| 0|2020-04-28|
| XO_121| 1|2020-04-30|
+-------+----+----------+
Now you can apply orderBy($"date") or orderBy($"userd_id",$"date") which ever is applicable for you.
You can use row_number() instead of Rank as below
_w = W.partitionBy("col2").orderBy("col1")
df = df.withColumn("rnk", F.row_number().over(_w))
df = df.filter(F.col("rnk") == F.lit("1"))
df.show()
+------+----+---+
| col1|col2|rnk|
+------+----+---+
|XO_121| 0| 1|
|XO_121| 10| 1|
|XO_121| 30| 1|
|XO_121| 20| 1|
|XO_121| 40| 1|
+------+----+---+
Also , you can use first() in case you know there is only repetition on value 0
df = df.groupBy("col1","col2").agg(F.first("col2").alias("col2")).orderBy("col2")
df.show()
+------+----+----+
| col1|col2|col2|
+------+----+----+
|XO_121| 0| 0|
|XO_121| 10| 10|
|XO_121| 20| 20|
|XO_121| 30| 30|
|XO_121| 50| 50|
+------+----+----+

How to find the max length unique rows from a dataframe with spark?

I am trying to find the unique rows (based on id) that have the maximum length value in a Spark dataframe. Each Column has a value of string type.
The dataframe is like:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi| |
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | | d |
|3 |b | c | a | d |
+-----+---+----+---+---+
The expectation is:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | a | d |
+-----+---+----+---+---+
I can't figure how to do this using Spark easily...
Thanks in advance
Note: This approach takes care of any addition/deletion of columns to the DataFrame, without the need of code change.
It can be done by first finding length of all columns after concatenating (except the first column), then filter all other rows except the row with the maximum length.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val output = input.withColumn("rowLength", length(concat(input.columns.toList.drop(1).map(col): _*)))
.withColumn("maxLength", max($"rowLength").over(Window.partitionBy($"id")))
.filter($"rowLength" === $"maxLength")
.drop("rowLength", "maxLength")
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi| |
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| | d|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> df.groupBy("id").agg(concat_ws("",collect_set(col("A"))).alias("A"),concat_ws("",collect_set(col("B"))).alias("B"),concat_ws("",collect_set(col("C"))).alias("C"),concat_ws("",collect_set(col("D"))).alias("D")).show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| a| d|
+---+----+----+----+----+

Sorting numeric String in Spark Dataset

Let's assume that I have the following Dataset:
+-----------+----------+
|productCode| amount|
+-----------+----------+
| XX-13| 300|
| XX-1| 250|
| XX-2| 410|
| XX-9| 50|
| XX-10| 35|
| XX-100| 870|
+-----------+----------+
Where productCode is of String type and the amount is an Int.
If one will try to order this by productCode the result will be (and this is expected because of nature of String comparison):
def orderProducts(product: Dataset[Product]): Dataset[Product] = {
product.orderBy("productCode")
}
// Output:
+-----------+----------+
|productCode| amount|
+-----------+----------+
| XX-1| 250|
| XX-10| 35|
| XX-100| 870|
| XX-13| 300|
| XX-2| 410|
| XX-9| 50|
+-----------+----------+
How can I get an output ordered by Integer part of the productCode like below considering Dataset API?
+-----------+----------+
|productCode| amount|
+-----------+----------+
| XX-1| 250|
| XX-2| 410|
| XX-9| 50|
| XX-10| 35|
| XX-13| 300|
| XX-100| 870|
+-----------+----------+
Use the expression in the orderBy. Check this out:
scala> val df = Seq(("XX-13",300),("XX-1",250),("XX-2",410),("XX-9",50),("XX-10",35),("XX-100",870)).toDF("productCode", "amt")
df: org.apache.spark.sql.DataFrame = [productCode: string, amt: int]
scala> df.orderBy(split('productCode,"-")(1).cast("int")).show
+-----------+---+
|productCode|amt|
+-----------+---+
| XX-1|250|
| XX-2|410|
| XX-9| 50|
| XX-10| 35|
| XX-13|300|
| XX-100|870|
+-----------+---+
scala>
With window functions, you could do like
scala> df.withColumn("row1",row_number().over(Window.orderBy(split('productCode,"-")(1).cast("int")))).show(false)
18/12/10 09:25:07 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-----------+---+----+
|productCode|amt|row1|
+-----------+---+----+
|XX-1 |250|1 |
|XX-2 |410|2 |
|XX-9 |50 |3 |
|XX-10 |35 |4 |
|XX-13 |300|5 |
|XX-100 |870|6 |
+-----------+---+----+
scala>
Note that spark complains of moving all data to single partition.

Spark SQL window function look ahead and complex function

I have the following data:
+-----+----+-----+
|event|t |type |
+-----+----+-----+
| A |20 | 1 |
| A |40 | 1 |
| B |10 | 1 |
| B |20 | 1 |
| B |120 | 1 |
| B |140 | 1 |
| B |320 | 1 |
| B |340 | 1 |
| B |360 | 7 |
| B |380 | 1 |
+-----+-----+----+
And what I want is something like this:
+-----+----+----+
|event|t |grp |
+-----+----+----+
| A |20 |1 |
| A |40 |1 |
| B |10 |2 |
| B |20 |2 |
| B |120 |3 |
| B |140 |3 |
| B |320 |4 |
| B |340 |4 |
| B |380 |5 |
+-----+----+----+
Rules:
Group all Values together that are at least 50ms away from each other. (column t) and belongs to the same event.
When a row of type 7 appears take a cut too and remove this row. (see last row)
The first rule I can achieve with the answer from this thread:
Code:
val windowSpec= Window.partitionBy("event").orderBy("t")
val newSession = (coalesce(
($"t" - lag($"t", 1).over(windowSpec)),
lit(0)
) > 50).cast("bigint")
val sessionized = df.withColumn("session", sum(newSession).over(userWindow))
I have to say I can't figure it out how it works and don't know how to modify it so that rule 2 also works...
Hope someone can give me some useful hints.
What I tried:
val newSession = (coalesce(
($"t" - lag($"t", 1).over(windowSpec)),
lit(0)
) > 50 || lead($"type",1).over(windowSpec) =!= 7 ).cast("bigint")
But only an error occurred: "Must follow method; cannot follow org.apache.spark.sql.Column val grp = (coalesce(
this should do the trick:
val newSession = (coalesce(
($"t" - lag($"t", 1).over(win)),
lit(0)
) > 50
or $"type"===7) // also start new group in this case
.cast("bigint")
df.withColumn("session", sum(newSession).over(win))
.where($"type"=!=7) // remove these rows
.orderBy($"event",$"t")
.show
gives:
+-----+---+----+-------+
|event| t|type|session|
+-----+---+----+-------+
| A| 20| 1| 0|
| A| 40| 1| 0|
| B| 10| 1| 0|
| B| 20| 1| 0|
| B|120| 1| 1|
| B|140| 1| 1|
| B|320| 1| 2|
| B|340| 1| 2|
| B|380| 1| 3|
+-----+---+----+-------+