How to Reverse arrangement DataFrame in Apache Spark - scala

How can I reverse this DataFrame using Scala.
I saw sort functions but must specific column, I only want to reverse them
+---+--------+-----+
|id | name|note |
+---+--------+-----+
|1 | james |any |
|3 | marry |some |
|2 | john |some |
|5 | tom |any |
+---+--------+-----+
to:
+---+--------+-----+
|id | name|note |
+---+--------+-----+
|5 | tom |any |
|2 | john |some |
|3 | marry |some |
|1 | james |any |
+---+--------+-----+

You can add a column with increasing id with use of monotonically_increasing_id()
and sort in descending order
val dff = Seq(
(1, "james", "any"),
(3, "marry", "some"),
(2, "john", "some"),
(5, "tom", "any")
).toDF("id", "name", "note")
dff.withColumn("index", monotonically_increasing_id())
.sort($"index".desc)
.drop($"index")
.show(false)
Output:
+---+-----+----+
|id |name |note|
+---+-----+----+
|5 |tom |any |
|2 |john |some|
|3 |marry|some|
|1 |james|any |
+---+-----+----+

You could do something like this:
val reverseDf = df.withColumn("row_num", row_number.over(Window.partitionBy(lit(1)).orderBy(lit(1))))
.orderBy($"row_num".desc)
.drop("row_num")
Or refer this instead of row number.

Related

How to rank dataframe depending on a group of rows in a column?

I have this dataframe :
+-----+----------+---------+
|num |Timestamp |frequency|
+-----+----------+---------+
|20.0 |1632899456|4 |
|20.0 |1632901256|4 |
|20.0 |1632901796|4 |
|20.0 |1632899155|4 |
|10.0 |1632901743|2 |
|10.0 |1632899933|2 |
|91.0 |1632899756|1 |
|32.0 |1632900776|1 |
|41.0 |1632900176|1 |
+-----+----------+---------+
I want to add a column containing the rank of each frequency. The new dataframe would be like this :
+-----+----------+---------+------------+
|num |Timestamp |frequency|rank |
+-----+----------+---------+------------+
|20.0 |1632899456|4 |1 |
|20.0 |1632901256|4 |1 |
|20.0 |1632901796|4 |1 |
|20.0 |1632899155|4 |1 |
|10.0 |1632901743|2 |2 |
|10.0 |1632899933|2 |2 |
|91.0 |1632899756|1 |3 |
|32.0 |1632900776|1 |3 |
|41.0 |1632900176|1 |3 |
+-----+----------+---------+------------+
I am using Spark version 2.4.3 and SQLContext, with scala language.
You can use dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("rank", dense_rank().over(Window.orderBy(desc("frequency")))

Spark Scala, merging two columnar dataframes duplicating the second dataframe each time

I want to merge 2 columns or 2 dataframes like
df1
+--+
|id|
+--+
|1 |
|2 |
|3 |
+--+
df2 --> this one can be a list as well
+--+
|m |
+--+
|A |
|B |
|C |
+--+
I want to have as resulting table
+--+--+
|id|m |
+--+--+
|1 |A |
|1 |B |
|1 |C |
|2 |A |
|2 |B |
|2 |C |
|3 |A |
|3 |B |
|3 |C |
+--+--+
def crossJoin(right: org.apache.spark.sql.Dataset[_]): org.apache.spark.sql.DataFrame
Using crossJoin function you can get same result. Please check code below.
scala> dfa.show
+---+
| id|
+---+
| 1|
| 2|
| 3|
+---+
scala> dfb.show
+---+
| m|
+---+
| A|
| B|
| C|
+---+
scala> dfa.crossJoin(dfb).orderBy($"id".asc).show(false)
+---+---+
|id |m |
+---+---+
|1 |B |
|1 |A |
|1 |C |
|2 |A |
|2 |B |
|2 |C |
|3 |C |
|3 |B |
|3 |A |
+---+---+

how to rename the Columns Produced by count() function in Scala

I have the below df:
+------+-------+--------+
|student| vars|observed|
+------+-------+--------+
| 1| ABC | 19|
| 1| ABC | 1|
| 2| CDB | 1|
| 1| ABC | 8|
| 3| XYZ | 3|
| 1| ABC | 389|
| 2| CDB | 946|
| 1| ABC | 342|
|+------+-------+--------+
I wanted to add a new frequency column groupBy two columns "student", "vars" in SCALA.
val frequency = df.groupBy($"student", $"vars").count()
This code generates a "count" column with the frequencies BUT losing observed column from the df.
I would like to create a new df as follows without losing "observed" column
+------+-------+--------+------------+
|student| vars|observed|total_count |
+------+-------+--------+------------+
| 1| ABC | 9|22
| 1| ABC | 1|22
| 2| CDB | 1|7
| 1| ABC | 2|22
| 3| XYZ | 3|3
| 1| ABC | 8|22
| 2| CDB | 6|7
| 1| ABC | 2|22
|+------+-------+-------+--------------+
You cannot do this directly but there are couple of ways,
You can join original df with count df. check here
You collect the observed column while doing aggregation and explode it again
With explode:
val frequency = df.groupBy("student", "vars").agg(collect_list("observed").as("observed_list"),count("*").as("total_count")).select($"student", $"vars",explode($"observed_list").alias("observed"), $"total_count")
scala> frequency.show(false)
+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3 |XYZ |3 |1 |
|2 |CDB |1 |2 |
|2 |CDB |946 |2 |
|1 |ABC |389 |5 |
|1 |ABC |342 |5 |
|1 |ABC |19 |5 |
|1 |ABC |1 |5 |
|1 |ABC |8 |5 |
+-------+----+--------+-----------+
We can use Window functions as well
val windowSpec = Window.partitionBy("student","vars")
val frequency = df.withColumn("total_count", count(col("student")) over windowSpec)
.show
+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3 |XYZ |3 |1 |
|2 |CDB |1 |2 |
|2 |CDB |946 |2 |
|1 |ABC |389 |5 |
|1 |ABC |342 |5 |
|1 |ABC |19 |5 |
|1 |ABC |1 |5 |
|1 |ABC |8 |5 |
+-------+----+--------+-----------+

Map values of a column with ArrayType based on values from another dataframe in PySpark

What I have:
| ids. |items |item_id|value|timestamp|
+--------+--------+-------+-----+---------+
|[A,B,C] |1.0 |1 |5 |100 |
|[A,B,D] |1.0 |2 |6 |90 |
|[D] |0.0. |3 |7 |80 |
|[C] |0.0. |4 |8 |80 |
+--------+--------+-------+-----+----------
| ids |id_num |
+--------+--------+
|A |1 |
|B |2 |
|C |3 |
|D |4 |
+---+----+--------+
What I want:
| ids |
+--------+
|[1,2,3] |
|[1,2,4] |
|[3] |
|[4] |
+--------+
Is there a way to do this without an explode? Thank you for your help!
You can use a UDF:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType
# Suppose this is the dictionary you want to map
map_dict = {'A':1, 'B':2,'C':3,'D':4}
def array_map(array_col):
return list(map(map_dict.get, array_col))
"""
If you prefer list comprehension, you can return [map_dict[k] for k in array_col]
"""
array_map_udf = udf(array_map, ArrayType())
df = df.withColumn("mapped_array", array_map_udf(col("ids")))
I can't think of a different method, but to get a parallelized dictionary, you can just use the toJSON method. It will require further processing on the kind of reference df you have:
import json
df_json = df.toJSON().map(lambda x: json.loads(x))

How to find the next occurring item from current row in a data frame using Spark Windowing?

I have the following Dataframe:
+------+----------+-------------+--------------------+---------+-----+----------+
|ID |MEM_ID | BFS | SVC_DT |TYP |SEQ |BFS_SEQ |
+------+----------+----------------------------------+---------+-----+----------+
|105771|29378668 | BRIMONIDINE | 2019-02-04 00:00:00|PD |1 |1 |
|105772|29378668 | BRIMONIDINE | 2019-04-04 00:00:00|PD |2 |2 |
|105773|29378668 | BRIMONIDINE | 2019-04-17 00:00:00|RV |3 |3 |
|105774|29378668 | TIMOLOL | 2019-04-17 00:00:00|RV |4 |1 |
|105775|29378668 | BRIMONIDINE | 2019-04-22 00:00:00|PD |5 |4 |
|105776|29378668 | TIMOLOL | 2019-04-22 00:00:00|PD |6 |2 |
+------+----------+----------------------------------+---------+-----+----------+
For every row, I have to find the occurrence of next 'PD' Typ at BFS level from the current row and populate its associated ID as a new column named 'NEXT_PD_TYP_ID'
The output I am expecting is:
+------+---------+-------------+--------------------+----+-----+---------+---------------+
|ID |MEM_ID | BFS | SVC_DT |TYP |SEQ |BFS_SEQ |NEXT_PD_TYP_ID |
+------+---------+----------------------------------+----+-----+---------+---------------+
|105771|29378668 | BRIMONIDINE | 2019-02-04 00:00:00|PD |1 |1 |105772 |
|105772|29378668 | BRIMONIDINE | 2019-04-04 00:00:00|PD |2 |2 |105775 |
|105773|29378668 | BRIMONIDINE | 2019-04-17 00:00:00|RV |3 |3 |105775 |
|105774|29378668 | TIMOLOL | 2019-04-17 00:00:00|RV |4 |1 |105776 |
|105775|29378668 | BRIMONIDINE | 2019-04-22 00:00:00|PD |5 |4 |null |
|105776|29378668 | TIMOLOL | 2019-04-22 00:00:00|PD |6 |2 |null |
+------+---------+----------------------------------+----+-----+---------+---------------+
Need help.
I have tried using the conditional aggregation: max(when), however since it has more than one 'PD' the max is returning only one value for all the rows.
No error messages
I hope this helps.
I created a new column with ID's of TYP === PD. I called it TYPPDID.
Then I used Window frame ranging from next row to unbounded following row and got the first not-null TYPPDID
orderBy("ID") in the end is only to show records in order.
import org.apache.spark.sql.functions._
val df = Seq(
("105771", "BRIMONIDINE", "PD"),
("105772", "BRIMONIDINE", "PD"),
("105773", "BRIMONIDINE","RV"),
("105774", "TIMOLOL", "RV"),
("105775", "BRIMONIDINE", "PD"),
("105776", "TIMOLOL", "PD")
).toDF("ID", "BFS", "TYP").withColumn("TYPPDID", when($"TYP" === "PD", $"ID"))
df: org.apache.spark.sql.DataFrame = [ID: string, BFS: string ... 2 more fields]
scala> df.show
+------+-----------+---+-------+
| ID| BFS|TYP|TYPPDID|
+------+-----------+---+-------+
|105771|BRIMONIDINE| PD| 105771|
|105772|BRIMONIDINE| PD| 105772|
|105773|BRIMONIDINE| RV| null|
|105774| TIMOLOL| RV| null|
|105775|BRIMONIDINE| PD| 105775|
|105776| TIMOLOL| PD| 105776|
+------+-----------+---+-------+
scala> val overColumns = Window.partitionBy("BFS").orderBy("ID").rowsBetween(1, Window.unboundedFollowing)
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#eb923ef
scala> df.withColumn("NEXT_PD_TYP_ID",first("TYPPDID", true).over(overColumns)).orderBy("ID").show(false)
+------+-----------+---+-------+-------+
|ID |BFS |TYP|TYPPDID|NEXT_PD_TYP_ID|
+------+-----------+---+-------+-------+
|105771|BRIMONIDINE|PD |105771 |105772 |
|105772|BRIMONIDINE|PD |105772 |105775 |
|105773|BRIMONIDINE|RV |null |105775 |
|105774|TIMOLOL |RV |null |105776 |
|105775|BRIMONIDINE|PD |105775 |null |
|105776|TIMOLOL |PD |105776 |null |
+------+-----------+---+-------+-------+