How can I reverse this DataFrame using Scala.
I saw sort functions but must specific column, I only want to reverse them
+---+--------+-----+
|id | name|note |
+---+--------+-----+
|1 | james |any |
|3 | marry |some |
|2 | john |some |
|5 | tom |any |
+---+--------+-----+
to:
+---+--------+-----+
|id | name|note |
+---+--------+-----+
|5 | tom |any |
|2 | john |some |
|3 | marry |some |
|1 | james |any |
+---+--------+-----+
You can add a column with increasing id with use of monotonically_increasing_id()
and sort in descending order
val dff = Seq(
(1, "james", "any"),
(3, "marry", "some"),
(2, "john", "some"),
(5, "tom", "any")
).toDF("id", "name", "note")
dff.withColumn("index", monotonically_increasing_id())
.sort($"index".desc)
.drop($"index")
.show(false)
Output:
+---+-----+----+
|id |name |note|
+---+-----+----+
|5 |tom |any |
|2 |john |some|
|3 |marry|some|
|1 |james|any |
+---+-----+----+
You could do something like this:
val reverseDf = df.withColumn("row_num", row_number.over(Window.partitionBy(lit(1)).orderBy(lit(1))))
.orderBy($"row_num".desc)
.drop("row_num")
Or refer this instead of row number.
Related
I have this dataframe :
+-----+----------+---------+
|num |Timestamp |frequency|
+-----+----------+---------+
|20.0 |1632899456|4 |
|20.0 |1632901256|4 |
|20.0 |1632901796|4 |
|20.0 |1632899155|4 |
|10.0 |1632901743|2 |
|10.0 |1632899933|2 |
|91.0 |1632899756|1 |
|32.0 |1632900776|1 |
|41.0 |1632900176|1 |
+-----+----------+---------+
I want to add a column containing the rank of each frequency. The new dataframe would be like this :
+-----+----------+---------+------------+
|num |Timestamp |frequency|rank |
+-----+----------+---------+------------+
|20.0 |1632899456|4 |1 |
|20.0 |1632901256|4 |1 |
|20.0 |1632901796|4 |1 |
|20.0 |1632899155|4 |1 |
|10.0 |1632901743|2 |2 |
|10.0 |1632899933|2 |2 |
|91.0 |1632899756|1 |3 |
|32.0 |1632900776|1 |3 |
|41.0 |1632900176|1 |3 |
+-----+----------+---------+------------+
I am using Spark version 2.4.3 and SQLContext, with scala language.
You can use dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("rank", dense_rank().over(Window.orderBy(desc("frequency")))
I want to merge 2 columns or 2 dataframes like
df1
+--+
|id|
+--+
|1 |
|2 |
|3 |
+--+
df2 --> this one can be a list as well
+--+
|m |
+--+
|A |
|B |
|C |
+--+
I want to have as resulting table
+--+--+
|id|m |
+--+--+
|1 |A |
|1 |B |
|1 |C |
|2 |A |
|2 |B |
|2 |C |
|3 |A |
|3 |B |
|3 |C |
+--+--+
def crossJoin(right: org.apache.spark.sql.Dataset[_]): org.apache.spark.sql.DataFrame
Using crossJoin function you can get same result. Please check code below.
scala> dfa.show
+---+
| id|
+---+
| 1|
| 2|
| 3|
+---+
scala> dfb.show
+---+
| m|
+---+
| A|
| B|
| C|
+---+
scala> dfa.crossJoin(dfb).orderBy($"id".asc).show(false)
+---+---+
|id |m |
+---+---+
|1 |B |
|1 |A |
|1 |C |
|2 |A |
|2 |B |
|2 |C |
|3 |C |
|3 |B |
|3 |A |
+---+---+
I have the below df:
+------+-------+--------+
|student| vars|observed|
+------+-------+--------+
| 1| ABC | 19|
| 1| ABC | 1|
| 2| CDB | 1|
| 1| ABC | 8|
| 3| XYZ | 3|
| 1| ABC | 389|
| 2| CDB | 946|
| 1| ABC | 342|
|+------+-------+--------+
I wanted to add a new frequency column groupBy two columns "student", "vars" in SCALA.
val frequency = df.groupBy($"student", $"vars").count()
This code generates a "count" column with the frequencies BUT losing observed column from the df.
I would like to create a new df as follows without losing "observed" column
+------+-------+--------+------------+
|student| vars|observed|total_count |
+------+-------+--------+------------+
| 1| ABC | 9|22
| 1| ABC | 1|22
| 2| CDB | 1|7
| 1| ABC | 2|22
| 3| XYZ | 3|3
| 1| ABC | 8|22
| 2| CDB | 6|7
| 1| ABC | 2|22
|+------+-------+-------+--------------+
You cannot do this directly but there are couple of ways,
You can join original df with count df. check here
You collect the observed column while doing aggregation and explode it again
With explode:
val frequency = df.groupBy("student", "vars").agg(collect_list("observed").as("observed_list"),count("*").as("total_count")).select($"student", $"vars",explode($"observed_list").alias("observed"), $"total_count")
scala> frequency.show(false)
+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3 |XYZ |3 |1 |
|2 |CDB |1 |2 |
|2 |CDB |946 |2 |
|1 |ABC |389 |5 |
|1 |ABC |342 |5 |
|1 |ABC |19 |5 |
|1 |ABC |1 |5 |
|1 |ABC |8 |5 |
+-------+----+--------+-----------+
We can use Window functions as well
val windowSpec = Window.partitionBy("student","vars")
val frequency = df.withColumn("total_count", count(col("student")) over windowSpec)
.show
+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3 |XYZ |3 |1 |
|2 |CDB |1 |2 |
|2 |CDB |946 |2 |
|1 |ABC |389 |5 |
|1 |ABC |342 |5 |
|1 |ABC |19 |5 |
|1 |ABC |1 |5 |
|1 |ABC |8 |5 |
+-------+----+--------+-----------+
What I have:
| ids. |items |item_id|value|timestamp|
+--------+--------+-------+-----+---------+
|[A,B,C] |1.0 |1 |5 |100 |
|[A,B,D] |1.0 |2 |6 |90 |
|[D] |0.0. |3 |7 |80 |
|[C] |0.0. |4 |8 |80 |
+--------+--------+-------+-----+----------
| ids |id_num |
+--------+--------+
|A |1 |
|B |2 |
|C |3 |
|D |4 |
+---+----+--------+
What I want:
| ids |
+--------+
|[1,2,3] |
|[1,2,4] |
|[3] |
|[4] |
+--------+
Is there a way to do this without an explode? Thank you for your help!
You can use a UDF:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType
# Suppose this is the dictionary you want to map
map_dict = {'A':1, 'B':2,'C':3,'D':4}
def array_map(array_col):
return list(map(map_dict.get, array_col))
"""
If you prefer list comprehension, you can return [map_dict[k] for k in array_col]
"""
array_map_udf = udf(array_map, ArrayType())
df = df.withColumn("mapped_array", array_map_udf(col("ids")))
I can't think of a different method, but to get a parallelized dictionary, you can just use the toJSON method. It will require further processing on the kind of reference df you have:
import json
df_json = df.toJSON().map(lambda x: json.loads(x))
I have the following Dataframe:
+------+----------+-------------+--------------------+---------+-----+----------+
|ID |MEM_ID | BFS | SVC_DT |TYP |SEQ |BFS_SEQ |
+------+----------+----------------------------------+---------+-----+----------+
|105771|29378668 | BRIMONIDINE | 2019-02-04 00:00:00|PD |1 |1 |
|105772|29378668 | BRIMONIDINE | 2019-04-04 00:00:00|PD |2 |2 |
|105773|29378668 | BRIMONIDINE | 2019-04-17 00:00:00|RV |3 |3 |
|105774|29378668 | TIMOLOL | 2019-04-17 00:00:00|RV |4 |1 |
|105775|29378668 | BRIMONIDINE | 2019-04-22 00:00:00|PD |5 |4 |
|105776|29378668 | TIMOLOL | 2019-04-22 00:00:00|PD |6 |2 |
+------+----------+----------------------------------+---------+-----+----------+
For every row, I have to find the occurrence of next 'PD' Typ at BFS level from the current row and populate its associated ID as a new column named 'NEXT_PD_TYP_ID'
The output I am expecting is:
+------+---------+-------------+--------------------+----+-----+---------+---------------+
|ID |MEM_ID | BFS | SVC_DT |TYP |SEQ |BFS_SEQ |NEXT_PD_TYP_ID |
+------+---------+----------------------------------+----+-----+---------+---------------+
|105771|29378668 | BRIMONIDINE | 2019-02-04 00:00:00|PD |1 |1 |105772 |
|105772|29378668 | BRIMONIDINE | 2019-04-04 00:00:00|PD |2 |2 |105775 |
|105773|29378668 | BRIMONIDINE | 2019-04-17 00:00:00|RV |3 |3 |105775 |
|105774|29378668 | TIMOLOL | 2019-04-17 00:00:00|RV |4 |1 |105776 |
|105775|29378668 | BRIMONIDINE | 2019-04-22 00:00:00|PD |5 |4 |null |
|105776|29378668 | TIMOLOL | 2019-04-22 00:00:00|PD |6 |2 |null |
+------+---------+----------------------------------+----+-----+---------+---------------+
Need help.
I have tried using the conditional aggregation: max(when), however since it has more than one 'PD' the max is returning only one value for all the rows.
No error messages
I hope this helps.
I created a new column with ID's of TYP === PD. I called it TYPPDID.
Then I used Window frame ranging from next row to unbounded following row and got the first not-null TYPPDID
orderBy("ID") in the end is only to show records in order.
import org.apache.spark.sql.functions._
val df = Seq(
("105771", "BRIMONIDINE", "PD"),
("105772", "BRIMONIDINE", "PD"),
("105773", "BRIMONIDINE","RV"),
("105774", "TIMOLOL", "RV"),
("105775", "BRIMONIDINE", "PD"),
("105776", "TIMOLOL", "PD")
).toDF("ID", "BFS", "TYP").withColumn("TYPPDID", when($"TYP" === "PD", $"ID"))
df: org.apache.spark.sql.DataFrame = [ID: string, BFS: string ... 2 more fields]
scala> df.show
+------+-----------+---+-------+
| ID| BFS|TYP|TYPPDID|
+------+-----------+---+-------+
|105771|BRIMONIDINE| PD| 105771|
|105772|BRIMONIDINE| PD| 105772|
|105773|BRIMONIDINE| RV| null|
|105774| TIMOLOL| RV| null|
|105775|BRIMONIDINE| PD| 105775|
|105776| TIMOLOL| PD| 105776|
+------+-----------+---+-------+
scala> val overColumns = Window.partitionBy("BFS").orderBy("ID").rowsBetween(1, Window.unboundedFollowing)
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#eb923ef
scala> df.withColumn("NEXT_PD_TYP_ID",first("TYPPDID", true).over(overColumns)).orderBy("ID").show(false)
+------+-----------+---+-------+-------+
|ID |BFS |TYP|TYPPDID|NEXT_PD_TYP_ID|
+------+-----------+---+-------+-------+
|105771|BRIMONIDINE|PD |105771 |105772 |
|105772|BRIMONIDINE|PD |105772 |105775 |
|105773|BRIMONIDINE|RV |null |105775 |
|105774|TIMOLOL |RV |null |105776 |
|105775|BRIMONIDINE|PD |105775 |null |
|105776|TIMOLOL |PD |105776 |null |
+------+-----------+---+-------+-------+