I am trying to implement uniqueCount in spark scala
Below is the transformation i am trying to implement :
case when ([last_revision]=1) and ([source]=""AR"") then UniqueCount([review_uuid]) OVER ([encounter_id]) end
Input
|last_revision|source|review_uuid |encounter_id|
|-------------|------|--------------|------------|
|1 |AR |123-1234-12345|7654 |
|1 |AR |123-7890-45678|7654 |
|1 |MR |789-1234-12345|7654 |
Expected Output
|last_revision|source|review_uuid |encounter_id|reviews_per_encounter|
|-------------|------|--------------|------------|---------------------|
|1 |AR |123-1234-12345|7654 |2 |
|1 |AR |123-7890-45678|7654 |2 |
|1 |MR |789-1234-12345|7654 |null |
My code :
.withColumn("reviews_per_encounter", when(col("last_revision") === "1" && col("source") === "AR", size(collect_set(col("review_uuid")).over(Window.partitionBy(col("encounter_id"))))))
My Output :
|last_revision|source|review_uuid |encounter_id|reviews_per_encounter|
|-------------|------|--------------|------------|---------------------|
|1 |AR |123-1234-12345|7654 |3 |
|1 |AR |123-7890-45678|7654 |3 |
|1 |MR |789-1234-12345|7654 |null |
Schema :
last_revision : integer
source : string
review_uuid : string
encounter_id : string
reviews_per_encounter : integer
In place of 2(expected) i am getting value 3, not sure what mistake i am doing here.
Please help. Thanks
The output makes perfect sense, as I commented, this is because this:
size(collect_set(col("review_uuid")))
Means:
give me the count of unique review_uuids in the whole dataframe (result: 3)
What you're looking for is:
give me the count of unique review_uuids only if the source in the corresponding row is equal to "AR" and "last_revision" is 1 (result: 2)
Notice the difference, now this doesn't need window functions and over actually. You can achieve this both using subqueries or self joining, here's how you can do it using self left join:
df.join(
df.where(col("last_revision") === lit(1) && col("source") === "AR")
.select(count_distinct(col("review_uuid")) as "reviews_per_encounter"),
col("last_revision") === lit(1) && col("source") === "AR",
"left"
)
Output:
+-------------+------+-----------+------------+---------------------+
|last_revision|source|review_uuid|encounter_id|reviews_per_encounter|
+-------------+------+-----------+------------+---------------------+
| 1| AR| 12345| 7654| 2|
| 1| AR| 45678| 7654| 2|
| 1| MR| 78945| 7654| null|
+-------------+------+-----------+------------+---------------------+
(I used some random uuid's, they were too long to copy :) )
Related
I have a dataframe called rawEDI that looks something like this;
Line_number
Segment
1
ST
2
BPT
3
SE
4
ST
5
BPT
6
N1
7
SE
8
ST
9
PTD
10
SE
Each row represents a line in a file. Each line is called a segment and is denoted by something called a segment identifier; a short string. Segments are grouped together in chunks that start with an ST segment identifier and end with an SE segment segment identifier. There can be any number of ST chunks in a given file and the size of each any ST chunk is not fixed.
I want to create a new column on the dataframe that represents numerically what ST group a given segment belongs to. This will allow me to use groupBy to perform aggregate operations across all ST segments without having to loop over each individual ST segment, which is too slow.
The final DataFrame would look like this;
Line_number
Segment
ST_Group
1
ST
1
2
BPT
1
3
SE
1
4
ST
2
5
BPT
2
6
N1
2
7
SE
2
8
ST
3
9
PTD
3
10
SE
3
In short, I want to create and populate a DataFrame column with a number that increments by one whenever the value "ST" appears in the Segment column.
I am using spark 2.3.2 and scala 2.11.8
My initial thought was to use iteration. I collected another DataFrame, df, that contained the starting and ending line_number for each segment, looking like this;
Start
End
1
3
4
7
8
10
Then iterate over the rows of the dataframe and use them to populate the new column like this;
var st = 1
for (row <- df.collect()) {
val start = row(0)
val end = row(1)
var labelSTs = rawEDI.filter("line_number > = ${start}").filter("line_number <= ${end}").withColumn("ST_Group", lit(st))
st = st + 1
However, this yields an empty DataFrame. Additionally, the use of a for loop is time-prohibitive, taking over 20s on my machine for this. Achieving this result without the use of a loop would be huge, but a solution with a loop may also be acceptable if performant.
I have a hunch this can be accomplished using a udf or a Window, but I'm not certain how to attack that.
This
val func = udf((s:String) => if(s == "ST") 1 else 0)
var labelSTs = rawEDI.withColumn("ST_Group", func((col("segment")))
Only populates the column with 1 at each ST segment start.
And this
val w = Window.partitionBy("Segment").orderBy("line_number")
val labelSTs = rawEDI.withColumn("ST_Group", row_number().over(w)
Returns a nonsense dataframe.
One way is to create an intermediate dataframe of "groups" that would tell you on which line each group starts and ends (sort of what you've already done), and then join it to the original table using greater-than/less-than conditions.
Sample data
scala> val input = Seq((1,"ST"),(2,"BPT"),(3,"SE"),(4,"ST"),(5,"BPT"),
(6,"N1"),(7,"SE"),(8,"ST"),(9,"PTD"),(10,"SE"))
.toDF("linenumber","segment")
scala> input.show(false)
+----------+-------+
|linenumber|segment|
+----------+-------+
|1 |ST |
|2 |BPT |
|3 |SE |
|4 |ST |
|5 |BPT |
|6 |N1 |
|7 |SE |
|8 |ST |
|9 |PTD |
|10 |SE |
+----------+-------+
Create a dataframe for groups, using Window just as your hunch was telling you:
scala> val groups = input.where("segment='ST'")
.withColumn("endline",lead("linenumber",1) over Window.orderBy("linenumber"))
.withColumn("groupnumber",row_number() over Window.orderBy("linenumber"))
.withColumnRenamed("linenumber","startline")
.drop("segment")
scala> groups.show(false)
+---------+-----------+-------+
|startline|groupnumber|endline|
+---------+-----------+-------+
|1 |1 |4 |
|4 |2 |8 |
|8 |3 |null |
+---------+-----------+-------+
Join both to get the result
scala> input.join(groups,
input("linenumber") >= groups("startline") &&
(input("linenumber") < groups("endline") || groups("endline").isNull))
.select("linenumber","segment","groupnumber")
.show(false)
+----------+-------+-----------+
|linenumber|segment|groupnumber|
+----------+-------+-----------+
|1 |ST |1 |
|2 |BPT |1 |
|3 |SE |1 |
|4 |ST |2 |
|5 |BPT |2 |
|6 |N1 |2 |
|7 |SE |2 |
|8 |ST |3 |
|9 |PTD |3 |
|10 |SE |3 |
+----------+-------+-----------+
The only problem with this is Window.orderBy() on an unpartitioned dataframe, which would collect all data to a single partition and thus could be a killer.
if you want just to add column with a number that increments by one whenever the value "ST" appears in the Segment column, you can filter lines with the ST segment in a separate dataframe,
var labelSTs = rawEDI.filter("segement == 'ST'");
// then group by ST and collect to list the linenumbers
var groupedDf = labelSTs.groupBy("Segment").agg(collect_list("Line_number").alias("Line_numbers"))
// now you need to flat back the data frame and log the line number index
var flattedDf = groupedDf.select($"Segment", explode($"Line_numbers").as("Line_number"))
// log the line_number index in your target column ST_Group
val withIndexDF = flattenedDF.withColumn("ST_Group", row_number().over(Window.partitionBy($"Segment").orderBy($"Line_number")))
and you have this as result:
+-------+----------+----------------+
|Segment|Line_number|ST_Group |
+-------+----------+----------------+
| ST| 1| 1|
| ST| 4| 2|
| ST| 8| 3|
+-------|----------|----------------|
then you concat this with other Segement in the initial dataframe.
Found a more simpler way, add a column which will have 1 when the segment column value is ST, otherwise it will have 0. Then using Window function find the cummulative sum of that new column. This will give you the desired results.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val rawEDI=Seq((1,"ST"),(2,"BPT"),(3,"SE"),(4,"ST"),(5,"BPT"),(6,"N1"),(7,"SE"),(8,"ST"),(9,"PTD"),(10,"SE")).toDF("line_number","segment")
val newDf=rawEDI.withColumn("ST_Group", ($"segment" === "ST").cast("bigint"))
val windowSpec = Window.orderBy("line_number")
newDf.withColumn("ST_Group", sum("ST_Group").over(windowSpec))
.show
+-----------+-------+--------+
|line_number|segment|ST_Group|
+-----------+-------+--------+
| 1| ST| 1|
| 2| BPT| 1|
| 3| SE| 1|
| 4| ST| 2|
| 5| BPT| 2|
| 6| N1| 2|
| 7| SE| 2|
| 8| ST| 3|
| 9| PTD| 3|
| 10| SE| 3|
+-----------+-------+--------+
I have this kind of dataset:
val cols = Seq("col_1","col_2")
val data = List(("a",1),
("b",1),
("a",2),
("c",3),
("a",3))
val df = spark.createDataFrame(data).toDF(cols:_*)
+-----+-----+
|col_1|col_2|
+-----+-----+
|a |1 |
|b |1 |
|a |2 |
|c |3 |
|a |3 |
+-----+-----+
I want to add an if-then column based on the existing columns.
df
.withColumn("col_new",
when(col("col_2").isin(2, 5), "str_1")
.when(col("col_2").isin(4, 6), "str_2")
.when(col("col_2").isin(1) && col("col_1").contains("a"), "str_3")
.when(col("col_2").isin(3) && col("col_1").contains("b"), "str_1")
.when(col("col_2").isin(1,2,3), "str_4")
.otherwise(lit("other")))
Instead of the list of when-then statements, I would prefer to apply a custom function. In Python I would run a lambda & map.
thank you!
I have a dataframe which I want to transform as below output where each row start_duration and end_duration will be resulted from previous row start_duration and end_duration, please let me know how to achieve it in spark using scala.
Below is the formulae to calculate start_duration and end_duration :
start_duration = max(previous end_duration + 1, current date);
end_duration = min(presciption_end date, start_duration + duration – 1)
Input Data Frame :
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+
|prescription_uid|patient_uid|ndc |label |dispensation_uid|date |duration|start_date|end_date |
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+
|0 |0 |16714-128|sinvastatin|0 |2015-06-10|30 |2015-06-01|2015-12-01|
|0 |0 |16714-128|sinvastatin|1 |2015-07-15|30 |2015-06-01|2015-12-01|
|0 |0 |16714-128|sinvastatin|2 |2015-08-01|30 |2015-06-01|2015-12-01|
|0 |0 |16714-128|sinvastatin|3 |2015-10-01|30 |2015-06-01|2015-12-01|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+
EXPECTED RESULT:
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
|prescription_uid|patient_uid|ndc |label |dispensation_uid|date |duration|start_date|end_date |first_start_duration|first_end_duration|start_duration|end_duration|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
|0 |0 |16714-128|sinvastatin|0 |2015-06-10|30 |2015-06-01|2015-12-01|2015-06-10 |2015-07-09 |2015-06-10 |2015-07-09 |
|0 |0 |16714-128|sinvastatin|1 |2015-07-15|30 |2015-06-01|2015-12-01|2015-06-10 |2015-07-09 |2015-07-15 |2015-08-13 |
|0 |0 |16714-128|sinvastatin|2 |2015-08-01|30 |2015-06-01|2015-12-01|2015-06-10 |2015-07-09 |2015-08-14 |2015-09-13 |
|0 |0 |16714-128|sinvastatin|3 |2015-10-01|30 |2015-06-01|2015-12-01|2015-06-10 |2015-07-09 |2015-10-01 |2015-10-30 |
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
Continued question from previous forum solution didn't worked when there is gap in between rows like above example gap is in third and last row :
https://stackoverflow.com/questions/64396803/how-to-apply-window-function-in-memory-transformation-with-new-column-scala/64405160#64405160
Break you problem in two parts.
1 Use lag to get previous and lead to (sample ) and crate new column
2 Use least (end_duration) and greatest (start_duration) to get.(sample link
I can help in sql .
prescription_uid,patient_uid,ndc,label,dispensation_uid,date,duration,start_date,end_date
0,0 ,16714-128,sinvastatin,0,2015-06-10,30,2015-06-01,2015-12-01
0,0 ,16714-128,sinvastatin,1,2015-07-15,30,2015-06-01,2015-12-01
0,0 ,16714-128,sinvastatin,2,2015-08-01,30,2015-06-01,2015-12-01
0,0 ,16714-128,sinvastatin,3,2015-10-01,30,2015-06-01,2015-12-01
var df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").load("file:///home/xxxx/Ram/sample_stack.csv")
var date=df.select("date").map(r=>r(0)).collect
var dt=data(0).toString
df=df.withColumn("first_start_duration",lit(dt))
val date_add = udf((x: String, y: Int) => {
val sdf = new SimpleDateFormat("yyyy-MM-dd")
var z=y-1
val result = new Date(sdf.parse(x).getTime() + TimeUnit.DAYS.toMillis(z))
sdf.format(result)
})
df=df.withColumn("first_end_duration", date_add($"first_start_duration", $"duration"))
df=df.withColumn("start_duration",df("date"))
df=df.withColumn("end_duration",date_add($"start_duration", $"duration"))
Result :
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
|prescription_uid|patient_uid| ndc| label|dispensation_uid| date|duration|start_date| end_date|first_start_duration|first_end_duration|start_duration|end_duration|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
| 0| 0|16714-128|sinvastatin| 0|2015-06-10| 30|2015-06-01|2015-12-01| 2015-06-10| 2015-07-09| 2015-06-10| 2015-07-09|
| 0| 0|16714-128|sinvastatin| 1|2015-07-15| 30|2015-06-01|2015-12-01| 2015-06-10| 2015-07-09| 2015-07-15| 2015-08-13|
| 0| 0|16714-128|sinvastatin| 2|2015-08-01| 30|2015-06-01|2015-12-01| 2015-06-10| 2015-07-09| 2015-08-01| 2015-08-30|
| 0| 0|16714-128|sinvastatin| 3|2015-10-01| 30|2015-06-01|2015-12-01| 2015-06-10| 2015-07-09| 2015-10-01| 2015-10-30|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
Here is a sample data
val df4 = sc.parallelize(List(
("A1",45, "5", 1, 90),
("A2",60, "1", 1, 120),
("A6", 30, "9", 1, 450),
("A7", 89, "7", 1, 333),
("A7", 89, "4", 1, 320),
("A2",60, "5", 1, 22),
("A1",45, "22", 1, 1)
)).toDF("CID","age", "children", "marketplace_id","value")
thanks to #Shu for this piece of code
val df5 = df4.selectExpr("CID","""to_json(named_struct("id", children)) as item""", "value", "marketplace_id")
+---+-----------+-----+--------------+
|CID|item |value|marketplace_id|
+---+-----------+-----+--------------+
|A1 |{"id":"5"} |90 |1 |
|A2 |{"id":"1"} |120 |1 |
|A6 |{"id":"9"} |450 |1 |
|A7 |{"id":"7"} |333 |1 |
|A7 |{"id":"4"} |320 |1 |
|A2 |{"id":"5"} |22 |1 |
|A1 |{"id":"22"}|1 |1 |
+---+-----------+-----+--------------+
when you do df5.dtypes
(CID,StringType), (item,StringType), (value,IntegerType), (marketplace_id,IntegerType)
the column item is of string type, is there a way this can be of json/object type(if that is a thing)?
EDIT 1:
I will describe what I am trying to achieve here, the above two steps remains same.
val w = Window.partitionBy("CID").orderBy(desc("value"))
val sorted_list = df5.withColumn("item", collect_list("item").over(w)).groupBy("CID").agg(max("item") as "item")
Output:
+---+-------------------------+
|CID|item |
+---+-------------------------+
|A6 |[{"id":"9"}] |
|A2 |[{"id":"1"}, {"id":"5"}] |
|A7 |[{"id":"7"}, {"id":"4"}] |
|A1 |[{"id":"5"}, {"id":"22"}]|
+---+-------------------------+
now whatever is inside [ ] is a string. which is causing a problem for one of the tools we are using.
Sorry, pardon me I am new to scala, spark if this is a basic question.
Store json data using struct type, check below code.
scala> dfa
.withColumn("item_without_json",struct($"cid".as("id")))
.withColumn("item_as_json",to_json($"item_without_json"))
.show(false)
+---+-----------+-----+--------------+-----------------+------------+
|CID|item |value|marketplace_id|item_without_json|item_as_json|
+---+-----------+-----+--------------+-----------------+------------+
|A1 |{"id":"A1"}|90 |1 |[A1] |{"id":"A1"} |
|A2 |{"id":"A2"}|120 |1 |[A2] |{"id":"A2"} |
|A6 |{"id":"A6"}|450 |1 |[A6] |{"id":"A6"} |
|A7 |{"id":"A7"}|333 |1 |[A7] |{"id":"A7"} |
|A7 |{"id":"A7"}|320 |1 |[A7] |{"id":"A7"} |
|A2 |{"id":"A2"}|22 |1 |[A2] |{"id":"A2"} |
|A1 |{"id":"A1"}|1 |1 |[A1] |{"id":"A1"} |
+---+-----------+-----+--------------+-----------------+------------+
Based on the comment you made to have the dataset converted to json you would use:
df4
.select(collect_list(struct($"CID".as("id"))).as("items"))
.write()
.json(path)
The output will look like:
{"items":[{"id":"A1"},{"id":"A2"},{"id":"A6"},{"id":"A7"}, ...]}
If you need the thing in memory to pass down to a function, instead of write().json(...) use toJSON
My spark dataframe looks like this:
+------+------+-------+------+
|userid|useid1|userid2|score |
+------+------+-------+------+
|23 |null |dsad |3 |
|11 |44 |null |4 |
|231 |null |temp |5 |
|231 |null |temp |2 |
+------+------+-------+------+
I want to do the calculation for each pair of userid and useid1/userid2 (whichever is not null).
And if it's useid1, I multiply the score by 5, if it's userid2, I multiply the score by 3.
Finally, I want to add all score for each pair.
The result should be:
+------+--------+-----------+
|userid|useid1/2|final score|
+------+--------+-----------+
|23 |dsad |9 |
|11 |44 |20 |
|231 |temp |21 |
+------+------+-------------+
How can I do this?
For the groupBy part, I know dataframe has the groupBy function, but I don't know if I can use it conditionally, like if userid1 is null, groupby(userid, userid2), if userid2 is null, groupby(userid, useid1).
For the calculation part, how to multiply 3 or 5 based on the condition?
The below solution will help to solve your problem.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val groupByUserWinFun = Window.partitionBy("userid","useid1/2")
val finalScoreDF = userDF.withColumn("useid1/2", when($"userid1".isNull, $"userid2").otherwise($"userid1"))
.withColumn("finalscore", when($"userid1".isNull, $"score" * 3).otherwise($"score" * 5))
.withColumn("finalscore", sum("finalscore").over(groupByUserWinFun))
.select("userid", "useid1/2", "finalscore").distinct()
using when method in spark SQL, select userid1 or 2 and multiply with values based on the condition
Output:
+------+--------+----------+
|userid|useid1/2|finalscore|
+------+--------+----------+
| 11 | 44| 20.0|
| 23 | dsad| 9.0|
| 231| temp| 21.0|
+------+--------+----------+
Group by will work:
val original = Seq(
(23, null, "dsad", 3),
(11, "44", null, 4),
(231, null, "temp", 5),
(231, null, "temp", 2)
).toDF("userid", "useid1", "userid2", "score")
// action
val result = original
.withColumn("useid1/2", coalesce($"useid1", $"userid2"))
.withColumn("score", $"score" * when($"useid1".isNotNull, 5).otherwise(3))
.groupBy("userid", "useid1/2")
.agg(sum("score").alias("final score"))
result.show(false)
Output:
+------+--------+-----------+
|userid|useid1/2|final score|
+------+--------+-----------+
|23 |dsad |9 |
|231 |temp |21 |
|11 |44 |20 |
+------+--------+-----------+
coalesce will do the needful.
df.withColumn("userid1/2", coalesce(col("useid1"), col("useid1")))
basically this function return first non-null value of the order
documentation :
COALESCE(T v1, T v2, ...)
Returns the first v that is not NULL, or NULL if all v's are NULL.
needs an import import org.apache.spark.sql.functions.coalesce