Filling blank field in a DataFrame with previous field value - scala

I am working with Scala and Spark and I am relatively new to programming in Scala, so maybe my question has a simple solution.
I have one DataFrame that keeps information about the active and deactivate clients in some promotion. That DataFrame shows the Client Id, the action that he/she took (he can activate or deactivate from the promotion at any time) and the Date that he or she took this action. Here is an example of that format:
Example of how the DataFrame works
I want a daily monitoring of the clients that are active and wish to see how this number varies through the days, but I am not able to code anything that works like that.
My idea was to make a crossJoin of two Dataframes; one that has only the Client Ids and another with only the dates, so I would have all the Dates related to all the Client IDs and I only needed to see the Client Status in each of the Dates (if the Client is active or desactive). So after that I made a left join of these new Dataframe with the DataFrame that related the Client ID and the events, but the result is a lot of dates that have a "null" status and I don't know how to fill it with the correct status. Here's the example:
Example of the final DataFrame
I have already tried to use the lag function, but it did not solve my problem. Does anyone have any idea that could help me?
Thank You!

A slightly expensive operation due to Spark SQL having restrictions on correlated sub-queries with <, <= >, >=.
Starting from your second dataframe with NULLs and assuming that large enough system and volume of data manageable:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
// My sample input
val df = Seq(
(1,"2018-03-12", "activate"),
(1,"2018-03-13", null),
(1,"2018-03-14", null),
(1,"2018-03-15", "deactivate"),
(1,"2018-03-16", null),
(1,"2018-03-17", null),
(1,"2018-03-18", "activate"),
(2,"2018-03-13", "activate"),
(2,"2018-03-14", "deactivate"),
(2,"2018-03-15", "activate")
).toDF("ID", "dt", "act")
//df.show(false)
val w = Window.partitionBy("ID").orderBy(col("dt").asc)
val df2 = df.withColumn("rank", dense_rank().over(w)).select("ID", "dt","act", "rank") //.where("rank == 1")
//df2.show(false)
val df3 = df2.filter($"act".isNull)
//df3.show(false)
val df4 = df2.filter(!($"act".isNull)).toDF("ID2", "dt2", "act2", "rank2")
//df4.show(false)
val df5 = df3.join(df4, (df3("ID") === df4("ID2")) && (df4("rank2") < df3("rank")),"inner")
//df5.show(false)
val w2 = Window.partitionBy("ID", "rank").orderBy(col("rank2").desc)
val df6 = df5.withColumn("rank_final", dense_rank().over(w2)).where("rank_final == 1").select("ID", "dt","act2").toDF("ID", "dt", "act")
//df6.show
val df7 = df.filter(!($"act".isNull))
val dfFinal = df6.union(df7)
dfFinal.show(false)
returns:
+---+----------+----------+
|ID |dt |act |
+---+----------+----------+
|1 |2018-03-13|activate |
|1 |2018-03-14|activate |
|1 |2018-03-16|deactivate|
|1 |2018-03-17|deactivate|
|1 |2018-03-12|activate |
|1 |2018-03-15|deactivate|
|1 |2018-03-18|activate |
|2 |2018-03-13|activate |
|2 |2018-03-14|deactivate|
|2 |2018-03-15|activate |
+---+----------+----------+
I solved this step-wise and in a rush, but no so apparent.

Related

Spark How can I filter out rows that contain char sequences from another dataframe?

So, I am trying to remove rows from df2 if the Value in df2 is "like" a key from df1. I'm not sure if this is possible, or if I might need to change df1 into a list first? It's a fairly small dataframe, but as you can see, we want to remove the 2nd and 3rd rows from df2 and just return back df2 without them.
df1
+--------------------+
| key|
+--------------------+
| Monthly Beginning|
| Annual Percentage|
+--------------------+
df2
+--------------------+--------------------------------+
| key| Value|
+--------------------+--------------------------------+
| Date| 1/1/2018|
| Date| Monthly Beginning on Tuesday|
| Number| Annual Percentage Rate for...|
| Number| 17.5|
+--------------------+--------------------------------+
I thought it would be something like this?
df.filter(($"Value" isin (keyDf.select("key") + "%"))).show(false)
But that doesn't work and I'm not surprised, but I think it helps show what I am trying to do if my previous explanation was not sufficient enough. Thank you for your help ahead of time.
Convert the first dataframe df1 to List[String] and then create one udf and apply filter condition
Spark-shell-
import org.apache.spark.sql.functions._
//Converting df1 to list
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
//Creating udf , spark stands for spark session
spark.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
//Applying filter
df2.filter("filterUDF(Value)=0").show
//output
+------+--------+
| key| Value|
+------+--------+
| Date|1/1/2018|
|Number| 17.5|
+------+--------+
Scala-IDE -
val sparkSession=SparkSession.builder().master("local").appName("temp").getOrCreate()
val df1=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df1.csv")
val df2=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df2.csv")
import sparkSession.implicits._
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
sparkSession.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
df2.filter("filterUDF(Value)=0").show
Convert df1 to List. Convert df2 to Dataset.
case class s(key:String,Value:String)
df2Ds = df2.as[s]
Then we can use the filter method to filter out the records.
Somewhat like this.
def check(str:String):Boolean = {
var i = ""
for(i<-df1List)
{
if(str.contains(i))
return false
}
return true
}
df2Ds.filter(s=>check(s.Value)).collect

How to split column into multiple columns in Spark 2?

I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.
The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()
A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))
If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful

Retaining a field value when the value includes to other values from a separate field

Imagine that we are a health plan and we have a database that records all of the claim line details for every claim a provider bills to us. I want to identify claim ids that were billed with two specific procedure codes. Here is a dummy dataset:
val df = Seq(
("153T32", "D0201"),
("153T32", "D3303"),
("153T32", "F2303"),
("421F32", "D0200"),
("421F32", "D1111"),
("421F32", "D0201"),
("991E32", "D0201"),
("991E32", "F2303"),
("991E32", "A1123"),
("529E52", "G1029"),
("529E52", "B0392")).
toDF("claim_id", "code")
In this fake example, I want to identify the claim_id values that are associated with both code === "D0201" and code === "F2303". I figured out how to do this creating two new DataFrames where each is filtered on one of the code values and then inner joining them together. But if there is a way to do this without creating to intermediary DataFrames, then I would like to see how you all would do it.
Here is my current solution:
val df1 = df.where($"code" === "D0201")
val df2 = df.where($"code" === "F2303")
val joinExpr = df1.col("claim_id") === df2.col("claim_id")
val finalDF = df1.join(df2, joinExpr, "inner").select(df1.col("claim_id"))
finalDF.show()
+--------+
|claim_id|
+--------+
| 153T32|
| 991E32|
+--------+
Assume there are no duplicated rows in the original data frame, here is one approach without joining:
(df.where($"code".isin("D0201", "F2303"))
.groupBy("claim_id").agg(count($"code").as("cnt"))
.where($"cnt" === 2).select("claim_id")
).show
+--------+
|claim_id|
+--------+
| 153T32|
| 991E32|
+--------+

How to calculate product of columns followed by sum over all columns?

Table 1 --Spark DataFrame table
There is a column called "productMe" in Table 1; and there are also other columns like a, b, c and so on whose schema name is contained in a schema array T.
What I want is the inner product of columns(product each row of the two columns) in schema array T with the column productMe(Table 2). And sum each column of Table 2 to get Table 3.
Table 2 is not necessary if you have good idea to get Table 3 in one step.
Table 2 -- Inner product table
For example, the column "a·productMe" is (3*0.2, 6*0.6, 5*0.4) to get (0.6, 3.6, 2)
Table 3 -- sum table
For example, the column "sum(a·productMe)" is 0.6+3.6+2=6.2.
Table 1 is DataFrame of Spark, how can I get Table 3?
You can try something like the following :
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)).toDF("productMe", "a", "b", "c")
import org.apache.spark.sql.functions.col
val columnsToSum = df.
columns. // <-- grab all the columns by their name
tail. // <-- skip productMe
map(col). // <-- create Column objects
map(c => round(sum(c * col("productMe")), 3).as(s"sum_${c}_productMe"))
val df2 = df.select(columnsToSum: _*)
df2.show()
# +---------------+---------------+---------------+
# |sum_a_productMe|sum_b_productMe|sum_c_productMe|
# +---------------+---------------+---------------+
# | 6.2| 6.3| 4.3|
# +---------------+---------------+---------------+
The trick is to use df.select(columnsToSum: _*) which means that you want to select all the columns on which we did the sum of columns times the productMe column. The :_* is a Scala-specific syntax to specify that we are passing repeated arguments because we don't have a fix number of arguments.
We can do it with simple SparkSql
val table1 = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)
).toDF("productMe", "a", "b", "c")
table1.show
table1.createOrReplaceTempView("table1")
val table2 = spark.sql("select a*productMe, b*productMe, c*productMe from table1") //spark is sparkSession here
table2.show
val table3 = spark.sql("select sum(a*productMe), sum(b*productMe), sum(c*productMe) from table1")
table3.show
All the other answers use sum aggregation that use groupBy under the covers.
groupBy always introduces a shuffle stage and usually (always?) is slower than corresponding window aggregates.
In this particular case, I also believe that window aggregates give better performance as you can see in their physical plans and details for their only one job.
CAUTION
Either solution uses one single partition to do the calculation that in turn makes them unsuitable for large datasets as their size together may easily exceed the memory size of a single JVM.
Window Aggregates
What follows is a window aggregate-based calculation which, in this particular case where we group over all the rows in a dataset, unfortunately gives the same physical plan. That makes my answer just a (hopefully) nice learning experience.
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)).toDF("productMe", "a", "b", "c")
// yes, I did borrow this trick with columns from #eliasah's answer
import org.apache.spark.sql.functions.col
val columns = df.columns.tail.map(col).map(c => c * col("productMe") as s"${c}_productMe")
val multiplies = df.select(columns: _*)
scala> multiplies.show
+------------------+------------------+------------------+
| a_productMe| b_productMe| c_productMe|
+------------------+------------------+------------------+
|0.6000000000000001| 1.5|1.2000000000000002|
|3.5999999999999996|1.7999999999999998|0.6000000000000001|
| 2.0| 3.0| 2.5|
+------------------+------------------+------------------+
def sumOverRows(name: String) = sum(name) over ()
val multipliesCols = multiplies.
columns.
map(c => sumOverRows(c) as s"sum_${c}")
val answer = multiplies.
select(multipliesCols: _*).
limit(1) // <-- don't use distinct or dropDuplicates here
scala> answer.show
+-----------------+---------------+-----------------+
| sum_a_productMe|sum_b_productMe| sum_c_productMe|
+-----------------+---------------+-----------------+
|6.199999999999999| 6.3|4.300000000000001|
+-----------------+---------------+-----------------+
Physical Plan
Let's see the physical plan then (as it was the only reason why we wanted to see how to do the query using window aggregates, wasn't it?)
The following is the details for the only job 0.
If I understand your question correctly then following can be your solution
val df = Seq(
(3,0.2,0.5,0.4),
(6,0.6,0.3,0.1),
(5,0.4,0.6,0.5)
).toDF("productMe", "a", "b", "c")
This gives input dataframe as you have (you can add more)
+---------+---+---+---+
|productMe|a |b |c |
+---------+---+---+---+
|3 |0.2|0.5|0.4|
|6 |0.6|0.3|0.1|
|5 |0.4|0.6|0.5|
+---------+---+---+---+
And
val productMe = df.columns.head
val colNames = df.columns.tail
var tempdf = df
for(column <- colNames){
tempdf = tempdf.withColumn(column, col(column)*col(productMe))
}
Above steps should give you Table2
+---------+------------------+------------------+------------------+
|productMe|a |b |c |
+---------+------------------+------------------+------------------+
|3 |0.6000000000000001|1.5 |1.2000000000000002|
|6 |3.5999999999999996|1.7999999999999998|0.6000000000000001|
|5 |2.0 |3.0 |2.5 |
+---------+------------------+------------------+------------------+
Table3 can be achieved as following
tempdf.select(sum("a").as("sum(a.productMe)"), sum("b").as("sum(b.productMe)"), sum("c").as("sum(c.productMe)")).show(false)
Table3 is
+-----------------+----------------+-----------------+
|sum(a.productMe) |sum(b.productMe)|sum(c.productMe) |
+-----------------+----------------+-----------------+
|6.199999999999999|6.3 |4.300000000000001|
+-----------------+----------------+-----------------+
Table2 can be achieved for any number of columns you have but Table3 would require you to define columns explicitly

scala spark - matching dataframes based on variable dates

I'm trying to match two dataframes based on a variable date window. I am not simply trying to get an exact match, which my code achieves but to get all likely candidates within a variable day window.
I was able to get exact matches on dates with my code.
But I want to find out if the records are still viable to match since they could be a few days off either side but would still be reasonable enough to join on.
I've tried looking for something similar to python's pd.to_timedelta('1 day') in spark to add to the filter but alas have struck no luck.
Here is my current code which matches the dataframe on the ID column and then runs a filter to ensure that the from_date in the second dataframe is between the start_date and the end_date of the first dataframe.
What I need is not the exact date match but be able to match records if they fall between a day or two (either side) of the actual dates.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
val df1 = spark.read.option("header","true")
.option("inferSchema","true").csv("../data/df1.csv")
val df2 = spark.read.option("header","true")
.option("inferSchema","true")
.csv("../data/df2.csv")
val df = df2.join(df1,
(df1("ID") === df2("ID")) &&
(df2("from_date") >= df1("start_date")) &&
(df2("from_date") <= df1("end_date")),"left")
.select(df1("ID"), df1("start_date"), df1("end_date"),
$"from_date", $"to_date")
df.coalesce(1).write.format("com.databricks.spark.csv")
.option("header", "true").save("../mydata.csv")
Essentially I want to be able to edit this date window to increase or decrease the data actually matching.
Would really appreciate your input. I'm new to spark/scala but gotta say I'm loving it so far ... soo much faster (and cleaner) than python!
cheers
You can apply date_add and date_sub to start_date/end_date in your join condition, as shown below:
import org.apache.spark.sql.functions._
import java.sql.Date
val df1 = Seq(
(1, Date.valueOf("2018-12-01"), Date.valueOf("2018-12-05")),
(2, Date.valueOf("2018-12-01"), Date.valueOf("2018-12-06")),
(3, Date.valueOf("2018-12-01"), Date.valueOf("2018-12-07"))
).toDF("ID", "start_date", "end_date")
val df2 = Seq(
(1, Date.valueOf("2018-11-30")),
(2, Date.valueOf("2018-12-08")),
(3, Date.valueOf("2018-12-08"))
).toDF("ID", "from_date")
val deltaDays = 1
df2.join( df1,
df1("ID") === df2("ID") &&
df2("from_date") >= date_sub(df1("start_date"), deltaDays) &&
df2("from_date") <= date_add(df1("end_date"), deltaDays),
"left_outer"
).show
// +---+----------+----+----------+----------+
// | ID| from_date| ID|start_date| end_date|
// +---+----------+----+----------+----------+
// | 1|2018-11-30| 1|2018-12-01|2018-12-05|
// | 2|2018-12-08|null| null| null|
// | 3|2018-12-08| 3|2018-12-01|2018-12-07|
// +---+----------+----+----------+----------+
You can get the same results using datediff() function also. Check this out:
scala> val df1 = Seq((1, "2018-12-01", "2018-12-05"),(2, "2018-12-01", "2018-12-06"),(3, "2018-12-01", "2018-12-07")).toDF("ID", "start_date", "end_date").withColumn("start_date",'start_date.cast("date")).withColumn("end_date",'end_date.cast("date"))
df1: org.apache.spark.sql.DataFrame = [ID: int, start_date: date ... 1 more field]
scala> val df2 = Seq((1, "2018-11-30"), (2, "2018-12-08"),(3, "2018-12-08")).toDF("ID", "from_date").withColumn("from_date",'from_date.cast("date"))
df2: org.apache.spark.sql.DataFrame = [ID: int, from_date: date]
scala> val delta = 1;
delta: Int = 1
scala> df2.join(df1,df1("ID") === df2("ID") && datediff('from_date,'start_date) >= -delta && datediff('from_date,'end_date)<=delta, "leftOuter").show(false)
+---+----------+----+----------+----------+
|ID |from_date |ID |start_date|end_date |
+---+----------+----+----------+----------+
|1 |2018-11-30|1 |2018-12-01|2018-12-05|
|2 |2018-12-08|null|null |null |
|3 |2018-12-08|3 |2018-12-01|2018-12-07|
+---+----------+----+----------+----------+
scala>