I have the following DataFrame in Spark and Scala:
group nodeId date
1 1 2016-10-12T12:10:00.000Z
1 2 2016-10-12T12:00:00.000Z
1 3 2016-10-12T12:05:00.000Z
2 1 2016-10-12T12:30:00.000Z
2 2 2016-10-12T12:35:00.000Z
I need to group records by group, sort them in ascending order by date and make pairs of sequential nodeId. Also, date should be converted to Unix epoch.
This can be better explained with the expected output:
group nodeId_1 nodeId_2 date
1 2 3 2016-10-12T12:00:00.000Z
1 3 1 2016-10-12T12:05:00.000Z
2 1 2 2016-10-12T12:30:00.000Z
This is what I did so far:
df
.groupBy("group")
.agg($"nodeId",$"date")
.orderBy(asc("date"))
But I don't know how to create pairs of nodeId.
You can be benefited by using Window function with lead inbuilt function to create the pairs and to_utc_timestamp inbuilt function to convert the date to epoch date. Finally you have to filter the unpaired rows as you don't require them in the output.
Following is the program of above explanation. I have used comments for clarity
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("group").orderBy("date") //defining window function grouping by group and ordering by date
import org.apache.spark.sql.functions._
df.withColumn("date", to_utc_timestamp(col("date"), "Asia/Kathmandu")) //converting the date to epoch datetime you can choose other timezone as required
.withColumn("nodeId_2", lead("nodeId", 1).over(windowSpec)) //using window for creating pairs
.filter(col("nodeId_2").isNotNull) //filtering out the unpaired rows
.select(col("group"), col("nodeId").as("nodeId_1"), col("nodeId_2"), col("date")) //selecting as required final dataframe
.show(false)
You should get the final dataframe as required
+-----+--------+--------+-------------------+
|group|nodeId_1|nodeId_2|date |
+-----+--------+--------+-------------------+
|1 |2 |3 |2016-10-12 12:00:00|
|1 |3 |1 |2016-10-12 12:05:00|
|2 |1 |2 |2016-10-12 12:30:00|
+-----+--------+--------+-------------------+
I hope the answer is helpful
Note to get the correct epoch date I have used Asia/Kathmandu as timezone.
If I understand your requirement correctly, you can use a self-join on group and a < inequality condition on nodeId:
val df = Seq(
(1, 1, "2016-10-12T12:10:00.000Z"),
(1, 2, "2016-10-12T12:00:00.000Z"),
(1, 3, "2016-10-12T12:05:00.000Z"),
(2, 1, "2016-10-12T12:30:00.000Z"),
(2, 2, "2016-10-12T12:35:00.000Z")
).toDF("group", "nodeId", "date")
df.as("df1").join(
df.as("df2"),
$"df1.group" === $"df2.group" && $"df1.nodeId" < $"df2.nodeId"
).select(
$"df1.group", $"df1.nodeId", $"df2.nodeId",
when($"df1.date" < $"df2.date", $"df1.date").otherwise($"df2.date").as("date")
)
// +-----+------+------+------------------------+
// |group|nodeId|nodeId|date |
// +-----+------+------+------------------------+
// |1 |1 |3 |2016-10-12T12:05:00.000Z|
// |1 |1 |2 |2016-10-12T12:00:00.000Z|
// |1 |2 |3 |2016-10-12T12:00:00.000Z|
// |2 |1 |2 |2016-10-12T12:30:00.000Z|
// +-----+------+------+------------------------+
Related
I have a spark dataframe which has two column one is id and secound id col_datetime. as you can see the dataframe given below. how can i filter the dataframe based on col_datetime to get the oldest month data. I want to achieve the result dynamically because I have 20 odd dataframes.
INPUT DF:-
import spark.implicits._
val data = Seq((1 , "2020-07-02 00:00:00.0"),(2 , "2020-08-02 00:00:00.0"),(3 , "2020-09-02 00:00:00.0"),(4 , "2020-10-02 00:00:00.0"),(5 , "2020-11-02 00:00:00.0"),(6 , "2020-12-02 00:00:00.0"),(7 , "2021-01-02 00:00:00.0"),(8 , "2021-02-02 00:00:00.0"),(9 , "2021-03-02 00:00:00.0"),(10, "2021-04-02 00:00:00.0"),(11, "2021-05-02 00:00:00.0"),(12, "2021-06-02 00:00:00.0"),(13, "2021-07-22 00:00:00.0"))
val dfFromData1 = data.toDF("ID","COL_DATETIME").withColumn("COL_DATETIME",to_timestamp(col("COL_DATETIME")))
+------+---------------------+
|ID |COL_DATETIME |
+------+---------------------+
|1 |2020-07-02 00:00:00.0|
|2 |2020-08-02 00:00:00.0|
|3 |2020-09-02 00:00:00.0|
|4 |2020-10-02 00:00:00.0|
|5 |2020-11-02 00:00:00.0|
|6 |2020-12-02 00:00:00.0|
|7 |2021-01-02 00:00:00.0|
|8 |2021-02-02 00:00:00.0|
|9 |2021-03-02 00:00:00.0|
|10 |2021-04-02 00:00:00.0|
|11 |2021-05-02 00:00:00.0|
|12 |2021-06-02 00:00:00.0|
|13 |2021-07-22 00:00:00.0|
+------+---------------------+
OUTPUT:-
DF1 : - Oldest month data
+------+---------------------+
|ID |COL_DATETIME |
+------+---------------------+
|1 |2020-07-02 00:00:00.0|
+------+---------------------+
DF2:- lastest months data after removing oldest month data from orginal DF.
+------+---------------------+
|ID |COL_DATETIME |
+------+---------------------+
|2 |2020-08-02 00:00:00.0|
|3 |2020-09-02 00:00:00.0|
|4 |2020-10-02 00:00:00.0|
|5 |2020-11-02 00:00:00.0|
|6 |2020-12-02 00:00:00.0|
|7 |2021-01-02 00:00:00.0|
|8 |2021-02-02 00:00:00.0|
|9 |2021-03-02 00:00:00.0|
|10 |2021-04-02 00:00:00.0|
|11 |2021-05-02 00:00:00.0|
|12 |2021-06-02 00:00:00.0|
|13 |2021-07-22 00:00:00.0|
+------+---------------------+
logic/approach:-
step1 :- calculate the minimum datetime for col_datetime column for given dataframe and assign to mindate variable.
Lets assume I will get
var mindate = "2020-07-02 00:00:00.0"
val mindate = dfFromData1.select(min("COL_DATETIME")).first()
print(mindate)
result:-
mindate : org.apache.spark.sql.Row = [2020-07-02 00:00:00.0]
[2020-07-02 00:00:00.0]
Step2:- to get the end date of month using mindate.I haven’t code for this part to get enddatemonth using mindate.
Val enddatemonth = "2020-07-31 00:00:00.0"
Step3 : - Now I can use enddatemonth variable to filter the spark dataframe in DF1 and DF2 based on conditions.
Even if I tried to filter the dataframe based on mindate I am getting error
val DF1 = dfFromData1.where(col("COL_DATETIME") <= enddatemonth)
val DF2 = dfFromData1.where(col("COL_DATETIME") > enddatemonth)
Error : <console>:166: error: type mismatch;
found : org.apache.spark.sql.Row
required: org.apache.spark.sql.Column val DF1 = dfFromData1.where(col("COL_DATETIME" )<= mindate)
Thanks...!!
A Similar approach , but I find it cleaner just to deal with MONTHS.
Idea : like we have epoch for seconds, compute it for months
val dfWithEpochMonth = dfFromData1.
withColumn("year",year($"COL_DATETIME")).
withColumn("month",month($"COL_DATETIME")).
withColumn("epochMonth", (($"year" - 1970 - 1) * 12) + $"month")
Now your df will look like :
+---+-------------------+----+-----+----------+
| ID| COL_DATETIME|year|month|epochMonth|
+---+-------------------+----+-----+----------+
| 1|2020-07-02 00:00:00|2020| 7| 595|
| 2|2020-08-02 00:00:00|2020| 8| 596|
| 3|2020-09-02 00:00:00|2020| 9| 597|
| 4|2020-10-02 00:00:00|2020| 10| 598|
Now, you can calculate min epochMonth and filter directly.
val minEpochMonth = dfWithEpochMonth.select(min("epochMonth")).first().apply(0).toString().toInt
val df1 = dfWithEpochMonth.where($"epochMonth" <= minEpochMonth)
val df2 = dfWithEpochMonth.where($"epochMonth" > minEpochMonth)
You can drop unnecessary columns.
To address your error message :
val mindate = dfFromData1.select(min("COL_DATETIME")).first()
val mindateString = mindate.apply(0).toString()
Now you can use mindateString to filter.
The below code gives a count vector for each row in the DataFrame:
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.fit(df)
cvModel.transform(df).show(false)
The result is:
+---+---------------+-------------------------+
|id |words |features |
+---+---------------+-------------------------+
|0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
|1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+
How to get total counts of each words, like:
+---+------+------+
|id |words |counts|
+---+------+------+
|0 |a | 3 |
|1 |b | 3 |
|2 |c | 2 |
+---+------+------+
Shankar's answer only gives you the actual frequencies if the CountVectorizer model keeps every single word in the corpus (e.g. no minDF or VocabSize limitations). In these cases you can use Summarizer to directly sum each Vector. Note: this requires Spark 2.3+ for Summarizer.
import org.apache.spark.ml.stat.Summarizer.metrics
// You need to select normL1 and another item (like mean) because, for some reason, Spark
// won't allow one Vector to be selected at a time (at least in 2.4)
val totalCounts = cvModel.transform(df)
.select(metrics("normL1", "mean").summary($"features").as("summary"))
.select("summary.normL1", "summary.mean")
.as[(Vector, Vector)]
.first()
._1
You'll then have to zip totalCounts with cvModel.vocabulary to get the words themselves.
You can simply explode and groupBy to get the count of each word
cvModel.transform(df).withColumn("words", explode($"words"))
.groupBy($"words")
.agg(count($"words").as("counts"))
.withColumn("id", row_number().over(Window.orderBy("words")) -1)
.show(false)
Output:
+-----+------+---+
|words|counts|id |
+-----+------+---+
|a |3 |1 |
|b |3 |2 |
|c |2 |3 |
+-----+------+---+
I am trying to count for a given order_id how many orders there were in the past 365 days which had a payment. And this is not the problem: I use the window function.
Where it gets tricky for me is: I don't want to count orders in this time window where the payment_date is after order_date of the current order_id.
Currently, I have something like this:
val window: WindowSpec = Window
.partitionBy("customer_id")
.orderBy("order_date")
.rangeBetween(-365*days, -1)
and
df.withColumn("paid_order_count", count("*") over window)
which would count all orders for the customer within the last 365 days before his current order.
How can I now incorporate a condition for the counting that takes the order_date of the current order into account?
Example:
+---------+-----------+-------------+------------+
|order_id |order_date |payment_date |customer_id |
+---------+-----------+-------------+------------+
|1 |2017-01-01 |2017-01-10 |A |
|2 |2017-02-01 |2017-02-10 |A |
|3 |2017-02-02 |2017-02-20 |A |
The resulting table should look like this:
+---------+-----------+-------------+------------+-----------------+
|order_id |order_date |payment_date |customer_id |paid_order_count |
+---------+-----------+-------------+------------+-----------------+
|1 |2017-01-01 |2017-01-10 |A |0 |
|2 |2017-02-01 |2017-02-10 |A |1 |
|3 |2017-02-02 |2017-02-20 |A |1 |
For order_id = 3 the paid_order_count should not be 2 but 1 as order_id = 2 is paid after order_id = 3 is placed.
I hope that I explained my problem well and look forward to your ideas!
Very good question!!!
A couple of remarks, using rangeBetween creates a fixed frame that is based on number of rows in it and not on values, so it will be problematic in 2 cases:
customer does not have orders every single day, so 365 rows window might contain rows with order_date well before one year ago
if customer has more than one order per day, it will mess with the one year coverage
combination of the 1 and 2
Also rangeBetween does not work with Date and Timestamp datatypes.
To solve it, it is possible to use window function with lists and an UDF:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = spark.sparkContext.parallelize(Seq(
(1, "2017-01-01", "2017-01-10", "A")
, (2, "2017-02-01", "2017-02-10", "A")
, (3, "2017-02-02", "2017-02-20", "A")
)
).toDF("order_id", "order_date", "payment_date", "customer_id")
.withColumn("order_date_ts", to_timestamp($"order_date", "yyyy-MM-dd").cast("long"))
.withColumn("payment_date_ts", to_timestamp($"payment_date", "yyyy-MM-dd").cast("long"))
// df.printSchema()
// df.show(false)
val window = Window.partitionBy("customer_id").orderBy("order_date_ts").rangeBetween(Window.unboundedPreceding, -1)
val count_filtered_dates = udf( (days: Int, top: Long, array: Seq[Long]) => {
val bottom = top - (days * 60 * 60 * 24).toLong // in spark timestamps are in secconds, calculating the date days ago
array.count(v => v >= bottom && v < top)
}
)
val res = df.withColumn("paid_orders", collect_list("payment_date_ts") over window)
.withColumn("paid_order_count", count_filtered_dates(lit(365), $"order_date_ts", $"paid_orders"))
res.show(false)
Output:
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
|order_id|order_date|payment_date|customer_id|order_date_ts|payment_date_ts|paid_orders |paid_order_count|
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
|1 |2017-01-01|2017-01-10 |A |1483228800 |1484006400 |[] |0 |
|2 |2017-02-01|2017-02-10 |A |1485907200 |1486684800 |[1484006400] |1 |
|3 |2017-02-02|2017-02-20 |A |1485993600 |1487548800 |[1484006400, 1486684800]|1 |
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
Converting dates to Spark timestamps in seconds makes the lists more memory efficient.
It is the easiest code to implement, but not the most optimal as the lists will take up some memory, custom UDAF would be best, but requires more coding, might do later. This will still work if you have thousands of orders per customer.
We have the following input dataframe.
df1
Dep |Gender|Salary|DOB |Place
Finance |Male |5000 |2009-02-02 00:00:00|UK
HR |Female|6000 |2006-02-02 00:00:00|null
HR |Male |14200 |null |US
IT |Male |null |2008-02-02 00:00:00|null
IT |Male |55555 |2008-02-02 00:00:00|UK
Marketing|Female|12200 |2005-02-02 00:00:00|UK
Used the following code to find the count:
df = df1.groupBy(df1['Dep'])
df2 = df.agg({'Salary':'count'})
df2.show()
The result is:
Dep |count(Salary)
Finance |1
HR |2
Marketing|1
IT |1
The expected result is shown below.
Dep |count(Salary)
Finance |1
HR |2
Marketing|1
IT |2
Here issue comes with 4-th row data, where Salary data is null. And count operation on null is not working.
Appreciate your help in solving this issue.
You can replace null values:
df \
.na.fill({'salary':0}) \
.groupBy('Dep') \
.agg({'Salary':'count'})
I have an RDD with multiple rows which looks like below.
val row = [(String, String), (String, String, String)]
The value is a sequence of Tuples. In the tuple, the last String is a timestamp and the second one is category. I want to filter this sequence based on maximum timestamp for each category.
(A,B) Id Category Timestamp
-------------------------------------------------------
(123,abc) 1 A 2016-07-22 21:22:59+0000
(234,bcd) 2 B 2016-07-20 21:21:20+0000
(123,abc) 1 A 2017-07-09 21:22:59+0000
(345,cde) 4 C 2016-07-05 09:22:30+0000
(456,def) 5 D 2016-07-21 07:32:06+0000
(234,bcd) 2 B 2015-07-20 21:21:20+0000
I want one row for each of the categories.I was looking for some help on getting the row with the max timestamp for each category. The result I am looking to get is
(A,B) Id Category Timestamp
-------------------------------------------------------
(234,bcd) 2 B 2016-07-20 21:21:20+0000
(123,abc) 1 A 2017-07-09 21:22:59+0000
(345,cde) 4 C 2016-07-05 09:22:30+0000
(456,def) 5 D 2016-07-21 07:32:06+0000
Given input dataframe as
+---------+---+--------+------------------------+
|(A,B) |Id |Category|Timestamp |
+---------+---+--------+------------------------+
|[123,abc]|1 |A |2016-07-22 21:22:59+0000|
|[234,bcd]|2 |B |2016-07-20 21:21:20+0000|
|[123,abc]|1 |A |2017-07-09 21:22:59+0000|
|[345,cde]|4 |C |2016-07-05 09:22:30+0000|
|[456,def]|5 |D |2016-07-21 07:32:06+0000|
|[234,bcd]|2 |B |2015-07-20 21:21:20+0000|
+---------+---+--------+------------------------+
You can do the following to get the result dataframe you require
import org.apache.spark.sql.functions._
val requiredDataframe = df.orderBy($"Timestamp".desc).groupBy("Category").agg(first("(A,B)").as("(A,B)"), first("Id").as("Id"), first("Timestamp").as("Timestamp"))
You should have the requiredDataframe as
+--------+---------+---+------------------------+
|Category|(A,B) |Id |Timestamp |
+--------+---------+---+------------------------+
|B |[234,bcd]|2 |2016-07-20 21:21:20+0000|
|D |[456,def]|5 |2016-07-21 07:32:06+0000|
|C |[345,cde]|4 |2016-07-05 09:22:30+0000|
|A |[123,abc]|1 |2017-07-09 21:22:59+0000|
+--------+---------+---+------------------------+
You can do the same by using Window function as below
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("Category").orderBy($"Timestamp".desc)
df.withColumn("rank", rank().over(windowSpec)).filter($"rank" === lit(1)).drop("rank")