I am trying to compare record of current and previous row in the below DataFrame. I want to calculate the Amount column.
scala> val dataset = sc.parallelize(Seq((1, 123, 50), (2, 456, 30), (3, 456, 70), (4, 789, 80))).toDF("SL_NO","ID","AMOUNT")
scala> dataset.show
+-----+---+------+
|SL_NO| ID|AMOUNT|
+-----+---+------+
| 1|123| 50|
| 2|456| 30|
| 3|456| 70|
| 4|789| 80|
+-----+---+------+
Calculation Logic:
For the row no 1, AMOUNT should be 50 from first row.
For the row no 2, if ID of SL_NO - 2 and 1 is not same then need to consider
AMOUNT of SL_NO - 2 (i.e - 30). Otherwise AMOUNT of SL_NO - 1 (i.e. - 50)
For the row no 3, if ID of SL_NO - 3 and 2 is not same then need to consider
AMOUNT of SL_NO - 3 (i.e - 70). Otherwise AMOUNT of SL_NO - 2 (i.e. - 30)
Same logic need to follow for the other rows also.
Expected Output:
+-----+---+------+
|SL_NO| ID|AMOUNT|
+-----+---+------+
| 1|123| 50|
| 2|456| 30|
| 3|456| 30|
| 4|789| 80|
+-----+---+------+
Please help.
You could use lag with when.otherwise, here is a demonstration:
import org.apache.spark.sql.expressions.Window
val w = Window.orderBy($"SL_NO")
dataset.withColumn("AMOUNT",
when($"ID" === lag($"ID", 1).over(w), lag($"AMOUNT", 1).over(w)).otherwise($"AMOUNT")
).show
+-----+---+------+
|SL_NO| ID|AMOUNT|
+-----+---+------+
| 1|123| 50|
| 2|456| 30|
| 3|456| 30|
| 4|789| 80|
+-----+---+------+
Note: since this example doesn't use any partition, it could have performance problem, in your real data, it would be helpful if your problem can be partitioned by some variables, may be Window.orderBy($"SL_NO").partitionBy($"ID") depending on your actual problem and whether IDs are sorted together.
Related
I need help creating the following logic.
I have a df with daily information for different customers, like in this example:
val df = Seq(
(1, "2021-08-15", 10),
(1, "2021-08-16", 10),
(1, "2021-08-17", 12),
(2, "2021-08-15", 5),
(2, "2021-08-16", 5)
.toDF("id", "date", "money")
I want to create an additional column using a condition, in which the column is true if the value for the same customer on the previous date is equal to today's value and false if the values change. The condition is for each specific customer and it shouldn't consider other customers.
My ideal final output would be:
// +---+----------+-----+--------------+
// | id|date |money|col-comparison|
// +---+----------+-----+--------------+
// | 1|2021-08-15| 10 | null|
// | 1|2021-08-16| 10 | true|
// | 1|2021-08-17| 12 | false|
// | 2|2021-08-15| 5 | null|
// | 2|2021-08-16| 5 | true|
I created this code, but its not giving me the desired output:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import spark.implicits._
def compareCol1(curr: Column, prev: Column): Column = curr === prev
val window = Window.orderBy("id", "date")
df.withColumn("col-comparison", compareCol1($"money", lag("money", 1).over(window)))
The issue that I'm having with the code is that it doesn't understand that comparisons should only be between the same ID. The input that I'm getting is this one:
// +---+----------+-----+--------------+
// | id|date |money|col-comparison|
// +---+----------+-----+--------------+
// | 1|2021-08-15| 10 | null|
// | 1|2021-08-16| 10 | true|
// | 1|2021-08-17| 12 | false|
// | 2|2021-08-15| 5 | false|
// | 2|2021-08-16| 5 | true|
Instead of having one comparison per id, its always comparing it to the previous value in the dataset.
Does someone know how I can fix this? Maybe its a very easy question but I'm not sure how to do it!
I have a dataframe with several columns, some of which are labeled PULocationID, DOLocationID, total_amount, and trip_distance. I'm trying to group by both PULocationID and DOLocationID, then count the combination each into a column called "count". I also need to take the average of total_amount and trip_distance and divide them into a column called "trip_rate". The end DF should be:
PULocationID
DOLocationID
count
trip_rate
123
422
1
5.2435
3
27
4
6.6121
Where (123,422) are paired together once for a trip rate of $5.24 and (3, 27) are paired together 4 times where the trip rate is $6.61.
Through reading some other threads, I'm able to group by the locations and count them using the below:
df.groupBy("PULocationID", 'DOLocationID').agg(count(lit(1)).alias("count")).show()
OR I can group by the locations and get the averages of the two columns I need using the below:
df.groupBy("PULocationID", 'DOLocationID').agg({'total_amount':'avg', 'trip_distance':'avg'}).show()
I tried a couple of things to get the trip_rate, but neither worked:
df.withColumn("trip_rate", (pyspark.sql.functions.col("total_amount") / pyspark.sql.functions.col("trip_distance")))
df.withColumn("trip_rate", df.total_amount/sum(df.trip_distance))
I also can't figure out how to combine the two queries that work (i.e. count of locations + averages).
Using this as an example input DataFrame:
+------------+------------+------------+-------------+
|PULocationID|DOLocationID|total_amount|trip_distance|
+------------+------------+------------+-------------+
| 123| 422| 10.487| 2|
| 3| 27| 19.8363| 3|
| 3| 27| 13.2242| 2|
| 3| 27| 6.6121| 1|
| 3| 27| 26.4484| 4|
+------------+------------+------------+-------------+
You can chain together the groupBy, agg, and select (you could also use withColumn and drop if you only need the 4 columns).
import pyspark.sql.functions as F
new_df = df.groupBy(
"PULocationID",
"DOLocationID",
).agg(
F.count(F.lit(1)).alias("count"),
F.avg(F.col("total_amount")).alias("avg_amt"),
F.avg(F.col("trip_distance")).alias("avg_distance"),
).select(
"PULocationID",
"DOLocationID",
"count",
(F.col("avg_amt") / F.col("avg_distance")).alias("trip_rate")
)
new_df.show()
+------------+------------+-----+-----------------+
|PULocationID|DOLocationID|count| trip_rate|
+------------+------------+-----+-----------------+
| 123| 422| 1| 5.2435|
| 3| 27| 4|6.612100000000001|
+------------+------------+-----+-----------------+
I already post a question similar but someone gave me a trick to avoid using the "if condition".
Here I am in a similar position and I do not find any trick to avoid it....
I have a dataframe.
var df = sc.parallelize(Array(
(1, "2017-06-29 10:53:53.0","2017-06-25 14:60:53.0","boulanger.fr"),
(2, "2017-07-05 10:48:57.0","2017-09-05 08:60:53.0","patissier.fr"),
(3, "2017-06-28 10:31:42.0","2017-02-28 20:31:42.0","boulanger.fr"),
(4, "2017-08-21 17:31:12.0","2017-10-21 10:29:12.0","patissier.fr"),
(5, "2017-07-28 11:22:42.0","2017-05-28 11:22:42.0","boulanger.fr"),
(6, "2017-08-23 17:03:43.0","2017-07-23 09:03:43.0","patissier.fr"),
(7, "2017-08-24 16:08:07.0","2017-08-22 16:08:07.0","boulanger.fr"),
(8, "2017-08-31 17:20:43.0","2017-05-22 17:05:43.0","patissier.fr"),
(9, "2017-09-04 14:35:38.0","2017-07-04 07:30:25.0","boulanger.fr"),
(10, "2017-09-07 15:10:34.0","2017-07-29 12:10:34.0","patissier.fr"))).toDF("id", "date1","date2", "mail")
df = df.withColumn("date1", (unix_timestamp($"date1", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))
df = df.withColumn("date2", (unix_timestamp($"date2", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))
df = df.orderBy("date1", "date2")
It looks like:
+---+---------------------+---------------------+------------+
|id |date1 |date2 |mail |
+---+---------------------+---------------------+------------+
|3 |2017-06-28 10:31:42.0|2017-02-28 20:31:42.0|boulanger.fr|
|1 |2017-06-29 10:53:53.0|2017-06-25 15:00:53.0|boulanger.fr|
|2 |2017-07-05 10:48:57.0|2017-09-05 09:00:53.0|patissier.fr|
|5 |2017-07-28 11:22:42.0|2017-05-28 11:22:42.0|boulanger.fr|
|4 |2017-08-21 17:31:12.0|2017-10-21 10:29:12.0|patissier.fr|
|6 |2017-08-23 17:03:43.0|2017-07-23 09:03:43.0|patissier.fr|
|7 |2017-08-24 16:08:07.0|2017-08-22 16:08:07.0|boulanger.fr|
|8 |2017-08-31 17:20:43.0|2017-05-22 17:05:43.0|patissier.fr|
|9 |2017-09-04 14:35:38.0|2017-07-04 07:30:25.0|boulanger.fr|
|10 |2017-09-07 15:10:34.0|2017-07-29 12:10:34.0|patissier.fr|
+---+---------------------+---------------------+------------+
For each id I want to count among all other line the number of lines with:
a date1 in [my_current_date1-60 day, my_current_date1-1 day]
a date2 < my_current_date1
the same mail than my current_mail
If I look at the line 5 I want to return the number of line with:
date1 in [2017-05-29 11:22:42.0, 2017-07-27 11:22:42.0]
date2 < 2017-07-28 11:22:42.0
mail = boulanger.fr
--> The result would be 2 (corresponding to id 1 and id 3)
So I would like to do something like:
val w = Window.partitionBy("mail").orderBy(col("date1").cast("long")).rangeBetween(-60*24*60*60,-1*24*60*60)
var df= df.withColumn("all_previous", count("mail") over w)
But this will respond to condition 1 and condition 3 but not the second one... i have to add something to includ this second condition comparing date2 to my_date1...
Using a generalized Window spec with last(date1) being the current date1 per Window partition and a sum over 0's and 1's as conditional count, here's how I would incorporate your condition #2 into the counting criteria:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
def days(n: Long): Long = n * 24 * 60 * 60
val w = Window.partitionBy("mail").orderBy($"date1".cast("long"))
val w1 = w.rangeBetween(days(-60), days(0))
val w2 = w.rangeBetween(days(-60), days(-1))
df.withColumn("all_previous", sum(
when($"date2".cast("long") < last($"date1").over(w1).cast("long"), 1).
otherwise(0)
).over(w2)
).na.fill(0).
show
// +---+-------------------+-------------------+------------+------------+
// | id| date1| date2| mail|all_previous|
// +---+-------------------+-------------------+------------+------------+
// | 3|2017-06-28 10:31:42|2017-02-28 20:31:42|boulanger.fr| 0|
// | 1|2017-06-29 10:53:53|2017-06-25 15:00:53|boulanger.fr| 1|
// | 5|2017-07-28 11:22:42|2017-05-28 11:22:42|boulanger.fr| 2|
// | 7|2017-08-24 16:08:07|2017-08-22 16:08:07|boulanger.fr| 3|
// | 9|2017-09-04 14:35:38|2017-07-04 07:30:25|boulanger.fr| 2|
// | 2|2017-07-05 10:48:57|2017-09-05 09:00:53|patissier.fr| 0|
// | 4|2017-08-21 17:31:12|2017-10-21 10:29:12|patissier.fr| 0|
// | 6|2017-08-23 17:03:43|2017-07-23 09:03:43|patissier.fr| 0|
// | 8|2017-08-31 17:20:43|2017-05-22 17:05:43|patissier.fr| 1|
// | 10|2017-09-07 15:10:34|2017-07-29 12:10:34|patissier.fr| 2|
// +---+-------------------+-------------------+------------+------------+
[UPDATE]
This solution is incorrect, even though the result appears to be correct with the sample dataset. In particular, last($"date1").over(w1) did not work the way intended. The answer is being kept to hopefully serve as a lead for a working solution.
This is my input dataframe:
id val
1 Y
1 N
2 a
2 b
3 N
Result should be:
id val
1 Y
2 a
2 b
3 N
I want to group by on col id which has both Y and N in the val and then remove the row where the column val contains "N".
Please help me resolve this issue as i am beginner to pyspark
you can first identify the problematic rows with a filter for val=="Y" and then join this dataframe back to the original one. Finally you can filter for Null values and for the rows you want to keep, e.g. val==Y. Pyspark should be able to handle the self-join even if there are a lot of rows.
The example is shown below:
df_new = spark.createDataFrame([
(1, "Y"), (1, "N"), (1,"X"), (1,"Z"),
(2,"a"), (2,"b"), (3,"N")
], ("id", "val"))
df_Y = df_new.filter(col("val")=="Y").withColumnRenamed("val","val_Y").withColumnRenamed("id","id_Y")
df_new = df_new.join(df_Y, df_new["id"]==df_Y["id_Y"],how="left")
df_new.filter((col("val_Y").isNull()) | ((col("val_Y")=="Y") & ~(col("val")=="N"))).select("id","val").show()
The result would be your preferred:
+---+---+
| id|val|
+---+---+
| 1| X|
| 1| Y|
| 1| Z|
| 3| N|
| 2| a|
| 2| b|
+---+---+
I have the following table stored in Hive called ExampleData:
+--------+-----+---|
|Site_ID |Time |Age|
+--------+-----+---|
|1 |10:00| 20|
|1 |11:00| 21|
|2 |10:00| 24|
|2 |11:00| 24|
|2 |12:00| 20|
|3 |11:00| 24|
+--------+-----+---+
I need to be able to process the data by Site. Unfortunately partitioning it by Site doesn't work (there are over 100k sites, all with fairly small amounts of data).
For each site, I need to select the Time column and Age column separately, and use this to feed into a function (which ideally I want to run on the executors, not the driver)
I've got a stub of how I think I want it to work, but this solution would only run on the driver, so it's very slow. I need to find a way of writing it so it will run an executor level:
// fetch a list of distinct sites and return them to the driver
//(if you don't, you won't be able to loop around them as they're not on the executors)
val distinctSites = spark.sql("SELECT site_id FROM ExampleData GROUP BY site_id LIMIT 10")
.collect
val allSiteData = spark.sql("SELECT site_id, time, age FROM ExampleData")
distinctSites.foreach(row => {
allSiteData.filter("site_id = " + row.get(0))
val times = allSiteData.select("time").collect()
val ages = allSiteData.select("ages").collect()
processTimesAndAges(times, ages)
})
def processTimesAndAges(times: Array[Row], ages: Array[Row]) {
// do some processing
}
I've tried broadcasting the distinctSites across all nodes, but this did not prove fruitful.
This seems such a simple concept and yet I have spent a couple of days looking into this. I'm very new to Scala/Spark, so apologies if this is a ridiculous question!
Any suggestions or tips are greatly appreciated.
RDD API provides a number of functions which can be used to perform operations in groups starting with low level repartition / repartitionAndSortWithinPartitions and ending with a number of *byKey methods (combineByKey, groupByKey, reduceByKey, etc.).
Example :
rdd.map( tup => ((tup._1, tup._2, tup._3), tup) ).
groupByKey().
forEachPartition( iter => doSomeJob(iter) )
In DataFrame you can use aggregate functions,GroupedData class provides a number of methods for the most common functions, including count, max, min, mean and sum
Example :
val df = sc.parallelize(Seq(
(1, 10.3, 10), (1, 11.5, 10),
(2, 12.6, 20), (3, 2.6, 30))
).toDF("Site_ID ", "Time ", "Age")
df.show()
+--------+-----+---+
|Site_ID |Time |Age|
+--------+-----+---+
| 1| 10.3| 10|
| 1| 11.5| 10|
| 2| 12.6| 20|
| 3| 2.6| 30|
+--------+-----+---+
df.groupBy($"Site_ID ").count.show
+--------+-----+
|Site_ID |count|
+--------+-----+
| 1| 2|
| 3| 1|
| 2| 1|
+--------+-----+
Note : As you have mentioned that solution is very slow ,You need to use partition ,in your case range partition is good option.
http://dev.sortable.com/spark-repartition/
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/