I am trying to subtract some months to a date. I have the following DF called df1 where MonthSub is always positive so I have to convert it maybe in negative to subtract the date:
+-------------+----------+
| Date | MonthSub |
+-------------+----------+
| 31/11/2020 | 12 |
| 25/07/2020 | 5 |
| 11/01/2020 | 1 |
+-------------+----------+
And I expect to get the following:
+-------------+----------+-------------+
| Date | MonthSub | Result |
+-------------+----------+-------------+
| 31/11/2020 | 12 | 31/11/2019 |
| 25/07/2020 | 5 | 25/02/2020 |
| 11/01/2020 | 1 | 11/12/2019 |
+-------------+----------+-------------+
Schema of DF1:
root
|-- Date: string (nullable = true)
|-- MonthSub: string (nullable = true)
What I am doing:
df1 = df1.withColumn("MonthSub", col("MonthSub").cast(IntegerType))
val dfMonth = df1.withColumn("Result", add_months(to_date(col("Date"), "dd-MM-yyyy"), col("MonthSub")))
But I constantly getting null values.
Are there other options to do this? or what am I doing wrong?
You can use add_months with negative months value as below
val dfMonth = df1.withColumn("Result", add_months(
to_date(col("Date"), "dd/MM/yyyy"), col("MonthSub") * lit(-1))
)
dfMonth.show(false)
Output:
+----------+--------+----------+
|Date |MonthSub|Result |
+----------+--------+----------+
|30/11/2020|12 |2019-11-30|
|25/07/2020|5 |2020-02-25|
|11/01/2020|1 |2019-12-11|
+----------+--------+----------+
You can change the date format as you like.
Related
I have a dataset as follows:
| id | text |
--------------
| 01 | hello world |
| 02 | this place is hell |
I also have a list of keywords I'm search for:
Keywords = ['hell', 'horrible', 'sucks']
When using the following solution using .rlike() or .contains(), sentences with either partial and exact matches to the list of words are returned to be true. I would like only exact matches to be returned.
Current code:
KEYWORDS = 'hell|horrible|sucks'
df = (
df
.select(
F.col('id'),
F.col('text'),
F.when(F.col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
)
)
Current output:
| id | text | keyword_found |
-------------------------------
| 01 | hello world | 1 |
| 02 | this place is hell | 1 |
Expected output:
| id | text | keyword_found |
--------------------------------
| 01 | hello world | 0 |
| 02 | this place is hell | 1 |
Try below code, I have just change the Keyword only :
from pyspark.sql.functions import col,when
data = [["01","hello world"],["02","this place is hell"]]
schema =["id","text"]
df2 = spark.createDataFrame(data, schema)
df2.show()
+---+------------------+
| id| text|
+---+------------------+
| 01| hello world|
| 02|this place is hell|
+---+------------------+
KEYWORDS = '(hell|horrible|sucks)$'
df = (
df2
.select(
col('id'),
col('text'),
when(col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
)
)
df.show()
+---+------------------+-------------+
| id| text|keyword_found|
+---+------------------+-------------+
| 01| hello world| 0|
| 02|this place is hell| 1|
+---+------------------+-------------+
Let me know if you need more help on this.
This should work
Keywords = 'hell|horrible|sucks'
df = (df.select(F.col('id'),F.col('text'),F.when(F.col('text').rlike('('+Keywords+')(\s|$)').otherwise(0).alias('keyword_found')))
id
text
keyword_found
01
hello world
0
02
this place is hell
1
I have a dataframe in Spark (Scala) from a large csv file.
Dataframe is something like this
key| col1 | timestamp |
---------------------------------
1 | aa | 2019-01-01 08:02:05.1 |
1 | aa | 2019-09-02 08:02:05.2 |
1 | cc | 2019-12-24 08:02:05.3 |
2 | dd | 2013-01-22 08:02:05.4 |
I need to add two columns start_date & end_date something like this
key| col1 | timestamp | start date | end date |
---------------------------------+---------------------------------------------------
1 | aa | 2019-01-01 08:02:05.1 | 2017-01-01 08:02:05.1 | 2018-09-02 08:02:05.2 |
1 | aa | 2019-09-02 08:02:05.2 | 2018-09-02 08:02:05.2 | 2019-12-24 08:02:05.3 |
1 | cc | 2019-12-24 08:02:05.3 | 2019-12-24 08:02:05.3 | NULL |
2 | dd | 2013-01-22 08:02:05.4 | 2013-01-22 08:02:05.4 | NULL |
Here,
for each column "key", end_date is next timestamp for the same key. However, "end_date" for the latest date should be NULL.
What I tried so far:
I tried to use window function to calculate rank for each partition
something like this
var df = read_csv()
//copy timestamp to start_date
df = df
.withColumn("start_date", df.col("timestamp"))
//add null value to the end_date
df = df.withColumn("end_date", typedLit[Option[String]](None))
val windowSpec = Window.partitionBy("merge_key_column").orderBy("start_date")
df
.withColumn("rank", dense_rank()
.over(windowSpec))
.withColumn("max", max("rank").over(Window.partitionBy("merge_key_column")))
So far, I haven't got the desired output.
Use window lead function for this case.
Example:
val df=Seq((1,"aa","2019-01-01 08:02:05.1"),(1,"aa","2019-09-02 08:02:05.2"),(1,"cc","2019-12-24 08:02:05.3"),(2,"dd","2013-01-22 08:02:05.4")).toDF("key","col1","timestamp")
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
val df1=df.withColumn("start_date",col("timestamp"))
val windowSpec = Window.partitionBy("key").orderBy("start_date")
df1.withColumn("end_date",lead(col("start_date"),1).over(windowSpec)).show(10,false)
//+---+----+---------------------+---------------------+---------------------+
//|key|col1|timestamp |start_date |end_date |
//+---+----+---------------------+---------------------+---------------------+
//|1 |aa |2019-01-01 08:02:05.1|2019-01-01 08:02:05.1|2019-09-02 08:02:05.2|
//|1 |aa |2019-09-02 08:02:05.2|2019-09-02 08:02:05.2|2019-12-24 08:02:05.3|
//|1 |cc |2019-12-24 08:02:05.3|2019-12-24 08:02:05.3|null |
//|2 |dd |2013-01-22 08:02:05.4|2013-01-22 08:02:05.4|null |
//+---+----+---------------------+---------------------+---------------------+
Iam tring to define the good approach to filter rows in a dataframe of GPS position with a distance threshold.
First i calculate the distance with an UDF by using a lag.
I was wondering how i can do that without recalculating a second time the distance after filtering the first one which is superior of the threshold.
And do over again the UDF calculting distance.
Input
|-- device_id --|-- gps_point -- |-- date -- |
| 1 | latlongvalue 1 | 2000-01-01 |
| 1 | latlongvalue 2 | 2000-01-02 |
| 1 | latlongvalue 3 | 2000-01-03 |
OutPut
|-- device_id --|-- gps_point -- |-- date -- |-- distance_covered --|
| 1 | latlongvalue 1 | 2000-01-01 | 0 |
| 1 | latlongvalue 2 | 2000-01-02 | 10000(error) |
| 1 | latlongvalue 3 | 2000-01-03 | 10300 |
Desired output
|-- device_id --|-- gps_point -- |-- date -- |-- distance_covered --|
| 1 | latlongvalue 1 | 2000-01-01 | 0 |
| 1 | latlongvalue 3 | 2000-01-03 | X km (from GPS1 to 3)|
val dfDeviceGpsEvents = gpsEvents.join(devices, Seq("device_id"), "left")
val dfDeviceWindow = Window
.partitionBy("device_id")
.orderBy($"device_id", $"date_gps")
val DfwithDistance = dfDeviceGpsEvents
.withColumn("gps_point", convertToMagellanPointUDF($"real_gps"))
.withColumn("prev_gps_point", lag("gps_point", -1, null)
.over(dfDeviceWindow)
)
.withColumn("distance_covered",
when($"prev_gps_point".isNotNull,
computeHaversineDistanceUDF($"gps_point", $"prev_gps_point"))
.otherwise(lit(null))
).withColumn("idx", monotonically_increasing_id)
What I am trying to code :
var loopError = 1
while (loopError > 0){
//get the first row in the DF with distance > to the threshold and get the index
val rowToFilter = firstWithDistance.select("idx").filter($"distance_covered >" + threshold).first()
if (rowToFilter.size > 0) {
// filter out the row of the DF
firstWithDistance = firstWithDistance.filter("idx ===" + rowToFilter.get(0))
loopError = 1
//redo the distance calculation with lag over all the DF
firstWithDistance = firstWithDistance
.withColumn("next_gps_point", lag("gps_point", -1, null)
.over(deviceWindow)
)
.withColumn("distance_covered", when($"next_gps_point".isNotNull, computeHaversineUDF($"gps_point", $"next_gps_point"))
.otherwise(lit(null)))
}
else {
loopError = 0
}
}
I need to get a date from below input on which there is a consecutive 'complete' status for past 7 days from that given date.
Requirement:
1. go Back 8 days (this is easy)
2. So we are on 20190111 from below data frame, I need to check day by day from 20190111 to 20190104 (7 day period) and get a date on which status has 'complete' for consecutive 7 days. So we should get 20190108
I need this in spark-scala.
input
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
| 9|20190109| pending|
| 10|20190110|complete|
| 11|20190111|complete|
| 12|20190112| pending|
| 13|20190113|complete|
| 14|20190114|complete|
| 15|20190115| pending|
| 16|20190116| pending|
| 17|20190117| pending|
| 18|20190118| pending|
| 19|20190119| pending|
+---+--------+--------+
output
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
output
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
for >= spark 2.4
import org.apache.spark.sql.expressions.Window
val df= Seq((1,"20190101","complete"),(2,"20190102","complete"),
(3,"20190103","complete"),(4,"20190104","complete"), (5,"20190105","complete"),(6,"20190106","complete"),(7,"20190107","complete"),(8,"20190108","complete"),
(9,"20190109", "pending"),(10,"20190110","complete"),(11,"20190111","complete"),(12,"20190112", "pending"),(13,"20190113","complete"),(14,"20190114","complete"),(15,"20190115", "pending") , (16,"20190116", "pending"),(17,"20190117", "pending"),(18,"20190118", "pending"),(19,"20190119", "pending")).toDF("id","date","status")
val df1= df.select($"id", to_date($"date", "yyyyMMdd").as("date"), $"status")
val win = Window.orderBy("id")
coalesce lag_status and status to remove null
val df2= df1.select($"*", lag($"status",1).over(win).as("lag_status")).withColumn("lag_stat", coalesce($"lag_status", $"status")).drop("lag_status")
create integer columns to denote if staus for current day is equal to status for previous days
val df3=df2.select($"*", ($"status"===$"lag_stat").cast("integer").as("status_flag"))
val win1= Window.orderBy($"id".desc).rangeBetween(0,7)
val df4= df3.select($"*", sum($"status_flag").over(win1).as("previous_7_sum"))
val df_new= df4.where($"previous_7_sum"===8).select($"date").select(explode(sequence(date_sub($"date",7), $"date")).as("date"))
val df5=df4.join(df_new, Seq("date"), "inner").select($"id", concat_ws("",split($"date".cast("string"), "-")).as("date"), $"status")
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
+---+--------+--------+
for spark < 2.4
use udf instead of built in array function "sequence"
val df1= df.select($"id", $"date".cast("integer").as("date"), $"status")
val win = Window.orderBy("id")
coalesce lag_status and status to remove null
val df2= df1.select($"*", lag($"status",1).over(win).as("lag_status")).withColumn("lag_stat", coalesce($"lag_status", $"status")).drop("lag_status")
create integer columns to denote if staus for current day is equal to status for previous days
val df3=df2.select($"*", ($"status"===$"lag_stat").cast("integer").as("status_flag"))
val win1= Window.orderBy($"id".desc).rangeBetween(0,7)
val df4= df3.select($"*", sum($"status_flag").over(win1).as("previous_7_sum"))
val ud1= udf((col1:Int) => {
((col1-7).to(col1 )).toArray})
val df_new= df4.where($"previous_7_sum"===8)
.withColumn("dt_arr", ud1($"date"))
.select(explode($"dt_arr" ).as("date"))
val df5=df4.join(df_new, Seq("date"), "inner").select($"id", concat_ws("",split($"date".cast("string"), "-")).as("date"), $"status")
I'm trying to figure out if what I'm trying to accomplish is even possible in Spark. Let's say I have a CSV that if read in as a DataFrame that looks like so:
+---------------------+-----------+-------+-------------+
| TimeStamp | Customer | User | Application |
+---------------------+-----------+-------+-------------+
| 2017-01-01 00:00:01 | customer1 | user1 | app1 |
| 2017-01-01 12:00:05 | customer1 | user1 | app1 |
| 2017-01-01 14:00:03 | customer1 | user2 | app2 |
| 2017-01-01 23:50:50 | customer1 | user1 | app1 |
| 2017-01-02 00:00:02 | customer1 | user1 | app1 |
+---------------------+-----------+-------+-------------+
I'm trying to produce a dataframe that includes a count of the number of the times a unique user from a certain customer has visited an application in the last 24 hours. So the result would look like so:
+---------------------+-----------+-------+-------------+----------------------+
| TimeStamp | Customer | User | Application | UniqueUserVisitedApp |
+---------------------+-----------+-------+-------------+----------------------+
| 2017-01-01 00:00:01 | customer1 | user1 | app1 | 0 |
| 2017-01-01 12:00:05 | customer1 | user2 | app1 | 1 |
| 2017-01-01 13:00:05 | customer1 | user2 | app1 | 2 |
| 2017-01-01 14:00:03 | customer1 | user1 | app1 | 2 |
| 2017-01-01 23:50:50 | customer1 | user3 | app1 | 2 |
| 2017-01-01 23:50:51 | customer2 | user4 | app2 | 0 |
| 2017-01-02 00:00:02 | customer1 | user1 | app1 | 3 |
+---------------------+-----------+-------+-------------+----------------------+
So I can do a tumbling window with the following code below, but that's not quite what we are looking for.
val data = spark.read.csv('path/to/csv')
val tumblingWindow = data
.groupBy(col("Customer"), col("Application"), window(data.col("TimeStamp"), "24 hours"))
.agg(countDistinct("user")).as("UniqueUsersVisitedApp")
The result is this:
+-----------+-------------+-------------------------+-----------------------+
| Customer | Application | Window | UniqueUsersVisitedApp |
+-----------+-------------+-------------------------+-----------------------+
| customer1 | app1 | [2017-01-01 00:00:00... | 2 |
| customer2 | app2 | [2017-01-01 00:00:00... | 1 |
| customer1 | app1 | [2017-01-02 00:00:00... | 1 |
+-----------+-------------+-------------------------+-----------------------+
Any help would be much appreciated.
If I understand your question correctly, just apply a filter before doing the groupBy:
data = spark.read.csv('path/to/csv')
result = (data
.filter(data['TimeStamp'] > now_minus_24_hours)
.groupBy(["Customer", "Application", "User"])
.count())
Note that users who haven't visited in the last 24 hours will be missing from the DataFrame, instead of having a count of zero.
Edit
If you are trying to get the number of visits in the last 24 hours relative to each timestamp, you can do something similar to my answer here. The basic steps will be:
reduceByKey to get a list of timestamps for each user/app/customer combination (identical to the other example). Each row will now be in the form:
((user, app, customer), list_of_timestamps)
Process each list of timestamps to generate a list of "number of visits in the previous 24 hours" for each timestamp. The data will now be in the form:
((user, app, customer), [(ts_0, num_visits_24hr_before_ts_0), (ts_1, num_visits_24_hr_before ts_2), ...])
flatMap each row back to multiple rows using something like:
lambda row: [(*row[0], *ts_num_visits) for ts_num_visits in row[1]]
I have tried it using pyspark window function, by creating subpartition for each date and apply count on them.Not sure how efficient they are. Here is my code snippet,
>>> from pyspark.sql import functions as F
>>> from pyspark.sql.types import TimestampType
>>> l = [('2017-01-01 00:00:01','customer1','user1','app1'),('2017-01-01 12:00:05','customer1','user1','app1'),('2017-01-01 14:00:03','customer1','user2','app2'),('2017-01-01 23:50:50','customer1','user1','app1'),('2017-01-02 00:00:02','customer1','user1','app1'),('2017-01-02 12:00:02','customer1','user1','app1'),('2017-01-03 14:00:02','customer1','user1','app1'),('2017-01-02 00:00:02','customer1','user2','app2'),('2017-01-01 16:04:01','customer1','user1','app1'),('2017-01-01 23:59:01','customer1','user1','app1'),('2017-01-01 18:00:01','customer1','user2','app2')]
>>> df = spark.createDataFrame(l,['TimeStamp','Customer','User','Application'])
>>> df = df.withColumn('TimeStamp',df['TimeStamp'].cast('timestamp')).withColumn('Date',F.to_date(F.col('TimeStamp')))
>>> df.show()
+-------------------+---------+-----+-----------+----------+
| TimeStamp| Customer| User|Application| Date|
+-------------------+---------+-----+-----------+----------+
|2017-01-01 00:00:01|customer1|user1| app1|2017-01-01|
|2017-01-01 12:00:05|customer1|user1| app1|2017-01-01|
|2017-01-01 14:00:03|customer1|user2| app2|2017-01-01|
|2017-01-01 23:50:50|customer1|user1| app1|2017-01-01|
|2017-01-02 00:00:02|customer1|user1| app1|2017-01-02|
|2017-01-02 12:00:02|customer1|user1| app1|2017-01-02|
|2017-01-03 14:00:02|customer1|user1| app1|2017-01-03|
|2017-01-02 00:00:02|customer1|user2| app2|2017-01-02|
|2017-01-01 16:04:01|customer1|user1| app1|2017-01-01|
|2017-01-01 23:59:01|customer1|user1| app1|2017-01-01|
|2017-01-01 18:00:01|customer1|user2| app2|2017-01-01|
+-------------------+---------+-----+-----------+----------+
>>> df.printSchema()
root
|-- TimeStamp: timestamp (nullable = true)
|-- Customer: string (nullable = true)
|-- User: string (nullable = true)
|-- Application: string (nullable = true)
|-- Date: date (nullable = true)
>>> w = Window.partitionBy('Customer','User','Application','Date').orderBy('Timestamp')
>>> diff = F.coalesce(F.datediff("TimeStamp", F.lag("TimeStamp", 1).over(w)), F.lit(0))
>>> subpartition = F.count(diff<1).over(w)
>>> df.select("*",(subpartition-1).alias('Count')).drop('Date').orderBy('Customer','User','Application','TimeStamp').show()
+-------------------+---------+-----+-----------+-----+
| TimeStamp| Customer| User|Application|Count|
+-------------------+---------+-----+-----------+-----+
|2017-01-01 00:00:01|customer1|user1| app1| 0|
|2017-01-01 12:00:05|customer1|user1| app1| 1|
|2017-01-01 16:04:01|customer1|user1| app1| 2|
|2017-01-01 23:50:50|customer1|user1| app1| 3|
|2017-01-01 23:59:01|customer1|user1| app1| 4|
|2017-01-02 00:00:02|customer1|user1| app1| 0|
|2017-01-02 12:00:02|customer1|user1| app1| 1|
|2017-01-03 14:00:02|customer1|user1| app1| 0|
|2017-01-01 14:00:03|customer1|user2| app2| 0|
|2017-01-01 18:00:01|customer1|user2| app2| 1|
|2017-01-02 00:00:02|customer1|user2| app2| 0|
+-------------------+---------+-----+-----------+-----+