Getting a Unique Count over a Particular Time Frame with Spark DataFrames

Getting a Unique Count over a Particular Time Frame with Spark DataFrames - scala

I'm trying to figure out if what I'm trying to accomplish is even possible in Spark. Let's say I have a CSV that if read in as a DataFrame that looks like so:
+---------------------+-----------+-------+-------------+
| TimeStamp | Customer | User | Application |
+---------------------+-----------+-------+-------------+
| 2017-01-01 00:00:01 | customer1 | user1 | app1 |
| 2017-01-01 12:00:05 | customer1 | user1 | app1 |
| 2017-01-01 14:00:03 | customer1 | user2 | app2 |
| 2017-01-01 23:50:50 | customer1 | user1 | app1 |
| 2017-01-02 00:00:02 | customer1 | user1 | app1 |
+---------------------+-----------+-------+-------------+
I'm trying to produce a dataframe that includes a count of the number of the times a unique user from a certain customer has visited an application in the last 24 hours. So the result would look like so:
+---------------------+-----------+-------+-------------+----------------------+
| TimeStamp | Customer | User | Application | UniqueUserVisitedApp |
+---------------------+-----------+-------+-------------+----------------------+
| 2017-01-01 00:00:01 | customer1 | user1 | app1 | 0 |
| 2017-01-01 12:00:05 | customer1 | user2 | app1 | 1 |
| 2017-01-01 13:00:05 | customer1 | user2 | app1 | 2 |
| 2017-01-01 14:00:03 | customer1 | user1 | app1 | 2 |
| 2017-01-01 23:50:50 | customer1 | user3 | app1 | 2 |
| 2017-01-01 23:50:51 | customer2 | user4 | app2 | 0 |
| 2017-01-02 00:00:02 | customer1 | user1 | app1 | 3 |
+---------------------+-----------+-------+-------------+----------------------+
So I can do a tumbling window with the following code below, but that's not quite what we are looking for.
val data = spark.read.csv('path/to/csv')
val tumblingWindow = data
.groupBy(col("Customer"), col("Application"), window(data.col("TimeStamp"), "24 hours"))
.agg(countDistinct("user")).as("UniqueUsersVisitedApp")
The result is this:
+-----------+-------------+-------------------------+-----------------------+
| Customer | Application | Window | UniqueUsersVisitedApp |
+-----------+-------------+-------------------------+-----------------------+
| customer1 | app1 | [2017-01-01 00:00:00... | 2 |
| customer2 | app2 | [2017-01-01 00:00:00... | 1 |
| customer1 | app1 | [2017-01-02 00:00:00... | 1 |
+-----------+-------------+-------------------------+-----------------------+
Any help would be much appreciated.

If I understand your question correctly, just apply a filter before doing the groupBy:
data = spark.read.csv('path/to/csv')
result = (data
.filter(data['TimeStamp'] > now_minus_24_hours)
.groupBy(["Customer", "Application", "User"])
.count())
Note that users who haven't visited in the last 24 hours will be missing from the DataFrame, instead of having a count of zero.
Edit
If you are trying to get the number of visits in the last 24 hours relative to each timestamp, you can do something similar to my answer here. The basic steps will be:
reduceByKey to get a list of timestamps for each user/app/customer combination (identical to the other example). Each row will now be in the form:
((user, app, customer), list_of_timestamps)
Process each list of timestamps to generate a list of "number of visits in the previous 24 hours" for each timestamp. The data will now be in the form:
((user, app, customer), [(ts_0, num_visits_24hr_before_ts_0), (ts_1, num_visits_24_hr_before ts_2), ...])
flatMap each row back to multiple rows using something like:
lambda row: [(*row[0], *ts_num_visits) for ts_num_visits in row[1]]

I have tried it using pyspark window function, by creating subpartition for each date and apply count on them.Not sure how efficient they are. Here is my code snippet,
>>> from pyspark.sql import functions as F
>>> from pyspark.sql.types import TimestampType
>>> l = [('2017-01-01 00:00:01','customer1','user1','app1'),('2017-01-01 12:00:05','customer1','user1','app1'),('2017-01-01 14:00:03','customer1','user2','app2'),('2017-01-01 23:50:50','customer1','user1','app1'),('2017-01-02 00:00:02','customer1','user1','app1'),('2017-01-02 12:00:02','customer1','user1','app1'),('2017-01-03 14:00:02','customer1','user1','app1'),('2017-01-02 00:00:02','customer1','user2','app2'),('2017-01-01 16:04:01','customer1','user1','app1'),('2017-01-01 23:59:01','customer1','user1','app1'),('2017-01-01 18:00:01','customer1','user2','app2')]
>>> df = spark.createDataFrame(l,['TimeStamp','Customer','User','Application'])
>>> df = df.withColumn('TimeStamp',df['TimeStamp'].cast('timestamp')).withColumn('Date',F.to_date(F.col('TimeStamp')))
>>> df.show()
+-------------------+---------+-----+-----------+----------+
| TimeStamp| Customer| User|Application| Date|
+-------------------+---------+-----+-----------+----------+
|2017-01-01 00:00:01|customer1|user1| app1|2017-01-01|
|2017-01-01 12:00:05|customer1|user1| app1|2017-01-01|
|2017-01-01 14:00:03|customer1|user2| app2|2017-01-01|
|2017-01-01 23:50:50|customer1|user1| app1|2017-01-01|
|2017-01-02 00:00:02|customer1|user1| app1|2017-01-02|
|2017-01-02 12:00:02|customer1|user1| app1|2017-01-02|
|2017-01-03 14:00:02|customer1|user1| app1|2017-01-03|
|2017-01-02 00:00:02|customer1|user2| app2|2017-01-02|
|2017-01-01 16:04:01|customer1|user1| app1|2017-01-01|
|2017-01-01 23:59:01|customer1|user1| app1|2017-01-01|
|2017-01-01 18:00:01|customer1|user2| app2|2017-01-01|
+-------------------+---------+-----+-----------+----------+
>>> df.printSchema()
root
|-- TimeStamp: timestamp (nullable = true)
|-- Customer: string (nullable = true)
|-- User: string (nullable = true)
|-- Application: string (nullable = true)
|-- Date: date (nullable = true)
>>> w = Window.partitionBy('Customer','User','Application','Date').orderBy('Timestamp')
>>> diff = F.coalesce(F.datediff("TimeStamp", F.lag("TimeStamp", 1).over(w)), F.lit(0))
>>> subpartition = F.count(diff<1).over(w)
>>> df.select("*",(subpartition-1).alias('Count')).drop('Date').orderBy('Customer','User','Application','TimeStamp').show()
+-------------------+---------+-----+-----------+-----+
| TimeStamp| Customer| User|Application|Count|
+-------------------+---------+-----+-----------+-----+
|2017-01-01 00:00:01|customer1|user1| app1| 0|
|2017-01-01 12:00:05|customer1|user1| app1| 1|
|2017-01-01 16:04:01|customer1|user1| app1| 2|
|2017-01-01 23:50:50|customer1|user1| app1| 3|
|2017-01-01 23:59:01|customer1|user1| app1| 4|
|2017-01-02 00:00:02|customer1|user1| app1| 0|
|2017-01-02 12:00:02|customer1|user1| app1| 1|
|2017-01-03 14:00:02|customer1|user1| app1| 0|
|2017-01-01 14:00:03|customer1|user2| app2| 0|
|2017-01-01 18:00:01|customer1|user2| app2| 1|
|2017-01-02 00:00:02|customer1|user2| app2| 0|
+-------------------+---------+-----+-----------+-----+

Related

pyspark dataframe check if string contains substring

i need help to implement below Python logic into Pyspark dataframe.
Python:
df1['isRT'] = df1['main_string'].str.lower().str.contains('|'.join(df2['sub_string'].str.lower()))
df1.show()
+--------+---------------------------+
|id | main_string |
+--------+---------------------------+
| 1 | i am a boy |
| 2 | i am from london |
| 3 | big data hadoop |
| 4 | always be happy |
| 5 | software and hardware |
+--------+---------------------------+
df2.show()
+--------+---------------------------+
|id | sub_string |
+--------+---------------------------+
| 1 | happy |
| 2 | xxxx |
| 3 | i am a boy |
| 4 | yyyy |
| 5 | from london |
+--------+---------------------------+
Final Output:
df1.show()
+--------+---------------------------+--------+
|id | main_string | isRT |
+--------+---------------------------+--------+
| 1 | i am a boy | True |
| 2 | i am from london | True |
| 3 | big data hadoop | False |
| 4 | always be happy | True |
| 5 | software and hardware | False |
+--------+---------------------------+--------+

First construct the substring list substr_list, and then use the rlike function to generate the isRT column.
df3 = df2.select(F.expr('collect_list(lower(sub_string))').alias('substr'))
substr_list = '|'.join(df3.first()[0])
df = df1.withColumn('isRT', F.expr(f'lower(main_string) rlike "{substr_list}"'))
df.show(truncate=False)

For your two dataframes,
df1 = spark.createDataFrame(['i am a boy', 'i am from london', 'big data hadoop', 'always be happy', 'software and hardware'], 'string').toDF('main_string')
df1.show(truncate=False)
df2 = spark.createDataFrame(['happy', 'xxxx', 'i am a boy', 'yyyy', 'from london'], 'string').toDF('sub_string')
df2.show(truncate=False)
+---------------------+
|main_string |
+---------------------+
|i am a boy |
|i am from london |
|big data hadoop |
|always be happy |
|software and hardware|
+---------------------+
+-----------+
|sub_string |
+-----------+
|happy |
|xxxx |
|i am a boy |
|yyyy |
|from london|
+-----------+
you can get the following result with the simple join expression.
from pyspark.sql import functions as f
df1.join(df2, f.col('main_string').contains(f.col('sub_string')), 'left') \
.withColumn('isRT', f.expr('if(sub_string is null, False, True)')) \
.drop('sub_string') \
.show()
+--------------------+-----+
| main_string| isRT|
+--------------------+-----+
| i am a boy| true|
| i am from london| true|
| big data hadoop|false|
| always be happy| true|
|software and hard...|false|
+--------------------+-----+

How can I add one column to other columns in PySpark?

I have the following PySpark DataFrame where each column represents a time series and I'd like to study their distance to the mean.
+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| 1 | 2 | ... | 2 |
| -1 | 5 | ... | 4 |
+----+----+-----+---------+
This is what I'm hoping to get:
+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| -1 | 0 | ... | 2 |
| -5 | 1 | ... | 4 |
+----+----+-----+---------+
Up until now, I've tried naively running a UDF on individual columns but it takes respectively 30s-50s-80s... (keeps increasing) per column so I'm probably doing something wrong.
cols = ["T1", "T2", ...]
for c in cols:
df = df.withColumn(c, df[c] - df["Average"])
Is there a better way to do this transformation of adding one column to many other?

By using rdd, it can be done in this way.
+---+---+-------+
|T1 |T2 |Average|
+---+---+-------+
|1 |2 |2 |
|-1 |5 |4 |
+---+---+-------+
df.rdd.map(lambda r: (*[r[i] - r[-1] for i in range(0, len(r) - 1)], r[-1])) \
.toDF(df.columns).show()
+---+---+-------+
| T1| T2|Average|
+---+---+-------+
| -1| 0| 2|
| -5| 1| 4|
+---+---+-------+

Spark Scala: Joining 2 tables and extract the data with max date(please see description)

I want to join two tables A and B and pick the records having max date from table B for each value.
Consider the following tables:
Table A:
+---+-----+----------+
| id|Value|start_date|
+---+---- +----------+
| 1 | a | 1/1/2018 |
| 2 | a | 4/1/2018 |
| 3 | a | 8/1/2018 |
| 4 | c | 1/1/2018 |
| 5 | d | 1/1/2018 |
| 6 | e | 1/1/2018 |
+---+-----+----------+
Table B:
+---+-----+----------+
|Key|Value|sent_date |
+---+---- +----------+
| x | a | 2/1/2018 |
| y | a | 7/1/2018 |
| z | a | 11/1/2018|
| p | c | 5/1/2018 |
| q | d | 5/1/2018 |
| r | e | 5/1/2018 |
+---+-----+----------+
The aim is to bring in column id from Table A to Table B for each value in Table B.
For the same, table A and B needs to be joined together with column value and for each record in B, max(A.start_date) for each data in column Value in Table A is found with condition A.start_date < B.sent_date
Lets consider the value=a here.
In table A, we can see 3 records for Value=a with 3 different start_date.
So when joining Table B, for value=a with sent_date=2/1/2018, record with max(start_date) for start_date which are less than sent_date in Table B is taken(in this case 1/1/2018) and corresponding data in column A.id is pulled to Table B.
Similarly for record with value=a and sent_date = 11/1/2018 in Table B, id=3 from table A needs to be pulled to table B.
The result must be as follows:
+---+-----+----------+---+
|Key|Value|sent_date |id |
+---+---- +----------+---+
| x | a | 2/1/2018 | 1 |
| y | a | 7/1/2018 | 2 |
| z | a | 11/1/2018| 3 |
| p | c | 5/1/2018 | 4 |
| q | d | 5/1/2018 | 5 |
| r | e | 5/1/2018 | 6 |
+---+-----+----------+---+
I am using Spark 2.3.
I have joined the two tables(using Dataframe) and found the max(start_date) based on the condition.
But I am unable to figure out how to pull the records here.
Can anyone help me out here
Thanks in Advance!!

I just changed the date "11/1/2018" to "9/1/2018" as the string sorting gives incorrect results. When converted to date, the logic would still work. See below
scala> val df_a = Seq((1,"a","1/1/2018"),
| (2,"a","4/1/2018"),
| (3,"a","8/1/2018"),
| (4,"c","1/1/2018"),
| (5,"d","1/1/2018"),
| (6,"e","1/1/2018")).toDF("id","value","start_date")
df_a: org.apache.spark.sql.DataFrame = [id: int, value: string ... 1 more field]
scala> val df_b = Seq(("x","a","2/1/2018"),
| ("y","a","7/1/2018"),
| ("z","a","9/1/2018"),
| ("p","c","5/1/2018"),
| ("q","d","5/1/2018"),
| ("r","e","5/1/2018")).toDF("key","valueb","sent_date")
df_b: org.apache.spark.sql.DataFrame = [key: string, valueb: string ... 1 more field]
scala> val df_join = df_b.join(df_a,'valueb==='valuea,"inner")
df_join: org.apache.spark.sql.DataFrame = [key: string, valueb: string ... 4 more fields]
scala> df_join.filter('sent_date >= 'start_date).withColumn("rank", rank().over(Window.partitionBy('key,'valueb,'sent_date).orderBy('start_date.desc))).filter('rank===1).drop("valuea","start_date","rank").show()
+---+------+---------+---+
|key|valueb|sent_date| id|
+---+------+---------+---+
| q| d| 5/1/2018| 5|
| p| c| 5/1/2018| 4|
| r| e| 5/1/2018| 6|
| x| a| 2/1/2018| 1|
| y| a| 7/1/2018| 2|
| z| a| 9/1/2018| 3|
+---+------+---------+---+
scala>
UPDATE
Below is the udf to handle date strings with MM/dd/yyyy formats
scala> def dateConv(x:String):String=
| {
| val y = x.split("/").map(_.toInt).map("%02d".format(_))
| y(2)+"-"+y(0)+"-"+y(1)
| }
dateConv: (x: String)String
scala> val udfdateconv = udf( dateConv(_:String):String )
udfdateconv: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> val df_a_dt = df_a.withColumn("start_date",date_format(udfdateconv('start_date),"yyyy-MM-dd").cast("date"))
df_a_dt: org.apache.spark.sql.DataFrame = [id: int, valuea: string ... 1 more field]
scala> df_a_dt.printSchema
root
|-- id: integer (nullable = false)
|-- valuea: string (nullable = true)
|-- start_date: date (nullable = true)
scala> df_a_dt.show()
+---+------+----------+
| id|valuea|start_date|
+---+------+----------+
| 1| a|2018-01-01|
| 2| a|2018-04-01|
| 3| a|2018-08-01|
| 4| c|2018-01-01|
| 5| d|2018-01-01|
| 6| e|2018-01-01|
+---+------+----------+
scala>

How to iterate over pairs in a column in Scala

I have a data frame like this, imported from a parquet file:
| Store_id | Date_d_id |
| 0 | 23-07-2017 |
| 0 | 26-07-2017 |
| 0 | 01-08-2017 |
| 0 | 25-08-2017 |
| 1 | 01-01-2016 |
| 1 | 04-01-2016 |
| 1 | 10-01-2016 |
What I am trying to achieve next is to loop through each customer's date in pair and get the day difference. Here is what it should look like:
| Store_id | Date_d_id | Day_diff |
| 0 | 23-07-2017 | null |
| 0 | 26-07-2017 | 3 |
| 0 | 01-08-2017 | 6 |
| 0 | 25-08-2017 | 24 |
| 1 | 01-01-2016 | null |
| 1 | 04-01-2016 | 3 |
| 1 | 10-01-2016 | 6 |
And finally, I will like to reduce the data frame to the average day difference by customer:
| Store_id | avg_diff |
| 0 | 7.75 |
| 1 | 3 |
I am very new to Scala and I don't even know where to start. Any help is highly appreciated! Thanks in advance.
Also, I am using Zeppelin notebook

One approach would be to use lag(Date) over Window partition and a UDF to calculate the difference in days between consecutive rows, then follow by grouping the DataFrame for the average difference in days. Note that Date_d_id is converted to yyyy-mm-dd format for proper String ordering within the Window partitions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(0, "23-07-2017"),
(0, "26-07-2017"),
(0, "01-08-2017"),
(0, "25-08-2017"),
(1, "01-01-2016"),
(1, "04-01-2016"),
(1, "10-01-2016")
).toDF("Store_id", "Date_d_id")
def daysDiff = udf(
(d1: String, d2: String) => {
import java.time.LocalDate
import java.time.temporal.ChronoUnit.DAYS
DAYS.between(LocalDate.parse(d1), LocalDate.parse(d2))
}
)
val df2 = df.
withColumn( "Date_ymd",
regexp_replace($"Date_d_id", """(\d+)-(\d+)-(\d+)""", "$3-$2-$1")).
withColumn( "Prior_date_ymd",
lag("Date_ymd", 1).over(Window.partitionBy("Store_id").orderBy("Date_ymd"))).
withColumn( "Days_diff",
when($"Prior_date_ymd".isNotNull, daysDiff($"Prior_date_ymd", $"Date_ymd")).
otherwise(0L))
df2.show
// +--------+----------+----------+--------------+---------+
// |Store_id| Date_d_id| Date_ymd|Prior_date_ymd|Days_diff|
// +--------+----------+----------+--------------+---------+
// | 1|01-01-2016|2016-01-01| null| 0|
// | 1|04-01-2016|2016-01-04| 2016-01-01| 3|
// | 1|10-01-2016|2016-01-10| 2016-01-04| 6|
// | 0|23-07-2017|2017-07-23| null| 0|
// | 0|26-07-2017|2017-07-26| 2017-07-23| 3|
// | 0|01-08-2017|2017-08-01| 2017-07-26| 6|
// | 0|25-08-2017|2017-08-25| 2017-08-01| 24|
// +--------+----------+----------+--------------+---------+
val resultDF = df2.groupBy("Store_id").agg(avg("Days_diff").as("Avg_diff"))
resultDF.show
// +--------+--------+
// |Store_id|Avg_diff|
// +--------+--------+
// | 1| 3.0|
// | 0| 8.25|
// +--------+--------+

You can use lag function to get the previous date over Window function, then do some manipulation to get the final dataframe that you require
first of all the Date_d_id column need to be converted to include timestamp for sorting to work correctly
import org.apache.spark.sql.functions._
val timestapeddf = df.withColumn("Date_d_id", from_unixtime(unix_timestamp($"Date_d_id", "dd-MM-yyyy")))
which should give your dataframe as
+--------+-------------------+
|Store_id| Date_d_id|
+--------+-------------------+
| 0|2017-07-23 00:00:00|
| 0|2017-07-26 00:00:00|
| 0|2017-08-01 00:00:00|
| 0|2017-08-25 00:00:00|
| 1|2016-01-01 00:00:00|
| 1|2016-01-04 00:00:00|
| 1|2016-01-10 00:00:00|
+--------+-------------------+
then you can apply the lag function over window function and finally get the date difference as
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Store_id").orderBy("Date_d_id")
val laggeddf = timestapeddf.withColumn("Day_diff", when(lag("Date_d_id", 1).over(windowSpec).isNull, null).otherwise(datediff($"Date_d_id", lag("Date_d_id", 1).over(windowSpec))))
laggeddf should be
+--------+-------------------+--------+
|Store_id|Date_d_id |Day_diff|
+--------+-------------------+--------+
|0 |2017-07-23 00:00:00|null |
|0 |2017-07-26 00:00:00|3 |
|0 |2017-08-01 00:00:00|6 |
|0 |2017-08-25 00:00:00|24 |
|1 |2016-01-01 00:00:00|null |
|1 |2016-01-04 00:00:00|3 |
|1 |2016-01-10 00:00:00|6 |
+--------+-------------------+--------+
now the final step is to use groupBy and aggregation to find the average
laggeddf.groupBy("Store_id")
.agg(avg("Day_diff").as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 11.0|
| 1| 4.5|
+--------+--------+
Now if you want to neglect the null Day_diff then you can do
laggeddf.groupBy("Store_id")
.agg((sum("Day_diff")/count($"Day_diff".isNotNull)).as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 8.25|
| 1| 3.0|
+--------+--------+
I hope the answer is helpful

UDAF merge rows where are first orderdby in a Spark DataSet/Dataframe

Let's say we have a dataset/dataframe in Spark where has 3 columns
ID, Word, Timestamp
I want to write a UDAF function where I can do something like this
df.show()
ID | Word | Timestamp
1 | I | "2017-1-1 00:01"
1 | am | "2017-1-1 00:02"
1 | Chris | "2017-1-1 00:03"
2 | I | "2017-1-1 00:01"
2 | am | "2017-1-1 00:02"
2 | Jessica | "2017-1-1 00:03"
val df_merged = df.groupBy("ID")
.sort("ID", "Timestamp")
.agg(custom_agg("ID", "Word", "Timestamp")
df_merged.show
ID | Words | StartTime | EndTime |
1 | "I am Chris" | "2017-1-1 00:01" | "2017-1-1 00:03" |
1 | "I am Jessica" | "2017-1-1 00:01" | "2017-1-1 00:03" |
The question is how can ensure that the column Words will be merged in the right order inside my UDAF?

Here is a sollution with Spark 2's groupByKey (used with an untyped Dataset).The advantage of groupByKey is that you have access to the group (you get an Iterator[Row] in mapGroups):
df.groupByKey(r => r.getAs[Int]("ID"))
.mapGroups{case(id,rows) => {
val sorted = rows
.toVector
.map(r => (r.getAs[String]("Word"),r.getAs[java.sql.Timestamp]("Timestamp")))
.sortBy(_._2.getTime)
(id,
sorted.map(_._1).mkString(" "),
sorted.map(_._2).head,
sorted.map(_._2).last
)
}
}.toDF("ID","Words","StartTime","EndTime")

Sorry I dont use Scala and hope you could read it.
Window function can do what you want:
df = df.withColumn('Words', f.collect_list(df['Word']).over(
Window().partitionBy(df['ID']).orderBy('Timestamp').rowsBetween(start=Window.unboundedPreceding,
end=Window.unboundedFollowing)))
Output:
+---+-------+-----------------+----------------+
| ID| Word| Timestamp| Words|
+---+-------+-----------------+----------------+
| 1| I|2017-1-1 00:01:00| [I, am, Chris]|
| 1| am|2017-1-1 00:02:00| [I, am, Chris]|
| 1| Chris|2017-1-1 00:03:00| [I, am, Chris]|
| 2| I|2017-1-1 00:01:00|[I, am, Jessica]|
| 2| am|2017-1-1 00:02:00|[I, am, Jessica]|
| 2|Jessica|2017-1-1 00:03:00|[I, am, Jessica]|
+---+-------+-----------------+----------------+
Then groupBy above data:
df = df.groupBy(df['ID'], df['Words']).agg(
f.min(df['Timestamp']).alias('StartTime'), f.max(df['Timestamp']).alias('EndTime'))
df = df.withColumn('Words', f.concat_ws(' ', df['Words']))
Output:
+---+------------+-----------------+-----------------+
| ID| Words| StartTime| EndTime|
+---+------------+-----------------+-----------------+
| 1| I am Chris|2017-1-1 00:01:00|2017-1-1 00:03:00|
| 2|I am Jessica|2017-1-1 00:01:00|2017-1-1 00:03:00|
+---+------------+-----------------+-----------------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Getting a Unique Count over a Particular Time Frame with Spark DataFrames - scala

Related

pyspark dataframe check if string contains substring

How can I add one column to other columns in PySpark?

Spark Scala: Joining 2 tables and extract the data with max date(please see description)

How to iterate over pairs in a column in Scala

UDAF merge rows where are first orderdby in a Spark DataSet/Dataframe

Categories

Resources