Dataframe Aggregation - scala

I have a dataframe DF with the following structure :
ID, DateTime, Latitude, Longitude, otherArgs
I want to group my data by ID and time window, and keep information about the location (For example the mean of the grouped latitude and the mean of the grouped longitude)
I successfully got a new dataframe with data grouped by id and time using :
DF.groupBy($"ID",window($"DateTime","2 minutes")).agg(max($"ID"))
But I lose my location data doing that.
What I am looking for is something that would look like this for example:
DF.groupBy($"ID",window($"DateTime","2 minutes"),mean("latitude"),mean("longitude")).agg(max($"ID"))
Returning only one row for each ID and time window.
EDIT :
Sample input :
DF : ID, DateTime, Latitude, Longitude, otherArgs
0 , 2018-01-07T04:04:00 , 25.000, 55.000, OtherThings
0 , 2018-01-07T04:05:00 , 26.000, 56.000, OtherThings
1 , 2018-01-07T04:04:00 , 26.000, 50.000, OtherThings
1 , 2018-01-07T04:05:00 , 27.000, 51.000, OtherThings
Sample output :
DF : ID, window(DateTime), Latitude, Longitude
0 , (2018-01-07T04:04:00 : 2018-01-07T04:06:00) , 25.5, 55.5
1 , (2018-01-07T04:04:00 : 2018-01-07T04:06:00) , 26.5, 50.5

Here is what you can do, you need to use mean with the aggregation.
val df = Seq(
(0, "2018-01-07T04:04:00", 25.000, 55.000, "OtherThings"),
(0, "2018-01-07T04:05:00", 26.000, 56.000, "OtherThings"),
(1, "2018-01-07T04:04:00", 26.000, 50.000, "OtherThings"),
(1, "2018-01-07T04:05:00", 27.000, 51.000, "OtherThings")
).toDF("ID", "DateTime", "Latitude", "Longitude", "otherArgs")
//convert Sting to DateType for DateTime
.withColumn("DateTime", $"DateTime".cast(DateType))
df.groupBy($"id", window($"DateTime", "2 minutes"))
.agg(
mean("Latitude").as("lat"),
mean("Longitude").as("long")
)
.show(false)
Output:
+---+---------------------------------------------+----+----+
|id |window |lat |long|
+---+---------------------------------------------+----+----+
|1 |[2018-01-06 23:59:00.0,2018-01-07 00:01:00.0]|26.5|50.5|
|0 |[2018-01-06 23:59:00.0,2018-01-07 00:01:00.0]|25.5|55.5|
+---+---------------------------------------------+----+----+

You should use the .agg() method for the aggregating
Perhaps this is what you mean?
DF
.groupBy(
'ID,
window('DateTime, "2 minutes")
)
.agg(
mean("latitude").as("latitudeMean"),
mean("longitude").as("longitudeMean")
)

Related

pyspark groupby agg with new col: diff between oldest and newest timetamp

I have pyspark dataframe with the following columns:
session_id
timestamp
data = [(("ID1", "2021-12-10 10:00:00")),
(("ID1", "2021-12-10 10:05:00")),
(("ID2", "2021-12-10 10:20:00")),
(("ID2", "2021-12-10 10:24:00")),
(("ID2", "2021-12-10 10:26:00")),
]
I would like to group sessions and add a new column called duration which would be the difference between oldest and newest timestamp for that session (in seconds):
ID1: 300
ID2: 360
How to achieve it ?
Thanks,
You can use an aggregate function like collect_list and then perform max and min operations on the list. To get duration in seconds, you can convert the time values to unix_timestamp and then perform the difference.
Try this:
from pyspark.sql.functions import (
col,
array_max,
collect_list,
array_min,
unix_timestamp,
)
data = [
("ID1", "2021-12-10 10:00:00"),
("ID1", "2021-12-10 10:05:00"),
("ID2", "2021-12-10 10:20:00"),
("ID2", "2021-12-10 10:24:00"),
("ID2", "2021-12-10 10:26:00"),
]
df = spark.createDataFrame(data, ["sessionId", "time"]).select(
"sessionId", col("time").cast("timestamp")
)
df2 = (
df.groupBy("sessionId")
.agg(
array_max(collect_list("time")).alias("max_time"),
array_min(collect_list("time")).alias("min_time"),
)
.withColumn("duration", unix_timestamp("max_time") - unix_timestamp("min_time"))
)
df2.show()

Scala Spark, array with incremental new column

Spark is reading from cosmosDB, which contains records like:
{
"answers": [
{
"answer": "2005-01-01 00:00",
"answerDt": "2022-07-01CEST08:07",
...,
"id": {uuid}
}
and code that takes those answers and created DF where each row is new record from that array:
dataDF
.select(
col("id").as("recordId"),
explode($"answers").as("qa")
)
.select(
col("recordId"),
$"qa.questionText",
col("qa.question").as("q-id"),
$"qa.answerText",
$"qa.answerDt"
)
.withColumn("id", concat_ws("-", col("q-id"), col("recordId")))
.drop(col("q-id"))
at the end I save it to other collection.
What I need is that I would like to add position number into those records.
So each answer row would have also some int number, which will be unique per recordId.
ie: from 1 to 20.
lp|| recordId| questionText| answerText| answerDt| id|
--------------------++-------------------+--------------------+--------------------+-------------------+-------------------+
1 |951a508c-d970-4d2...|Please give me th...| 197...|2022-06-28CEST16:52|123abcde_VB_GEN_Q...|
2 |951a508c-d970-4d2...|What X should I N...| female|2022-06-28CEST16:52|123abcde_VB_GEN_Q...|
3 |951a508c-d970-4d2...|Please Share me t...| 72 kg|2022-06-28CEST16:53|123abcde_VB_GEN_Q...|
1 |12345678-0987-4d2...|Give me the smth ...| 10 kg|2022-06-28CEST16:53|123abcde_VB_GEN_Q...|
Is it possible ? thanks
val w = Window.partitionBy("recordId").orderBy("your col")
val resDF = sourceDF.withColumn("row_num", row_number.over(w))

How can i split timestamp to Date and time?

//loading DF
val df1 = spark.read.option("header",true).option("inferSchema",true).csv("time.csv ")
//
+-------------+
| date_time|
+-----+-------+
|1545905416000|
+-----+-------+
when i use the cast to change the column value to DateType, it shows error
=> the datatype is not matching (date_time : bigint)in df
df1.withColumn("date_time", df1("date").cast(DateType)).show()
Any solution for solveing it???
i tried doing
val a = df1.withColumn("date_time",df1("date").cast(StringType)).drop("date").toDF()
a.withColumn("fomatedDateTime",a("date_time").cast(DateType)).show()
but it does not work.
Welcome to StackOverflow!
You need to convert the timestamp from epoch format to date and then do the computation. You can try this:
import spark.implicits._
val df = spark.read.option("header",true).option("inferSchema",true).csv("time.csv ")
val df1 = df.withColumn(
"dateCreated",
date_format(
to_date(
substring(
from_unixtime($"date_time".divide(1000)),
0,
10
),
"yyyy-MM-dd"
)
,"dd-MM-yyyy")
)
.withColumn(
"timeCreated",
substring(
from_unixtime($"date_time".divide(1000)),
11,
19
)
)
Sample data from my usecase:
+---------+-------------+--------+-----------+-----------+
| adId| date_time| price|dateCreated|timeCreated|
+---------+-------------+--------+-----------+-----------+
|230010452|1469178808000| 5950.0| 22-07-2016| 14:43:28|
|230147621|1469456306000| 19490.0| 25-07-2016| 19:48:26|
|229662644|1468546792000| 12777.0| 15-07-2016| 07:09:52|
|229218611|1467815284000| 9996.0| 06-07-2016| 19:58:04|
|229105894|1467656022000| 7700.0| 04-07-2016| 23:43:42|
|230214681|1469559471000| 4600.0| 27-07-2016| 00:27:51|
|230158375|1469469248000| 999.0| 25-07-2016| 23:24:08|
+---------+-------------+--------+-----------+-----------+
You need to adjust the time. By default it would be your timezone. For me it's GMT +05:30. Hope it helps.

Scala: For loop on dataframe, create new column from existing by index

I have a dataframe with two columns:
id (string), date (timestamp)
I would like to loop through the dataframe, and add a new column with an url, which includes the id. The algorithm should look something like this:
add one new column with the following value:
for each id
"some url" + the value of the dataframe's id column
I tried to make this work in Scala, but I have problems with getting the specific id on the index of "a"
val k = df2.count().asInstanceOf[Int]
// for loop execution with a range
for( a <- 1 to k){
// println( "Value of a: " + a );
val dfWithFileURL = dataframe.withColumn("fileUrl", "https://someURL/" + dataframe("id")[a])
}
But this
dataframe("id")[a]
is not working with Scala. I could not find solution yet, so every kind of suggestions are welcome!
You can simply use the withColumn function in Scala, something like this:
val df = Seq(
( 1, "1 Jan 2000" ),
( 2, "2 Feb 2014" ),
( 3, "3 Apr 2017" )
)
.toDF("id", "date" )
// Add the fileUrl column
val dfNew = df
.withColumn("fileUrl", concat(lit("https://someURL/"), $"id"))
.show
My results:
Not sure if this is what you require but you can use zipWithIndex for indexing.
data.show()
+---+---------------+
| Id| Url|
+---+---------------+
|111|http://abc.go.org/|
|222|http://xyz.go.net/|
+---+---------------+
import org.apache.spark.sql._
val df = sqlContext.createDataFrame(
data.rdd.zipWithIndex
.map{case (r, i) => Row.fromSeq(r.toSeq:+(s"""${r.getString(1)}${i+1}"""))},
StructType(data.schema.fields :+ StructField("fileUrl", StringType, false))
)
Output:
df.show(false)
+---+---------------+----------------+
|Id |Url |fileUrl |
+---+---------------+----------------+
|111|http://abc.go.org/|http://abc.go.org/1|
|222|http://xyz.go.net/|http://xyz.go.net/2|
+---+---------------+----------------+

Pyspark groupby then sort within group

I have a table which contains id, offset, text. Suppose input:
id offset text
1 1 hello
1 7 world
2 1 foo
I want output like:
id text
1 hello world
2 foo
I'm using:
df.groupby(id).agg(concat_ws("",collect_list(text))
But I don't know how to ensure the order in the text. I did sort before groupby the data, but I've heard that groupby might shuffle the data. Is there a way to do sort within group after groupby data?
this will create a required df:
df1 = sqlContext.createDataFrame([("1", "1","hello"), ("1", "7","world"), ("2", "1","foo")], ("id", "offset" ,"text" ))
display(df1)
then you can use the following code, could be optimized further:
#udf
def sort_by_offset(col):
result =""
text_list = col.split("-")
for i in range(len(text_list)):
text_list[i] = text_list[i].split(" ")
text_list[i][0]=int(text_list[i][0])
text_list = sorted(text_list, key=lambda x: x[0], reverse=False)
for i in range(len(text_list)):
result = result+ " " +text_list[i][1]
return result.lstrip()
df2 = df1.withColumn("offset_text",concat(col("offset"),lit(" "),col("text")))
df3 = df2.groupby(col("id")).agg(concat_ws("-",collect_list(col("offset_text"))).alias("offset_text"))
df4 = df3.withColumn("text",sort_by_offset(col("offset_text")))
display(df4)
Final Output:
Add sort_array:
from pyspark.sql.functions import sort_array
df.groupby(id).agg(concat_ws("", sort_array(collect_list(text))))