how to use "windows" over "aggregation using groupby followed by join"? - pyspark

I have a dataframe (df) with below structure:
Grid_ID, Latitude, Longitude, DateTimeStamp
Input:
Grid_ID Latitude Longitude DateTimeStamp
Grid_1 Lat1 Long1 2021-06-30 00:00:00
Grid_1 Lat1 Long1 2021-06-30 00:01:00
Grid_1 Lat1 Long1 2021-06-30 00:02:00
Grid_1 Lat2 Long2 2021-07-01 00:00:00
Grid_1 Lat2 Long2 2021-07-01 00:01:00
Grid_1 Lat2 Long2 2021-07-01 00:02:00
and Grid_ID has two sets of lat and long that are mutually exclusive based upon Date column. i.e. when Date <= 06/30/2021, Latitude/Longitude = Lat1/Long1 and
when Date > 06/30/2021, Latitude/Longitude = Lat2/Long2, respectively
I need to create new columns (Corrected_Lat and Corrected_Long) and assign Lat2/Long2
I am using groupBy and agg as below to do the above:
df_dated = df.withColumn("date", F.to_date("DateTimeStamp")) \
.filter(F.col("date") == "2021-07-01") \
.groupBy("Grid_ID") \
.agg(F.collect_set("Latitude").getItem(0).cast("float").alias("corrected_lat"),
F.collect_set("Longitude").getItem(0).cast("float").alias("corrected_long")) \
.withColumnRenamed("Grid_ID", "Grid_ID_dated") \
.select("Grid_ID_dated", "corrected_lat", "corrected_long")
df_final = df.join(df_dated, on=[df.Grid_ID == df_dated.Grid_ID_dated],
how="inner") \
.select(*df.columns, "corrected_lat", "corrected_long")
Output:
Grid_ID Latitude Longitude DateTimeStamp corrected_lat corrected_long
Grid_1 Lat1 Long1 2021-06-30 00:00:00 Lat2 Long2
Grid_1 Lat1 Long1 2021-06-30 00:01:00 Lat2 Long2
Grid_1 Lat1 Long1 2021-06-30 00:02:00 Lat2 Long2
Grid_1 Lat2 Long2 2021-07-01 00:00:00 Lat2 Long2
Grid_1 Lat2 Long2 2021-07-01 00:01:00 Lat2 Long2
Grid_1 Lat2 Long2 2021-07-01 00:02:00 Lat2 Long2
But I am wondering if windows function can be used here and would it be faster than first approach using groupBy and agg?
Any other approach that is faster is certainly appreciated.

This will apply the latest Latitude/Longitude value to all rows (per Grid_ID):
w = (
Window.partitionBy('Grid_ID')
.orderBy("DateTimeStamp")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
)
df_final = (
df.withColumn("date", F.to_date("DateTimeStamp"))
.select(*df.columns,
F.last('Latitude').over(w).alias('corrected_lat'),
F.last('Longitude').over(w).alias('corrected_long'),
)
)

Related

Pyspark: How to get time difference in seconds while split it into each days

question itself might be a little awkward while I will show you what i mean by example.
here is the dataframe that i am dealing with
+---------+-------------------+-------------------+---------------------+
| id| start_time| ts_utc|stream_duration_total|
+---------+-------------------+-------------------+---------------------+
| 33|2022-07-03 00:07:20|2022-07-06 11:10:34| 298994|
+---------+-------------------+-------------------+---------------------+
I calculated column stream_duration_total(time difference in seconds) by myself, while i actually want to get a multiple rows like,
apparently 298994 is the total time difference in seconds. However, I was asked to get a daily info which means I need to calculate a data frame like down below. The below one splits the seconds into each days.Lets say,2022-07-03 00:07:20 to the end of 2022-07-03, they have a 85959 seconds time gap, and so on.
id start_time ts_utc dt stream_duration_total
33 2022-07-03 00:07:20 2022-07-06 11:10:34 2022-07-03 85959
33 2022-07-03 00:07:20 2022-07-06 11:10:34 2022-07-04 86400
33 2022-07-03 00:07:20 2022-07-06 11:10:34 2022-07-05 86400
33 2022-07-03 00:07:20 2022-07-06 11:10:34 2022-07-06 40235
PLEASE help me!!! Thanks in advance!
Linus got it right
A solution of pyspark
# start of udf parse_date_range
#udf(returnType=ArrayType(DateType()))
def parse_date_range(start_date, end_date):
dates = []
for i in range((end_date - start_date).days + 1):
dates.append( start_date + timedelta(days=i) )
return dates
# end of udf parse_date_range
df = spark.createDataFrame([(33, "2022-07-03 00:07:20","2022-07-06 11:10:34")], ['id', 'start_time', 'ts_utc'])\
.withColumn("start_time", col("start_time").cast("timestamp"))\
.withColumn("ts_utc", col("ts_utc").cast("timestamp"))
df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- start_time: timestamp (nullable = true)
# |-- ts_utc: timestamp (nullable = true)
df = df.withColumn("split_date", parse_date_range(col("start_time"), col("ts_utc")))\
.withColumn("split_date", explode(col("split_date")))\
.withColumn("stream_duration_total", when(col("start_time").cast("date") == col("split_date"), 86399 + col("split_date").cast("timestamp").cast("long") - col("start_time").cast("long"))
.when(col("ts_utc").cast("date") == col("split_date"), 1 + col("ts_utc").cast("long") - col("split_date").cast("timestamp").cast("long"))
.otherwise(lit(86400)))
df.show()
+---+-------------------+-------------------+----------+---------------------+
| id| start_time| ts_utc|split_date|stream_duration_total|
+---+-------------------+-------------------+----------+---------------------+
| 33|2022-07-03 00:07:20|2022-07-06 11:10:34|2022-07-03| 85959|
| 33|2022-07-03 00:07:20|2022-07-06 11:10:34|2022-07-04| 86400|
| 33|2022-07-03 00:07:20|2022-07-06 11:10:34|2022-07-05| 86400|
| 33|2022-07-03 00:07:20|2022-07-06 11:10:34|2022-07-06| 40235|
+---+-------------------+-------------------+----------+---------------------+
We can do this without the use of UDFs as well. The sequence() function will help you create the dates between the two timestamps, and that list can be exploded to create rows.
data_ls = [
('33', '2022-07-03 00:07:20', '2022-07-06 11:10:34')
]
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['id', 'start_ts', 'ts_utc']). \
withColumn('start_ts', func.col('start_ts').cast('timestamp')). \
withColumn('ts_utc', func.col('ts_utc').cast('timestamp'))
# +---+-------------------+-------------------+
# |id |start_ts |ts_utc |
# +---+-------------------+-------------------+
# |33 |2022-07-03 00:07:20|2022-07-06 11:10:34|
# +---+-------------------+-------------------+
data_sdf. \
withColumn('dt_seq', func.expr('sequence(to_date(start_ts), to_date(ts_utc), interval 1 day)')). \
withColumn('dt', func.explode('dt_seq')). \
withColumn('end_ts', func.from_unixtime(func.date_add('dt', 1).cast('timestamp').cast('int') - 1).cast('timestamp')). \
withColumn('duration',
func.when(func.to_date('start_ts') == func.col('dt'), func.col('end_ts').cast('long') - func.col('start_ts').cast('long')).
when(func.to_date('ts_utc') == func.col('dt'), func.col('end_ts').cast('long') - func.col('ts_utc').cast('long')).
otherwise(func.col('end_ts').cast('long') - func.lag('end_ts').over(wd.partitionBy('id').orderBy('dt')).cast('long'))
). \
show(truncate=False)
# +---+-------------------+-------------------+------------------------------------------------+----------+-------------------+--------+
# |id |start_ts |ts_utc |dt_seq |dt |end_ts |duration|
# +---+-------------------+-------------------+------------------------------------------------+----------+-------------------+--------+
# |33 |2022-07-03 00:07:20|2022-07-06 11:10:34|[2022-07-03, 2022-07-04, 2022-07-05, 2022-07-06]|2022-07-03|2022-07-03 23:59:59|85959 |
# |33 |2022-07-03 00:07:20|2022-07-06 11:10:34|[2022-07-03, 2022-07-04, 2022-07-05, 2022-07-06]|2022-07-04|2022-07-04 23:59:59|86400 |
# |33 |2022-07-03 00:07:20|2022-07-06 11:10:34|[2022-07-03, 2022-07-04, 2022-07-05, 2022-07-06]|2022-07-05|2022-07-05 23:59:59|86400 |
# |33 |2022-07-03 00:07:20|2022-07-06 11:10:34|[2022-07-03, 2022-07-04, 2022-07-05, 2022-07-06]|2022-07-06|2022-07-06 23:59:59|46165 |
# +---+-------------------+-------------------+------------------------------------------------+----------+-------------------+--------+
I've retained all calculated fields for your understanding. You can select only the required ones - select('id', 'start_ts', 'ts_utc', 'dt', 'duration').

How to round timestamp to 10 minutes in Spark 3.0?

I have a timestamp like that in $"my_col":
2022-01-21 22:11:11
with date_trunc("minute",($"my_col"))
2022-01-21 22:11:00
with date_trunc("hour",($"my_col"))
2022-01-21 22:00:00
What is a Spark 3.0 way to get
2022-01-21 22:10:00
?
Convert the timestamp into seconds using unix_timestamp function, then perform the rounding by dividing by 600 (10 minutes), round the result of division and multiply by 600 again:
val df = Seq(
("2022-01-21 22:11:11"),
("2022-01-21 22:04:04"),
("2022-01-21 22:19:34"),
("2022-01-21 22:57:14")
).toDF("my_col").withColumn("my_col", to_timestamp($"my_col"))
df.withColumn(
"my_col_rounded",
from_unixtime(round(unix_timestamp($"my_col") / 600) * 600)
).show
//+-------------------+-------------------+
//|my_col |my_col_rounded |
//+-------------------+-------------------+
//|2022-01-21 22:11:11|2022-01-21 22:10:00|
//|2022-01-21 22:04:04|2022-01-21 22:00:00|
//|2022-01-21 22:19:34|2022-01-21 22:20:00|
//|2022-01-21 22:57:14|2022-01-21 23:00:00|
//+-------------------+-------------------+
You can also truncate the original timestamp to hours, get the minutes that your round to 10 and add them to truncated timestamp using interval:
df.withColumn(
"my_col_rounded",
date_trunc("hour", $"my_col") + format_string(
"interval %s minute",
expr("round(extract(MINUTE FROM my_col)/10.0)*10")
).cast("interval")
)

How to split a DataFrame column value on linefeed and create a new column with the last 2 items (lines)

I'd like to split a column value with line feeds and create a new column with the two last items (lines)
df1 = spark.createDataFrame([
["001\r\nLuc Krier\r\n2363 Ryan Road, Long Lake South Dakota"],
["002\r\nJeanny Thorn\r\n2263 Patton Lane Raleigh North Carolina"],
["003\r\nTeddy E Beecher\r\n2839 Hartland Avenue Fond Du Lac Wisconsin"],
["004\r\nPhilippe Schauss\r\n1 Im Oberdorf Allemagne"],
["005\r\nMeindert I Tholen\r\nHagedoornweg 138 Amsterdam"]
]).toDF("s")
This is not working (None value):
df.withColumn('last_2', split(df.s, '\r\n')[-2])
You can achieve this simply by substring_index function as
df1.withColumn('last2',f.substring_index('s','\r\n',-2)).drop('s').show(10,False)
+-----------------------------------------------------------+
|last2 |
+-----------------------------------------------------------+
|Luc Krier
2363 Ryan Road, Long Lake South Dakota |
|Jeanny Thorn
2263 Patton Lane Raleigh North Carolina |
|Teddy E Beecher
2839 Hartland Avenue Fond Du Lac Wisconsin|
|Philippe Schauss
1 Im Oberdorf Allemagne |
|Meindert I Tholen
Hagedoornweg 138 Amsterdam |
+-----------------------------------------------------------+
Hope it helps
Yes, i am also facing the same issue with negative indexing, but positive indexing works.
I tried with the slice function, and it worked fine. can you try this:
import pyspark.sql.functions as F
df1 = sqlContext.createDataFrame([ ["001\r\nLuc Krier\r\n2363 Ryan Road, Long Lake South Dakota"], ["002\r\n\Jeanny Thorn\rn2263 Patton Lane Raleigh North Carolina"], ["003\r\nTeddy E Beecher\r\n2839 Hartland Avenue Fond Du Lac Wisconsin"], ["004\r\n\Philippe Schauss\r\n1 Im Oberdorf Allemagne"], ["005\r\n\Meindert I Tholen\r\nHagedoornweg 138 Amsterdam"] ]).toDF("s")
df_r = df1.withColumn('spl',F.split(F.col('s'),'\r\n'))
df_res = df_r.withColumn("res",F.slice(F.col("spl"),-1,1))
perhaps this is helpful -
val sDF = Seq("""001\r\nLuc Krier\r\n2363 Ryan Road, Long Lake South Dakota""",
"""002\r\nJeanny Thorn\r\n2263 Patton Lane Raleigh North Carolina""",
"""003\r\nTeddy E Beecher\r\n2839 Hartland Avenue Fond Du Lac Wisconsin""",
"""004\r\nPhilippe Schauss\r\n1 Im Oberdorf Allemagne""",
"""005\r\nMeindert I Tholen\r\nHagedoornweg 138 Amsterdam""").toDF("""s""")
val processedDF = sDF.withColumn("col1", slice(split(col("s"), """\\r\\n"""), -2, 2))
processedDF.show(false)
processedDF.printSchema()
/**
* +--------------------------------------------------------------------+-------------------------------------------------------------+
* |s |col1 |
* +--------------------------------------------------------------------+-------------------------------------------------------------+
* |001\r\nLuc Krier\r\n2363 Ryan Road, Long Lake South Dakota |[Luc Krier, 2363 Ryan Road, Long Lake South Dakota] |
* |002\r\nJeanny Thorn\r\n2263 Patton Lane Raleigh North Carolina |[Jeanny Thorn, 2263 Patton Lane Raleigh North Carolina] |
* |003\r\nTeddy E Beecher\r\n2839 Hartland Avenue Fond Du Lac Wisconsin|[Teddy E Beecher, 2839 Hartland Avenue Fond Du Lac Wisconsin]|
* |004\r\nPhilippe Schauss\r\n1 Im Oberdorf Allemagne |[Philippe Schauss, 1 Im Oberdorf Allemagne] |
* |005\r\nMeindert I Tholen\r\nHagedoornweg 138 Amsterdam |[Meindert I Tholen, Hagedoornweg 138 Amsterdam] |
* +--------------------------------------------------------------------+-------------------------------------------------------------+
*
* root
* |-- s: string (nullable = true)
* |-- col1: array (nullable = true)
* | |-- element: string (containsNull = true)
*/

Data processing for time series data-spark

Given the below sample data,
t- timeseries datetime sample,
lat-latitude,
long-longitude
t lat long
0 27 28
5 27 28
10 27 28
15 29 49
20 29 49
25 27 28
30 27 28
I want to get the output similar to this, I want to process the time series data in such a way that grouping the pair of lat long I am able to get the distinct time series interval for the pair.
I am doing the processing in spark
Lat-long interval
(27,28) (0,10)
(29,49) (15,20)
(27,28) (25,30)
I wouldn't have suggested you this solution if your data were huge but since you commented
I am processing the day wise data which is stored in cassandara ,size of 5-6k rows of records/second
following solution proposal should be fine
Looking at your given dataframe, schema should be as
root
|-- t: integer (nullable = false)
|-- lat: integer (nullable = false)
|-- long: integer (nullable = false)
And your expected output suggests that you would need an additional column for grouping the dataframe which would require you to collect the data on one executor
val collectedRDD = df.collect()
var varianceCount, lattitude, longitude = 0
val groupedData = new ArrayBuffer[(Int, Int, Int, Int)]()
for(rdd <- collectedRDD) {
val t = rdd.getAs[Int]("t")
val lat = rdd.getAs[Int]("lat")
val long = rdd.getAs[Int]("long")
if (lat != lattitude || long != longitude) {
varianceCount = varianceCount + 1
lattitude = lat
longitude = long
groupedData.append((t, lat, long, varianceCount))
}
else {
groupedData.append((t, lat, long, varianceCount))
}
}
Then you convert the ArrayBuffer to dataframe and use groupBy and aggregation as
val finalDF = groupedData
.toDF("t", "lat", "long", "grouped")
.groupBy(struct("lat", "long").as("lat-long"), col("grouped"))
.agg(struct(min("t"), max("t")).as("interval"))
.drop("grouped")
finalDF should be
+--------+--------+
|lat-long|interval|
+--------+--------+
|[29,49] |[15,20] |
|[27,28] |[0,10] |
|[27,28] |[25,30] |
+--------+--------+
I hope the answer is helpful

PostgreSQL check if coordinate is inside a bounding box

I have some locations I want to store on a database, the locations are defined by 4 coordinates p1(lat,long), p2(lat,long), p3(lat,long) and p4(lat,long), the only characteristic is that they always form a rectangle.
Once a few locations are stored in the DB I want to be able to query it, giving it a point(lat, long) and check if this point is inside any of the boxes in the DB.
My first question is what's the best way to design this table to make it easier and efficient to query it later. My first guess is something like this:
| id | lat1 | lon1 | lat2 | lon2 | lat3 | lon3 | lat4 | lon4 |
--------------------------------------------------------------
But I'm not sure what the best query to get all the locations or single location that another point is inside. for example the database contains 2 locations (rows)
| id | lat1 | lon1 | lat2 | lon2 | lat3 | lon3 | lat4 | lon4 |
--------------------------------------------------------------
| 1 | 0 | 0 | 0 | 10 | 10 | 10 | 10 | 0 |
| 2 | 50 | 50 | 50 | 60 | 60 | 60 | 60 | 50 |
If I have the point (5,5) how can I query the DB to get row 1?
You can use the least() and greatest() functions to get the min/max values for the x,y values (and maybe use these to construct two points and an enclosing rectangle)
CREATE TABLE latlon
( id INTEGER NOT NULL PRIMARY KEY
, lat1 INTEGER NOT NULL , lon1 INTEGER NOT NULL
, lat2 INTEGER NOT NULL , lon2 INTEGER NOT NULL
, lat3 INTEGER NOT NULL , lon3 INTEGER NOT NULL
, lat4 INTEGER NOT NULL , lon4 INTEGER NOT NULL
);
INSERT into LATLON ( id,lat1,lon1,lat2,lon2,lat3,lon3,lat4,lon4) VALUES
( 1 ,0 ,0 ,0 ,10 ,10 ,10 ,10 ,0 ),
( 2 ,50 ,50 ,50 ,60 ,60 ,60 ,60 ,50 );
SELECT id
, LEAST(lat1,lat2,lat3,lat4) AS MINLAT
, LEAST(lon1,lon2,lon3,lon4) AS MINLON
, GREATEST(lat1,lat2,lat3,lat4) AS MAXLAT
, GREATEST(lon1,lon2,lon3,lon4) AS MAXLON
FROM latlon;
Result:
CREATE TABLE
INSERT 0 2
id | minlat | minlon | maxlat | maxlon
----+--------+--------+--------+--------
1 | 0 | 0 | 10 | 10
2 | 50 | 50 | 60 | 60
(2 rows)
finding the (5,5) point:
SELECT * FROM (
SELECT id
, LEAST(lat1,lat2,lat3,lat4) AS minlat
, LEAST(lon1,lon2,lon3,lon4) AS minlon
, GREATEST(lat1,lat2,lat3,lat4) AS maxlat
, GREATEST(lon1,lon2,lon3,lon4) AS maxlon
FROM latlon
) rect
WHERE 5 >= rect.minlat AND 5 < rect.maxlat
AND 5 >= minlon AND 5 < rect.maxlon
;
Question 1: Are the rectangles horizontally aligned like your examples?
If they are, then it's enough to consider lat/long 1 and 3, for example, because you know that:
minLat = lat1
minLong = long1
maxLat = lat3
maxLong = long3
Question 2: Are lat and long 1 minor than lat and long 3 (are they ordered)?
If yes:
// You simply need to check that:
(lat1 <= latX <= lat3)
&& (long1 <= longX <= long3)
If not, you can previously check that lat1 <= lat3 and long1 <= long2 and switch them if needed.
Finally: If the answer to the question 1 was "not", then you can use the same principle, but first you need to apply some additional math. But, attending provided examples I suppose that it is not the case.
...anyway, If you are planning to check more complex cases (such as real -not only integers- coordinates) you probably should try using PostGIS...