I am trying to implement a rolling window with a 30 minutes timeframe that is grouped by the source_ip. The idea is to get the average for each of the source_ip. Not sure this is the right way to do it. The problem I have is the ip 192.168.1.3 which seems to average more than the 30 minute window since packets 25 is a couple of days later.
df = sqlContext.createDataFrame([('192.168.1.1', 17, "2017-03-10T15:27:18+00:00"),
('192.168.1.2', 1, "2017-03-15T12:27:18+00:00"),
('192.168.1.2', 2, "2017-03-15T12:28:18+00:00"),
('192.168.1.2', 3, "2017-03-15T12:29:18+00:00"),
('192.168.1.3', 4, "2017-03-15T12:28:18+00:00"),
('192.168.1.3', 5, "2017-03-15T12:29:18+00:00"),
('192.168.1.3', 25, "2017-03-18T11:27:18+00:00")],
["source_ip","packets", "timestampGMT"])
w = (Window()
.partitionBy("source_ip")
.orderBy(F.col("timestampGMT").cast('long'))
.rangeBetween(-1800, 0))
df = df.withColumn('rolling_average', F.avg("packets").over(w))
df.show(100,False)
This is the result I get. I would expect 4.5 for the first 2 entries and 25 for the third entry?
+-----------+-------+-------------------------+------------------+
|source_ip |packets|timestampGMT |rolling_average |
+-----------+-------+-------------------------+------------------+
|192.168.1.3|4 |2017-03-15T12:28:18+00:00|11.333333333333334|
|192.168.1.3|5 |2017-03-15T12:29:18+00:00|11.333333333333334|
|192.168.1.3|25 |2017-03-18T11:27:18+00:00|11.333333333333334|
|192.168.1.2|1 |2017-03-15T12:27:18+00:00|2.0 |
|192.168.1.2|2 |2017-03-15T12:28:18+00:00|2.0 |
|192.168.1.2|3 |2017-03-15T12:29:18+00:00|2.0 |
|192.168.1.1|17 |2017-03-10T15:27:18+00:00|17.0 |
+-----------+-------+-------------------------+------------------+
Change the string to the timestamp first, and orderBy it.
import pyspark.sql.functions as F
from pyspark.sql import Window
w = (Window()
.partitionBy("source_ip")
.orderBy(F.col("timestamp"))
.rangeBetween(-1800, 0))
df = df.withColumn("timestamp", F.unix_timestamp(F.to_timestamp("timestampGMT"))) \
.withColumn('rolling_average', F.avg("packets").over(w))
df.printSchema()
df.show(100,False)
root
|-- source_ip: string (nullable = true)
|-- packets: long (nullable = true)
|-- timestampGMT: string (nullable = true)
|-- timestamp: long (nullable = true)
|-- rolling_average: double (nullable = true)
+-----------+-------+-------------------------+----------+---------------+
|source_ip |packets|timestampGMT |timestamp |rolling_average|
+-----------+-------+-------------------------+----------+---------------+
|192.168.1.2|1 |2017-03-15T12:27:18+00:00|1489580838|1.0 |
|192.168.1.2|2 |2017-03-15T12:28:18+00:00|1489580898|1.5 |
|192.168.1.2|3 |2017-03-15T12:29:18+00:00|1489580958|2.0 |
|192.168.1.1|17 |2017-03-10T15:27:18+00:00|1489159638|17.0 |
|192.168.1.3|4 |2017-03-15T12:28:18+00:00|1489580898|4.0 |
|192.168.1.3|5 |2017-03-15T12:29:18+00:00|1489580958|4.5 |
|192.168.1.3|25 |2017-03-18T11:27:18+00:00|1489836438|25.0 |
+-----------+-------+-------------------------+----------+---------------+
Related
I can do the following statement in SparkSQL:
result_df = spark.sql("""select
one_field,
field_with_struct
from purchases""")
And resulting data frame will have the field with full struct in field_with_struct.
one_field
field_with_struct
123
{name1,val1,val2,f2,f4}
555
{name2,val3,val4,f6,f7}
I want to select only few fields from field_with_struct, but keep them still in struct in the resulting data frame. If something could be possible (this is not real code):
result_df = spark.sql("""select
one_field,
struct(
field_with_struct.name,
field_with_struct.value2
) as my_subset
from purchases""")
To get this:
one_field
my_subset
123
{name1,val2}
555
{name2,val4}
Is there any way of doing this with SQL? (not with fluent API)
There's a much simpler solution making use of arrays_zip, no need to explode/collect_list (which can be error prone/difficult with complex data since it relies on using something like an id column):
>>> from pyspark.sql import Row
>>> from pyspark.sql.functions import arrays_zip
>>> df = sc.createDataFrame((([Row(x=1, y=2, z=3), Row(x=2, y=3, z=4)],),), ['array_of_structs'])
>>> df.show(truncate=False)
+----------------------+
|array_of_structs |
+----------------------+
|[{1, 2, 3}, {2, 3, 4}]|
+----------------------+
>>> df.printSchema()
root
|-- array_of_structs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
| | |-- z: long (nullable = true)
>>> # Selecting only two of the nested fields:
>>> selected_df = df.select(arrays_zip("array_of_structs.x", "array_of_structs.y").alias("array_of_structs"))
>>> selected_df.printSchema()
root
|-- array_of_structs: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
>>> selected_df.show()
+----------------+
|array_of_structs|
+----------------+
|[{1, 2}, {2, 3}]|
+----------------+
EDIT Adding in the corresponding Spark SQL code, since that was requested by the OP:
>>> df.createTempView("test_table")
>>> sql_df = sc.sql("""
SELECT
transform(array_of_structs, x -> struct(x.x, x.y)) as array_of_structs
FROM test_table
""")
>>> sql_df.printSchema()
root
|-- array_of_structs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
>>> sql_df.show()
+----------------+
|array_of_structs|
+----------------+
|[{1, 2}, {2, 3}]|
+----------------+
In fact, the pseudo code which I have provided is working. For a nested array of object it's not so straightforward. At first, the array should be exploded (EXPLODE() function) and then selected a subset. After that it's possible to make a COLLECT_LIST().
WITH
unfold_by_items AS (SELECT id, EXPLODE(Items) AS item FROM spark_tbl_items)
, format_items as (SELECT
id
, STRUCT(
item.item_id
, item.name
) AS item
FROM unfold_by_items)
, fold_by_items AS (SELECT id, COLLECT_LIST(item) AS Items FROM format_items GROUP BY id)
SELECT * FROM fold_by_items
This will choose only two fields from the struct in Items and in the end returns a dataset which contains again an array with Items.
Below is my source schema.
root
|-- header: struct (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- id: string (nullable = true)
| |-- honame: string (nullable = true)
|-- device: struct (nullable = true)
| |-- srcId: string (nullable = true)
| |-- srctype.: string (nullable = true)
|-- ATTRIBUTES: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- event_date: date (nullable = true)
|-- event_datetime: string (nullable = true)
I want to explode the ATTRIBUTES map type column and select all the columns which ends with _id.
Im using the below code.
val exploded = batch_df.select($"event_date", explode($"ATTRIBUTES")).show()
I am getting the below sample output.
---+----------+--------------------+--------------------+
|date | key| value|
+----------+--------------------+--------------------+
|2021-05-18|SYST_id | 85|
|2021-05-18|RECVR_id | 1|
|2021-05-18|Account_Id| | 12345|
|2021-05-18|Vb_id | 845|
|2021-05-18|SYS_INFO_id | 640|
|2021-05-18|mem_id | 456|
------------------------------------------------------
However, my required output is as below.
+---+-------+--------------+-----------+------------+-------+-------------+-------+
|date | SYST_id | RECVR_id | Account_Id | Vb_id | SYS_INFO_id| mem_id|
+----+------+--------------+-----------+------------+-------+-------------+-------+
|2021-05-18| 85 | 1 | 12345 | 845 | 640 | 456 |
+-----------+--------------+-----------+------------+-------+-------------+-------+
Could someone pls assist.
Your approach works. You only have to add a pivot operation after the explode:
import org.apache.spark.sql.functions._
exploded.groupBy("date").pivot("key").agg(first("value")).show()
I assume that the combination of date and key is unique, so it is safe to take the first (and only) value in the aggregation. If the combination is not unique, you could use collect_list as aggregation function.
Edit:
To add scrId and srctype, simply add these columns to the select statement:
val exploded = batch_df.select($"event_date", $"device.srcId", $"device.srctype", explode($"ATTRIBUTES"))
To reduce the number of columns after the pivot operation, apply a filter on the key column before aggregating:
val relevant_cols = Array("Account_Id", "Vb_id", "RECVR_id", "mem_id") // the four additional columns
exploded.filter($"key".isin(relevant_cols:_*).or($"key".endsWith(lit("_split"))))
.groupBy("date").pivot("key").agg(first("value")).show()
I would like to aggregate sum of Order.amount for each customerId where date<='30/03/2021'(mm/dd/yyyy), taking advantage of having the array per userId rows.
output based on the below input data.
1 250
2 450
CustomerID Order
1 [[1,100,01/01/2021],[2,200,06/01/2021],[3,150,03/01/2021]]
2 [[1,200,02/01/2021],[2,250,03/01/2021],[3,300,05/01/2021]]
CustomerID
array : Order
struct of element
Order
amount
date
Suppose df is your dataframe,
df = spark.createDataFrame(
((1, [[1,100,'01/01/2021'],[2,200,'06/01/2021'],[3,150,'03/01/2021']]),
(2 , [[1,200,'02/01/2021'],[2,250,'03/01/2021'],[3,300,'05/01/2021']])),
"CustomerID : int, Order : array<struct<Order: int, amount: int, date: string>>")
df.printSchema()
# root
# |-- CustomerID: integer (nullable = true)
# |-- Order: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- Order: integer (nullable = true)
# | | |-- amount: integer (nullable = true)
# | | |-- date: string (nullable = true)
for Spark version >= 2.4.x, you can use higher order functions to work with array. In this case, filter the dates then aggregate the amounts.
import pyspark.sql.functions as F
sql_expr = """
aggregate(
filter(order, x -> to_date(x.date, 'MM/dd/yyyy') <= '2021-03-30').amount,
0, (a, b) -> a + b)
"""
df = df.withColumn('total_amount', F.expr(sql_expr))
df.show()
# +----------+--------------------+------------+
# |CustomerID| Order|total_amount|
# +----------+--------------------+------------+
# | 1|[[1, 100, 01/01/2...| 250|
# | 2|[[1, 200, 02/01/2...| 450|
# +----------+--------------------+------------+
I have a column which represents unix_timestamp and want to convert it into string with this format, 'yyyy-MM-dd HH:mm:ss.SSS'.
unix_timestamp | time_string
1578569683753 | 2020-01-09 11:34:43.753
1578569581793 | 2020-01-09 11:33:01.793
1578569581993 | 2020-01-09 11:33:01.993
Is there any builtin function or how does it work? Thanks.
df1 = df1.withColumn('utc_stamp', F.from_unixtime('Timestamp', format="YYYY-MM-dd HH:mm:ss"))
df1.show(truncate=False)
from_unixtime converts only into seconds, for milliseconds I just have to concat them from original column to new column.
unixtimestamp only supports second precision. Looking at your values the precision is at milliseconds, the last 3 positions are milliseconds.
from pyspark.sql.functions import substring,unix_timestamp,col,to_timestamp,concat,lit,from_unixtime
df = spark.createDataFrame([('1578569683753',), ('1578569581793',),('1578569581993',)], ['TMS'])
df.show(3,False)
df.printSchema()
Result
+-------------+
|TMS |
+-------------+
|1578569683753|
|1578569581793|
|1578569581993|
+-------------+
root
|-- TMS: string (nullable = true)
Convert to the human-readable timestamp format
df1 = (df
.select("TMS"
,from_unixtime(substring(col("TMS"),1,10), format="yyyy-MM-dd HH:mm:sss").alias("TMS_WITHOUT_MILLISECONDS")
,(substring("TMS",11,3)).alias("MILLISECONDS")
,(concat(from_unixtime(substring(col("TMS"),1,10), format="yyyy-MM-dd HH:mm:sss"),lit('.'), substring(df.TMS,11,3))).alias("TMS_StringType")
,to_timestamp(concat(from_unixtime(substring(col("TMS"),1,10), format="yyyy-MM-dd HH:mm:sss"),lit('.'), substring(df.TMS,11,3))).alias("TMS_TimestampType")
)
)
df1.show(3,False)
df1.printSchema()
Output
+-------------+------------------------+------------+------------------------+-----------------------+
|TMS |TMS_WITHOUT_MILLISECONDS|MILLISECONDS|TMS_StringType |TMS_TimestampType |
+-------------+------------------------+------------+------------------------+-----------------------+
|1578569683753|2020-01-09 11:34:043 |753 |2020-01-09 11:34:043.753|2020-01-09 11:34:43.753|
|1578569581793|2020-01-09 11:33:001 |793 |2020-01-09 11:33:001.793|2020-01-09 11:33:01.793|
|1578569581993|2020-01-09 11:33:001 |993 |2020-01-09 11:33:001.993|2020-01-09 11:33:01.993|
+-------------+------------------------+------------+------------------------+-----------------------+
root
|-- TMS: string (nullable = true)
|-- TMS_WITHOUT_MILLISECONDS: string (nullable = true)
|-- MILLISECONDS: string (nullable = true)
|-- TMS_StringType: string (nullable = true)
|-- TMS_TimestampType: timestamp (nullable = true)
So my table looks something like this:
customer_1|place|customer_2|item |count
-------------------------------------------------
a | NY | b |(2010,304,310)| 34
a | NY | b |(2024,201,310)| 21
a | NY | b |(2010,304,312)| 76
c | NY | x |(2010,304,310)| 11
a | NY | b |(453,131,235) | 10
I've tried doing, but this does not eliminate the duplicates as the former array is still there (as it should be, I need it for end results).
val df= df_one.withColumn("vs", struct(col("item").getItem(size(col("item"))-1), col("item"), col("count")))
.groupBy(col("customer_1"), col("place"), col("customer_2"))
.agg(max("vs").alias("vs"))
.select(col("customer_1"), col("place"), col("customer_2"), col("vs.item"), col("vs.count"))
I would like to group by customer_1, place and customer_2 columns and return only array structs whose last item (-1) is unique with the highest count, any ideas?
Expected output:
customer_1|place|customer_2|item |count
-------------------------------------------------
a | NY | b |(2010,304,312)| 76
a | NY | b |(2010,304,310)| 34
a | NY | b |(453,131,235) | 10
c | NY | x |(2010,304,310)| 11
Given that the schema of the dataframe is as
root
|-- customer_1: string (nullable = true)
|-- place: string (nullable = true)
|-- customer_2: string (nullable = true)
|-- item: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- count: string (nullable = true)
You can apply concat funcations to create temp column for checking duplicate rows as done below
import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item"(size($"item")-1)))
.dropDuplicates("temp")
.drop("temp")
You should get following output
+----------+-----+----------+----------------+-----+
|customer_1|place|customer_2|item |count|
+----------+-----+----------+----------------+-----+
|a |NY |b |[2010, 304, 312]|76 |
|c |NY |x |[2010, 304, 310]|11 |
|a |NY |b |[453, 131, 235] |10 |
|a |NY |b |[2010, 304, 310]|34 |
+----------+-----+----------+----------------+-----+
Struct
Given the schema of dataframe is as
root
|-- customer_1: string (nullable = true)
|-- place: string (nullable = true)
|-- customer_2: string (nullable = true)
|-- item: struct (nullable = true)
| |-- _1: integer (nullable = false)
| |-- _2: integer (nullable = false)
| |-- _3: integer (nullable = false)
|-- count: string (nullable = true)
We can still do same as above with slight change in getting the third item from the struct as
import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item._3"))
.dropDuplicates("temp")
.drop("temp")
Hope the answer is helpful