Spark ML Transformer - aggregate over a window using rangeBetween - scala

I would like to create custom Spark ML Transformer that applies an aggregation function within rolling window with the construct over window. I would like to be able to use this transformer in Spark ML Pipeline.
I would like to achieve something that could be done quite easily with withColumn as given in this answer
Spark Window Functions - rangeBetween dates
for example:
val w = Window.orderBy(col("unixTimeMS")).rangeBetween(0, 700)
val df_new = df.withColumn("cts", sum("someColumnName").over(w))
Where
df is my dataframe
unixTimeMS is unix time in milliseconds
someColumnName is some column that I want to perform aggregation.
In this example I do a sum over the rows within the window.
the window w includes current transaction and all transactions within 700 ms from the current transaction.
Is it possible to put such window aggregation into Spark ML Transformer?
I was able to achieve something similar with Spark ML SQLTransformer where the
val query = """SELECT *,
sum(someColumnName) over (order by unixTimeMS) as cts
FROM __THIS__"""
new SQLTransformer().setStatement(query)
But I can't figure out how to use rangeBetween in SQL to select period of time. Not just number of rows. I need specific period of time with respect to unixTimeMS of the current row.
I understand the Unary Transforme is not the way to do it because I need to make an aggregate. Do I need to define a UDAF (user defined aggregate function) and use it in SQLTransformer?
I wasn't able to find any example of UDAF containing window function.

I am answering my own question for the future reference. I ended up using SQLTransformer. Just like the window function in the example where I use range between:
val query = SELECT *,
sum(dollars) over (
partition by Numerocarte
order by UnixTime
range between 1000 preceding and 200 following) as cts
FROM __THIS__"
Where 1000 and 200 in range between relate to units of the order by column.

Related

Pyspark filter a dataframe based on another dataframe containing distinct values

I have a df1 based on disticnt values containing two columns, date and value. There is df2 that has multiple column but contains the date and value column. For each distinct value from df1, i want to filter the df2 such that the records before the date from df1 are dropped. It would be rather easy for a single disticnt value, i can use something like filter by vlaue and then gt(lit(date), however i have over 500 such distinct pairs in df1. For single operation, it takes around 20 minute so if i use a loop then it is computationally not feasible. Perhaps some body can advice me on a better methodology here.
have tried multiple methodlogies but nothing has worked yet.

How to calculate median on timestamp windowed data in Spark?

I am very new to spark. I am stuck at this for a while now -
I am using Databricks for this. I create a dataframe df from database. Then I am collecting data using 30 minutes time window. But first challenge is I am unable to figure out if this groupBy operation is correct. Using agg seems to be not working.
val window30 = df.groupBy(window($"X_DATE_TS", "30 minute"), $"X_ID").agg(sort_array($"X_VALUE"))
display(window30)
Secondly, if I want to calculate median of the column X_VALUE, how do I perform this on this multiple rows selected?

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

Unable to Convert to DataFrame from RDD after applying partitioning

I am using Spark 2.1.0
When i am trying to use Window function on a Dataframe
val winspec = Window.partitionBy("partition_column")
DF.withColumn("column", avg(DF("col_name")).over(winspec))
My Plan changes and add the below lines to the Physical Plan and due to this An Extra Stage , EXtra Shuffling is happening and the Data is Huge which Slows down my Query like anything & runs for Hours.
+- Window [avg(cast(someColumn#262 as double)) windowspecdefinition(partition_column#460, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS someColumn#263], [partition_column#460]
+- *Sort [partition_column#460 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(partition_column#460, 200)
Also i see the Stage as MapInternalPartition which i think is partitioned internally Now i don't know what is this. But because i think because of this even my 100 tasks took 30+ mins and in that 99 was completed within 1-2 mins and the last 1 task took remaining 30 mins leaving my cluster IDLE with no parallel processing which makes me think that is the data partitioned properly when Window function is used ???
I Tried to apply HashPartitioning by converting it to RDD... BECAUSE we cannot apply Custom / HashPartitioner on a Dataframe
So if i do this :
val myVal = DF.rdd.partitioner(new HashPartitioner(10000))
I am getting a return type of ANY with which i am not getting any Action list to perform.
I checked and saw that the column with which the Partitioning is happening in Window functions contains all NULL values
TL;DR:
A shuffle when using window functions is not extra. It is required for correctness and cannot be removed.
Applying partitioner shuffles.
Datasets cannot reuse RDD partitioning. Partitioning with Dataset should be done with repartition method:
df.repartition($"col_name")
But it won't help you because of 2)
And this:
val myVal = DF.rdd.partitioner(new HashPartitioner(10000))
wouldn't return Any. It wouldn't compile, as there is no partitioner method for RDD, which takes arguments.
Correct method is partitionBy but it is not applicable to RDD[Row] and it wouldn't help you because of 3).
If there is enough memory you can try
df.join(
broadcast(df.groupBy("partition_column").agg(avg(DF("col_name"))),
Seq("partition_column")
)
Edit:
If you're trying to compute running average (avg with Window.partitionBy("partition_column") computes global average by group, not running average), then you're out of luck.
If partitioning column has only NULLS, then task is not distributed and fully sequential.
To compute global running average you can try to apply logic similar to this How to compute cumulative sum using Spark.

kdb/q: use function in a select from partitioned table

I'm trying to get max drawdown from a partitioned table across multiple dates. The query works fine when run with a date constrained to a specific day. E.g.
select {max neg x-maxs x} pnl from trades where date=last date
It's getting map-reduced over multiple dates so the above query no longer works. I can make the query run over multiple dates by adding another aggregation:
select max {max neg x-maxs x} pnl from trades
but it's not getting the max drawdown from continuous sequence of trades but a maximum of daily drawdowns.
I wonder if there's a way to make it work with a single select without chaining selects like
select {max neg x-maxs x} pnl from select pnl from trades
I've got a rather big query to pull a lot of various metrics on the trades where max drawdown is just one of them. Using chained select means that I need to break the big query into two queries, map-reduced and non-map-reduced, and then join them back which would make the query look ugly.
Thanks!
Select query runs on each date in partition db and apply function to each date values and finally aggregates them depending upon the call (user defined function behaves differently than plain 'q' functions).
So I don't think you can combine that into one query. But there are ways you can look for to make your query more generalized and reusable for different scenarios.
For ex. convert your query to functional form and use variables in that query for column name and user function. Put this in one function which will accept column name and user function. Now you can call this function with different set of (column ;function). Something like :
runF:{[col;usrfunc] funtional_query_uses_col_userfunc }
All this depends on your use cases. Also check for memory usage as you'll be taking lot of data into memory.