How to calculate median on timestamp windowed data in Spark? - scala

I am very new to spark. I am stuck at this for a while now -
I am using Databricks for this. I create a dataframe df from database. Then I am collecting data using 30 minutes time window. But first challenge is I am unable to figure out if this groupBy operation is correct. Using agg seems to be not working.
val window30 = df.groupBy(window($"X_DATE_TS", "30 minute"), $"X_ID").agg(sort_array($"X_VALUE"))
display(window30)
Secondly, if I want to calculate median of the column X_VALUE, how do I perform this on this multiple rows selected?

Related

Dataframe display function in pyspark on databricks platform

I am new to databricks, i was studing topic dataframe in pyspark
df = spark.read.parquet(salesPath)
display(df)
Above is my code , i m not getting ,what actually the up arrows do?
and why this beautiful df.display not included in Apache pyspark documentation?
Arrows are used to sort the displayed portion of the dataframe. But please note that the display function shows at max 1000 records, and won't load the whole dataset.
The display function isn't included into PySpark documentation because it's specific to Databricks. Similar function also exist in Jupyter that you can use with PySpark, but it's not part of the PySpark. (you can use df.show() function to display as text table - it's a part of the PySpark's DataFrame API)

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

How to fit a kernel density estimate on a pyspark dataframe column and use it for creating a new column with the estimates

My use is the following. Consider I have a pyspark dataframe which has the following format:
df.columns:
1. hh: Contains the hour of the day (type int)
2. userId : some unique identifier.
What I want to do is I want to figure out list of userIds which have anomalous hits onto the page. So I first do a groupby as so:
df=df.groupby("hh","userId).count().alias("LoginCounts)
Now the format of the dataframe would be:
1. hh
2. userId
3.LoginCounts: Number of times a specific user logs in at a particular hour.
I want to use the pyspark kde function as follows:
from pyspark.mllib.stat import KernelDensity
kd=KernelDensity()
kd.setSample(df.select("LoginCounts").rdd)
kd.estimate([13.0,14.0]).
I get the error:
Py4JJavaError: An error occurred while calling o647.estimateKernelDensity.
: org.apache.spark.SparkException: Job aborted due to stage failure
Now my end goal is to fit a kde on say a day's hour based data and then use the next day's data to get the probability estimates for each login count.
Eg: I would like to achieve something of this nature:
df.withColumn("kdeProbs",kde.estimate(col("LoginCounts)))
So the column kdeProbs will contain P(LoginCount=x | estimated kde).
I have tried searching for an example of the same but am always redirected to the standard kde example on the spark.apache.org page, which does not solve my case.
It's not enough to just select one column and convert it to an RDD; you need to also select the actual data in that column for it to work. Try this:
from pyspark.mllib.stat import KernelDensity
dat_rdd = df.select("LoginCounts").rdd
# actually select data from RDD
dat_rdd_data = dat_rdd.map(lambda x: x[0])
kd = KernelDensity()
kd.setSample(dat_rdd_data)
kd.estimate([13.0,14.0])

Spark ML Transformer - aggregate over a window using rangeBetween

I would like to create custom Spark ML Transformer that applies an aggregation function within rolling window with the construct over window. I would like to be able to use this transformer in Spark ML Pipeline.
I would like to achieve something that could be done quite easily with withColumn as given in this answer
Spark Window Functions - rangeBetween dates
for example:
val w = Window.orderBy(col("unixTimeMS")).rangeBetween(0, 700)
val df_new = df.withColumn("cts", sum("someColumnName").over(w))
Where
df is my dataframe
unixTimeMS is unix time in milliseconds
someColumnName is some column that I want to perform aggregation.
In this example I do a sum over the rows within the window.
the window w includes current transaction and all transactions within 700 ms from the current transaction.
Is it possible to put such window aggregation into Spark ML Transformer?
I was able to achieve something similar with Spark ML SQLTransformer where the
val query = """SELECT *,
sum(someColumnName) over (order by unixTimeMS) as cts
FROM __THIS__"""
new SQLTransformer().setStatement(query)
But I can't figure out how to use rangeBetween in SQL to select period of time. Not just number of rows. I need specific period of time with respect to unixTimeMS of the current row.
I understand the Unary Transforme is not the way to do it because I need to make an aggregate. Do I need to define a UDAF (user defined aggregate function) and use it in SQLTransformer?
I wasn't able to find any example of UDAF containing window function.
I am answering my own question for the future reference. I ended up using SQLTransformer. Just like the window function in the example where I use range between:
val query = SELECT *,
sum(dollars) over (
partition by Numerocarte
order by UnixTime
range between 1000 preceding and 200 following) as cts
FROM __THIS__"
Where 1000 and 200 in range between relate to units of the order by column.

dataframe.selectexpr performace for selecting large number of columns

I am using spark dataframe in scala. My data frame is holding about 400 columns, with 1000-1M rows. I am running a
datagrame.selectExpr operation(1 to 400th column) on certain criteria and once fetching them, I am aggregating the values of all these columns.
my selectexpr statement:
val df = df2.selectExpr(getColumn(beginDate, endDate, x._2): _*)
getColumn method will fetch columns day wise between start and enddate from my dataframe (this may be 365 columns as we have day wise data).
my summing by expression is :
df.map(row => (row(0), row(1), row(2), (3 until row.length).map(row.getLong(_)).sum)).collect()
I find that selecting these many number of columns is degrading the performance of my job. Is there anyway to make this fetching of 400 columns much faster?
Dataframe is better optimization than the RDD . And you are using it which is good . But can you Please check the Spark UI and what stage it is taking much time. If its taking the time due to calculation or due to data load. Reshaping Data . And try to scale up slowly for faster output.Check if the partioning can also help your code to run it fatser . Apache Spark Code Faster .Making the code faster depnds on various factor and try to optimize by using those.