Dataframe display function in pyspark on databricks platform - pyspark

I am new to databricks, i was studing topic dataframe in pyspark
df = spark.read.parquet(salesPath)
display(df)
Above is my code , i m not getting ,what actually the up arrows do?
and why this beautiful df.display not included in Apache pyspark documentation?

Arrows are used to sort the displayed portion of the dataframe. But please note that the display function shows at max 1000 records, and won't load the whole dataset.
The display function isn't included into PySpark documentation because it's specific to Databricks. Similar function also exist in Jupyter that you can use with PySpark, but it's not part of the PySpark. (you can use df.show() function to display as text table - it's a part of the PySpark's DataFrame API)

Related

How to calculate median on timestamp windowed data in Spark?

I am very new to spark. I am stuck at this for a while now -
I am using Databricks for this. I create a dataframe df from database. Then I am collecting data using 30 minutes time window. But first challenge is I am unable to figure out if this groupBy operation is correct. Using agg seems to be not working.
val window30 = df.groupBy(window($"X_DATE_TS", "30 minute"), $"X_ID").agg(sort_array($"X_VALUE"))
display(window30)
Secondly, if I want to calculate median of the column X_VALUE, how do I perform this on this multiple rows selected?

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra table using scala and apache-spark.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy operation to group the data according to src_ip,src_port,dst_ip,dst_port and I also want to have the latest value from a received_time column of the original dataframe.
I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen.
I know how to use .withColumn and also, I think that .map() can be used here.
Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

PySpark Sliding Window Calculations

I have a PySpark dataframe on which I want to run a sliding window calculation. Here is sample code of the operation that I want to run (shown for pandas dataframe):
df["Total"].shift(1).rolling(7, min_periods = 7).avg()
Can anyone perhaps tell me how I can replicate this operation in PySpark?
Check this sample -
How to use window functions in PySpark?
Details on Window Function -
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
and docs - http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

How to find widths of a Flat File using read_fwf() in Pandas?

I have downloaded some data from the Mainframe (.DATA format) and I need to parse it to create a PySpark DataFrame and perform some operations on it. Before doing that, I created a sample file and read it using read_fwf() feature of Pandas.
I was able to read and create the DataFrame but I encountered some problems like
Padding of "0" in the first column of some of the Rows
Repeating Headers while reading the Data
These were some of the issues I can handle, however the key challenge I am facing is in identifying the widths of the columns. I currently have 65 columns but in order to create a PySpark DataFrame, I would require to know the widths of these columns. Can read_fwf() tell what is the widths it is using for each column ?
And is there a read_fwf() like function in PySpark ? Or we would have to write a MapRed code for it ?

Spark ML Transformer - aggregate over a window using rangeBetween

I would like to create custom Spark ML Transformer that applies an aggregation function within rolling window with the construct over window. I would like to be able to use this transformer in Spark ML Pipeline.
I would like to achieve something that could be done quite easily with withColumn as given in this answer
Spark Window Functions - rangeBetween dates
for example:
val w = Window.orderBy(col("unixTimeMS")).rangeBetween(0, 700)
val df_new = df.withColumn("cts", sum("someColumnName").over(w))
Where
df is my dataframe
unixTimeMS is unix time in milliseconds
someColumnName is some column that I want to perform aggregation.
In this example I do a sum over the rows within the window.
the window w includes current transaction and all transactions within 700 ms from the current transaction.
Is it possible to put such window aggregation into Spark ML Transformer?
I was able to achieve something similar with Spark ML SQLTransformer where the
val query = """SELECT *,
sum(someColumnName) over (order by unixTimeMS) as cts
FROM __THIS__"""
new SQLTransformer().setStatement(query)
But I can't figure out how to use rangeBetween in SQL to select period of time. Not just number of rows. I need specific period of time with respect to unixTimeMS of the current row.
I understand the Unary Transforme is not the way to do it because I need to make an aggregate. Do I need to define a UDAF (user defined aggregate function) and use it in SQLTransformer?
I wasn't able to find any example of UDAF containing window function.
I am answering my own question for the future reference. I ended up using SQLTransformer. Just like the window function in the example where I use range between:
val query = SELECT *,
sum(dollars) over (
partition by Numerocarte
order by UnixTime
range between 1000 preceding and 200 following) as cts
FROM __THIS__"
Where 1000 and 200 in range between relate to units of the order by column.