spark (Scala) dataframe filtering (FIR) - scala

Let say I have a dataframe ( stored in scala val as df) which contains the data from a csv:
time,temperature
0,65
1,67
2,62
3,59
which I have no problem reading this from file as a spark dataframe in scala language.
I would like to add a filtered column (by filter I meant signal processing moving average filtering), (say I want to do (T[n]+T[n-1])/2.0):
time,temperature,temperatureAvg
0,65,(65+0)/2.0
1,67,(67+65)/2.0
2,62,(62+67)/2.0
3,59,(59+62)/2.0
(Actually, say for the first row, I want 32.5 instead of (65+0)/2.0. I wrote it to clarify the expected 2-time-step filtering operation output)
So how to achieve this? I am not familiar with spark dataframe operation which combine rows iteratively along column...

Spark 3.1+
Replace
$"time".cast("timestamp")
with
import org.apache.spark.sql.functions.timestamp_seconds
timestamp_seconds($"time")
Spark 2.0+
In Spark 2.0 and later it is possible to use window function as a input for groupBy. It allows you to specify windowDuration, slideDuration and startTime (offset). It works only with TimestampType column but it is not that hard to find a workaround for that. In your case it will require some additional steps to correct for boundaries but general solution can expressed as shown below:
import org.apache.spark.sql.functions.{window, avg}
df
.withColumn("ts", $"time".cast("timestamp"))
.groupBy(window($"ts", windowDuration="2 seconds", slideDuration="1 second"))
.avg("temperature")
Spark < 2.0
If there is a natural way to partition your data you can use window functions as follows:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.mean
val w = Window.partitionBy($"id").orderBy($"time").rowsBetween(-1, 0)
val df = sc.parallelize(Seq(
(1L, 0, 65), (1L, 1, 67), (1L, 2, 62), (1L, 3, 59)
)).toDF("id", "time", "temperature")
df.select($"*", mean($"temperature").over(w).alias("temperatureAvg")).show
// +---+----+-----------+--------------+
// | id|time|temperature|temperatureAvg|
// +---+----+-----------+--------------+
// | 1| 0| 65| 65.0|
// | 1| 1| 67| 66.0|
// | 1| 2| 62| 64.5|
// | 1| 3| 59| 60.5|
// +---+----+-----------+--------------+
You can create windows with arbitrary weights using lead / lag functions:
lit(0.6) * $"temperature" +
lit(0.3) * lag($"temperature", 1) +
lit(0.2) * lag($"temperature", 2)
It is still possible without partitionBy clause but will be extremely inefficient. If this is the case you won't be able to use DataFrames. Instead you can use sliding over RDD (see for example Operate on neighbor elements in RDD in Spark). There is also spark-timeseries package you may find useful.

Related

Using pandas udf without looping in pyspark

So suppose I have a big spark dataframe .I dont know how many columns.
(the solution has to be in pyspark using pandas udf. Not a different approach)
I want to perform an action on all columns. So it's ok to loop inside on all columns
But I dont want to loop through rows. I want it to act on the column at once.
I didnt find on the internet how this could be done.
Suppose I have this datafrme
A B C
5 3 2
1 7 0
Now I want to send to pandas udf to get sum of each row.
Sum
10
8
Number of columns not known.
I can do it inside the udf by looping row at a time. But I dont want. I want it to act on all rows without looping. And I allow looping through columns if needed.
One option I tried is combining all colmns to array column
ARR
[5,3,2]
[1,7,0]
But even here it doesnt work for me without looping.
I send this column to the udf and then inside I need to loop through its rows and sum each value of the list-row.
It would be nice if I could seperate each column as a one and act on the whole column at once
How do I act on the column at once? Without looping through the rows?
If I loop through the rows I guess it's no better than a regular python udf
I wouldnt go to pandas udfs, resort to udfs it cant be done in pyspark. Anyway code for both below
df = spark.read.load('/databricks-datasets/asa/small/small.csv', header=True,format='csv')
sf = df.select(df.colRegex("`.*rrDelay$|.*pDelay$`"))
#sf.show()
columns = ["id","ArrDelay","DepDelay"]
data = [("a", 81.0,3),
("b", 36.2,5),
("c", 12.0,5),
("d", 81.0,5),
("e", 36.3,5),
("f", 12.0,5),
("g", 111.7,5)]
sf = spark.createDataFrame(data=data,schema=columns)
sf.show()
# Use aggregate function
new = (sf.withColumn('sums', array(*[x for x in ['ArrDelay','DepDelay'] ]))#Create an array of values per row on desired columns
.withColumn('sums', expr("aggregate(sums,cast(0 as double), (c,i)-> c+i)"))# USE aggregate to sum
).show()
#use pandas udf
sch= sf.withColumn('v', lit(90.087654623)).schema
def sum_s(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
for pdf in iterator:
yield pdf.assign(v=pdf.sum(1))
sf.mapInPandas(sum_s, schema=sch).show()
here's a simple way to do it
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import Window
from functools import reduce
df = spark.createDataFrame(
[
(5,3,2),
(1,7,0),
],
["A", "B", "C"],
)
cols = df.columns
calculate_sum = reduce(lambda a, x: a+x, map(col, cols))
df = (
df
.withColumn(
"sum",calculate_sum
)
)
df.show()
output:
+---+---+---+---+
| A| B| C|sum|
+---+---+---+---+
| 5| 3| 2| 10|
| 1| 7| 0| 8|
+---+---+---+---+

Check the minimum by iterating one row in a dataframe over all the rows in another dataframe

Let's say I have the following two dataframes:
DF1:
+----------+----------+----------+
| Place|Population| IndexA|
+----------+----------+----------+
| A| Int| X_A|
| B| Int| X_B|
| C| Int| X_C|
+----------+----------+----------+
DF2:
+----------+----------+
| City| IndexB|
+----------+----------+
| D| X_D|
| E| X_E|
| F| X_F|
| ....| ....|
| ZZ| X_ZZ|
+----------+----------+
The dataframes above are normally of much larger size.
I want to determine to which City(DF2) the shortest distance is from every Place from DF1. The distance can be calculated based on the index. So for every row in DF1, I have to iterate over every row in DF2 and look for the shortest distances based on the calculations with the indexes. For the distance calculation there is a function defined:
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
I tried the following:
val output = DF1.agg(functions.min(distance(col("IndexA"), DF2.col("IndexB"))))
But this, the code compiles but I get the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s)
H3Index#220L missing from Places#316,Population#330,IndexAx#338L in operator !Aggregate
[min(if ((isnull(IndexA#338L) OR isnull(IndexB#220L))) null else
UDF(knownnotnull(IndexA#338L), knownnotnull(IndexB#220L))) AS min(UDF(IndexA, IndexB))#346].
So I suppose I do something wrong with iterating over each row in DF2 when taking one row from DF1 but I couldn't find a solution.
What am I doing wrong? And am I in the right direction?
You are getting this error because the index column you are using only exists in DF2 and not DF1 where you are attempting to perform the aggregation.
In order to make this field accessible and determine the distance from all points, you would need to
Cross join DF1 and Df2 to have every index of Df1 matching every index of DF2
Determine the distance using your udf
Find the min on this new cross joined udf with the distances
This may look like :
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, min, udf}
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
val resultDF = DF1.crossJoin(DF2)
.withColumn("distance", distance(col("IndexA"), col("IndexB")))
//instead of using a groupby then matching the min distance of the aggregation with the initial df. I've chosen to use a window function min to determine the min_distance of each group (determined by Place) and filter by the city with the min distance to each place
.withColumn("min_distance", min("distance").over(Window.partitionBy("Place")))
.where(col("distance") === col("min_distance"))
.drop("min_distance")
This will result in a dataframe with columns from both dataframes and and additional column distance.
NB. Your current approach which is comparing every item in one df to every item in another df is an expensive operation. If you have the opportunity to filter early (eg joining on heuristic columns, i.e. other columns which may indicate a place may be closer to a city), this is recommended.
Let me know if this works for you.
If you have only a few cities (less than or around 1000), you can avoid crossJoin and Window shuffle by collecting cities in an array and then perform distance computation for each place using this collected array:
import org.apache.spark.sql.functions.{array_min, col, struct, transform, typedLit, udf}
val citiesIndexes = df2.select("City", "IndexB")
.collect()
.map(row => (row.getString(0), row.getLong(1)))
val result = df1.withColumn(
"City",
array_min(
transform(
typedLit(citiesIndexes),
x => struct(distance(col("IndexA"), x.getItem("_2")), x.getItem("_1"))
)
).getItem("col2")
)
This piece of code works for Spark 3 and greater. If you are on a Spark version smaller than 3.0, you should replace array_min(...).getItem("col2") part by an user-defined function.

Spark: Is "count" on Grouped Data a Transformation or an Action?

I know that count called on an RDD or a DataFrame is an action. But while fiddling with the spark shell, I observed the following
scala> val empDF = Seq((1,"James Gordon", 30, "Homicide"),(2,"Harvey Bullock", 35, "Homicide"),(3,"Kristen Kringle", 28, "Records"),(4,"Edward Nygma", 30, "Forensics"),(5,"Leslie Thompkins", 31, "Forensics")).toDF("id", "name", "age", "department")
empDF: org.apache.spark.sql.DataFrame = [id: int, name: string, age: int, department: string]
scala> empDF.show
+---+----------------+---+----------+
| id| name|age|department|
+---+----------------+---+----------+
| 1| James Gordon| 30| Homicide|
| 2| Harvey Bullock| 35| Homicide|
| 3| Kristen Kringle| 28| Records|
| 4| Edward Nygma| 30| Forensics|
| 5|Leslie Thompkins| 31| Forensics|
+---+----------------+---+----------+
scala> empDF.groupBy("department").count //count returned a DataFrame
res1: org.apache.spark.sql.DataFrame = [department: string, count: bigint]
scala> res1.show
+----------+-----+
|department|count|
+----------+-----+
| Homicide| 2|
| Records| 1|
| Forensics| 2|
+----------+-----+
When I called count on GroupedData (empDF.groupBy("department")), I got another DataFrame as the result (res1). This leads me to believe that count in this case was a transformation. It is further supported by the fact that no computations were triggered when I called count, instead, they started when I ran res1.show.
I haven't been able to find any documentation that suggests count could be a transformation as well. Could someone please shed some light on this?
The .count() what you have used in your code is over RelationalGroupedDataset, which creates a new column with count of elements in the grouped dataset. This is a transformation. Refer:
https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.GroupedDataset
The .count() that you use normally over RDD/DataFrame/Dataset is completely different from the above and this .count() is an Action. Refer: https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.rdd.RDD
EDIT:
always use .count() with .agg() while operating on groupedDataSet in order to avoid confusion in future:
empDF.groupBy($"department").agg(count($"department") as "countDepartment").show
Case 1:
You use rdd.count() to count the number of rows. Since it initiates the DAG execution and returns the data to the driver, its an action for RDD.
for ex: rdd.count // it returns a Long value
Case 2:
If you call count on Dataframe, it initiates the DAG execution and returns the data to the driver, its an action for Dataframe.
for ex: df.count // it returns a Long value
Case 3:
In your case you are calling groupBy on dataframe which returns RelationalGroupedDataset object, and you are calling count on grouped Dataset which returns a Dataframe, so its a transformation since it doesn't gets the data to the driver and initiates the DAG execution.
for ex:
df.groupBy("department") // returns RelationalGroupedDataset
.count // returns a Dataframe so a transformation
.count // returns a Long value since called on DF so an action
As you've already figure out - if method returns a distributed object (Dataset or RDD) it can be qualified as a transformations.
However these distinctions are much better suited for RDDs than Datasets. The latter ones features an optimizer, including recently added cost based optimizer, and might be much less lazy the old API, blurring differences between transformation and action in some case.
Here however it is safe to say count is a transformation.

How to flatten a nested field in a Spark Dataset?

I have nested field like below. I want to call flatmap (I think) to produce a flattened row.
My dataset has
A,B,[[x,y,z]],C
I want to convert it to produce output like
A,B,X,Y,Z,C
This is for Spark 2.0+
Thanks!
Apache DataFu has a generic explodeArray method that will do
exactly what you need.
import datafu.spark.DataFrameOps._
val df = sc.parallelize(Seq(("A","B",Array("X","Y","Z"),"C"))).toDF
df.explodeArray(col("_3"), "token").show
This will produce:
+---+---+---------+---+------+------+------+
| _1| _2| _3| _4|token0|token1|token2|
+---+---+---------+---+------+------+------+
| A| B|[X, Y, Z]| C| X| Y| Z|
+---+---+---------+---+------+------+------+
One thing to consider is that this method evaluates the data frame in order to determine how many columns to create - if it's expensive to compute it should be cached.
Full disclosure - I am a member of Apache DataFu.
Try this for RDD:
val rdd = sc.parallelize(Seq(("A","B",Array("X","Y","Z"),"C")))
rdd.flatMap(x => (Option(x._3).map(y => (x._1,x._2,y(0),y(1),y(2),x._4 )))).collect.foreach(println)
Output:
(A,B,X,Y,Z,C)

How can I build a CoordinateMatrix in Spark using a DataFrame?

I am trying to use the Spark implementation of the ALS algorithm for recommendation systems, so I built the DataFrame depicted below, as training data:
|--------------|--------------|--------------|
| userId | itemId | rating |
|--------------|--------------|--------------|
Now, I would like to create a sparse matrix, to represent the interactions between every user and every item. The matrix will be sparse because if there is no interaction between a user and an item, the corresponding value in the matrix will be zero. Thus, in the end, most values will be zero.
But how can I achieve this, using a CoordinateMatrix? I'm saying CoordinateMatrix because I'm using Spark 2.1.1, with python, and in the documentation, I saw that a CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse.
In other words, how can I get from this DataFrame to a CoordinateMatrix, where the rows would be users, the columns would be items and the ratings would be the values in the matrix?
A CoordinateMatrix is just a wrapper for an RDD of MatrixEntrys. A MatrixEntry is just a wrapper over a (long, long, float) tuple. Pyspark allows you to create a CoordinateMatrix from an RDD of such tuples. If the userId and itemId fields are both IntegerTypes and the rating is something like a FloatType, then creating the desired matrix is very straightforward.
from pyspark.mllib.linalg.distributed import CoordinateMatrix
cmat=CoordinateMatrix(df.rdd.map(tuple))
It is only slightly more complicated if you have StringTypes for the userId and itemId fields. You would need to index those strings first and then pass the indices to the CoordinateMatrix.
With Spark 2.4.0, I am showing the whole example that I hope to meet your need.
Create dataframe using dictionary and pandas:
my_dict = {
'userId': [1,2,3,4,5,6],
'itemId': [101,102,103,104,105,106],
'rating': [5.7, 8.8, 7.9, 9.1, 6.6, 8.3]
}
import pandas as pd
pd_df = pd.DataFrame(my_dict)
df = spark.createDataFrame(pd_df)
See the dataframe:
df.show()
+------+------+------+
|userId|itemId|rating|
+------+------+------+
| 1| 101| 5.7|
| 2| 102| 8.8|
| 3| 103| 7.9|
| 4| 104| 9.1|
| 5| 105| 6.6|
| 6| 106| 8.3|
+------+------+------+
Create CoordinateMatrix from dataframe:
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
coorRDD = df.rdd.map(lambda x: MatrixEntry(x[0], x[1], x[2]))
coorMatrix = CoordinateMatrix(coorRDD)
Now see the data type of result:
type(coorMatrix)
pyspark.mllib.linalg.distributed.CoordinateMatrix