I created a dataframe called df in pyspark with HiveContext (not SQLContext).
But I find that after call df.cache() I will not be able to call df.show(). For example:
>>> df.show(2)
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
| bits| dst_ip|dst_port|flow_direction|in_iface|ip_dscp|out_iface| pkts|protocol| src_ip|src_port| tag|
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
|16062594|42.120.84.166| 11291| 1| 3| 36| 2|17606406| pnni|42.120.84.115| 14166|10008|
|13914480|42.120.82.254| 13667| 0| 4| 32| 1|13953516| ax.25| 42.120.86.49| 19810|10002|
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
only showing top 2 rows
>>>
>>> df.cache()
DataFrame[bits: bigint, dst_ip: string, dst_port: bigint, flow_direction: string, in_iface: bigint, ip_dscp: string, out_iface: bigint, pkts: bigint, protocol: string, src_ip: string, src_port: bigint, tag: string]
>>> df.show(2)
16/05/16 15:59:32 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<stdin>", line 1, in <lambda>
IndexError: list index out of range
But after call df.unpersist(), the df.show() will work again
I do not understand. Because I think df.cache() is just caching the RDD for later use. Why the df.show() not work after call cache?
http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
Caching Data In Memory
Spark SQL can cache tables using an in-memory columnar format by calling sqlContext.cacheTable("tableName") or dataFrame.cache(). Then Spark SQL will scan only required columns and will automatically tune compression to minimize memory usage and GC pressure. You can call sqlContext.uncacheTable("tableName") to remove the table from memory.
Configuration of in-memory caching can be done using the setConf method on SQLContext or by running SET key=value commands using SQL.
https://forums.databricks.com/questions/6834/cache-table-advanced-before-executing-the-spark-sq.html#answer-6900
Related
I know that count called on an RDD or a DataFrame is an action. But while fiddling with the spark shell, I observed the following
scala> val empDF = Seq((1,"James Gordon", 30, "Homicide"),(2,"Harvey Bullock", 35, "Homicide"),(3,"Kristen Kringle", 28, "Records"),(4,"Edward Nygma", 30, "Forensics"),(5,"Leslie Thompkins", 31, "Forensics")).toDF("id", "name", "age", "department")
empDF: org.apache.spark.sql.DataFrame = [id: int, name: string, age: int, department: string]
scala> empDF.show
+---+----------------+---+----------+
| id| name|age|department|
+---+----------------+---+----------+
| 1| James Gordon| 30| Homicide|
| 2| Harvey Bullock| 35| Homicide|
| 3| Kristen Kringle| 28| Records|
| 4| Edward Nygma| 30| Forensics|
| 5|Leslie Thompkins| 31| Forensics|
+---+----------------+---+----------+
scala> empDF.groupBy("department").count //count returned a DataFrame
res1: org.apache.spark.sql.DataFrame = [department: string, count: bigint]
scala> res1.show
+----------+-----+
|department|count|
+----------+-----+
| Homicide| 2|
| Records| 1|
| Forensics| 2|
+----------+-----+
When I called count on GroupedData (empDF.groupBy("department")), I got another DataFrame as the result (res1). This leads me to believe that count in this case was a transformation. It is further supported by the fact that no computations were triggered when I called count, instead, they started when I ran res1.show.
I haven't been able to find any documentation that suggests count could be a transformation as well. Could someone please shed some light on this?
The .count() what you have used in your code is over RelationalGroupedDataset, which creates a new column with count of elements in the grouped dataset. This is a transformation. Refer:
https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.GroupedDataset
The .count() that you use normally over RDD/DataFrame/Dataset is completely different from the above and this .count() is an Action. Refer: https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.rdd.RDD
EDIT:
always use .count() with .agg() while operating on groupedDataSet in order to avoid confusion in future:
empDF.groupBy($"department").agg(count($"department") as "countDepartment").show
Case 1:
You use rdd.count() to count the number of rows. Since it initiates the DAG execution and returns the data to the driver, its an action for RDD.
for ex: rdd.count // it returns a Long value
Case 2:
If you call count on Dataframe, it initiates the DAG execution and returns the data to the driver, its an action for Dataframe.
for ex: df.count // it returns a Long value
Case 3:
In your case you are calling groupBy on dataframe which returns RelationalGroupedDataset object, and you are calling count on grouped Dataset which returns a Dataframe, so its a transformation since it doesn't gets the data to the driver and initiates the DAG execution.
for ex:
df.groupBy("department") // returns RelationalGroupedDataset
.count // returns a Dataframe so a transformation
.count // returns a Long value since called on DF so an action
As you've already figure out - if method returns a distributed object (Dataset or RDD) it can be qualified as a transformations.
However these distinctions are much better suited for RDDs than Datasets. The latter ones features an optimizer, including recently added cost based optimizer, and might be much less lazy the old API, blurring differences between transformation and action in some case.
Here however it is safe to say count is a transformation.
I get multiple incoming files and i have to compare each incoming file with the source file then merge and replace the old rows with the new rows and append the extra rows if any present in the source file. Afterwords I have to use the updated sourcefile and compare with another incoming file, update it and so the process goes on.
I have so far created the dataframe for each file and compared and merged using join. i want to save all the updates done in the source file and use the updated source file again to compare and update incomming files.
val merge = df.union(dfSource.join(df, Seq( "EmployeeID" ),joinType= "left_anti").orderBy("EmployeeID") )
merge.write.mode ("append").format("text").insertInto("dfSource")
merge.show()
I tried this way but it dosent update my dfSource dataframe. could somebody help please.
Thanks
Not possible this way. Need to use tables and then save to a file as final part of process.
Suggest you align your approach as follows - which allows parallel loading but really I suspect not really of benefit.
Load all files in order of delivery with each record loaded being tagged with a timestamp or some ordering sequence from your sequence number of files along with type of record. E.g. File X with, say, position 2 in sequence gets records loaded with seqnum = 2. You can use the DF approach on the file being processed and appending to a Impala / Hive KUDU table if performing all within SPARK domain.
For records in the same file apply monotonically_increasing_id() to get ordering within the file if same key can exist in same file. See DataFrame-ified zipWithIndex. Or zipWithIndex via RDD via conversion and back to DF.
Then issue a select statement to take the key values with maximum value timestamp, seq_num per key. E.g. if in current run 3 recs, say, for key=1, only one needs to be processed - presumably the one with highest value.
Save as a new file.
Process this new file accordingly.
OR:
Bypass step 3 and read in asc order and process data accordingly.
Comment to make:
Typically I load such data with LOAD to HIVE / IMPALA with partitioning key being set via extracting timestamp from the file name. Requires some LINUX scripting / processing. That's a question of style and should not be a real Big Data bottleneck.
Here is a snippet with simulated input of how some aspects can be done to allow a MAX select against a key for UPSerts. The Operation, DEL,ALT whatever you need to add. Although I think you can do this yourself actually from what I have seen:
import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
def dfSchema(columnNames: List[String]): StructType =
StructType(
Seq(
StructField(name = "key", dataType = StringType, nullable = false),
StructField(name = "file", dataType = StringType, nullable = false),
StructField(name = "ts", dataType = StringType, nullable = false),
StructField(name = "val", dataType = StringType, nullable = false),
StructField(name = "seq_val", dataType = LongType, nullable = false)
)
)
val newSchema = dfSchema(List("key", "file", "ts", "val", "seq_val"))
val df1 = Seq(
("A","F1", "ts1","1"),
("B","F1", "ts1","10"),
("A","F1", "ts2","2"),
("C","F2", "ts3","8"),
("A","F2", "ts3","3"),
("A","F0", "ts0","0")
).toDF("key", "file", "ts","val")
val rddWithId = df1.sort($"key", $"ts".asc).rdd.zipWithIndex
val dfZippedWithId = spark.createDataFrame(rddWithId.map{ case (row, index) => Row.fromSeq(row.toSeq ++ Array(index))}, newSchema)
dfZippedWithId.show
returns:
+---+----+---+---+-------+
|key|file| ts|val|seq_val|
+---+----+---+---+-------+
| A| F0|ts0| 0| 0|
| A| F1|ts1| 1| 1|
| A| F1|ts2| 2| 2|
| A| F2|ts3| 3| 3|
| B| F1|ts1| 10| 4|
| C| F2|ts3| 8| 5|
+---+----+---+---+-------+
ready for subsequent processing.
Is it possible to use QuantileDiscretizer, keeping NaN values, with a RandomForestClassifier?
I have been getting an error like this:
18/03/23 17:38:15 ERROR Executor: Exception in task 3.0 in stage 133.0 (TID 381)
java.lang.IllegalArgumentException: DecisionTree given invalid data: Feature 1 is categorical with values in {0,...,1, but a data point gives it value 2.0.
Bad data point: (1.0,[1.0,2.0])
Example
The idea here is to create a numeric column and discretize it using quantiles, keeping invalid numbers (NaN) in a special bucket.
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler,
QuantileDiscretizer}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassifier}
val tseq = Seq((0, "a", 1.0), (1, "b", 0.0), (2, "c", 2.0),
(3, "a", 1.0), (4, "a", 3.0), (5, "c", Double.NaN))
val tdf = SparkInit.ss.createDataFrame(tseq).toDF("id", "category", "class")
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
val discr = new QuantileDiscretizer()
.setInputCol("class")
.setOutputCol("quant")
.setNumBuckets(2)
.setHandleInvalid("keep")
val assembler = new VectorAssembler()
.setInputCols(Array("categoryIndex", "quant"))
.setOutputCol("features")
val rf = new RandomForestClassifier()
.setLabelCol("categoryIndex")
.setFeaturesCol("features")
.setNumTrees(3)
new Pipeline()
.setStages(Array(indexer, discr, assembler, rf))
.fit(tdf)
.transform(tdf)
.show()
Without trying to fit the Random Forest, I was getting a DataFrame like this:
+---+--------+-----+-------------+-----+---------+
| id|category|class|categoryIndex|quant| features|
+---+--------+-----+-------------+-----+---------+
| 0| a| 1.0| 0.0| 1.0|[0.0,1.0]|
| 1| b| 0.0| 2.0| 0.0|[2.0,0.0]|
| 2| c| 2.0| 1.0| 1.0|[1.0,1.0]|
| 3| a| 1.0| 0.0| 1.0|[0.0,1.0]|
| 4| a| 3.0| 0.0| 1.0|[0.0,1.0]|
| 5| c| NaN| 1.0| 2.0|[1.0,2.0]|
+---+--------+-----+-------------+-----+---------+
If I try to fit the model, I get the error:
18/03/23 17:54:12 WARN DecisionTreeMetadata: DecisionTree reducing maxBins from 32 to 6 (= number of training instances)
18/03/23 17:54:12 WARN BlockManager: Putting block rdd_490_3 failed due to an exception
18/03/23 17:54:12 WARN BlockManager: Block rdd_490_3 could not be removed as it was not found on disk or in memory
18/03/23 17:54:12 ERROR Executor: Exception in task 3.0 in stage 143.0 (TID 414)
java.lang.IllegalArgumentException: DecisionTree given invalid data: Feature 1 is categorical with values in {0,...,1, but a data point gives it value 2.0.
Bad data point: (1.0,[1.0,2.0])
at org.apache.spark.ml.tree.impl.TreePoint$.findBin(TreePoint.scala:124)
at org.apache.spark.ml.tree.impl.TreePoint$.org$apache$spark$ml$tree$impl$TreePoint$$labeledPointToTreePoint(TreePoint.scala:93)
at org.apache.spark.ml.tree.impl.TreePoint$$anonfun$convertToTreeRDD$2.apply(TreePoint.scala:73)
at org.apache.spark.ml.tree.impl.TreePoint$$anonfun$convertToTreeRDD$2.apply(TreePoint.scala:72)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
Does QuantileDiscretizer inserts some kind of metadata about the special extra bucket? It's weird that I was able to build a model using columns with the same values before, but without forcing any discretization.
Update
Yes, columns does have attached metadata and it looks like this:
org.apache.spark.sql.types.Metadata = {"ml_attr":
{"ord":true,
"vals":["-Infinity, 5.0","5.0, 10.0","10.0, Infinity"],
"type":"nominal"}
}
The question now might be: how to set correctly the metadata to include values like Double.NaN?
The workaround I used was simply to remove the associated metadata from the discretized columns, letting the decision tree implementation to decide what to do with the data. I think the column would actually become a numerical column ([0, 1, 2, 2, 1], for example), but, if too many categories are created, the column could be discretized again (look for the parameter maxBins).
In my case, the simplest way to remove the metadata was to fill the DataFrame after applying QuantileDiscretizer:
// Nothing is actually filled in my case, since there was no missing
// values before this operation.
df.na.fill(Double.NaN, Array("quant"))
I'm almost sure you could also manually remove the metadata accessing the column object directly.
Update
We can change a column's metadata by creating an alias (reference):
val metadata: Metadata = ...
df.select($"colA".as("colB", metadata))
This answer describes a way to get the column's metadata by getting the respective StructField of a DataFrame's schema.
Sometimes (e.g. for testing and bechmarking) I want force the execution of the transformations defined on a DataFrame. AFAIK calling an action like count does not ensure that all Columns are actually computed, show may only compute a subset of all Rows (see examples below)
My solution is to write the DataFrame to HDFS using df.write.saveAsTable, but this "clutters" my system with tables I don't want to keep any further.
So what is the best way to trigger the evaluation of a DataFrame?
Edit:
Note that there is also a recent discussion on the spark developer list : http://apache-spark-developers-list.1001551.n3.nabble.com/Will-count-always-trigger-an-evaluation-of-each-row-td21018.html
I made a small example which shows that count on DataFrame does not evaluate everything (tested using Spark 1.6.3 and spark-master = local[2]):
val df = sc.parallelize(Seq(1)).toDF("id")
val myUDF = udf((i:Int) => {throw new RuntimeException;i})
df.withColumn("test",myUDF($"id")).count // runs fine
df.withColumn("test",myUDF($"id")).show() // gives Exception
Using the same logic, here an example that show does not evaluate all rows:
val df = sc.parallelize(1 to 10).toDF("id")
val myUDF = udf((i:Int) => {if(i==10) throw new RuntimeException;i})
df.withColumn("test",myUDF($"id")).show(5) // runs fine
df.withColumn("test",myUDF($"id")).show(10) // gives Exception
Edit 2 : For Eliasah: The Exception says this:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 6, localhost): java.lang.RuntimeException
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:68)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:68)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:68)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
.
.
.
.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)
at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1500)
at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1500)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2087)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1499)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1506)
at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1376)
at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2100)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1375)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1457)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:350)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:311)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:319)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:74)
.
.
.
.
It's a bit late, but here's the fundamental reason: count does not act the same on RDD and DataFrame.
In DataFrames there's an optimization, as in some cases you do not require to load data to actually know the number of elements it has (especially in the case of yours where there's no data shuffling involved). Hence, the DataFrame materialized when count is called will not load any data and will not pass into your exception throwing. You can easily do the experiment by defining your own DefaultSource and Relation and see that calling count on a DataFrame will always end up in the method buildScan with no requiredColumns no matter how many columns you did select (cf. org.apache.spark.sql.sources.interfaces to understand more). It's actually a very efficient optimization ;-)
In RDDs though, there's no such optimizations (that's why one should always try to use DataFrames when possible). Hence the count on RDD executes all the lineage and returns the sum of all sizes of the iterators composing any partitions.
Calling dataframe.count goes into the first explanation, but calling dataframe.rdd.count goes into the second as you did build an RDD out of your DataFrame. Note that calling dataframe.cache().count forces the dataframe to be materialized as you required Spark to cache the results (hence it needs to load all the data and transform it). But it does have the side-effect of caching your data...
I guess simply getting an underlying rdd from DataFrame and triggering an action on it should achieve what you're looking for.
df.withColumn("test",myUDF($"id")).rdd.count // this gives proper exceptions
It appears that df.cache.count is the way to go:
scala> val myUDF = udf((i:Int) => {if(i==1000) throw new RuntimeException;i})
myUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(IntegerType)))
scala> val df = sc.parallelize(1 to 1000).toDF("id")
df: org.apache.spark.sql.DataFrame = [id: int]
scala> df.withColumn("test",myUDF($"id")).show(10)
[rdd_51_0]
+---+----+
| id|test|
+---+----+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 5|
| 6| 6|
| 7| 7|
| 8| 8|
| 9| 9|
| 10| 10|
+---+----+
only showing top 10 rows
scala> df.withColumn("test",myUDF($"id")).count
res13: Long = 1000
scala> df.withColumn("test",myUDF($"id")).cache.count
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (int) => int)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
.
.
.
Caused by: java.lang.RuntimeException
Source
I prefer to use df.save.parquet(). This does add disc I/o time that you can estimate and subtract out later, but you are positive that spark performed each step you expected and did not trick you with lazy evaluation.
Let say I have a dataframe ( stored in scala val as df) which contains the data from a csv:
time,temperature
0,65
1,67
2,62
3,59
which I have no problem reading this from file as a spark dataframe in scala language.
I would like to add a filtered column (by filter I meant signal processing moving average filtering), (say I want to do (T[n]+T[n-1])/2.0):
time,temperature,temperatureAvg
0,65,(65+0)/2.0
1,67,(67+65)/2.0
2,62,(62+67)/2.0
3,59,(59+62)/2.0
(Actually, say for the first row, I want 32.5 instead of (65+0)/2.0. I wrote it to clarify the expected 2-time-step filtering operation output)
So how to achieve this? I am not familiar with spark dataframe operation which combine rows iteratively along column...
Spark 3.1+
Replace
$"time".cast("timestamp")
with
import org.apache.spark.sql.functions.timestamp_seconds
timestamp_seconds($"time")
Spark 2.0+
In Spark 2.0 and later it is possible to use window function as a input for groupBy. It allows you to specify windowDuration, slideDuration and startTime (offset). It works only with TimestampType column but it is not that hard to find a workaround for that. In your case it will require some additional steps to correct for boundaries but general solution can expressed as shown below:
import org.apache.spark.sql.functions.{window, avg}
df
.withColumn("ts", $"time".cast("timestamp"))
.groupBy(window($"ts", windowDuration="2 seconds", slideDuration="1 second"))
.avg("temperature")
Spark < 2.0
If there is a natural way to partition your data you can use window functions as follows:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.mean
val w = Window.partitionBy($"id").orderBy($"time").rowsBetween(-1, 0)
val df = sc.parallelize(Seq(
(1L, 0, 65), (1L, 1, 67), (1L, 2, 62), (1L, 3, 59)
)).toDF("id", "time", "temperature")
df.select($"*", mean($"temperature").over(w).alias("temperatureAvg")).show
// +---+----+-----------+--------------+
// | id|time|temperature|temperatureAvg|
// +---+----+-----------+--------------+
// | 1| 0| 65| 65.0|
// | 1| 1| 67| 66.0|
// | 1| 2| 62| 64.5|
// | 1| 3| 59| 60.5|
// +---+----+-----------+--------------+
You can create windows with arbitrary weights using lead / lag functions:
lit(0.6) * $"temperature" +
lit(0.3) * lag($"temperature", 1) +
lit(0.2) * lag($"temperature", 2)
It is still possible without partitionBy clause but will be extremely inefficient. If this is the case you won't be able to use DataFrames. Instead you can use sliding over RDD (see for example Operate on neighbor elements in RDD in Spark). There is also spark-timeseries package you may find useful.