How to get the former few rows of current row in structured streaming dataframe - spark-structured-streaming

I want to get the several rows ahead of current row, however, first, take, rows between are not supported in structured steaming DataFrame.

use groupByKey and flatMapGroupsWithState can achieve the goal.

Related

Best practice in Spark to filter dataframe, execute different actions on resulted dataframes and then union the new dataframes back

Since I am new to Spark I would like to ask a question about a pattern that I am using in Spark but don't know if it's a bad practice ( splitting a dataframe in two based on a filter, execute different actions on them and then joining them back ).
To give an example, having dataframe df:
val dfFalse = df.filter(col === false).distinct()
val dfTrue = df.filter(col === true).join(otherDf, Seq(id), "left_anti").distinct()
val newDf = dfFalse union dfTrue
Since my original dataframe has milions of rows I am curious if this filtering twice is a bad practice and I should use some other pattern in Spark which I may not be aware of. In other cases I even need to do 3,4 filters and then apply different actions to individual data frames and then union them all back.
Kind regards,
There are several key points to take into account when you start to use Spark to process big amounts of data in order to analyze our performance:
Spark parallelism depends of the number of partitions that you have in your distributed memory representations(RDD or Dataframes). That means that the process(Spark actions) will be executed in parallel across the cluster. But note that there are two main different kind of transformations: Narrow transformations and wide transformations. The former represent operations that will be executed without shuffle, so the data donĀ“t need to be reallocated in different partitions thus avoiding data transfer among workers. Consider that if you what to perform a distinct by a specific key Spark must reorganize the data in order to detect the duplicates. Take a look to the doc.
Regarding doing more or less filter transformations:
Spark is based on a lazy evaluation model, it means that all the transformations that you executes on a dataframe are not going to be executed unless you call an action, for example a write operation. And the Spark optimizer evaluates your transformations in order to create an optimized execution plan. So, if you have five or six filter operations it will never traverse the dataframe six times(in contrast to other dataframe frameworks). The optimizer will take your filtering operations and will create one. Here some details.
So have in mind that Spark is a distributed in memory data processor and it is a must to know these details because you can spawn hundreds of cores over hundred of Gbs.
The efficiency of this approach highly depends on the ability to reduce the amount of the overlapped data files that are scanned by both the splits.
I will focus on two techniques that allow data-skipping:
Partitions - if the predicates are based on a partitioned column, only the necessary data will be scanned, based on the condition. In your case, if you split the original dataframe into 2 based on a partitioned column filtering, each dataframe will scan only the corresponding portion of the data. In this case, your approach will be perform really well as no data will be scanned twice.
Filter/predicate pushdown - data stored in a format supporting filter pushdown (Parquet for example) allows reading only the files that contains records with values matching the condition. In case that the values of the filtered column are distributed across many files, the filter pushdown will be inefficient since the data is skipped on a file basis and if a certain file contains values for both the splits, it will be scanned twice. Writing the data sorted by the filtered column might improve the efficiency of the filter pushdown (on read) by gathering the same values into a fewer amount of files.
As long as you manage to split your dataframe, using the above techniques, and minimize the amount of the overlap between the splits, this approach will be more efficient.

How to calculate number of rows obtained after a join, a filter or a write without using count function - Pyspark

I am using PySpark to join, filter and write large dataframe to a csv.
After each filter join or write, I count the number of lines with df.count().
However, counting the number of rows mean reloading the data and re-perform the various operations.
How could I count the number of lines during each different operations without reloading and calculate as with df.count() ?
I am aware that the cache function could be a solution to not reload and recalculate but I am looking for another solution as it's not always the best one.
Thank you in advance!
Why not look at the spark UI to see get a feel for what's happening instead of using Count? This might help you get a feel without actually doing counts. Jobs/Tasks can help you find the bottlenecks. The SQL tab can help you look at your plan to understand what's actually happening.
If you want something better than count.
countApprox is cheaper.(RDD level tooling) You should be caching if you are going to count it and then use it the dataframe again after. Actually count is sometimes use to force caching.

Larger number of Transformations on multiple dataframe in Spark

I have a transform engine built on the spark that is metadata-driven. I perform a set of transformations on multiple data frames stored in memory in a Scala Map[String, DataFrame]. I encounter a condition where I generate a data frame using 84 transforms including(withColumn, Join, union etc). After these, the output data frame is used as an input to another set of transformations.
If I write the intermediate transformation result after first 84 transformations and then load the dataframe into the Map from the output path. The next set of transformations works fine. If I do not this, it takes 30 mins just to evaluate.
My Approach: I tried persisting the Dataframe using:
dfMap(target).cache()
But this approach did not help.
So in these 84 transformations , how many of them are aggregations based on a same key? For example if you are calculating min, max etc., on a particular column value such as user_id, then it makes to store your original dataframe after bucketing it by user id. Also for joining, if you are using the same key, you can partition it by them. If you dont bucket, then for each transformation, there is a spark shuffle.
This answer should help - Spark Data set transformation to array

Spark grouped map UDF in Scala

I am trying to write some code that would allow me to compute some action on a group of rows of a dataframe. In PySpark, this is possible by defining a Pandas UDF of type GROUPED_MAP. However, in Scala, I only found a way to create custom aggregators (UDAFs) or classic UDFs.
My temporary solution is to generate a list of keys that would encode my groups which would allow me to filter the dataframe and perform my action for each subset of dataframe. However, this approach is not optimal and very slow.
The performed actions are made sequentially, thus taking a lot of time. I could parallelize the loop but I'm sure this would show any improvement since Spark is already distributed.
Is there any better way to do what I want ?
Edit: Tried parallelizing using Futures but there was no speed improvement, as expected
To the best of my knowledge, this is something that's not possible in Scala. Depending on what you want, I think there could be other ways of applying a transformation to a group of rows in Spark / Scala:
Do a groupBy(...).agg(collect_list(<column_names>)), and use a UDF that operates on the array of values. If desired, you can use a select statement with explode(<array_column>) to revert to the original format
Try rewriting what you want to achieve using window functions. You can add a new column with an aggregate expression like so:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('group)
val result = spark.range(100)
.withColumn("group", pmod('id, lit(3)))
.withColumn("group_sum", sum('id).over(w))

Group data based on multiple column in spark using scala's API

I have an RDD, want to group data based on multiple column. for large dataset spark cannot work using combineByKey, groupByKey, reduceByKey and aggregateByKey, these gives heap space error. Can you give another method for resolving it using Scala's API?
You may want to use treeReduce() for doing incremental reduce in Spark. However, you hypothesis that spark can not work on large dataset is not true, and I suspect you just don't have enough partitions in your data, so maybe a repartition() is what you need.