Windowing Aggregations in Spark - pyspark

I am learning spark structured streaming and I am a bit confused about windowing aggregations. I have following questions.
Does Windowing aggregation always considers time which is passed as a column in window function?
If so then are we always have to provide timestamp column in dataset?
What happens when dataset doesn't contain timestamp column?

Related

Spark scala what's the difference between predicate pushdown and partitioning in term of processing and stroge

I'm using dataframe and I have come across these terms. I couldn't fully understand them if possible, can you give an example on both?
Predicate pushdown is when filtering query results, a consumer of the parquet-mr API in spark can fetch all records from the API and then evaluate each record against the predicates of the filtering condition. Like you are joining two large parquet table -
SELECT * FROM museum m JOIN painting p ON p.museumid = m.id WHERE p.width > 120 AND p.height > 150
However, this requires assembling all records in memory, even non-matching ones. With predicate pushdown, these conditions are passed to the parquet-mr library instead, which evaluates the predicates on a lower level and discards non-matching records without assembling them first. So data gets filtered first based on the condition then comes to memory to join operations
Partitioning in Spark is actually logically divided data. When you read a file from the s3/ADLS/gzip file of HDFS it creates single partition for each file. Later when you process your data based on Shuffle partition your data frame partition changes accordingly. So I think partitioning and predicate pushdown have only one common factor when to fetch data from columnar files if the filter is applied the data comes to each partition in Spark is lesser than the existing data lying on the file system
Hope this helps

How to handle too many aggregation operation?

In my requirement, reading the table from hive (Size - around 1 TB) I have to do too many aggregation operation mostly avg & sum.
I tried following code.its running for long time .Is there another way to optimize or efficient wasy of handling multiple agg operatin
finalDF.groupBy($"Dseq", $"FmNum", $"yrs",$"mnt",$"FromDnsty")
   .agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"),avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),
     avg($"neg"),avg($"Rd"),avg("savg"),avg("slavg"),avg($"dex"),avg("cur"),avg($"Nexp"), avg($"NExpp"),avg($"Psat"),
     avg($"Pexps"),avg($"Pxn"),avg($"Pn"),avg($"AP3"),avg($"APd"),avg($"RInd"),avg($"CP"),avg($"CScr"),
     avg($"Fspct7p1"), avg($"Fspts7p1"),avg($"TlpScore"),avg($"Ordrs"),avg($"Drs"),
     avg("Lns"),avg("Judg"),avg("ds"),
     avg("ob"),sum("Ss"),sum("dol"),sum("liens"),sum("pct"),
     sum("jud"),sum("sljd"),sum("pNB"),avg("pctt"),sum($"Dolneg"),sum("Ls"),sum("sl"),sum($"PA"),sum($"DS"),
     sum($"DA"),sum("dcur"),sum($"sat"),sum($"Pes"),sum($"Pn"),sum($"Pn"),sum($"Dlo"),sum($"Dol"),sum("pdol"),sum("pct"),sum("judg"))
Note - I am using Spark Scala

Improve groupby operation in Spark 1.5.2

We are facing poor performance using Spark.
I have 2 specific questions:
When debugging we noticed that a few of the groupby operations done on Rdd are taking more time
Also a few of the stages are appearing twice, some finishing very quickly, some taking more time
Here is a screenshot of .
Currently running locally, having shuffle partitions set to 2 and number of partitions set to 5, data is around 1,00,000 records.
Speaking of groupby operation, we are grouping a dataframe (which is a result of several joins) based on two columns, and then applying a function to get some result.
val groupedRows = rows.rdd.groupBy(row => (
row.getAs[Long](Column1),
row.getAs[Int](Column2)
))
val rdd = groupedRows.values.map(Criteria)
Where Criteria is some function acted on the grouped resultant rows. Can we optimize this group by in any way?
Here is a screenshot of the .
I would suggest you not to convert the existing dataframe to rdd and do the complex process you are performing.
If you want to perform Criteria function on two columns (Column1 and Column2), you can do this directly on dataframe. Moreover, if your Criteria can be reduced to combination of inbuilt functions then it would be great. But you can always use udf functions for custom rules.
What I would suggest you to do is groupBy on the dataframe and apply aggregation functions
rows.groupBy("Column1", "Column2").agg(Criteria function)
You can use Window functions if you want multiple rows from the grouped dataframe. more info here
.groupBy is known to be not the most efficient approach:
Note: This operation may be very expensive. If you are grouping in
order to perform an aggregation (such as a sum or average) over each
key, using PairRDDFunctions.aggregateByKey or
PairRDDFunctions.reduceByKey will provide much better performance.
Sometimes it is better to use .reduceByKey or .aggregateByKey, as explained here:
While both of these functions will produce the correct answer, the
reduceByKey example works much better on a large dataset. That's
because Spark knows it can combine output with a common key on each
partition before shuffling the data.
Why .reduceByKey, .aggregateByKey work faster than .groupBy? Because part of the aggregation happens during map phase and less data is shuffled around worker nodes during reduce phase. Here is a good explanation on how does aggregateByKey work.

What is difference between transformations and rdd functions in spark?

I am reading spark textbooks and I see that transformations and actions and again I read rdd functions , so I am confuse, can anyone explain what is the basic difference between transformations and spark rdd functions.
Both are used to change the rdd data contents and return a new rdd but I want to know the precise explantion.
Spark rdd functions are transformations and actions both. Transformation is function that changes rdd data and Action is a function that doesn't change the data but gives an output.
For example :
map, filter, union etc are all transformation as they help in changing the existing data.
reduce, collect, count are all action as they give output and not change data.
for more info visit Spark and Jacek
RDDs support only two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
RDD Functions is a generic term used in textbook for internal mechanism.
For example, MAP is a transformation that passes each dataset element through a function and returns a new RDD representing the results. REDUCE is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program.
Since Spark's collections are immutable in nature, we can't change the data once the RDD is created.
Transformations are function that apply to RDDs and produce other RDDs in output (ie: map, flatMap, filter, join, groupBy, ...).
Actions are the functions that apply to RDDs and produce non-RDD (Array,List...etc) data as output (ie: count, saveAsText, foreach, collect, ...).

The fastest way to get count of processed dataframe records in Spark

I'm currently developing a Spark Application that uses dataframes to compute and aggregates specific columns from a hive table.
Aside from using count() function in dataframes/rdd. Is there a more optimal approach to get the number of records processed or number of count of records of a dataframe ?
I just need to know if there's something needed to override a specific function or so.
Any replies would be appreciated. I'm currently using Apache spark 1.6.
Thank you.
Aside from using count() function in dataframes/rdd, is there a more optimal
approach to get the number of records processed or number of count of records
of a dataframe?
Nope. Since an RDD may have an arbitrarily complex execution plan, involving JDBC table queries, file scans, etc., there's no apriori way to determine its size short of counting.