Unable to Convert to DataFrame from RDD after applying partitioning - scala

I am using Spark 2.1.0
When i am trying to use Window function on a Dataframe
val winspec = Window.partitionBy("partition_column")
DF.withColumn("column", avg(DF("col_name")).over(winspec))
My Plan changes and add the below lines to the Physical Plan and due to this An Extra Stage , EXtra Shuffling is happening and the Data is Huge which Slows down my Query like anything & runs for Hours.
+- Window [avg(cast(someColumn#262 as double)) windowspecdefinition(partition_column#460, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS someColumn#263], [partition_column#460]
+- *Sort [partition_column#460 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(partition_column#460, 200)
Also i see the Stage as MapInternalPartition which i think is partitioned internally Now i don't know what is this. But because i think because of this even my 100 tasks took 30+ mins and in that 99 was completed within 1-2 mins and the last 1 task took remaining 30 mins leaving my cluster IDLE with no parallel processing which makes me think that is the data partitioned properly when Window function is used ???
I Tried to apply HashPartitioning by converting it to RDD... BECAUSE we cannot apply Custom / HashPartitioner on a Dataframe
So if i do this :
val myVal = DF.rdd.partitioner(new HashPartitioner(10000))
I am getting a return type of ANY with which i am not getting any Action list to perform.
I checked and saw that the column with which the Partitioning is happening in Window functions contains all NULL values

TL;DR:
A shuffle when using window functions is not extra. It is required for correctness and cannot be removed.
Applying partitioner shuffles.
Datasets cannot reuse RDD partitioning. Partitioning with Dataset should be done with repartition method:
df.repartition($"col_name")
But it won't help you because of 2)
And this:
val myVal = DF.rdd.partitioner(new HashPartitioner(10000))
wouldn't return Any. It wouldn't compile, as there is no partitioner method for RDD, which takes arguments.
Correct method is partitionBy but it is not applicable to RDD[Row] and it wouldn't help you because of 3).
If there is enough memory you can try
df.join(
broadcast(df.groupBy("partition_column").agg(avg(DF("col_name"))),
Seq("partition_column")
)
Edit:
If you're trying to compute running average (avg with Window.partitionBy("partition_column") computes global average by group, not running average), then you're out of luck.
If partitioning column has only NULLS, then task is not distributed and fully sequential.
To compute global running average you can try to apply logic similar to this How to compute cumulative sum using Spark.

Related

Spark [1.5] write dataframe as parquet to HDFS time increases linearly with time

I am working on Spark 1.5 and I have a list of dataframes that I iterate over on the driver and then union 10 Dataframes using grouped(10) on df's list and then write the union dataframe as parquet in HDFS.
dfsList.toList.grouped(10).toList.zipWithIndex.par.map {
case(f,count) => {
f.reduce(_ unionAll _).where($"sstp".isNotNull).write.partitionBy("sstp").parquet(s"$parquetPath/${startIndex}_${count}")
}
}
Now I could see that the time to write
Time taken to write increases from 0.2 sec in the beginning to almost 7 minutes by 47th job, I am unable to find out the reason for this.
Please help/provide any hints or solutions to understand and fix this.
EDIT - Here is how the df.explain looks like for a union before write.
== Physical Plan ==
Union
Filter isnotnull(sstp#1606075)
Scan PhysicalRDD[sno#1606074,sstp#1606075,hdfspath#1606076]
Filter isnotnull(sstp#1606090)
Scan PhysicalRDD[sno#1606089,sstp#1606090,hdfspath#1606091]
Filter isnotnull(sstp#1606108)
Scan PhysicalRDD[sno#1606107,sstp#1606108,hdfspath#1606109]
Filter isnotnull(sstp#1606123)
Scan PhysicalRDD[sno#1606122,sstp#1606123,hdfspath#1606124]
Filter isnotnull(sstp#1606135)
Scan PhysicalRDD[sno#1606134,sstp#1606135,hdfspath#1606136]
Filter isnotnull(sstp#1606153)
Scan PhysicalRDD[sno#1606152,sstp#1606153,hdfspath#1606154]
Filter isnotnull(sstp#1606171)
Scan PhysicalRDD[sno#1606170,sstp#1606171,hdfspath#1606172]
Filter isnotnull(sstp#1606189)
Scan PhysicalRDD[sno#1606188,sstp#1606189,hdfspath#1606190]
Filter isnotnull(sstp#1606195)
Scan PhysicalRDD[sno#1606194,sstp#1606195,hdfspath#1606196]
Filter isnotnull(sstp#1606198)
Scan PhysicalRDD[sno#1606197,sstp#1606198,hdfspath#1606199]
EDIT 2 : I have also realized that at any given time, out of all the active jobs only 2 or 3 jobs seems to show any porgress rest all remain in sort of a waiting queue and this causes each job to take linearly increasing time.
Why is this happening? Possibly because my data isn't partitioned correctly.
Note: I have 10 executors with 4 cores per executor for this spark context.

Apache Spark, range-joins, data skew and performance

I have the following Apache Spark SQL join predicate:
t1.field1 = t2.field1 and t2.start_date <= t1.event_date and t1.event_date < t2.end_date
data:
t1 DataFrame have over 50 millions rows
t2 DataFrame have over 2 millions rows
almost all t1.field1 fields in t1 DataFrame have the same value(null).
Right now the Spark cluster hangs for more than 10 minutes on a single task in order to perform this join and because of data skew. Only one worker and one task on this worker works at this point of time. All other 9 workers are idle. How to improve this join in order to distribute the load from this one particular task to whole Spark cluster?
I am assuming you are doing inner join.
Below steps can be followed to optimise join -
1. Before joining we can filter out t1 and t2 based on smallest or largest start_date, event_date, end_date. It will reduce number of rows.
Check if t2 dataset have null value for field1, if not before join t1 dataset can be filtered based on notNull condition. It will reduce t1 size
If your job is getting only few executors than available one then you have less number of partitions. Simply repartition the dataset, set an optimal number so than you don't endup with large number of partitions or vice versa.
You can check if partitioning has happened properly (no skewness) by looking at tasks execution time, it should be similar.
Check if smaller dataset can be fit in executors memory, broadcast_join can be used.
You might like to read - https://github.com/vaquarkhan/Apache-Kafka-poc-and-notes/wiki/Apache-Spark-Join-guidelines-and-Performance-tuning
If almost all the rows in t1 have t1.field1 = null, and the event_date row is numeric (or you convert it to a timestamp), you can first use Apache DataFu to do a ranged join, and filter out the rows in which t1.field1 != t2.field1 afterwards.
The range join code would look like this:
t1.joinWithRange("event_date", t2, "start_date", "end_date", 10)
The last argument - 10 - is the decrease factor. This does bucketing, as Raphael Roth suggested in his answer.
You can see an example of such a ranged join in the blog post introducing DataFu-Spark.
Full disclosure - I am a member of DataFu and wrote the blog post.
I assume spark already pushed the not-null filter on t1.field1, you can verify this in the explain-plan.
I would rather experiment with creating an additional attribute which can be used as an equi-join condition, e.g. by bucketing. For example you could create a month attribute. To do this, you would need to enumerate monthsin t2, this is usually done using UDFs. See this SO-question for an example : How to improve broadcast Join speed with between condition in Spark

Spark ML Transformer - aggregate over a window using rangeBetween

I would like to create custom Spark ML Transformer that applies an aggregation function within rolling window with the construct over window. I would like to be able to use this transformer in Spark ML Pipeline.
I would like to achieve something that could be done quite easily with withColumn as given in this answer
Spark Window Functions - rangeBetween dates
for example:
val w = Window.orderBy(col("unixTimeMS")).rangeBetween(0, 700)
val df_new = df.withColumn("cts", sum("someColumnName").over(w))
Where
df is my dataframe
unixTimeMS is unix time in milliseconds
someColumnName is some column that I want to perform aggregation.
In this example I do a sum over the rows within the window.
the window w includes current transaction and all transactions within 700 ms from the current transaction.
Is it possible to put such window aggregation into Spark ML Transformer?
I was able to achieve something similar with Spark ML SQLTransformer where the
val query = """SELECT *,
sum(someColumnName) over (order by unixTimeMS) as cts
FROM __THIS__"""
new SQLTransformer().setStatement(query)
But I can't figure out how to use rangeBetween in SQL to select period of time. Not just number of rows. I need specific period of time with respect to unixTimeMS of the current row.
I understand the Unary Transforme is not the way to do it because I need to make an aggregate. Do I need to define a UDAF (user defined aggregate function) and use it in SQLTransformer?
I wasn't able to find any example of UDAF containing window function.
I am answering my own question for the future reference. I ended up using SQLTransformer. Just like the window function in the example where I use range between:
val query = SELECT *,
sum(dollars) over (
partition by Numerocarte
order by UnixTime
range between 1000 preceding and 200 following) as cts
FROM __THIS__"
Where 1000 and 200 in range between relate to units of the order by column.

dataframe.selectexpr performace for selecting large number of columns

I am using spark dataframe in scala. My data frame is holding about 400 columns, with 1000-1M rows. I am running a
datagrame.selectExpr operation(1 to 400th column) on certain criteria and once fetching them, I am aggregating the values of all these columns.
my selectexpr statement:
val df = df2.selectExpr(getColumn(beginDate, endDate, x._2): _*)
getColumn method will fetch columns day wise between start and enddate from my dataframe (this may be 365 columns as we have day wise data).
my summing by expression is :
df.map(row => (row(0), row(1), row(2), (3 until row.length).map(row.getLong(_)).sum)).collect()
I find that selecting these many number of columns is degrading the performance of my job. Is there anyway to make this fetching of 400 columns much faster?
Dataframe is better optimization than the RDD . And you are using it which is good . But can you Please check the Spark UI and what stage it is taking much time. If its taking the time due to calculation or due to data load. Reshaping Data . And try to scale up slowly for faster output.Check if the partioning can also help your code to run it fatser . Apache Spark Code Faster .Making the code faster depnds on various factor and try to optimize by using those.

SparkSQL PostgresQL Dataframe partitions

I have a very simple setup of SparkSQL connecting to a Postgres DB and I'm trying to get a DataFrame from a table, the Dataframe with a number X of partitions (lets say 2). The code would be the following:
Map<String, String> options = new HashMap<String, String>();
options.put("url", DB_URL);
options.put("driver", POSTGRES_DRIVER);
options.put("dbtable", "select ID, OTHER from TABLE limit 1000");
options.put("partitionColumn", "ID");
options.put("lowerBound", "100");
options.put("upperBound", "500");
options.put("numPartitions","2");
DataFrame housingDataFrame = sqlContext.read().format("jdbc").options(options).load();
For some reason, one partition of the DataFrame contains almost all rows.
For what I can understand lowerBound/upperBound are the parameters used to finetune this. In SparkSQL's documentation (Spark 1.4.0 - spark-sql_2.11) it says they are used to define the stride, not to filter/range the partition column. But that raises several questions:
The stride is the frequency (number of elements returned each query) with which Spark will query the DB for each executor (partition)?
If not, what is the purpose of this parameters, what do they depend on and how can I balance my DataFrame partitions in a stable way (not asking all partitions contain the same number of elements, just that there is an equilibrium - for example 2 partitions 100 elements 55/45 , 60/40 or even 65/35 would do)
Can't seem to find a clear answer to these questions around and was wondering if maybe some of you could clear this points for me, because right now is affecting my cluster performance when processing X million rows and all the heavy lifting goes to one single executor.
Cheers and thanks for your time.
Essentially the lower and upper bound and the number of partitions are used to calculate the increment or split for each parallel task.
Let's say the table has partition column "year", and has data from 2006 to 2016.
If you define the number of partitions as 10, with lower bound 2006 and higher bound 2016, you will have each task fetching data for its own year - the ideal case.
Even if you incorrectly specify the lower and / or upper bound, e.g. set lower = 0 and upper = 2016, there will be a skew in data transfer, but, you will not "lose" or fail to retrieve any data, because:
The first task will fetch data for year < 0.
The second task will fetch data for year between 0 and 2016/10.
The third task will fetch data for year between 2016/10 and 2*2016/10.
...
And the last task will have a where condition with year->2016.
T.
Lower bound are indeed used against the partitioning column; refer to this code (current version at the moment of writing this):
https://github.com/apache/spark/blob/40ed2af587cedadc6e5249031857a922b3b234ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
Function columnPartition contains the code for the partitioning logic and the use of lower / upper bound.
lowerbound and upperbound have been currently identified to do what they do in the previous answers. A followup to this would be how to balance the data across partitions without looking at the min max values or if your data is heavily skewed.
If your database supports "hash" function, it could do the trick.
partitionColumn = "hash(column_name)%num_partitions"
numPartitions = 10 // whatever you want
lowerBound = 0
upperBound = numPartitions
This will work as long as the modulus operation returns a uniform distribution over [0,numPartitions)