As far as I understand, this is the most efficient way to calculate average in Spark: Spark : Average of values instead of sum in reduceByKey using Scala.
My question is: if I use the high-level dataset with a groupby followed by Spark functions' avg(), will I get the same RDD under the hood? Can I trust Catalyst or I should use the low-level RDD? I mean, will writing low-level code yield better results than a dataset?
Example code:
employees
.groupBy($"employee")
.agg(
avg($"salary").as("avg_salary")
)
Versus:
employees
.mapValues(employee => (employee.salary, 1)) // map entry with a count of 1
.reduceByKey {
case ((sumL, countL), (sumR, countR)) =>
(sumL + sumR, countL + countR)
}
.mapValues {
case (sum , count) => sum / count
}
I don't see it a black-and-white question. In general, if you have a RDD, especially if it's a PairRDD, and need a result in RDD, it would make sense to settle with reduceByKey. On the other hand, given a DataFrame I would recommend going with groupBy/agg(avg).
A couple of things to consider:
Built-in optimization
While reduceByKey is relatively efficient compared to functions like groupByKey, it does induce stage boundaries since the operation requires repartitioning the data by keys. Depending on the RDD's partitions, the number of tasks in the derived stage may end up to be too small to take advantage of the available cpu cores, potentially resulting in a performance bottleneck. Such performance issue could be addressed, for instance, by manually assigning numPartition in reduceByKey:
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
But the point is that, to fully optimize RDD operations, one might need to put in some manual tweaking effort. In contrary, most operations for DataFrames are automatically optimized by the built-in Catalyst query optimizer.
Memory usage efficiency
Perhaps the other more significant factor to be looked at is related to memory usage for large dataset. When a RDD needs to be distributed across nodes or written to disk, Spark will serialize every row of data into objects, subject to costly Garbage Collection overhead. On the other hand, with knowledge of a DataFrame's schema, Spark doesn't need to serialize the data into objects. The Tungsten execution engine can leverage off-heap memory to store data in binary format for transformations, resulting in more efficient use of memory.
In conclusion, while there may be more knobs for tweaking using low-level code, that does not necessarily result in more performant code due to inadequate optimization, additional cost for serialization, etc.
We can conclude this from the plan generated by Spark.
This is the plan for DataFrame syntax-
val employees = spark.createDataFrame(Seq(("E1",100.0), ("E2",200.0),("E3",300.0))).toDF("employee","salary")
employees
.groupBy($"employee")
.agg(
avg($"salary").as("avg_salary")
).explain(true)
Plan -
== Parsed Logical Plan ==
'Aggregate ['employee], [unresolvedalias('employee, None), avg('salary) AS avg_salary#11]
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
== Analyzed Logical Plan ==
employee: string, avg_salary: double
Aggregate [employee#4], [employee#4, avg(salary#5) AS avg_salary#11]
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
== Optimized Logical Plan ==
Aggregate [employee#4], [employee#4, avg(salary#5) AS avg_salary#11]
+- LocalRelation [employee#4, salary#5]
== Physical Plan ==
*(2) HashAggregate(keys=[employee#4], functions=[avg(salary#5)], output=[employee#4, avg_salary#11])
+- Exchange hashpartitioning(employee#4, 10)
+- *(1) HashAggregate(keys=[employee#4], functions=[partial_avg(salary#5)], output=[employee#4, sum#17, count#18L])
+- LocalTableScan [employee#4, salary#5]
As the plan suggests first "HashAggregate" happened with partial average then "exchange hashpartitioning" happened for full average. The conclusion is that catalyst optimized the DataFrame operation as if we programmed with "reduceByKey" syntax. So we needn't take the burden of writing low level code.
Here is how RDD code and plan looks like.
employees
.map(employee => ("key",(employee.getAs[Double]("salary"), 1))) // map entry with a count of 1
.rdd.reduceByKey {
case ((sumL, countL), (sumR, countR)) =>
(sumL + sumR, countL + countR)
}
.mapValues {
case (sum , count) => sum / count
}.toDF().explain(true)
Plan -
== Parsed Logical Plan ==
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#30, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#31]
+- ExternalRDD [obj#29]
== Analyzed Logical Plan ==
_1: string, _2: double
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#30, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#31]
+- ExternalRDD [obj#29]
== Optimized Logical Plan ==
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#30, assertnotnull(input[0, scala.Tuple2, true])._2 AS _2#31]
+- ExternalRDD [obj#29]
== Physical Plan ==
*(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#30, assertnotnull(input[0, scala.Tuple2, true])._2 AS _2#31]
+- Scan[obj#29]
The plan is optimized and also involves serialization of data into objects which means extra pressure of memory.
Conclusion
I would use daraframe syntax for its simplicity and possibly better performance.
Debug at println("done") , Go to http://localhost:4040/stages/ ,You will get the result.
val spark = SparkSession
.builder()
.master("local[*]")
.appName("example")
.getOrCreate()
val employees = spark.createDataFrame(Seq(("employee1",1000),("employee2",2000),("employee3",1500))).toDF("employee","salary")
import spark.implicits._
import org.apache.spark.sql.functions._
// Spark functions
employees
.groupBy("employee")
.agg(
avg($"salary").as("avg_salary")
).show()
// your low-level code
println("done")
Related
When computing statistics for a simple parallelized collection in Spark 2.3.0 I'm getting some strange results:
val df = spark.sparkContext.parallelize(Seq("y")).toDF("y")
df.queryExecution.stringWithStats
== Optimized Logical Plan
== Project [value#7 AS y#9], Statistics(sizeInBytes=8.0 EB, hints=none)
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7], Statistics(sizeInBytes=8.0 EB, hints=none)
+- ExternalRDD [obj#6], Statistics(sizeInBytes=8.0 EB, hints=none)
That is 8.0 Exabytes of data.
If I do the equivalent without parallelizing
== Optimized Logical Plan
== LocalRelation [x#3], Statistics(sizeInBytes=20.0 B, hints=none)
Clearly when the collection is getting serialized there is a side effect that the query planner can't accurately figure out the size. Am I missing something here?
Spark SQL doesn't seem to be able to push down LIMIT operators on inner joined tables. This is an issue when joining large tables to extract a small subset of rows. I'm testing on Spark 2.2.1 (most recent release).
Below is a contrived example, which runs in the spark-shell (Scala):
First, set up the tables:
case class Customer(id: Long, name: String, email: String, zip: String)
case class Order(id: Long, customer: Long, date: String, amount: Long)
val customers = Seq(
Customer(0, "George Washington", "gwashington#usa.gov", "22121"),
Customer(1, "John Adams", "gwashington#usa.gov", "02169"),
Customer(2, "Thomas Jefferson", "gwashington#usa.gov", "22902"),
Customer(3, "James Madison", "gwashington#usa.gov", "22960"),
Customer(4, "James Monroe", "gwashington#usa.gov", "22902")
)
val orders = Seq(
Order(1, 1, "07/04/1776", 23456),
Order(2, 3, "03/14/1760", 7850),
Order(3, 2, "05/23/1784", 12400),
Order(4, 3, "09/03/1790", 6550),
Order(5, 4, "07/21/1795", 2550),
Order(6, 0, "11/27/1787", 1440)
)
import spark.implicits._
val customerTable = spark.sparkContext.parallelize(customers).toDS()
customerTable.createOrReplaceTempView("customer")
val orderTable = spark.sparkContext.parallelize(orders).toDS()
orderTable.createOrReplaceTempView("order")
Now run the following join query, with a LIMIT and an arbitrary filter for each joined table:
scala> val join = spark.sql("SELECT c.* FROM customer c JOIN order o ON c.id = o.customer WHERE c.id > 1 AND o.amount > 5000 LIMIT 1")
Then print the corresponding optimized execution plan:
scala> println(join.queryExecution.sparkPlan.toString)
CollectLimit 1
+- Project [id#5L, name#6, email#7, zip#8]
+- SortMergeJoin [id#5L], [customer#17L], Inner
:- Filter (id#5L > 1)
: +- SerializeFromObject [assertnotnull(input[0, $line14.$read$$iw$$iw$Customer, true]).id AS id#5L, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line14.$read$$iw$$iw$Customer, true]).name, true) AS name#6, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line14.$read$$iw$$iw$Customer, true]).email, true) AS email#7, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line14.$read$$iw$$iw$Customer, true]).zip, true) AS zip#8]
: +- Scan ExternalRDDScan[obj#4]
+- Project [customer#17L]
+- Filter ((amount#19L > 5000) && (customer#17L > 1))
+- SerializeFromObject [assertnotnull(input[0, $line15.$read$$iw$$iw$Order, true]).id AS id#16L, assertnotnull(input[0, $line15.$read$$iw$$iw$Order, true]).customer AS customer#17L, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line15.$read$$iw$$iw$Order, true]).date, true) AS date#18, assertnotnull(input[0, $line15.$read$$iw$$iw$Order, true]).amount AS amount#19L]
+- Scan ExternalRDDScan[obj#15]
and you can immediately see that both tables are sorted in their entirety before being joined (although for these small example tables the explicit Sort step is not shown before the SortMergeJoin), and only afterward is the LIMIT applied.
If one of the databases contains billions of rows, this query becomes extremely slow and resource intensive, regardless of the LIMIT size.
Is Spark capable of optimizing such a query? Or can I work around the issue without mangling my SQL beyond recognition?
Spark 3.0 adds LimitPushDown so from that version it will be able to do that.
Is Spark capable of optimizing such a query
In short, it is not.
Using old nomenclature, join is a wide a transformation, where each output partition depends on every upstream partition. As a result, both parent datasets have to be fully scanned, to compute even a single partition of the child.
It is not impossible that some optimizations will be included in the future, however if you goal is to:
extract a small subset of rows.
then you should consider using database, not Apache Spark.
Not sure whether this will be helpful in your use case, but did you consider adding the filter and limit clause in order table itself.
val orderTableLimited = orderTable.filter($"customer" >
1).filter($"amount" > 5000).limit(1)
I have a Dataset that i created from a RDD and try to join it with another Dataset which is created from my Phoenix Table:
val dfToJoin = sparkSession.createDataset(rddToJoin)
val tableDf = sparkSession
.read
.option("table", "table")
.option("zkURL", "localhost")
.format("org.apache.phoenix.spark")
.load()
val joinedDf = dfToJoin.join(tableDf, "columnToJoinOn")
When i execute it, it seems that the whole database table is loaded to do the join.
Is there a way to do such a join so that the filtering is done on the database instead of in spark?
Also: dfToJoin is smaller than the table, i do not know if this is important.
Edit: Basically i want to join my Phoenix table with an Dataset created through spark, without fetching the whole table into the executor.
Edit2: Here is the physical plan:
*Project [FEATURE#21, SEQUENCE_IDENTIFIER#22, TAX_NUMBER#23,
WINDOW_NUMBER#24, uniqueIdentifier#5, readLength#6]
+- *SortMergeJoin [FEATURE#21], [feature#4], Inner
:- *Sort [FEATURE#21 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(FEATURE#21, 200)
: +- *Filter isnotnull(FEATURE#21)
: +- *Scan PhoenixRelation(FEATURES,localhost,false)
[FEATURE#21,SEQUENCE_IDENTIFIER#22,TAX_NUMBER#23,WINDOW_NUMBER#24]
PushedFilters: [IsNotNull(FEATURE)], ReadSchema:
struct<FEATURE:int,SEQUENCE_IDENTIFIER:string,TAX_NUMBER:int,
WINDOW_NUMBER:int>
+- *Sort [feature#4 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(feature#4, 200)
+- *Filter isnotnull(feature#4)
+- *SerializeFromObject [assertnotnull(input[0, utils.CaseClasses$QueryFeature, true], top level Product input object).feature AS feature#4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, utils.CaseClasses$QueryFeature, true], top level Product input object).uniqueIdentifier, true) AS uniqueIdentifier#5, assertnotnull(input[0, utils.CaseClasses$QueryFeature, true], top level Product input object).readLength AS readLength#6]
+- Scan ExternalRDDScan[obj#3]
As you can see the equals-filter is not contained in the pushed-filters list, so it is obvious that no predicate pushdown is happening.
Spark will fetch the Phoenix table records to appropriate executors(not the entire table to one executor)
As the is no direct filter on Phoenix table df, we see only *Filter isnotnull(FEATURE#21) in physical plan.
As you are mentioning Phoenix table data is less when you apply filter on it. You push the filter to phoenix table on feature column by finding feature_ids in other dataset.
//This spread across workers - fully distributed
val dfToJoin = sparkSession.createDataset(rddToJoin)
//This sits in driver - not distributed
val list_of_feature_ids = dfToJoin.dropDuplicates("feature")
.select("feature")
.map(r => r.getString(0))
.collect
.toList
//This spread across workers - fully distributed
val tableDf = sparkSession
.read
.option("table", "table")
.option("zkURL", "localhost")
.format("org.apache.phoenix.spark")
.load()
.filter($"FEATURE".isin(list_of_feature_ids:_*)) //added filter
//This spread across workers - fully distributed
val joinedDf = dfToJoin.join(tableDf, "columnToJoinOn")
joinedDf.explain()
I have a huge Spark Dataset with columns A, B, C, D, E. Question is if I initially repartition on column A, and subsequently do two 'within-partition' groupBy operations:
**groupBy("A", "C")**....map(....).**groupBy("A", "E")**....map(....)
is Spark 2.0 clever enough to by-pass shuffling since both groupBy operations are 'within-partition' with respect to the parent stage - i.e. column A is included in both groupBy column specs? If not, what can I do to ensure a narrow dependency throughout the chain of operations?
Spark indeed supports optimization like this. You can check that by analyzing execution plan:
val df = Seq(("a", 1, 2)).toDF("a", "b", "c")
df.groupBy("a").max().groupBy("a", "max(b)").sum().explain
== Physical Plan ==
*HashAggregate(keys=[a#42, max(b)#92], functions=[sum(cast(max(b)#92 as bigint)), sum(cast(max(c)#93 as bigint))])
+- *HashAggregate(keys=[a#42, max(b)#92], functions=[partial_sum(cast(max(b)#92 as bigint)), partial_sum(cast(max(c)#93 as bigint))])
+- *HashAggregate(keys=[a#42], functions=[max(b#43), max(c#44)])
+- Exchange hashpartitioning(a#42, 200)
+- *HashAggregate(keys=[a#42], functions=[partial_max(b#43), partial_max(c#44)])
+- LocalTableScan [a#42, b#43, c#44]
As you can see there is only one exchange but two hash aggregates.
I've started using Spark SQL and DataFrames in Spark 1.4.0. I'm wanting to define a custom partitioner on DataFrames, in Scala, but not seeing how to do this.
One of the data tables I'm working with contains a list of transactions, by account, silimar to the following example.
Account Date Type Amount
1001 2014-04-01 Purchase 100.00
1001 2014-04-01 Purchase 50.00
1001 2014-04-05 Purchase 70.00
1001 2014-04-01 Payment -150.00
1002 2014-04-01 Purchase 80.00
1002 2014-04-02 Purchase 22.00
1002 2014-04-04 Payment -120.00
1002 2014-04-04 Purchase 60.00
1003 2014-04-02 Purchase 210.00
1003 2014-04-03 Purchase 15.00
At least initially, most of the calculations will occur between the transactions within an account. So I would want to have the data partitioned so that all of the transactions for an account are in the same Spark partition.
But I'm not seeing a way to define this. The DataFrame class has a method called 'repartition(Int)', where you can specify the number of partitions to create. But I'm not seeing any method available to define a custom partitioner for a DataFrame, such as can be specified for an RDD.
The source data is stored in Parquet. I did see that when writing a DataFrame to Parquet, you can specify a column to partition by, so presumably I could tell Parquet to partition it's data by the 'Account' column. But there could be millions of accounts, and if I'm understanding Parquet correctly, it would create a distinct directory for each Account, so that didn't sound like a reasonable solution.
Is there a way to get Spark to partition this DataFrame so that all data for an Account is in the same partition?
Spark >= 2.3.0
SPARK-22614 exposes range partitioning.
val partitionedByRange = df.repartitionByRange(42, $"k")
partitionedByRange.explain
// == Parsed Logical Plan ==
// 'RepartitionByExpression ['k ASC NULLS FIRST], 42
// +- AnalysisBarrier Project [_1#2 AS k#5, _2#3 AS v#6]
//
// == Analyzed Logical Plan ==
// k: string, v: int
// RepartitionByExpression [k#5 ASC NULLS FIRST], 42
// +- Project [_1#2 AS k#5, _2#3 AS v#6]
// +- LocalRelation [_1#2, _2#3]
//
// == Optimized Logical Plan ==
// RepartitionByExpression [k#5 ASC NULLS FIRST], 42
// +- LocalRelation [k#5, v#6]
//
// == Physical Plan ==
// Exchange rangepartitioning(k#5 ASC NULLS FIRST, 42)
// +- LocalTableScan [k#5, v#6]
SPARK-22389 exposes external format partitioning in the Data Source API v2.
Spark >= 1.6.0
In Spark >= 1.6 it is possible to use partitioning by column for query and caching. See: SPARK-11410 and SPARK-4849 using repartition method:
val df = Seq(
("A", 1), ("B", 2), ("A", 3), ("C", 1)
).toDF("k", "v")
val partitioned = df.repartition($"k")
partitioned.explain
// scala> df.repartition($"k").explain(true)
// == Parsed Logical Plan ==
// 'RepartitionByExpression ['k], None
// +- Project [_1#5 AS k#7,_2#6 AS v#8]
// +- LogicalRDD [_1#5,_2#6], MapPartitionsRDD[3] at rddToDataFrameHolder at <console>:27
//
// == Analyzed Logical Plan ==
// k: string, v: int
// RepartitionByExpression [k#7], None
// +- Project [_1#5 AS k#7,_2#6 AS v#8]
// +- LogicalRDD [_1#5,_2#6], MapPartitionsRDD[3] at rddToDataFrameHolder at <console>:27
//
// == Optimized Logical Plan ==
// RepartitionByExpression [k#7], None
// +- Project [_1#5 AS k#7,_2#6 AS v#8]
// +- LogicalRDD [_1#5,_2#6], MapPartitionsRDD[3] at rddToDataFrameHolder at <console>:27
//
// == Physical Plan ==
// TungstenExchange hashpartitioning(k#7,200), None
// +- Project [_1#5 AS k#7,_2#6 AS v#8]
// +- Scan PhysicalRDD[_1#5,_2#6]
Unlike RDDs Spark Dataset (including Dataset[Row] a.k.a DataFrame) cannot use custom partitioner as for now. You can typically address that by creating an artificial partitioning column but it won't give you the same flexibility.
Spark < 1.6.0:
One thing you can do is to pre-partition input data before you create a DataFrame
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.HashPartitioner
val schema = StructType(Seq(
StructField("x", StringType, false),
StructField("y", LongType, false),
StructField("z", DoubleType, false)
))
val rdd = sc.parallelize(Seq(
Row("foo", 1L, 0.5), Row("bar", 0L, 0.0), Row("??", -1L, 2.0),
Row("foo", -1L, 0.0), Row("??", 3L, 0.6), Row("bar", -3L, 0.99)
))
val partitioner = new HashPartitioner(5)
val partitioned = rdd.map(r => (r.getString(0), r))
.partitionBy(partitioner)
.values
val df = sqlContext.createDataFrame(partitioned, schema)
Since DataFrame creation from an RDD requires only a simple map phase existing partition layout should be preserved*:
assert(df.rdd.partitions == partitioned.partitions)
The same way you can repartition existing DataFrame:
sqlContext.createDataFrame(
df.rdd.map(r => (r.getInt(1), r)).partitionBy(partitioner).values,
df.schema
)
So it looks like it is not impossible. The question remains if it make sense at all. I will argue that most of the time it doesn't:
Repartitioning is an expensive process. In a typical scenario most of the data has to be serialized, shuffled and deserialized. From the other hand number of operations which can benefit from a pre-partitioned data is relatively small and is further limited if internal API is not designed to leverage this property.
joins in some scenarios, but it would require an internal support,
window functions calls with matching partitioner. Same as above, limited to a single window definition. It is already partitioned internally though, so pre-partitioning may be redundant,
simple aggregations with GROUP BY - it is possible to reduce memory footprint of the temporary buffers**, but overall cost is much higher. More or less equivalent to groupByKey.mapValues(_.reduce) (current behavior) vs reduceByKey (pre-partitioning). Unlikely to be useful in practice.
data compression with SqlContext.cacheTable. Since it looks like it is using run length encoding, applying OrderedRDDFunctions.repartitionAndSortWithinPartitions could improve compression ratio.
Performance is highly dependent on a distribution of the keys. If it is skewed it will result in a suboptimal resource utilization. In the worst case scenario it will be impossible to finish the job at all.
A whole point of using a high level declarative API is to isolate yourself from a low level implementation details. As already mentioned by #dwysakowicz and #RomiKuntsman an optimization is a job of the Catalyst Optimizer. It is a pretty sophisticated beast and I really doubt you can easily improve on that without diving much deeper into its internals.
Related concepts
Partitioning with JDBC sources:
JDBC data sources support predicates argument. It can be used as follows:
sqlContext.read.jdbc(url, table, Array("foo = 1", "foo = 3"), props)
It creates a single JDBC partition per predicate. Keep in mind that if sets created using individual predicates are not disjoint you'll see duplicates in the resulting table.
partitionBy method in DataFrameWriter:
Spark DataFrameWriter provides partitionBy method which can be used to "partition" data on write. It separates data on write using provided set of columns
val df = Seq(
("foo", 1.0), ("bar", 2.0), ("foo", 1.5), ("bar", 2.6)
).toDF("k", "v")
df.write.partitionBy("k").json("/tmp/foo.json")
This enables predicate push down on read for queries based on key:
val df1 = sqlContext.read.schema(df.schema).json("/tmp/foo.json")
df1.where($"k" === "bar")
but it is not equivalent to DataFrame.repartition. In particular aggregations like:
val cnts = df1.groupBy($"k").sum()
will still require TungstenExchange:
cnts.explain
// == Physical Plan ==
// TungstenAggregate(key=[k#90], functions=[(sum(v#91),mode=Final,isDistinct=false)], output=[k#90,sum(v)#93])
// +- TungstenExchange hashpartitioning(k#90,200), None
// +- TungstenAggregate(key=[k#90], functions=[(sum(v#91),mode=Partial,isDistinct=false)], output=[k#90,sum#99])
// +- Scan JSONRelation[k#90,v#91] InputPaths: file:/tmp/foo.json
bucketBy method in DataFrameWriter (Spark >= 2.0):
bucketBy has similar applications as partitionBy but it is available only for tables (saveAsTable). Bucketing information can used to optimize joins:
// Temporarily disable broadcast joins
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
df.write.bucketBy(42, "k").saveAsTable("df1")
val df2 = Seq(("A", -1.0), ("B", 2.0)).toDF("k", "v2")
df2.write.bucketBy(42, "k").saveAsTable("df2")
// == Physical Plan ==
// *Project [k#41, v#42, v2#47]
// +- *SortMergeJoin [k#41], [k#46], Inner
// :- *Sort [k#41 ASC NULLS FIRST], false, 0
// : +- *Project [k#41, v#42]
// : +- *Filter isnotnull(k#41)
// : +- *FileScan parquet default.df1[k#41,v#42] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/spark-warehouse/df1], PartitionFilters: [], PushedFilters: [IsNotNull(k)], ReadSchema: struct<k:string,v:int>
// +- *Sort [k#46 ASC NULLS FIRST], false, 0
// +- *Project [k#46, v2#47]
// +- *Filter isnotnull(k#46)
// +- *FileScan parquet default.df2[k#46,v2#47] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/spark-warehouse/df2], PartitionFilters: [], PushedFilters: [IsNotNull(k)], ReadSchema: struct<k:string,v2:double>
* By partition layout I mean only a data distribution. partitioned RDD has no longer a partitioner.
** Assuming no early projection. If aggregation covers only small subset of columns there is probably no gain whatsoever.
In Spark < 1.6 If you create a HiveContext, not the plain old SqlContext you can use the HiveQL DISTRIBUTE BY colX... (ensures each of N reducers gets non-overlapping ranges of x) & CLUSTER BY colX... (shortcut for Distribute By and Sort By) for example;
df.registerTempTable("partitionMe")
hiveCtx.sql("select * from partitionMe DISTRIBUTE BY accountId SORT BY accountId, date")
Not sure how this fits in with Spark DF api. These keywords aren't supported in the normal SqlContext (note you dont need to have a hive meta store to use the HiveContext)
EDIT: Spark 1.6+ now has this in the native DataFrame API
So to start with some kind of answer : ) - You can't
I am not an expert, but as far as I understand DataFrames, they are not equal to rdd and DataFrame has no such thing as Partitioner.
Generally DataFrame's idea is to provide another level of abstraction that handles such problems itself. The queries on DataFrame are translated into logical plan that is further translated to operations on RDDs. The partitioning you suggested will probably be applied automatically or at least should be.
If you don't trust SparkSQL that it will provide some kind of optimal job, you can always transform DataFrame to RDD[Row] as suggested in of the comments.
Use the DataFrame returned by:
yourDF.orderBy(account)
There is no explicit way to use partitionBy on a DataFrame, only on a PairRDD, but when you sort a DataFrame, it will use that in it's LogicalPlan and that will help when you need to make calculations on each Account.
I just stumbled upon the same exact issue, with a dataframe that I want to partition by account.
I assume that when you say "want to have the data partitioned so that all of the transactions for an account are in the same Spark partition", you want it for scale and performance, but your code doesn't depend on it (like using mapPartitions() etc), right?
I was able to do this using RDD. But I don't know if this is an acceptable solution for you.
Once you have the DF available as an RDD, you can apply repartitionAndSortWithinPartitions to perform custom repartitioning of data.
Here is a sample I used:
class DatePartitioner(partitions: Int) extends Partitioner {
override def getPartition(key: Any): Int = {
val start_time: Long = key.asInstanceOf[Long]
Objects.hash(Array(start_time)) % partitions
}
override def numPartitions: Int = partitions
}
myRDD
.repartitionAndSortWithinPartitions(new DatePartitioner(24))
.map { v => v._2 }
.toDF()
.write.mode(SaveMode.Overwrite)