spark.sql.shuffle.partitions local spark performance behavior - scala
I recently came across a strange performance issue with spark while local testing that turned out to be related to the number of shuffle partitions. I found this quip on the readme for "spark-fast-tests":
It's best set the number of shuffle partitions to a small number like one or four in your test suite. This configuration can make your tests run up to 70% faster. You can remove this configuration option or adjust it if you're working with big DataFrames in your test suite.
But, I'd like to know... WHY?
So much so, that I've gone through the trouble of reproducing the issue (obfuscating quite a lot of business domain case classes using this gist).
The below will run in ~10s on my mac locally using a fairly vanilla spark create:
lazy val spark: SparkSession =
SparkSession
.builder()
.appName("Test-Driver")
.master("local[2]")
.config("spark.sql.warehouse.dir", TempWarehouseDir.toString)
.getOrCreate()
When the shuffle setting is 1.
However! If I change the shuffle setting to something that a cluster might have, say 200 performance drops to near a minute:
spark.sqlContext.setConf("spark.sql.shuffle.partitions", "200")
Does anyone know what is going on here? Why would increasing the shuffle partitions cause the performance to drop so significantly locally?
Granted the domain classes are large, but I don't think that totally explains why the test behaves this way.
Here is the test code:
"list join df takes a long time" in {
spark.sqlContext.setConf("spark.sql.shuffle.partitions", "200")
val withList =
Seq(
("key1", Seq(MyBigDomainClass(None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None))),
)
.toDF("key1", "values").as[(String, List[MyBigDomainClass])]
val withoutList =
Seq(
("key1", 1),
("key2", 2)
).toDF("key1", "value").as[(String, Int)]
var start = System.currentTimeMillis()
val joined = withoutList.joinWith(withList, withList("key1") === withoutList("key1"), "full")
joined.show
println(s"[join] elapsed: ${(System.currentTimeMillis() - start) / 1000}s")
start = System.currentTimeMillis()
joined.map {
case (a, b) => (Option(a), Option(b).fold(List.empty[MyBigDomainClass])(_._2))
}.show
println(s"[map] elapsed: ${(System.currentTimeMillis() - start) / 1000}s")
}
And the domain classes:
package com.edmunds.bde.dataservices.imt.inventory.pipeline.job
case class TestClass_2(field_80:Option[String], field_81:Option[String], field_82:Option[Int])
case class TestClass_3(field_84:Option[Int], field_85:Option[Int], field_86:Option[Int])
case class TestClass_4(field_90:Option[String], field_91:Option[String], field_92:Option[String], field_93:Option[Double], field_94:Option[Double])
case class TestClass_5(field_96:Option[String], field_97:Option[String], field_98:Option[String], field_99:Option[Double], field_100:Option[String], field_101:Option[Int], field_102:Option[String], field_103:Option[String], field_104:Option[String], field_105:Option[Int], field_106:Option[Int], field_107:Option[Int], field_108:Option[Int])
case class TestClass_6(field_111:Option[String], field_112:Option[String], field_113:Option[String], field_114:Option[String], field_115:Option[String], field_116:Option[String], field_117:Option[String], field_118:Option[String], field_119:Option[String])
case class TestClass_7(field_121:Option[String], field_122:Option[String], field_123:Option[String], field_124:Option[String], field_125:Option[String], field_126:Option[String], field_127:Option[String], field_128:Option[String], field_129:Option[String])
case class TestClass_8(field_131:Option[String], field_132:Option[String], field_133:Option[String], field_134:Option[String], field_135:Option[String], field_136:Option[String], field_137:Option[String], field_138:Option[String], field_139:Option[String])
case class TestClass_9(field_141:Option[Long], field_142:Option[String], field_143:Option[String], field_144:Option[String], field_145:Option[Long], field_146:Option[Long])
case class TestClass_10(field_150:Option[Long], field_151:Option[String], field_152:Option[String], field_153:Option[String], field_154:Option[Seq[String]])
case class TestClass_1(field_70:Option[Long], field_71:Option[String], field_72:Option[String], field_73:Option[Long], field_74:Option[String], field_75:Option[String], field_76:Option[String], field_77:Option[String], field_78:Option[String], field_82:Option[TestClass_2], field_86:Option[TestClass_3], field_87:Option[Double], field_88:Option[Double], field_94:Option[Seq[TestClass_4]], field_108:Option[TestClass_5], field_109:Option[String], field_119:Option[TestClass_6], field_129:Option[TestClass_7], field_139:Option[TestClass_8], field_146:Option[Seq[TestClass_9]], field_147:Option[Seq[String]], field_148:Option[Seq[String]], field_154:Option[Seq[TestClass_10]])
case class TestClass_12(field_157:Option[Double], field_158:Option[Double], field_159:Option[Double], field_160:Option[Double], field_161:Option[Double], field_162:Option[java.math.BigDecimal], field_163:Option[java.math.BigDecimal], field_164:Option[Double], field_165:Option[Double])
case class TestClass_11(field_165:Option[TestClass_12], field_166:Option[Long], field_167:Option[scala.collection.Map[String, String]])
case class TestClass_14(field_170:Option[Double], field_171:Option[Double], field_172:Option[String])
case class TestClass_15(field_174:Option[Double], field_175:Option[Double], field_176:Option[Double], field_177:Option[Double], field_178:Option[Double], field_179:Option[Double], field_180:Option[Double], field_181:Option[Double], field_182:Option[Double], field_183:Option[Double], field_184:Option[Double], field_185:Option[Double], field_186:Option[Double], field_187:Option[Int], field_188:Option[Long], field_189:Option[Long], field_190:Option[Long], field_191:Option[Long])
case class TestClass_16(field_193:Option[Double], field_194:Option[Double], field_195:Option[Double], field_196:Option[Double], field_197:Option[Double], field_198:Option[Double])
case class TestClass_17(field_200:Option[java.math.BigDecimal], field_201:Option[Double], field_202:Option[java.math.BigDecimal], field_203:Option[Int])
case class TestClass_19(field_211:Option[Int], field_212:Option[String], field_213:Option[Double], field_214:Option[Int], field_215:Option[Double], field_216:Option[Int], field_217:Option[Double], field_218:Option[Int], field_219:Option[Int], field_220:Option[Int], field_221:Option[Int], field_222:Option[String], field_223:Option[java.sql.Date], field_224:Option[Int], field_225:Option[Int], field_226:Option[Int], field_227:Option[Int], field_228:Option[String])
case class TestClass_18(field_205:Option[Double], field_206:Option[Double], field_207:Option[Double], field_208:Option[Double], field_209:Option[String], field_228:Option[TestClass_19])
case class TestClass_20(field_230:Option[java.sql.Timestamp], field_231:Option[Long], field_232:Option[String], field_233:Option[String], field_234:Option[String], field_235:Option[java.sql.Timestamp], field_236:Option[java.sql.Timestamp], field_237:Option[Double], field_238:Option[Int], field_239:Option[Int], field_240:Option[Boolean], field_241:Option[Int], field_242:Option[Int], field_243:Option[Double], field_244:Option[Long], field_245:Option[String], field_246:Option[java.sql.Timestamp], field_247:Option[String])
case class TestClass_21(field_249:Option[java.sql.Timestamp], field_250:Option[Long], field_251:Option[String], field_252:Option[String], field_253:Option[String], field_254:Option[java.sql.Timestamp], field_255:Option[java.sql.Timestamp], field_256:Option[Double], field_257:Option[Int], field_258:Option[Int], field_259:Option[Boolean], field_260:Option[Int], field_261:Option[Int], field_262:Option[Double], field_263:Option[Long], field_264:Option[String], field_265:Option[java.sql.Timestamp], field_266:Option[String])
case class TestClass_13(field_172:Option[TestClass_14], field_191:Option[TestClass_15], field_198:Option[TestClass_16], field_203:Option[TestClass_17], field_228:Option[TestClass_18], field_247:Option[Seq[TestClass_20]], field_266:Option[Seq[TestClass_21]], field_267:Option[java.math.BigDecimal])
case class TestClass_22(field_269:Option[String], field_270:Option[String], field_271:Option[String], field_272:Option[String], field_273:Option[Double], field_274:Option[String])
case class TestClass_23(field_277:Option[Int], field_278:Option[Boolean], field_279:Option[Int], field_280:Option[Boolean], field_281:Option[Boolean], field_282:Option[Boolean], field_283:Option[Boolean], field_284:Option[Boolean], field_285:Option[Boolean], field_286:Option[String], field_287:Option[String], field_288:Option[String], field_289:Option[Boolean], field_290:Option[Boolean])
case class TestClass_25(field_293:Option[Boolean], field_294:Option[Boolean], field_295:Option[String], field_296:Option[String])
case class TestClass_26(field_298:Option[Boolean], field_299:Option[Boolean], field_300:Option[String], field_301:Option[String])
case class TestClass_27(field_303:Option[Boolean], field_304:Option[Boolean], field_305:Option[String], field_306:Option[String])
case class TestClass_24(field_296:Option[TestClass_25], field_301:Option[TestClass_26], field_306:Option[TestClass_27])
case class TestClass_28(field_311:Option[Long], field_312:Option[Long], field_313:Option[Boolean], field_314:Option[Int], field_315:Option[String], field_316:Option[String], field_317:Option[Boolean], field_318:Option[Boolean], field_319:Option[Boolean])
case class MyBigDomainClass(field_1:Option[String], field_2:Option[String], field_3:Option[String], field_4:Option[String], field_5:Option[java.sql.Timestamp], field_6:Option[java.sql.Date], field_7:Option[String], field_8:Option[String], field_9:Option[String], field_10:Option[String], field_11:Option[Int], field_12:Option[String], field_13:Option[String], field_14:Option[String], field_15:Option[String], field_16:Option[String], field_17:Option[String], field_18:Option[Double], field_19:Option[Double], field_20:Option[Double], field_21:Option[Double], field_22:Option[Double], field_23:Option[Double], field_24:Option[Double], field_25:Option[Double], field_26:Option[Double], field_27:Option[Double], field_28:Option[Double], field_29:Option[Double], field_30:Option[String], field_31:Option[String], field_32:Option[String], field_33:Option[String], field_34:Option[String], field_35:Option[String], field_36:Option[String], field_37:Option[String], field_38:Option[String], field_39:Option[String], field_40:Option[String], field_41:Option[String], field_42:Option[String], field_43:Option[String], field_44:Option[String], field_45:Option[String], field_46:Option[String], field_47:Option[Int], field_48:Option[Int], field_49:Option[java.sql.Date], field_50:Option[java.sql.Date], field_51:Option[java.sql.Date], field_52:Option[java.sql.Date], field_53:Option[String], field_54:Option[String], field_55:Option[Int], field_56:Option[java.sql.Date], field_57:Option[String], field_58:Option[String], field_59:Option[String], field_60:Option[String], field_61:Option[String], field_62:Option[String], field_63:Option[String], field_64:Option[Boolean], field_65:Option[scala.collection.Map[String, String]], field_66:Option[Int], field_67:Option[Int], field_68:Option[String], field_154:Option[TestClass_1], field_167:Option[TestClass_11], field_267:Option[TestClass_13], field_274:Option[Seq[TestClass_22]], field_275:Option[Int], field_290:Option[TestClass_23], field_306:Option[TestClass_24], field_307:Option[Int], field_308:Option[Boolean], field_309:Option[Boolean], field_319:Option[TestClass_28], field_320:Option[java.sql.Timestamp], field_321:Option[java.sql.Date])
I had the exact same issue on one of my previous projects. When the number of partitions is 200, every shuffling operation (join, group by, etc...) will create 200 partitions in your dataset. Since you have just 2 worker threads, those 200 partitions will be processed sequentially on the 2 worker threads (no more than 2 partitions will be processed in parallel).
Every partition incurs some overhead. Say you have 1000 records in total. Split to 200 partitions, you have 5 records per partition. So total processing cost is 1000 records + 200 partition overhead. Since the partition is very small (just 5 records), the processing time of the data in the partition becomes less than the overhead incurred by having the partition.
The general rule of thumb is to have ~2 partitions per core.
Adding to https://stackoverflow.com/a/72801534/4278032, the difference you experienced (10 secs using 4 partitions vs ~60 secs with 200 partitions) seems way bigger than anything I've seen before.
Running your code on my machine shows results more in line with what I would expect (for Spark 2.4). Most interesting is that setting shuffle partitions for tests seems to be not required anymore for Spark 3.2 / 3.3.
Using 200 partitions (Spark 3.3.0):
+---------+--------------------+
| _1| _2|
+---------+--------------------+
|{key1, 1}|{key1, [{null, nu...|
|{key2, 2}| null|
+---------+--------------------+
[join] elapsed: 2s
+---------+--------------------+
| _1| _2|
+---------+--------------------+
|{key1, 1}|[{null, null, nul...|
|{key2, 2}| []|
+---------+--------------------+
[map] elapsed: 2s
Using 4 partitions (Spark 3.3.0):
+---------+--------------------+
| _1| _2|
+---------+--------------------+
|{key1, 1}|{key1, [{null, nu...|
|{key2, 2}| null|
+---------+--------------------+
[join] elapsed: 2s
+---------+--------------------+
| _1| _2|
+---------+--------------------+
|{key1, 1}|[{null, null, nul...|
|{key2, 2}| []|
+---------+--------------------+
[map] elapsed: 2s
Using 200 partitions (Spark 2.4.8):
+---------+--------------------+
| _1| _2|
+---------+--------------------+
|[key1, 1]|[key1, [[,,,,,,,,...|
|[key2, 2]| null|
+---------+--------------------+
[join] elapsed: 5s
+---------+--------------------+
| _1| _2|
+---------+--------------------+
|[key1, 1]|[[,,,,,,,,,,,,,,,...|
|[key2, 2]| []|
+---------+--------------------+
[map] elapsed: 17s
Using 4 partitions (Spark 2.4.8):
+---------+--------------------+
| _1| _2|
+---------+--------------------+
|[key1, 1]|[key1, [[,,,,,,,,...|
|[key2, 2]| null|
+---------+--------------------+
[join] elapsed: 2s
+---------+--------------------+
| _1| _2|
+---------+--------------------+
|[key1, 1]|[[,,,,,,,,,,,,,,,...|
|[key2, 2]| []|
+---------+--------------------+
[map] elapsed: 3s
Though, when reducing the log level to DEBUG in case of Spark 2.4 execution time really explodes just because of the pure overhead of logging for all these empty partitions(and constant synchronisation between the two threads).
Related
Spark Combining Disparate rate Dataframes in Time
Using Spark and Scala, I have two DataFrames with data values. I'm trying to accomplish something that, when processing serially would be trival, but when processing in a cluster seems daunting. Let's say I have to sets of values. One of them is very regular: Relative Time Value1 10 1 20 2 30 3 And I want to combine it with another value that is very irregular: Relative Time Value2 1 100 22 200 And get this (driven by Value1): Relative Time Value1 Value2 10 1 100 20 2 100 30 3 200 Note: There are a few scenarios here. One of them is that Value1 is a massive DataFrame and Value2 only has a few hundred values. The other scenario is that they're both massive. Also note: I depict Value2 as being very slow, and it might be, but also could may be much faster than Value1, so I may have 10 or 100 values of Value2 before my next value of Value1, and I'd want the latest. Because of this doing a union of them and windowing it doesn't seem practical. How would I accomplish this in Spark?
I think you can do: Full outer join between the two tables Use the last function to look back the closest value of value2 import spark.implicits._ import org.apache.spark.sql.expressions.Window val df1 = spark.sparkContext.parallelize(Seq( (10, 1), (20, 2), (30, 3) )).toDF("Relative Time", "value1") val df2 = spark.sparkContext.parallelize(Seq( (1, 100), (22, 200) )).toDF("Relative Time", "value2_temp") val df = df1.join(df2, Seq("Relative Time"), "outer") val window = Window.orderBy("Relative Time") val result = df.withColumn("value2", last($"value2_temp", ignoreNulls = true).over(window)).filter($"value1".isNotNull).drop("value2_temp") result.show() +-------------+------+------+ |Relative Time|value1|value2| +-------------+------+------+ | 10| 1| 100| | 20| 2| 100| | 30| 3| 200| +-------------+------+------+
Dynamic dataframe with n columns and m rows
Reading data from json(dynamic schema) and i'm loading that to dataframe. Example Dataframe: scala> import spark.implicits._ import spark.implicits._ scala> val DF = Seq( (1, "ABC"), (2, "DEF"), (3, "GHIJ") ).toDF("id", "word") someDF: org.apache.spark.sql.DataFrame = [number: int, word: string] scala> DF.show +------+-----+ |id | word| +------+-----+ | 1| ABC| | 2| DEF| | 3| GHIJ| +------+-----+ Requirement: Column count and names can be anything. I want to read rows in loop to fetch each column one by one. Need to process that value in subsequent flows. Need both column name and value. I'm using scala. Python: for i, j in df.iterrows(): print(i, j) Need the same functionality in scala and it column name and value should be fetched separtely. Kindly help.
df.iterrows is not from pyspark, but from pandas. In Spark, you can use foreach : DF .foreach{_ match {case Row(id:Int,word:String) => println(id,word)}} Result : (2,DEF) (3,GHIJ) (1,ABC) I you don't know the number of columns, you cannot use unapply on Row, then just do : DF .foreach(row => println(row)) Result : [1,ABC] [2,DEF] [3,GHIJ] And operate with row using its methods getAs etc
Dataframe replace each row null values with unique epoch time
I have 3 rows in dataframes and in 2 rows, the column id has got null values. I need to loop through the each row on that specific column id and replace with epoch time which should be unique and should happen in dataframe itself. How can it be done? For eg: id | name 1 a null b null c I wanted this dataframe which converts null to epoch time. id | name 1 a 1435232 b 1542344 c
df .select( when($"id").isNull, /*epoch time*/).otherwise($"id").alias("id"), $"name" ) EDIT You need to make sure the UDF precise enough - if it is only has millisecond resolution you will see duplicate values. See my example below that clearly illustrates my approach works: scala> def rand(s: String): Double = Math.random rand: (s: String)Double scala> val udfF = udf(rand(_: String)) udfF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,DoubleType,Some(List(StringType))) scala> res11.select(when($"id".isNull, udfF($"id")).otherwise($"id").alias("id"), $"name").collect res21: Array[org.apache.spark.sql.Row] = Array([0.6668195187088702,a], [0.920625293516218,b])
Check this out scala> val s1:Seq[(Option[Int],String)] = Seq( (Some(1),"a"), (null,"b"), (null,"c")) s1: Seq[(Option[Int], String)] = List((Some(1),a), (null,b), (null,c)) scala> val df = s1.toDF("id","name") df: org.apache.spark.sql.DataFrame = [id: int, name: string] scala> val epoch = java.time.Instant.now.getEpochSecond epoch: Long = 1539084285 scala> df.withColumn("id",when( $"id".isNull,epoch).otherwise($"id")).show +----------+----+ | id|name| +----------+----+ | 1| a| |1539084285| b| |1539084285| c| +----------+----+ scala> EDIT1: I used milliseconds, then also I get same values. Spark doesn't capture nano seconds in time portion. It is possible that many rows could get the same milliseconds. So your assumption of getting unique values based on epoch would not work. scala> def getEpoch(x:String):Long = java.time.Instant.now.toEpochMilli getEpoch: (x: String)Long scala> val myudfepoch = udf( getEpoch(_:String):Long ) myudfepoch: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType))) scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)).otherwise($"id")).show +-------------+----+ | id|name| +-------------+----+ | 1| a| |1539087300957| b| |1539087300957| c| +-------------+----+ scala> The only possibility is to use the monotonicallyIncreasingId, but that values may not be of same length all the time. scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)+monotonicallyIncreasingId).otherwise($"id")).show warning: there was one deprecation warning; re-run with -deprecation for details +-------------+----+ | id|name| +-------------+----+ | 1| a| |1539090186541| b| |1539090186543| c| +-------------+----+ scala> EDIT2: I'm able to trick the System.nanoTime and get the increasing ids, but they will not be sequential, but the length can be maintained. See below scala> def getEpoch(x:String):String = System.nanoTime.toString.take(12) getEpoch: (x: String)String scala> val myudfepoch = udf( getEpoch(_:String):String ) myudfepoch: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType))) scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)).otherwise($"id")).show +------------+----+ | id|name| +------------+----+ | 1| a| |186127230392| b| |186127230399| c| +------------+----+ scala> Try this out when running in clusters and adjust the take(12), if you get duplicate values.
Spark: Is "count" on Grouped Data a Transformation or an Action?
I know that count called on an RDD or a DataFrame is an action. But while fiddling with the spark shell, I observed the following scala> val empDF = Seq((1,"James Gordon", 30, "Homicide"),(2,"Harvey Bullock", 35, "Homicide"),(3,"Kristen Kringle", 28, "Records"),(4,"Edward Nygma", 30, "Forensics"),(5,"Leslie Thompkins", 31, "Forensics")).toDF("id", "name", "age", "department") empDF: org.apache.spark.sql.DataFrame = [id: int, name: string, age: int, department: string] scala> empDF.show +---+----------------+---+----------+ | id| name|age|department| +---+----------------+---+----------+ | 1| James Gordon| 30| Homicide| | 2| Harvey Bullock| 35| Homicide| | 3| Kristen Kringle| 28| Records| | 4| Edward Nygma| 30| Forensics| | 5|Leslie Thompkins| 31| Forensics| +---+----------------+---+----------+ scala> empDF.groupBy("department").count //count returned a DataFrame res1: org.apache.spark.sql.DataFrame = [department: string, count: bigint] scala> res1.show +----------+-----+ |department|count| +----------+-----+ | Homicide| 2| | Records| 1| | Forensics| 2| +----------+-----+ When I called count on GroupedData (empDF.groupBy("department")), I got another DataFrame as the result (res1). This leads me to believe that count in this case was a transformation. It is further supported by the fact that no computations were triggered when I called count, instead, they started when I ran res1.show. I haven't been able to find any documentation that suggests count could be a transformation as well. Could someone please shed some light on this?
The .count() what you have used in your code is over RelationalGroupedDataset, which creates a new column with count of elements in the grouped dataset. This is a transformation. Refer: https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.GroupedDataset The .count() that you use normally over RDD/DataFrame/Dataset is completely different from the above and this .count() is an Action. Refer: https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.rdd.RDD EDIT: always use .count() with .agg() while operating on groupedDataSet in order to avoid confusion in future: empDF.groupBy($"department").agg(count($"department") as "countDepartment").show
Case 1: You use rdd.count() to count the number of rows. Since it initiates the DAG execution and returns the data to the driver, its an action for RDD. for ex: rdd.count // it returns a Long value Case 2: If you call count on Dataframe, it initiates the DAG execution and returns the data to the driver, its an action for Dataframe. for ex: df.count // it returns a Long value Case 3: In your case you are calling groupBy on dataframe which returns RelationalGroupedDataset object, and you are calling count on grouped Dataset which returns a Dataframe, so its a transformation since it doesn't gets the data to the driver and initiates the DAG execution. for ex: df.groupBy("department") // returns RelationalGroupedDataset .count // returns a Dataframe so a transformation .count // returns a Long value since called on DF so an action
As you've already figure out - if method returns a distributed object (Dataset or RDD) it can be qualified as a transformations. However these distinctions are much better suited for RDDs than Datasets. The latter ones features an optimizer, including recently added cost based optimizer, and might be much less lazy the old API, blurring differences between transformation and action in some case. Here however it is safe to say count is a transformation.
Spark2 Dataframe/RDD process in groups
I have the following table stored in Hive called ExampleData: +--------+-----+---| |Site_ID |Time |Age| +--------+-----+---| |1 |10:00| 20| |1 |11:00| 21| |2 |10:00| 24| |2 |11:00| 24| |2 |12:00| 20| |3 |11:00| 24| +--------+-----+---+ I need to be able to process the data by Site. Unfortunately partitioning it by Site doesn't work (there are over 100k sites, all with fairly small amounts of data). For each site, I need to select the Time column and Age column separately, and use this to feed into a function (which ideally I want to run on the executors, not the driver) I've got a stub of how I think I want it to work, but this solution would only run on the driver, so it's very slow. I need to find a way of writing it so it will run an executor level: // fetch a list of distinct sites and return them to the driver //(if you don't, you won't be able to loop around them as they're not on the executors) val distinctSites = spark.sql("SELECT site_id FROM ExampleData GROUP BY site_id LIMIT 10") .collect val allSiteData = spark.sql("SELECT site_id, time, age FROM ExampleData") distinctSites.foreach(row => { allSiteData.filter("site_id = " + row.get(0)) val times = allSiteData.select("time").collect() val ages = allSiteData.select("ages").collect() processTimesAndAges(times, ages) }) def processTimesAndAges(times: Array[Row], ages: Array[Row]) { // do some processing } I've tried broadcasting the distinctSites across all nodes, but this did not prove fruitful. This seems such a simple concept and yet I have spent a couple of days looking into this. I'm very new to Scala/Spark, so apologies if this is a ridiculous question! Any suggestions or tips are greatly appreciated.
RDD API provides a number of functions which can be used to perform operations in groups starting with low level repartition / repartitionAndSortWithinPartitions and ending with a number of *byKey methods (combineByKey, groupByKey, reduceByKey, etc.). Example : rdd.map( tup => ((tup._1, tup._2, tup._3), tup) ). groupByKey(). forEachPartition( iter => doSomeJob(iter) ) In DataFrame you can use aggregate functions,GroupedData class provides a number of methods for the most common functions, including count, max, min, mean and sum Example : val df = sc.parallelize(Seq( (1, 10.3, 10), (1, 11.5, 10), (2, 12.6, 20), (3, 2.6, 30)) ).toDF("Site_ID ", "Time ", "Age") df.show() +--------+-----+---+ |Site_ID |Time |Age| +--------+-----+---+ | 1| 10.3| 10| | 1| 11.5| 10| | 2| 12.6| 20| | 3| 2.6| 30| +--------+-----+---+ df.groupBy($"Site_ID ").count.show +--------+-----+ |Site_ID |count| +--------+-----+ | 1| 2| | 3| 1| | 2| 1| +--------+-----+ Note : As you have mentioned that solution is very slow ,You need to use partition ,in your case range partition is good option. http://dev.sortable.com/spark-repartition/ https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/