How can I parallelize different SparkSQL execution efficiently? - scala

Environment
Scala
Apache Spark: Spark 2.2.1
EMR on AWS: emr-5.12.1
Content
I have one large DataFrame, like below:
val df = spark.read.option("basePath", "s3://some_bucket/").json("s3://some_bucket/group_id=*/")
There are JSON files ~1TB at s3://some_bucket and it includes 5000 partitions of group_id.
I want to execute conversion using SparkSQL, and it differs by each group_id.
The Spark code is like below:
// Create view
val df = spark.read.option("basePath", "s3://data_lake/").json("s3://data_lake/group_id=*/")
df.createOrReplaceTempView("lakeView")
// one of queries like this:
// SELECT
// col1 as userId,
// col2 as userName,
// .....
// FROM
// lakeView
// WHERE
// group_id = xxx;
val queries: Seq[String] = getGroupIdMapping
// ** Want to know better ways **
queries.par.foreach(query => {
val convertedDF: DataFrame = spark.sql(query)
convertedDF.write.save("s3://another_bucket/")
})
The par can parallelize by Runtime.getRuntime.availableProcessors num, and it will be equal to the number of driver's cores.
But It seems weird and not efficient enough because it has nothing to do with Spark's parallization.
I really want to do with something like groupBy in scala.collection.Seq.
This is not right spark code:
df.groupBy(groupId).foreach((groupId, parDF) => {
parDF.createOrReplaceTempView("lakeView")
val convertedDF: DataFrame = spark.sql(queryByGroupId)
convertedDF.write.save("s3://another_bucket")
})

1) First of all if your data is already stored in files per group id there is no reason to mix it up and then group by id using Spark.
It's much more simple and efficient to load for each group id only relevant files
2) Spark itself parallelizes the computation. So in most cases there is no need for external parallelization.
But if you feel that Spark doesn't utilize all resources you can:
a) if each individual computation takes less than few seconds then task schedulling overhead is comparable to task execution time so it's possible to get a boost by running few tasks in parallel.
b) computation takes significant amount of time but resources are still underutilized. Then most probably you should increase the number of partitions for your dataset.
3) If you finally decided to run several tasks in parallel it can be achieved this way:
val parallelism = 10
val executor = Executors.newFixedThreadPool(parallelism)
val ec: ExecutionContext = ExecutionContext.fromExecutor(executor)
val tasks: Seq[String] = ???
val results: Seq[Future[Int]] = tasks.map(query => {
Future{
//spark stuff here
0
}(ec)
})
val allDone: Future[Seq[Int]] = Future.sequence(results)
//wait for results
Await.result(allDone, scala.concurrent.duration.Duration.Inf)
executor.shutdown //otherwise jvm will probably not exit

Related

How to put all the TreeNodes from LogicalPlan in the list

I am solving one problem where spark 3.0.1 cuts my execution plan if I apply the toJSON method to it:
//spark 3.0.1
val myDf: DataFrame = ???
val myPersistDf = myDf.persist
val plan: LogicalPlan = myPersistDf.queryExecution.optimizedPlan
//toJSON method cuts down my plan
val jsonPlan = plan.toJSON
My execution plan is being cut, so I want to work around this problem. The Logical Plan is stored in the spark as a tree from the TreeNode. Is there a way to get a list of all Tree Nodes of the execution plan?
For example, such a list:
Seq[TreeNode[_]]

Spark Repeatable/Deterministic Results [duplicate]

This question already has answers here:
Why does df.limit keep changing in Pyspark?
(3 answers)
Closed 2 years ago.
I'm running the Spark code below (basically created as a MVE) which does a:
Read parquet and limit
Partition by
Join
Filter
I'm struggling to understand why I get a different number of rows in the joined dataframe i.e. the dataframe after stage 3 above each time I run the application. Why is this happening?
The reason I think that shouldn't be happening is that the limit is deterministic so each time the same rows should be in the partitioned dataframe, albeit in a different order. In the join I am joining on the field that the partition was done on. I am expecting to have every combination of pairs within a partition, but I think this should equate to the same number each time.
def main(args: Array[String]) {
val maxRows = args(0)
val spark = SparkSession.builder.getOrCreate()
val windowSpec = Window.partitionBy("epoch_1min").orderBy("epoch")
val data = spark.read.parquet("srcfile.parquet").limit(maxRows.toInt)
val partitionDf = data.withColumn("row", row_number().over(windowSpec))
partitionDf.persist(StorageLevel.MEMORY_ONLY)
logger.debug(s"${partitionDf.count()} rows in partitioned data")
val dfOrig = partitionDf.withColumnRenamed("epoch_1min", "epoch_1min_orig").withColumnRenamed("row", "row_orig")
val dfDest = partitionDf.withColumnRenamed("epoch_1min", "epoch_1min_dest").withColumnRenamed("row", "row_dest")
val joined = dfOrig.join(dfDest, dfOrig("epoch_1min_orig") === dfDest("epoch_1min_dest"), "inner")
logger.debug(s"Rows in joined dataframe ${joined.count()}")
val filtered = joined.filter(col("row_orig") < col("row_dest"))
logger.debug(s"Rows in filtered dataframe ${filtered.count()}")
}
there could be underlying data changes if you start a new App.
Otherwise, using Spark SQL just like ANSI SQL on an RDBMS, there is no guaranteed ordering of data when ORDER BY is not used. So, you cannot assume with varying Executor allocation that the processing will be the same (without ordering/sorting) second time around, etc.

HBase Concurrent / Parallel Scan from Spark 1.6, Scala 2.10.6 besides multithreading

I have a list of rowPrefixes Array("a", "b", ...)
I need to query HBase (using Nerdammer) for each of the rowPrefix. My current solution is
case class Data(x: String)
val rowPrefixes = Array("a", "b", "c")
rowPrefixes.par
.map( rowPrefix => {
val rdd = sc.hbaseTable[Data]("tableName")
.inColumnFamily("columnFamily")
.withStartRow(rowPrefix)
rdd
})
.reduce(_ union _)
I was basically loading multiple rdd using multithreads (.par) and then unionizing all of them in the end. Is there a better way to do this? I don't mind using other library besides nerdammer.
Besides, I'm worried about the reflection API threadsafe issue since I'm reading hbase into an RDD of case class.
I haven't used Nerdammer connector but if we consider your example of 4 prefix row key filters, using par the amount of parallelism would be limited, the cluster may go underutilized and results may be slow.
You can check if following can be achieved using Nerdammer connector, I have used hbase-spark connector (CDH), in below approach the row key prefixes will be scanned across all table partitions ie all the table regions spread across the cluster in parallel, which can utilize the available resources (cores/RAM) more efficiently and more importantly leverage the power of distributed computing.
val hbaseConf = HBaseConfiguration.create()
// set zookeeper quorum properties in hbaseConf
val hbaseContext = new HBaseContext(sc, hbaseConf)
val rowPrefixes = Array("a", "b", "c")
val filterList = new FilterList()
rowPrefixes.foreach { x => filterList.addFilter(new PrefixFilter(Bytes.toBytes(x))) }
var scan = new Scan()
scan.setFilter(filterList)
scan.addFamily(Bytes.toBytes("myCF"));
val rdd = hbaseContext.hbaseRDD(TableName.valueOf("tableName"), scan)
rdd.mapPartitions(populateCaseClass)
In your case too full table scan will happen but only 4 partitions will do considerable amount of work, assuming you have sufficient cores available and par can allocate one core to each element in rowPrefixes array.
Hope this helps.

Is there an alternative to joinWithCassandraTable for DataFrames in Spark (Scala) when retrieving data from only certain Cassandra partitions?

When extracting small number of partitions from large C* table using RDDs, we can use this:
val rdd = … // rdd including partition data
val data = rdd.repartitionByCassandraReplica(keyspace, tableName)
.joinWithCassandraTable(keyspace, tableName)
Do we have available an equally effective approach using DataFrames?
Update (Apr 26, 2017):
To be more concrete, I prepared an example.
I have 2 tables in Cassandra:
CREATE TABLE ids (
id text,
registered timestamp,
PRIMARY KEY (id)
)
CREATE TABLE cpu_utils (
id text,
date text,
time timestamp,
cpu_util int,
PRIMARY KEY (( id, date ), time)
)
The first one contains a list of valid IDs and the second one cpu utilization data. I would like to efficiently get average cpu utilization per each id in table ids for one day, say "2017-04-25".
The most efficient way with the RDDs that I know of is the following:
val sc: SparkContext = ...
val date = "2017-04-25"
val partitions = sc.cassandraTable(keyspace, "ids")
.select("id").map(r => (r.getString("id"), date))
val data = partitions.repartitionByCassandraReplica(keyspace, "cpu_utils")
.joinWithCassandraTable(keyspace, "cpu_utils")
.select("id", "cpu_util").values
.map(r => (r.getString("id"), (r.getDouble("cpu_util"), 1)))
// aggrData in form: (id, (avg(cpu_util), count))
// example row: ("718be4d5-11ad-4849-8aab-aa563c9c290e",(6,723))
val aggrData = data.reduceByKey((a, b) => (
1d * (a._1 * a._2 + b._1 * b._2) / (a._2 + b._2),
a._2 + b._2))
aggrData.foreach(println)
This approach takes about 5 seconds to complete (setup with Spark on my local machine, Cassandra on some remote server). Using it, I am performing operations on less than 1% of partitions in table cpu_utils .
With the Dataframes this is the approach I am using currently:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val date = "2017-04-25"
val partitions = sqlContext.read.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "ids", "keyspace" -> keyspace)).load()
.select($"id").withColumn("date", lit(date))
val data: DataFrame = sqlContext.read.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "cpu_utils", "keyspace" -> keyspace)).load()
.select($"id", $"cpu_util", $"date")
val dataFinal = partitions.join(data, partitions.col("id").equalTo(data.col("id")) and partitions.col("date").equalTo(data.col("date")))
.select(data.col("id"), data.col("cpu_util"))
.groupBy("id")
.agg(avg("cpu_util"), count("cpu_util"))
dataFinal.show()
However, this approach seems to load the whole table cpu_utils into memory as execution time here is considerably longer (almost 1 minute).
I am asking if there exists a better approach using Dataframes that would at least reach if not perform better than the RDD approach mentioned above?
P.s.: I am using Spark 1.6.1.

spark: read parquet file and process it

I am new of Spark 1.6. I'd like read an parquet file and process it.
For simplify suppose to have a parquet with this structure:
id, amount, label
and I have 3 rule:
amount < 10000 => label=LOW
10000 < amount < 100000 => label=MEDIUM
amount > 1000000 => label = HIGH
How can do it in spark and scala?
I try something like that:
case class SampleModels(
id: String,
amount: Double,
label: String,
)
val sc = SparkContext.getOrCreate()
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = sqlContext.read.parquet("/path/file/")
val ds = df.as[SampleModels].map( row=>
MY LOGIC
WRITE OUTPUT IN PARQUET
)
Is it right approach? Is it efficient? "MYLOGIC" could be more complex.
Thanks
Yes, it's the right way to work with spark. If your logic is simple, you can try to use built-in functions to operate on dataframe directly (like when in your case), it will be a little faster than mapping rows to to case class and executing code in jvm and you will be able to save the results back to parquet easily.
Yes, it is correct approach.
It will do one pass over your complete data to build the extra column you need.
If you want a sql way, this is the way to go,
val df = sqlContext.read.parquet("/path/file/")
df.registerTempTable("MY_TABLE")
val df2 = sqlContext.sql("select *, case when amount < 10000 then LOW else HIGH end as label from MY_TABLE")
Remember to use hiveContext instead of sparkContext though.