Spark Group By and with Rank function is running very slow - scala

I am writing a spark app for finding top n accessed URLs within a time frame. But This job keeps running and takes hours for 389451 records in ES for an instance. I want to reduce this time.
I am reading from Elastic search in spark as bellow
val df = sparkSession.read
.format("org.elasticsearch.spark.sql")
.load(date + "/" + business)
.withColumn("ts_str", date_format($"ts", "yyyy-MM-dd HH:mm:ss")).drop("ts").withColumnRenamed("ts_str", "ts")
.select(selects.head, selects.tail:_*)
.filter($"ts" === ts)
.withColumn("url", split($"uri", "\\?")(0)).drop("uri").withColumnRenamed("url", "uri").cache()
In above DF I am reading and filtering from ElasticSearch. Also I am removing query params from URI.
Then I am doing group by
var finalDF = df.groupBy("col1","col2","col3","col4","col5","uri").agg(sum("total_bytes").alias("total_bytes"), sum("total_req").alias("total_req"))
Then I am running a window function
val partitionBy = Seq("col1","col2","col3","col4","col5")
val window = Window.partitionBy(partitionBy.head, partitionBy.tail:_*).orderBy(desc("total_req"))
finalDF = finalDF.withColumn("rank", rank.over(window)).where($"rank" <= 5).drop("rank")
Then I am writing finalDF to cassandra
finalDF.write.format("org.apache.spark.sql.cassandra").options(Map("table" -> "table", "keyspace" -> "keyspace")).mode(SaveMode.Append).save()
I have 4 data nodes in ES cluster and My Spark machine is 16 cores 64GB Ram VM. Please help me finding where the problem is.

It could be a good idea that you persist your dataframe after read, because you are going to be using so many times in rank function.

Related

Spark Repeatable/Deterministic Results [duplicate]

This question already has answers here:
Why does df.limit keep changing in Pyspark?
(3 answers)
Closed 2 years ago.
I'm running the Spark code below (basically created as a MVE) which does a:
Read parquet and limit
Partition by
Join
Filter
I'm struggling to understand why I get a different number of rows in the joined dataframe i.e. the dataframe after stage 3 above each time I run the application. Why is this happening?
The reason I think that shouldn't be happening is that the limit is deterministic so each time the same rows should be in the partitioned dataframe, albeit in a different order. In the join I am joining on the field that the partition was done on. I am expecting to have every combination of pairs within a partition, but I think this should equate to the same number each time.
def main(args: Array[String]) {
val maxRows = args(0)
val spark = SparkSession.builder.getOrCreate()
val windowSpec = Window.partitionBy("epoch_1min").orderBy("epoch")
val data = spark.read.parquet("srcfile.parquet").limit(maxRows.toInt)
val partitionDf = data.withColumn("row", row_number().over(windowSpec))
partitionDf.persist(StorageLevel.MEMORY_ONLY)
logger.debug(s"${partitionDf.count()} rows in partitioned data")
val dfOrig = partitionDf.withColumnRenamed("epoch_1min", "epoch_1min_orig").withColumnRenamed("row", "row_orig")
val dfDest = partitionDf.withColumnRenamed("epoch_1min", "epoch_1min_dest").withColumnRenamed("row", "row_dest")
val joined = dfOrig.join(dfDest, dfOrig("epoch_1min_orig") === dfDest("epoch_1min_dest"), "inner")
logger.debug(s"Rows in joined dataframe ${joined.count()}")
val filtered = joined.filter(col("row_orig") < col("row_dest"))
logger.debug(s"Rows in filtered dataframe ${filtered.count()}")
}
there could be underlying data changes if you start a new App.
Otherwise, using Spark SQL just like ANSI SQL on an RDBMS, there is no guaranteed ordering of data when ORDER BY is not used. So, you cannot assume with varying Executor allocation that the processing will be the same (without ordering/sorting) second time around, etc.

HBase Concurrent / Parallel Scan from Spark 1.6, Scala 2.10.6 besides multithreading

I have a list of rowPrefixes Array("a", "b", ...)
I need to query HBase (using Nerdammer) for each of the rowPrefix. My current solution is
case class Data(x: String)
val rowPrefixes = Array("a", "b", "c")
rowPrefixes.par
.map( rowPrefix => {
val rdd = sc.hbaseTable[Data]("tableName")
.inColumnFamily("columnFamily")
.withStartRow(rowPrefix)
rdd
})
.reduce(_ union _)
I was basically loading multiple rdd using multithreads (.par) and then unionizing all of them in the end. Is there a better way to do this? I don't mind using other library besides nerdammer.
Besides, I'm worried about the reflection API threadsafe issue since I'm reading hbase into an RDD of case class.
I haven't used Nerdammer connector but if we consider your example of 4 prefix row key filters, using par the amount of parallelism would be limited, the cluster may go underutilized and results may be slow.
You can check if following can be achieved using Nerdammer connector, I have used hbase-spark connector (CDH), in below approach the row key prefixes will be scanned across all table partitions ie all the table regions spread across the cluster in parallel, which can utilize the available resources (cores/RAM) more efficiently and more importantly leverage the power of distributed computing.
val hbaseConf = HBaseConfiguration.create()
// set zookeeper quorum properties in hbaseConf
val hbaseContext = new HBaseContext(sc, hbaseConf)
val rowPrefixes = Array("a", "b", "c")
val filterList = new FilterList()
rowPrefixes.foreach { x => filterList.addFilter(new PrefixFilter(Bytes.toBytes(x))) }
var scan = new Scan()
scan.setFilter(filterList)
scan.addFamily(Bytes.toBytes("myCF"));
val rdd = hbaseContext.hbaseRDD(TableName.valueOf("tableName"), scan)
rdd.mapPartitions(populateCaseClass)
In your case too full table scan will happen but only 4 partitions will do considerable amount of work, assuming you have sufficient cores available and par can allocate one core to each element in rowPrefixes array.
Hope this helps.

Is there an alternative to joinWithCassandraTable for DataFrames in Spark (Scala) when retrieving data from only certain Cassandra partitions?

When extracting small number of partitions from large C* table using RDDs, we can use this:
val rdd = … // rdd including partition data
val data = rdd.repartitionByCassandraReplica(keyspace, tableName)
.joinWithCassandraTable(keyspace, tableName)
Do we have available an equally effective approach using DataFrames?
Update (Apr 26, 2017):
To be more concrete, I prepared an example.
I have 2 tables in Cassandra:
CREATE TABLE ids (
id text,
registered timestamp,
PRIMARY KEY (id)
)
CREATE TABLE cpu_utils (
id text,
date text,
time timestamp,
cpu_util int,
PRIMARY KEY (( id, date ), time)
)
The first one contains a list of valid IDs and the second one cpu utilization data. I would like to efficiently get average cpu utilization per each id in table ids for one day, say "2017-04-25".
The most efficient way with the RDDs that I know of is the following:
val sc: SparkContext = ...
val date = "2017-04-25"
val partitions = sc.cassandraTable(keyspace, "ids")
.select("id").map(r => (r.getString("id"), date))
val data = partitions.repartitionByCassandraReplica(keyspace, "cpu_utils")
.joinWithCassandraTable(keyspace, "cpu_utils")
.select("id", "cpu_util").values
.map(r => (r.getString("id"), (r.getDouble("cpu_util"), 1)))
// aggrData in form: (id, (avg(cpu_util), count))
// example row: ("718be4d5-11ad-4849-8aab-aa563c9c290e",(6,723))
val aggrData = data.reduceByKey((a, b) => (
1d * (a._1 * a._2 + b._1 * b._2) / (a._2 + b._2),
a._2 + b._2))
aggrData.foreach(println)
This approach takes about 5 seconds to complete (setup with Spark on my local machine, Cassandra on some remote server). Using it, I am performing operations on less than 1% of partitions in table cpu_utils .
With the Dataframes this is the approach I am using currently:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val date = "2017-04-25"
val partitions = sqlContext.read.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "ids", "keyspace" -> keyspace)).load()
.select($"id").withColumn("date", lit(date))
val data: DataFrame = sqlContext.read.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "cpu_utils", "keyspace" -> keyspace)).load()
.select($"id", $"cpu_util", $"date")
val dataFinal = partitions.join(data, partitions.col("id").equalTo(data.col("id")) and partitions.col("date").equalTo(data.col("date")))
.select(data.col("id"), data.col("cpu_util"))
.groupBy("id")
.agg(avg("cpu_util"), count("cpu_util"))
dataFinal.show()
However, this approach seems to load the whole table cpu_utils into memory as execution time here is considerably longer (almost 1 minute).
I am asking if there exists a better approach using Dataframes that would at least reach if not perform better than the RDD approach mentioned above?
P.s.: I am using Spark 1.6.1.

Weird behavior of DataFrame operations

Consider the code:
val df1 = spark.table("t1").filter(col("c1")=== lit(127))
val df2 = spark.sql("select x,y,z from ORCtable")
val df3 = df1.join(df2.toDF(df2.columns.map(_ + "_R"): _*),
trim(upper(coalesce(col("y_R"), lit("")))) === trim(upper(coalesce(col("a"), lit("")))), "leftouter")
df3.select($"y_R",$"z_R").show(500,false)
This is producing the warning WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.The code fails java.lang.OutOfMemoryError: GC overhead limit exceeded.
But if I run the following code:
val df1 = spark.table("t1").filter(col("c1")=== lit(127))
val df2 = spark.sql("select x,y,z from ORCtable limit 2000000")//only difference here
//ORC table has 1651343 rows so doesn't exceed limit 2000000
val df3 = df1.join(df2.toDF(df2.columns.map(_ + "_R"): _*),
trim(upper(coalesce(col("y_R"), lit("")))) === trim(upper(coalesce(col("a"), lit("")))), "leftouter")
df3.select($"y_R",$"z_R").show(500,false)
This produces the correct output. I'm at a loss why this happens and what changes. Can someone help make some sense of this?
To answer my own question: The Spark physical execution plan are different for the two ways of generating the same dataframe which can be checked by calling the .explain() method.
The first way uses the broadcast-hash join which causes java.lang.OutOfMemoryError: GC overhead limit exceeded whereas the latter way runs the sort-merge join which is typically slower but does not strain the garbage collection as much.
This difference in physical execution plans is introduced by the additional filteroperation on the df2 dataframe.

spark: read parquet file and process it

I am new of Spark 1.6. I'd like read an parquet file and process it.
For simplify suppose to have a parquet with this structure:
id, amount, label
and I have 3 rule:
amount < 10000 => label=LOW
10000 < amount < 100000 => label=MEDIUM
amount > 1000000 => label = HIGH
How can do it in spark and scala?
I try something like that:
case class SampleModels(
id: String,
amount: Double,
label: String,
)
val sc = SparkContext.getOrCreate()
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = sqlContext.read.parquet("/path/file/")
val ds = df.as[SampleModels].map( row=>
MY LOGIC
WRITE OUTPUT IN PARQUET
)
Is it right approach? Is it efficient? "MYLOGIC" could be more complex.
Thanks
Yes, it's the right way to work with spark. If your logic is simple, you can try to use built-in functions to operate on dataframe directly (like when in your case), it will be a little faster than mapping rows to to case class and executing code in jvm and you will be able to save the results back to parquet easily.
Yes, it is correct approach.
It will do one pass over your complete data to build the extra column you need.
If you want a sql way, this is the way to go,
val df = sqlContext.read.parquet("/path/file/")
df.registerTempTable("MY_TABLE")
val df2 = sqlContext.sql("select *, case when amount < 10000 then LOW else HIGH end as label from MY_TABLE")
Remember to use hiveContext instead of sparkContext though.