Union operation in spark running very slow - scala

I'm running a spark sql with below statements and configuration but apparently dfs.reduce((x, y) => x.union(y)).distinct().coalesce(1) step is taking a lot of time to execute,roughly 5 mins, my input parquet file has just 88 records. Any thoughts what could be the issue ?
val spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.config("spark.sql.warehouse.dir", warehouseLocation)
.config("spark.master", "local")
.config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
//set new runtime options
spark.conf.set("spark.sql.shuffle.partitions", 6)
spark.conf.set("spark.executor.memory", "2g")
spark.conf.set("spark.driver.host", "localhost")
spark.conf.set("spark.cores.max", "8")
val dfs = m.map(field => spark.sql(s"select 'DataProfilerStats' as Table_Name,
'$field' as Column_Name,min($field) as min_value from parquetDFTable"))
val withSum = dfs.reduce((x, y) => x.union(y)).distinct().coalesce(1)
UPDATE
I have a single parquet which I'm reading into dataframe, question is also that if it can be split into smaller chunks.

Related

Improve performance of slow running spark streaming process that uses micro-batching

I am trying to create an application to process 10 million json files where the size of a json can vary from 1mb to 50mb.
To avoid burdening the driver I am using the structured streaming api to process 100,000 json files at a time rather than loading all the source files at once.
mySchema
val mySchema: StructType =
StructType( Array(
StructField("ID",StringType,true),
StructField("StartTime",DoubleType, true),
StructField("Data", ArrayType(
StructType( Array(
StructField("field1",DoubleType,true),
StructField("field2",LongType,true),
StructField("field3",LongType,true),
StructField("field4",DoubleType,true),
StructField("field5",DoubleType,true),
StructField("field6",DoubleType,true),
StructField("field7",LongType,true),
StructField("field8",LongType,true)
)),true),true)))
Create Streaming Dataframe by picking 100,000 files at a time
val readDF = spark.readStream
.format("json")
.option("MaxFilesPerTrigger", 100000)
.option( "pathGlobFilter", "*.json")
.option( "recursiveFileLookup", "true")
.schema(mySchema)
.load("/mnt/source/2020/*")
writeStream to start streaming computation
val sensorFileWriter = binaryDF
.writeStream
.queryName( "myStream")
.format("delta")
.trigger(Trigger.ProcessingTime("30 seconds"))
.outputMode("append")
.option( "checkpointLocation", "/mnt/dir/checkpoint")
.foreachBatch(
(batchDF: DataFrame, batchId: Long) => {
batchDF.persist()
val parseDF = batchDF
.withColumn("New_Set", expr("transform(Data, x -> (x.field1 as f1, x.field2 as f2, x.field3 as field3))"))
.withColumn("Data_New",addCol(uuid(),to_json($"New_Set")))
.withColumn("data_size", size(col("Data")))
.withColumn("requestid", uuid())
.withColumn("start_epoch_double", bround($"StartTime").cast("long"))
.withColumn("Start_date", from_unixtime($"start_epoch_double", "YYYYMMdd"))
.withColumn("request", concat(lit("start"), col("Data_New"), lit("end")))
.persist()
val requestDF = parseDF
.select($"Start_date", $"request")
requestDF.write
.partitionBy("Start_date")
.mode("append")
.parquet("/mnt/Target/request")
}
)
In the above "addCol" is a user defined function that adds new StructField to Array of StructFields
val addCol = udf((id:String,json:String) => {
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats
import org.json4s.JsonDSL._
compact(parse(json).extract[List[Map[String,String]]].map(m => Map("requestid" -> id) ++ m))
})
"uuid" is another udf that generates a unique id
val uuid = udf(() => java.util.UUID.randomUUID().toString)
Databricks cluster Config:-
Apache Spark 2.4.5
70 Workers: 3920.0 GB Memory, 1120 Cores (i.e. 56.0 GB Memory and 16 Cores per Worker)
1 Driver: 128.0 GB Memory, 32 Cores
The below image is total tasks for writing each batch of 100,000 which takes more than an hour. The entire process takes days to complete processing the 10 million json files.
How can I make this streaming process run faster?
Should I be setting the property for "spark.sql.shuffle.partitions". If so what is a good value for this property?
In most cases, you want a 1-to-1 mapping of partitions to cores for streaming applications...Azure Databricks Structured Streaming.
To optimize mapping of your partitions to cores, try this:
spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)
This will give you a 1-to-1 mapping of your partitions to cores.

Why could streaming join of queries over kafka topics take so long?

I'm using Spark Structured Streaming and joining two streams from Kafka topics.
I noticed that the streaming query takes around 15 seconds for each record. In the below screenshot, the stage id 2 takes 15s. Why could that be?
The code is as follows:
val kafkaTopic1 = "demo2"
val kafkaTopic2 = "demo3"
val bootstrapServer = "localhost:9092"
val spark = SparkSession
.builder
.master("local")
.getOrCreate
import spark.implicits._
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServer)
.option("subscribe", kafkaTopic1)
.option("failOnDataLoss", false)
.load
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServer)
.option("subscribe", kafkaTopic2)
.option("failOnDataLoss", false)
.load
val order_details = df1
.withColumn(...)
.select(...)
val invoice_details = df2
.withColumn(...)
.where(...)
order_details
.join(invoice_details)
.where(order_details.col("s_order_id") === invoice_details.col("order_id"))
.select(...)
.writeStream
.format("console")
.option("truncate", false)
.start
.awaitTermination()
Code-wise everything works fine. The only problem is the time to join the two streams. How could this query be optimised?
It's fairly possible that the execution time is not satisfactory given the master URL, i.e. .master("local"). Change it to local[*] at the very least and you should find the join faster.

DataFrame persist does not improve performance in Spark

I am writing a Scala script that reads from a table, transforms data and shows result using Spark. I am using Spark 2.1.1.2 and Scala 2.11.8. There is a dataframe instance I use twice in the script (df2 in the code below.). Since dataframes are calculated when an action is called on them, not when they are declared, I predict that this dataframe to be calculated twice. I thought that persisting this dataframe would improve performance thinking that, it would be calculated once (when persisted), instead of twice, if persisted.
However, script run lasts ~10 seconds longer when I persist compared to when I don't persist. I cannot figure out what the reason for this is. If someone has an idea, it would be much appreciated.
My submission command line is below:
spark-submit --class TestQuery --master yarn --driver-memory 10G --executor-memory 10G --executor-cores 2 --num-executors 4 /home/bcp_data/test/target/TestQuery-1.0-SNAPSHOT.jar
Scala script is below:
val spark = SparkSession
.builder()
.appName("TestQuery")
.config("spark.sql.warehouse.dir", "file:/tmp/hsperfdata_hdfs/spark-warehouse/")
.enableHiveSupport()
.getOrCreate()
val m = spark.sql("select id, startdate, enddate, status from members")
val l = spark.sql("select mid, no, status, potential from log")
val r = spark.sql("select mid, code from records")
val df1 = m.filter(($"status".isin(1,2).and($"startdate" <= one_year_ago)).and((($"enddate" >= one_year_ago)))
val df2 = df1.select($"id", $"code").join(l, "mid").filter(($"status".equalTo(1)).and($"potential".notEqual(9))).select($"no", $"id", $"code")
df2.persist
val df3 = df2.join(r, df2("id").equalTo(r("mid"))).filter($"code".isin("0001","0010","0015","0003","0012","0014","0032","0033")).groupBy($"code").agg(countDistinct($"no"))
val fa = spark.sql("select mid, acode from actions")
val fc = spark.sql("select dcode, fcode from params.codes")
val df5 = fa.join(fc, fa("acode").startsWith(fc("dcode")), "left_outer").select($"mid", $"fcode")
val df6 = df2.join(df5, df2("id").equalTo(df5("mid"))).groupBy($"code", $"fcode")
println("count1: " + df3.count + " count2: " + df6.count)
using caching is the right choice here, but your statement
df2.persist
has no effect because you do not utilize the returned dataframe. Just do
val df2 = df1.select($"id", $"code")
.join(l, "mid")
.filter(($"status".equalTo(1)).and($"potential".notEqual(9)))
.select($"no", $"id", $"code")
.persist

How to load rows from Cassandra table as Dataframe in Spark?

I can load whole Cassandra table as dataframe as below
val tableDf = sparkSession.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> table, "keyspace" -> keyspace))
.load()
But I couldn't find a way to fetch rows by primary key, something like
select * from table where key = ''
Is there a way to do this?
val tableDf = sparkSession.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> table, "keyspace" -> keyspace))
.load()
.filter("key='YOUR_KEY'")
Using this spark-cassandra-connector will use predicate pushdown and will fetch only required data.
Dataframes and Predicate pushdown
Java way to the same is :
SparkSession sparkSession = SparkSession.builder().appName("Spark Sql Job").master("local[*]")
.config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
.config("spark.cassandra.connection.host", "localhost")
.config("spark.cassandra.connection.port", "9042").getOrCreate();
SQLContext sqlCtx = sparkSession.sqlContext();
Dataset<Row> rowsDataset = sqlCtx.read().format("org.apache.spark.sql.cassandra").option("keyspace", "myschema")
.option("table", "mytable").load();
rowsDataset.show();
It should be the same for scala i believe

PySpark - read recursive Hive table

I have a Hive table that has multiple sub-directories in HDFS, something like:
/hdfs_dir/my_table_dir/my_table_sub_dir1
/hdfs_dir/my_table_dir/my_table_sub_dir2
...
Normally I set the following parameters before I run a Hive script:
set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
select * from my_db.my_table;
I'm trying to do the same using PySpark,
conf = (SparkConf().setAppName("My App")
...
.set("hive.input.dir.recursive", "true")
.set("hive.mapred.supports.subdirectories", "true")
.set("hive.supports.subdirectories", "true")
.set("mapred.input.dir.recursive", "true"))
sc = SparkContext(conf = conf)
sqlContext = HiveContext(sc)
my_table = sqlContext.sql("select * from my_db.my_table")
and end up with an error like:
java.io.IOException: Not a file: hdfs://hdfs_dir/my_table_dir/my_table_sub_dir1
What's the correct way to read a Hive table with sub-directories in Spark?
What I have found is that these values must be preceded with spark as in:
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")
Try setting them through ctx.sql() prior to execute the query:
sqlContext.sql("SET hive.mapred.supports.subdirectories=true")
sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")
my_table = sqlContext.sql("select * from my_db.my_table")
Try setting them through SpakSession to execute the query:
sparkSession = (SparkSession
.builder
.appName('USS - Unified Scheme of Sells')
.config("hive.metastore.uris", "thrift://probighhwm001:9083", conf=SparkConf())
.config("hive.input.dir.recursive", "true")
.config("hive.mapred.supports.subdirectories", "true")
.config("hive.supports.subdirectories", "true")
.config("mapred.input.dir.recursive", "true")
.enableHiveSupport()
.getOrCreate()
)