Spark Scala performance issues with take and InsertInto commands - scala

Please look at the attached screenshot.
I am trying to do some performance improvement to my spark job and its taking almost 5 min to execute the take action on dataframe.
I am using take for making sure that dataframe has some records in it and if it is present, I want to proceed for further processing.
I tried take and count and don't see much difference in the time for execution.
Another scenario is where its taking around 10min to write the datafrane into hive table(it has max 200 rows and 10 columns).
df.write.mode("append").partitionBy("date").insertInto(tablename)
Please suggest how we can minimize the time its taking for take and insert into hive table.
Updates:
Here is my spark submit : spark-submit --master yarn-cluster --class com.xxxx.info.InfoAssets --conf "spark.executor.extraJavaOptions=-XX:+UseCompressedOops -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.security.auth.login.config=kafka_spark_jaas.conf" --files /home/ngap.app.rcrp/hive-site.xml,/home//kafka_spark_jaas.conf,/etc/security/keytabs/ngap.sa.rcrp.keytab --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar --executor-memory 3G --num-executors 3 --executor-cores 10 /home/InfoAssets/InfoAssets.jar
Code details:
its a simple dataframe which has 8 columns with around 200 records in it and I am using following code to insert into hive table.
df.write.mode("append").partitionBy("partkey").insertInto(hiveDB + "." + tableName)
Thanks,Bab

Don't use count before the write if not necessary and if your table is already created then use Spark SQL to insert the data into Hive Partitioned table.
spark.sql("Insert into <tgt tbl> partition(<col name>) select cols,partition col from temp_tbl")

Related

Total records processed in each micro batch spark streaming

Is there a way I can find how many records got processed into downstream delta table for each micro-batch. I've streaming job, which runs hourly once using trigger.once() with the append mode. For audit purpose, I want to know how many records got processed for each micro batch. I've tried the below code to print the count of records processed(shown in the second line).
ss_count=0
def write_to_managed_table(micro_batch_df, batchId):
#print(f"inside foreachBatch for batch_id:{batchId}, rows in passed dataframe: {micro_batch_df.count()}")
ss_count = micro_batch_df.count()
saveloc = "TABLE_PATH"
df_final.writeStream.trigger(once=True).foreachBatch(write_to_managed_table).option('checkpointLocation', f"{saveloc}/_checkpoint").start(saveloc)
print(ss_count)
Streaming job will run without any issues but micro_batch_df.count() will not print any count.
Any pointers would be much appreciated.
Here is a working example of what you are looking for (structured_steaming_example.py):
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("StructuredStreamTesting") \
.getOrCreate()
# Create DataFrame representing the stream of input
df = spark.read.parquet("data/")
lines = spark.readStream.schema(df.schema).parquet("data/")
def batch_write(output_df, batch_id):
print("inside foreachBatch for batch_id:{0}, rows in passed dataframe: {1}".format(batch_id, output_df.count()))
save_loc = "/tmp/example"
query = (lines.writeStream.trigger(once=True)
.foreachBatch(batch_write)
.option('checkpointLocation', save_loc + "/_checkpoint")
.start(save_loc)
)
query.awaitTermination()
The sample parquet file is attached. Please put that in the data folder and execute the code using spark-submit
spark-submit --master local structured_steaming_example.py
Please put any sample parquet file under data folder for testing.

Spark-Optimization Techniques

Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert)
using spark submit as:
spark-submit \
--class class-name\
--jar file
or can I add any extra Parameter in spark submit for improving the optimization.
scala code(sample):
All Imports
object sample_1 {
def main(args: Array[String]) {
//sparksession with enabled hivesuppport
var a1=sparksession.sql("load data inpath 'filepath' overwrite into table table_name")
var b1=sparksession.sql("insert into tablename (all_column) select 'ALL_COLUMNS' from source_table")
}
}
First of all, you don't need to store the data in the temp table to write into hive table later. You can straightaway read the file and write the output using the DataFrameWriter API. This will reduce one step from your code.
You can write as follows:
val spark = SparkSession.builder.enableHiveSupport().getOrCreate()
val df = spark.read.csv(filePath) //Add header or delimiter options if needed
inputDF.write.mode("append").format(outputFormat).saveAsTable(outputDB + "." + outputTableName)
Here, the outputFormat will be orc, the outputDB will be your hive database and outputTableName will be your Hive table name.
I think using the above technique, your write time will reduce significantly. Also, please mention the resources your job is using and I may be able to optimize it further.
Another optimization you can use is to partition your dataframe while writing. This will make the write operation faster. However, you need to decide the columns on which to partition carefully so that you don't end up creating a lot of partitions.

How to avoid gc overhead limit exceeded in a range query with GeoSpark?

I am using Spark 2.4.3 with the extension of GeoSpark 1.2.0.
I have two tables to join as range distance. One table (t1) if ~ 100K rows with one column only that is a Geospark's geometry. The other table (t2) is ~ 30M rows and it is composed by an Int value and a Geospark's geometry column.
What I am trying to do is just a simple:
val spark = SparkSession
.builder()
// .master("local[*]")
.config("spark.serializer", classOf[KryoSerializer].getName)
.config("spark.kryo.registrator", classOf[GeoSparkKryoRegistrator].getName)
.config("geospark.global.index", "true")
.config("geospark.global.indextype", "rtree")
.config("geospark.join.gridtype", "rtree")
.config("geospark.join.numpartition", 200)
.config("spark.sql.parquet.filterPushdown", "true")
// .config("spark.sql.shuffle.partitions", 10000)
.config("spark.sql.autoBroadcastJoinThreshold", -1)
.appName("PropertyMaster.foodDistanceEatout")
.getOrCreate()
GeoSparkSQLRegistrator.registerAll(spark)
spark.sparkContext.setLogLevel("ERROR")
spark.read
.load(s"$dataPath/t2")
.repartition(200)
.createOrReplaceTempView("t2")
spark.read
.load(s"$dataPath/t1")
.repartition(200)
.cache()
.createOrReplaceTempView("t1")
val query =
"""
|select /*+ BROADCAST(t1) */
| t2.cid, ST_Distance(t1.geom, t2.geom) as distance
| from t2, t1 where ST_Distance(t1.geom, t2.geom) <= 3218.69""".stripMargin
spark.sql(query)
.repartition(200)
.write.mode(SaveMode.Append)
.option("path", s"$dataPath/my_output.csv")
.format("csv").save()
I tried different configurations, cboth when I run it locally or on my local cluster on my laptop (tot mem 16GB and 8 cores) but without any luck as the program crashes at "Distinct at Join" for GeoSpark with lots of shuffling. However I am not able to remove the shuffling from SparkSQL syntax. I thought to add an extra column id on the biggest table as for example same integer every 200 rows or so and repartition by that, but didn't work too.
I was expecting a partitioner for GeoSpark indexing but I am not sure it is working.
Any idea?
I have found an answer myself, as the problem of the GC overhead was due to partitioning but also the memory needed for the Partitioner by GeoSpark (based on index) and the timeout due to long geoquery calculations that have been solved adding the following parameters as suggested by GeoSpark website itself:
spark.executor.memory 4g
spark.driver.memory 10g
spark.network.timeout 10000s
spark.driver.maxResultSize 5g

How to optimize withColumn in Spark scala?

I am new at Spark and Scala and I want to optimize a request that I wrote on Spark which is very very heavy and slow (my database is huge and it contains a lot of data).
I have a first table "city_ID" :
ID City_CODE
1 10
2 11
3 12
And a second table "city_name" that has a common field with the first one :
City_Code City_NAME_CRYPT
10 ADFKJV - Paris
11 AGHDFBNLKFJ - London
12 AGZEORIUYG- Madrid
What I want to have in my final result is the city id and its proper name (which I can compute with a regex on the city_name field) WITHOUT ANY OTHER DATA. So, it should look like this :
ID NAME
10 Paris
11 London
12 Madrid
Here is my current code :
val result = spark.sql(""" SELECT t1.id, t2.city_name_crypt AS name_crypt
FROM table1 t1
INNER JOIN table2
on t1.city_code = t2.city_code""").withColumn("name", regexp_extract($"name_crypt", ".*?(\\d+)\\)$", 1)).drop($"name_crypt").show()
The big problem for me is that I just want to have 2 columns, not 3! But since I did an inner join I am obliged to keep this third column on my dataframe while it's useless in my case. It's why I used the drop after the with column method.
Can you please help me to fix this problem?
Thank you in advance!
I think that's not what is making it slow. But you can use withColumnRenamed like so...
result.withColumnRenamed("name", regexp_extract($"name_crypt", ".*?(\\d+)\\)$", 1))
If you are new to Spark a lot of people don't parallelize the tasks at first. Perhaps you should make sure that the parallelization of your tasks is good. Check the num-executors and executor-memory
https://spark.apache.org/docs/latest/configuration.html
Here is an example spark-submit command...
spark-submit \
--class yourClass \
--master yarn \
--deploy-mode cluster \
--executor-memory 8G \
--num-executors 40 \
/path/to/myJar.jar

Spark dataframe write method writing many small files

I've got a fairly simple job coverting log files to parquet. It's processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12 thousand files.
Job works as follows:
val events = spark.sparkContext
.textFile(s"$stream/$sourcetype")
.map(_.split(" \\|\\| ").toList)
.collect{case List(date, y, "Event") => MyEvent(date, y, "Event")}
.toDF()
df.write.mode(SaveMode.Append).partitionBy("date").parquet(s"$path")
It collects the events with a common schema, converts to a DataFrame, and then writes out as parquet.
The problem I'm having is that this can create a bit of an IO explosion on the HDFS cluster, as it's trying to create so many tiny files.
Ideally I want to create only a handful of parquet files within the partition 'date'.
What would be the best way to control this? Is it by using 'coalesce()'?
How will that effect the amount of files created in a given partition? Is it dependent on how many executors I have working in Spark? (currently set at 100).
you have to repartiton your DataFrame to match the partitioning of the DataFrameWriter
Try this:
df
.repartition($"date")
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")
In Python you can rewrite Raphael's Roth answer as:
(df
.repartition("date")
.write.mode("append")
.partitionBy("date")
.parquet("{path}".format(path=path)))
You might also consider adding more columns to .repartition to avoid problems with very large partitions:
(df
.repartition("date", another_column, yet_another_colum)
.write.mode("append")
.partitionBy("date)
.parquet("{path}".format(path=path)))
The simplest solution would be to replace your actual partitioning by :
df
.repartition(to_date($"date"))
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")
You can also use more precise partitioning for your DataFrame i.e the day and maybe the hour of an hour range. and then you can be less precise for writer.
That actually depends on the amount of data.
You can reduce entropy by partitioning DataFrame and the write with partition by clause.
I came across the same issue and I could using coalesce solved my problem.
df
.coalesce(3) // number of parts/files
.write.mode(SaveMode.Append)
.parquet(s"$path")
For more information on using coalesce or repartition you can refer to the following spark: coalesce or repartition
Duplicating my answer from here: https://stackoverflow.com/a/53620268/171916
This is working for me very well:
data.repartition(n, "key").write.partitionBy("key").parquet("/location")
It produces N files in each output partition (directory), and is (anecdotally) faster than using coalesce and (again, anecdotally, on my data set) faster than only repartitioning on the output.
If you're working with S3, I also recommend doing everything on local drives (Spark does a lot of file creation/rename/deletion during write outs) and once it's all settled use hadoop FileUtil (or just the aws cli) to copy everything over:
import java.net.URI
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
// ...
def copy(
in : String,
out : String,
sparkSession: SparkSession
) = {
FileUtil.copy(
FileSystem.get(new URI(in), sparkSession.sparkContext.hadoopConfiguration),
new Path(in),
FileSystem.get(new URI(out), sparkSession.sparkContext.hadoopConfiguration),
new Path(out),
false,
sparkSession.sparkContext.hadoopConfiguration
)
}
how about trying running scripts like this as map job consolidating all the parquet files into one:
$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
-Dmapred.reduce.tasks=1 \
-input "/hdfs/input/dir" \
-output "/hdfs/output/dir" \
-mapper cat \
-reducer cat