How to optimize withColumn in Spark scala? - scala

I am new at Spark and Scala and I want to optimize a request that I wrote on Spark which is very very heavy and slow (my database is huge and it contains a lot of data).
I have a first table "city_ID" :
ID City_CODE
1 10
2 11
3 12
And a second table "city_name" that has a common field with the first one :
City_Code City_NAME_CRYPT
10 ADFKJV - Paris
11 AGHDFBNLKFJ - London
12 AGZEORIUYG- Madrid
What I want to have in my final result is the city id and its proper name (which I can compute with a regex on the city_name field) WITHOUT ANY OTHER DATA. So, it should look like this :
ID NAME
10 Paris
11 London
12 Madrid
Here is my current code :
val result = spark.sql(""" SELECT t1.id, t2.city_name_crypt AS name_crypt
FROM table1 t1
INNER JOIN table2
on t1.city_code = t2.city_code""").withColumn("name", regexp_extract($"name_crypt", ".*?(\\d+)\\)$", 1)).drop($"name_crypt").show()
The big problem for me is that I just want to have 2 columns, not 3! But since I did an inner join I am obliged to keep this third column on my dataframe while it's useless in my case. It's why I used the drop after the with column method.
Can you please help me to fix this problem?
Thank you in advance!

I think that's not what is making it slow. But you can use withColumnRenamed like so...
result.withColumnRenamed("name", regexp_extract($"name_crypt", ".*?(\\d+)\\)$", 1))
If you are new to Spark a lot of people don't parallelize the tasks at first. Perhaps you should make sure that the parallelization of your tasks is good. Check the num-executors and executor-memory
https://spark.apache.org/docs/latest/configuration.html
Here is an example spark-submit command...
spark-submit \
--class yourClass \
--master yarn \
--deploy-mode cluster \
--executor-memory 8G \
--num-executors 40 \
/path/to/myJar.jar

Related

PySpark structured streaming and filtered processing for parts

I want to evaluate a streamed (unbound) data frame within Spark 2.4:
time id value
6:00:01.000 1 333
6:00:01.005 1 123
6:00:01.050 2 544
6:00:01.060 2 544
When all the data of id 1 got into the dataframe and the data of the next id 2 comes I want to do calculations for the complete data of id 1. But how do I do that? I think I cannot use the window functions since I do not know the time in advance that also varies for each id. And I also do not know the id from other sources besides the streamed data frame.
The only solution that come to my mind contains variable comparison (a memory) and a while loop:
id_old = 0 # start value
while true:
id_cur = id_from_dataframe
if id_cur != id_old: # id has changed
do calulation for id_cur
id_old = id_cur
But I do not think that this is the right solution. Can you give me a hint or documentation that helps me since I cannot find examples or documentation.
I get it running with a combination of watermarking and grouping:
import pyspark.sql.functions as F
d2 = d1.withWatermark("time", "60 second") \
.groupby('id', \
F.window('time', "40 second")) \
.agg(
F.count("*").alias("count"), \
F.min("time").alias("time_start"), \
F.max("time").alias("time_stop"), \
F.round(F.avg("value"),1).alias('value_avg'))
Most of the documentation only shows the basic stuff with grouping only by time and I saw just one example with another parameter for grouping, so I put my 'id' there.

Many to many join on large datasets in Spark

I have two large datasets, A and B, which I wish to join on key K.
Each dataset contains many rows with the same value of K, so this is a many-to-many join.
This join fails with memory related errors if I just try it naively.
Let's also say grouping both datasets by K, doing the join and then exploding back out with some trickery to get the correct result isn't a viable option, again due to memory issues
Are there any clever tricks people have found which improves the chance of this working?
Update:
Adding a very, very contrived concrete example:
spark-shell --master local[4] --driver-memory 5G --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.sql.shuffle.partitions=10000 --conf spark.default.parallelism=10000
val numbersA = (1 to 100000).toList.toDS
val numbersWithDataA = numbersA.repartition(10000).map(n => (n, 1, Array.fill[Byte](1000*1000)(0)))
numbersWithDataA.write.mode("overwrite").parquet("numbersWithDataA.parquet")
val numbersB = (1 to 100).toList.toDS
val numbersWithDataB = numbersB.repartition(100).map(n => (n, 1, Array.fill[Byte](1000*1000)(0)))
numbersWithDataB.write.mode("overwrite").parquet("numbersWithDataB.parquet")
val numbersWithDataInA = spark.read.parquet("numbersWithDataA.parquet").toDF("numberA", "one", "dataA")
val numbersWithDataInB = spark.read.parquet("numbersWithDataB.parquet").toDF("numberB", "one", "dataB")
numbersWithDataInA.join(numbersWithDataInB, Seq("one")).write.mode("overwrite").parquet("joined.parquet")
Fails with Caused by: java.lang.OutOfMemoryError: Java heap space
--conf spark.sql.autoBroadcastJoinThreshold=-1
means you are disabling the broadcast feature.
You can change it to any suitable value <2gb (since 2gb limit is there). spark.sql.autoBroadcastJoinThreshold is default 10mb
as per spark documentation. I dont know the reason you have disabled it. if you disbale it SparkStregies will switch the path to sortmerge join or shuffle hash join. see my article for details
Remaining I dont think there is any need to change as its common pattern of joining 2 datasets.
Further reading DataFrame join optimization - Broadcast Hash Join
UPDATE : Alternatively In your real example (not contrieved :-)) you can do these steps
Steps :
1) Each dataset find out join key (may be for example, pickup a unique/distinct category or country or state field) and collect them as an array since its small data you can collect.
2) For each category element in an array join the 2 datasets (playing with small dataset joins) with category as where condition add to a sequence of dataframes.
3) reduce and union these dataframes.
scala example :
val dfCatgories = Seq(df1Category1, df2Category2, df3Category3)
dfCatgories.reduce(_ union _)
Note : for each join I still prefer BHJ since it will be less/no shuffle

How to avoid gc overhead limit exceeded in a range query with GeoSpark?

I am using Spark 2.4.3 with the extension of GeoSpark 1.2.0.
I have two tables to join as range distance. One table (t1) if ~ 100K rows with one column only that is a Geospark's geometry. The other table (t2) is ~ 30M rows and it is composed by an Int value and a Geospark's geometry column.
What I am trying to do is just a simple:
val spark = SparkSession
.builder()
// .master("local[*]")
.config("spark.serializer", classOf[KryoSerializer].getName)
.config("spark.kryo.registrator", classOf[GeoSparkKryoRegistrator].getName)
.config("geospark.global.index", "true")
.config("geospark.global.indextype", "rtree")
.config("geospark.join.gridtype", "rtree")
.config("geospark.join.numpartition", 200)
.config("spark.sql.parquet.filterPushdown", "true")
// .config("spark.sql.shuffle.partitions", 10000)
.config("spark.sql.autoBroadcastJoinThreshold", -1)
.appName("PropertyMaster.foodDistanceEatout")
.getOrCreate()
GeoSparkSQLRegistrator.registerAll(spark)
spark.sparkContext.setLogLevel("ERROR")
spark.read
.load(s"$dataPath/t2")
.repartition(200)
.createOrReplaceTempView("t2")
spark.read
.load(s"$dataPath/t1")
.repartition(200)
.cache()
.createOrReplaceTempView("t1")
val query =
"""
|select /*+ BROADCAST(t1) */
| t2.cid, ST_Distance(t1.geom, t2.geom) as distance
| from t2, t1 where ST_Distance(t1.geom, t2.geom) <= 3218.69""".stripMargin
spark.sql(query)
.repartition(200)
.write.mode(SaveMode.Append)
.option("path", s"$dataPath/my_output.csv")
.format("csv").save()
I tried different configurations, cboth when I run it locally or on my local cluster on my laptop (tot mem 16GB and 8 cores) but without any luck as the program crashes at "Distinct at Join" for GeoSpark with lots of shuffling. However I am not able to remove the shuffling from SparkSQL syntax. I thought to add an extra column id on the biggest table as for example same integer every 200 rows or so and repartition by that, but didn't work too.
I was expecting a partitioner for GeoSpark indexing but I am not sure it is working.
Any idea?
I have found an answer myself, as the problem of the GC overhead was due to partitioning but also the memory needed for the Partitioner by GeoSpark (based on index) and the timeout due to long geoquery calculations that have been solved adding the following parameters as suggested by GeoSpark website itself:
spark.executor.memory 4g
spark.driver.memory 10g
spark.network.timeout 10000s
spark.driver.maxResultSize 5g

Spark Scala performance issues with take and InsertInto commands

Please look at the attached screenshot.
I am trying to do some performance improvement to my spark job and its taking almost 5 min to execute the take action on dataframe.
I am using take for making sure that dataframe has some records in it and if it is present, I want to proceed for further processing.
I tried take and count and don't see much difference in the time for execution.
Another scenario is where its taking around 10min to write the datafrane into hive table(it has max 200 rows and 10 columns).
df.write.mode("append").partitionBy("date").insertInto(tablename)
Please suggest how we can minimize the time its taking for take and insert into hive table.
Updates:
Here is my spark submit : spark-submit --master yarn-cluster --class com.xxxx.info.InfoAssets --conf "spark.executor.extraJavaOptions=-XX:+UseCompressedOops -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.security.auth.login.config=kafka_spark_jaas.conf" --files /home/ngap.app.rcrp/hive-site.xml,/home//kafka_spark_jaas.conf,/etc/security/keytabs/ngap.sa.rcrp.keytab --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar --executor-memory 3G --num-executors 3 --executor-cores 10 /home/InfoAssets/InfoAssets.jar
Code details:
its a simple dataframe which has 8 columns with around 200 records in it and I am using following code to insert into hive table.
df.write.mode("append").partitionBy("partkey").insertInto(hiveDB + "." + tableName)
Thanks,Bab
Don't use count before the write if not necessary and if your table is already created then use Spark SQL to insert the data into Hive Partitioned table.
spark.sql("Insert into <tgt tbl> partition(<col name>) select cols,partition col from temp_tbl")

Spark taking long time to return the count of an RDD

I have dataframe that was executed on a query. Data size might be more than 1GB.
After using spark sql, consider I have a dataframe df. After this,
I filtered the dataset into 2 dataframes,
one with values less than 540 ie filteredFeaturesBefore, another with value between 540 and 640 ie filteredFeaturesAfter
After this, I combined the 2 dataframes mentioned above ie combinedFeatures by using inner join
val combinedFeatures = sqlContext.sql("""select * from filteredFeaturesBefore inner join filteredFeaturesAfter on filteredFeaturesBefore.ID = filteredFeaturesAfter.ID """)
Later I create a dataset of LabeledPoint to pass the data into the machine learning model.
val combinedRddFeatures = combinedFeatures.rdd
combinedRddFeatures.map(event => LabeledPoint(parseDouble(event(0) + ""), Vectors.dense(parseDouble(event(1) + ""),
parseDouble(event(2) + ""), parseDouble(event(3) + ""), parseDouble(event(4) + ""))))
If I perform, finalSamples.count() spark is executing and is not returning anything for a long time. I executed the program for 6 hours and still no results was returned from spark. I had to stop the execution because laptop was almost slow and was not responding properly.
I dont know whether this is because of my laptop processor speed or is spark hung.
I'm using macBook air 2017 which has a processor of 1.8Ghz.
Can you tell me why is this happening as I'm new to spark.
Also, is there any workaround for this. Instead of splitting the data into 2 dataframes, can I iterate both the dataframes and extract the labelelPoints data structure? If yes, can you suggest me the method to do this.