Spark table joins - Resource allocation issue - scala

I am stuck while working with hive tables using spark cluster (Yarn is in placed). I have some 7 tables which I need to join and then replace some null values and writing back the result to Hive final DF.
I use spark SQL (Scala) ,creating 6 different data frame first. and then join all the dataframes and writing back the result to hive table.
After five minutes my code throws below error, which I know is due to not setting my resource allocation properly.
19/10/13 06:46:53 ERROR client.TransportResponseHandler: Still have 2 requests outstanding when connection from /100.66.0.1:36467 is closed
19/10/13 06:46:53 ERROR cluster.YarnScheduler: Lost executor 401 on aaaa-bd10.pq.internal.myfove.com: Container container_e33_1570683426425_4555_01_000414 exited from explicit termination request.
19/10/13 06:47:02 ERROR cluster.YarnScheduler: Lost executor 391 on aaaa-bd10.pq.internal.myfove.com: Container marked as failed: container_e33_1570683426425_4555_01_000403 on host: aaaa-bd10.pq.internal.myfove.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
My hardware specification
HostName Memory in GB CPU Memory for Yarn CPU For Yarn
Node 1 126 32 90 26
Node 2 126 32 90 26
Node 3 126 32 90 26
Node 4 126 32 90 26
How to set below variables properly, so that my code doesn't throw an error (container marked as failed - killed by request 143)?
I am trying different configuration, but nothing helped yet.
val spark = (SparkSession.builder
.appName("Final Table")
.config("spark.driver.memory", "5g")
.config("spark.executor.memory", "15g")
.config("spark.dynamicAllocation.maxExecutors","6")
.config("spark.executor.cores", "5")
.enableHiveSupport()
.getOrCreate())
DF1 = spark.sqk("Select * from table_1") //1.4 million records and 10 var
DF2 = spark.sqk("Select * from table_2") //1.4 million records and 3000
DF3 = spark.sqk("Select * from table_3") //1.4 million records and 300
DF4 = spark.sqk("Select * from table_4") //1.4 million records and 600
DF5 = spark.sqk("Select * from table_5") //1.4 million records and 150
DF6 = spark.sqk("Select * from table_6") //1.4 million records and 2
DF7 = spark.sqk("Select * from table_7") //1.4 million records and 12
val joinDF1 = df1.join(df2, df1("number") === df2("number"), "left_outer").drop(df2("number"))
val joinDF2 = joinDF1.join(df3,joinDF1("number") === df3("number"), "left_outer").drop(df3("number"))
val joinDF3 = joinDF2.join(df4,joinDF2("number") === df4("number"), "left_outer").drop(df4("number"))
val joinDF4 = joinDF3.join(df5,joinDF3("number") === df5("number"), "left_outer").drop(df5("number"))
val joinDF5 = joinDF4.join(df6,joinDF4("number") === df6("number"), "left_outer").drop(df6("number")).drop("Dt")
val joinDF6 = joinDF5.join(df7,joinDF5("number") === df7("number"), "left_outer").drop(df7("number")).drop("Dt")
joinDF6.createOrReplaceTempView("joinDF6")
spark.sql("create table hive table as select * from joinDF6")

Please check your yarn.nodemanager.log-dirs in Ambari if you are using Ambari. If not, try to find out this property and if its pointing to directory where you have very less space, Please change it to some other directory which have more space.
While running tasks cotainers create blocks which gets store in yarn.nodemanger.log-dirs location, if its not enough to store the blocks container starts to fail.

Related

Why is adaptive SQL not working with df persist?

val spark = SparkSession.builder().master("local[4]").appName("Test")
.config("spark.sql.adaptive.enabled", "true")
.config("spark.sql.adaptive.coalescePartitions.enabled", "true")
.config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "50m")
.config("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1")
.config("spark.sql.adaptive.coalescePartitions.initialPartitionNum", "1024")
.getOrCreate()
val df = spark.read.csv("<Input File Path>")
val df1 = df.distinct()
df1.persist() // On removing this line. Code works as expected
df1.write.csv("<Output File Path>")
I have an input file of size 2 GB which is read as 16 partitions of size 128 MB each. I have enabled adaptive SQL to coalesce partitions after the shuffle
Without df1.persist, df1.write.csv writes 4 partition files of 50 MB each which is expected
Without persist
If I include df1.persist, Spark is writing 200 partitions(adaptive coalesce not working)
With persist
.config("spark.sql.optimizer.canChangeCachedPlanOutputPartitioning", "true")
Adding this config worked
https://issues.apache.org/jira/projects/SPARK/issues/SPARK-38172?filter=reportedbyme

Spark JDBC Read, Partition On, Column Type to Select?

I am trying a read a SQL Table (15 million rows) using Spark into Dataframe, I want to leverage Multi-Core to Do the read very Fast and do the Partition, What are the column/s I can select to partition ? is it ID, UUID, Sequence, date-time? How should I calculate the Number of Partitions?
There are multiple of complex questions in your question :
- What are the column/s I can select to partition ?
It depends on your needs and your computing goals and transformations that you will do next with spark on your data. (if groupBy(key) and your key is date-time, then you should partitionBy date-time)
- Number of partition depends on : the size of your data, your hardware ressources, yours needs ... it is a complex question, you have to take into account shuffle partitions for transformations also (default is 200, value advised by spark: 3 * number of cpu)
val sparkSession = org.apache.spark.sql.SparkSession.builder
.master("local[*]")
.appName("JdbcMicroService")
.getOrCreate()
sparkSession.conf.set("spark.sql.shuffle.partitions", 3 * nb CPU)
def requestPostgreSql(sparkSession: SparkSession, database : database, dateOfRequest : String) : DataFrame = {
val url = "jdbc:postgresql://" + database.url + "/" + database.databaseName
val requestDF = sparkSession.read.format("jdbc")
.option("Driver", "org.postgresql.Driver")
.option("url", url)
.option("dbtable",database.tableName)
.option("user",database.user)
.option("password",database.passwd)
.load()
.repartition(col("colName"))
requestDF
}

Spark Group By and with Rank function is running very slow

I am writing a spark app for finding top n accessed URLs within a time frame. But This job keeps running and takes hours for 389451 records in ES for an instance. I want to reduce this time.
I am reading from Elastic search in spark as bellow
val df = sparkSession.read
.format("org.elasticsearch.spark.sql")
.load(date + "/" + business)
.withColumn("ts_str", date_format($"ts", "yyyy-MM-dd HH:mm:ss")).drop("ts").withColumnRenamed("ts_str", "ts")
.select(selects.head, selects.tail:_*)
.filter($"ts" === ts)
.withColumn("url", split($"uri", "\\?")(0)).drop("uri").withColumnRenamed("url", "uri").cache()
In above DF I am reading and filtering from ElasticSearch. Also I am removing query params from URI.
Then I am doing group by
var finalDF = df.groupBy("col1","col2","col3","col4","col5","uri").agg(sum("total_bytes").alias("total_bytes"), sum("total_req").alias("total_req"))
Then I am running a window function
val partitionBy = Seq("col1","col2","col3","col4","col5")
val window = Window.partitionBy(partitionBy.head, partitionBy.tail:_*).orderBy(desc("total_req"))
finalDF = finalDF.withColumn("rank", rank.over(window)).where($"rank" <= 5).drop("rank")
Then I am writing finalDF to cassandra
finalDF.write.format("org.apache.spark.sql.cassandra").options(Map("table" -> "table", "keyspace" -> "keyspace")).mode(SaveMode.Append).save()
I have 4 data nodes in ES cluster and My Spark machine is 16 cores 64GB Ram VM. Please help me finding where the problem is.
It could be a good idea that you persist your dataframe after read, because you are going to be using so many times in rank function.

Why is Datastax cassandra sparkContext is selecting 124 rows in RDD when trying to select only single row through limit 1

I am using datastax cassandra sparkContext(sc) and running following scala code :
val table = sc.cassandraTable("database_name", "table_name")
val rdd = table.limit(1)
rdd.count
The above is printing 124, which means 124 records are selected, even when the limit given as parameter is 1. Similarly on table.limit(2) is giving 248 rows. Is there a specific reason for a 124 multiplier, or am I missing something?

Weird behavior of DataFrame operations

Consider the code:
val df1 = spark.table("t1").filter(col("c1")=== lit(127))
val df2 = spark.sql("select x,y,z from ORCtable")
val df3 = df1.join(df2.toDF(df2.columns.map(_ + "_R"): _*),
trim(upper(coalesce(col("y_R"), lit("")))) === trim(upper(coalesce(col("a"), lit("")))), "leftouter")
df3.select($"y_R",$"z_R").show(500,false)
This is producing the warning WARN TaskMemoryManager: Failed to allocate a page (2097152 bytes), try again.The code fails java.lang.OutOfMemoryError: GC overhead limit exceeded.
But if I run the following code:
val df1 = spark.table("t1").filter(col("c1")=== lit(127))
val df2 = spark.sql("select x,y,z from ORCtable limit 2000000")//only difference here
//ORC table has 1651343 rows so doesn't exceed limit 2000000
val df3 = df1.join(df2.toDF(df2.columns.map(_ + "_R"): _*),
trim(upper(coalesce(col("y_R"), lit("")))) === trim(upper(coalesce(col("a"), lit("")))), "leftouter")
df3.select($"y_R",$"z_R").show(500,false)
This produces the correct output. I'm at a loss why this happens and what changes. Can someone help make some sense of this?
To answer my own question: The Spark physical execution plan are different for the two ways of generating the same dataframe which can be checked by calling the .explain() method.
The first way uses the broadcast-hash join which causes java.lang.OutOfMemoryError: GC overhead limit exceeded whereas the latter way runs the sort-merge join which is typically slower but does not strain the garbage collection as much.
This difference in physical execution plans is introduced by the additional filteroperation on the df2 dataframe.