Simple spark job fail due to GC overhead limit - scala

I've created a standalone spark (2.1.1) cluster on my local machines
with 9 cores / 80G each machine (total of 27 cores / 240G Ram)
I've got a sample spark job that sum all the numbers from 1 to x
this is the code :
package com.example
import org.apache.spark.sql.SparkSession
object ExampleMain {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.master("spark://192.168.1.2:7077")
.config("spark.driver.maxResultSize" ,"3g")
.appName("ExampleApp")
.getOrCreate()
val sc = spark.SparkContext
val rdd = sc.parallelize(Lisst.range(1, 1000))
val sum = rdd.reduce((a,b) => a+b)
println(sum)
done
}
def done = {
println("\n\n")
println("-------- DONE --------")
}
}
When running the above code I get results after a few seconds
so I've crancked up the code to sum all the numbers from 1 to 1B (1,000,000,000) and than I get GC overhead limit reached
I read that spark should spill memory to the HDD if there isn't enough memory, I've tried to play with my cluster configuration but that didn't helped.
Driver memory = 6G
Number of workers = 24
Cores per worker = 1
Memory per worker = 10
I'm not a developer, and have no knowledge in Scala but would like to find a solution to run this code without GC issues.
Per #philantrovert request I'm adding my spark-submit command
/opt/spark-2.1.1/bin/spark-submit \
--class "com.example.ExampleMain" \
--master spark://192.168.1.2:6066 \
--deploy-mode cluster \
/mnt/spark-share/example_2.11-1.0.jar
In addition my spark/conf are as following:
slaves file contain the 3 IP addresses of my nodes (including the master)
spark-defaults contain:
spark.master spark://192.168.1.2:7077
spark.driver.memory 10g
spark-env.sh contain:
SPARK_LOCAL_DIRS= shared folder among all nodes
SPARK_EXECUTOR_MEMORY=10G
SPARK_DRIVER_MEMORY=10G
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=10G
SPARK_WORKER_INSTANCES=8
SPARK_WORKER_DIR= shared folder among all nodes
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
Thanks

I suppose the problem is that you create a List with 1 Billion entries on the driver, which is a huge datastructure (4GB). There is a more efficient way the programmatically create an Dataset/RDD:
val rdd = spark.range(1000000000L).rdd

Related

Spark Cassandra integration not using C* optimizations

I am running code from intellij IDE .My spark cassandra cluster has 3 nodes . Cassandra nodes and spark workers are on same machines
val sparkConf = new SparkConf()
.set(s"spark.sql.catalog.mycatalog", "com.datastax.spark.connector.datasource.CassandraCatalog")
.set("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions")
.set("spark.sql.catalog.casscatalog", "com.datastax.spark.connector.datasource.CassandraCatalog");
val sc = SparkSession.builder()
.config(sparkConf)
.master("spark://master")
.withExtensions(new CassandraSparkExtensions)
.getOrCreate();
val table = sc.sql("select * from table where primarykeyA = 1");
table.show(10)
Now above query when I run normally will be in millisec as I have mentioned partition key
Expectation This query should only hit 1 of worker nodes which has the parition data and should return
Somehow when this runs it ends up going to all worker nodes which indicates Datastax optimization are not in place
Is there a way I can submit below parameter --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-beta via code

Spark SQL Dataset[Row] collect to driver throw java.io.EOFException: Premature EOF: no length prefix available

while using Spark to read a data set by the following code:
val df: Dataset[Row] = spark.read.format("csv).schema(schema).load("hdfs://master:9000/mydata")
Then I want to collect the data to the driver:
val rows_array: Array[Row] = df.collect()
An error occurred:
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:244)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStream$ResponseProcessor.run(DFSOutputStream.java:733)
The reason for this error seems to be that there are too many data items, because when I use
val rows_array: Array[Row] = df.take(10,000,000)
it can run successfully.
But when I use
val rows_array: Array[Row] = df.take(100,000,000)
the error appeared again(Successfully run the next day, but there are still errors when fetching all data)
Environment:
Spark 2.4.3
Hadoop 2.7.7
Total rows of data about 800,000,000, 12GB
and the memory is enough.
---------------------------------------------------------edit line--------------------------------------------------------
Today I ran it again using the code below:
val fields = Array.range(0, 2).map(i => StructField(s"col$i", IntegerType))
val schema: StructType = new StructType(fields)
val spark: SparkSession = SparkSession.builder.appName("test").getOrCreate
val df: Dataset[Row] = spark.read.format("csv).schema(schema).load("hdfs://master:9000/mydata")
df.cache()
df.count()
val df_rows: Array[Row] = df.collect()
print("df_rows[0] + df_rows(0))
print("df_rows size:" + df_rows.length)
I submit the application by:
{SPARK_HOME}/bin/spark-submit --master spark://master:7077 \
--conf spark.executor.cores=35 \
--total-executor-cores 105 \
--executor-memory 145g \
--driver-memory 200g \
--conf "spark.executor.extraJavaOptions=-Xms145g" \
--conf "spark.driver.extraJavaOptions=-Xms200g"
The size of the csv file I read is about 12GB, each line in the csv file is two integers, and the memory occupied by df.cache() is 5.5GB.
Environment:
Spark 2.4.3
Hadoop 2.7.7
Total rows of data about 800,000,000, 12GB
There are three machines in the cluster and they are all workers, the memory of the node submitting the job is 370GB(driver node)
I monitored the spark web ui, and all the tasks were successfully completed(Running time is about 60s), but after more than ten minutes, an error will be generated in the shell(Attention: The process did not exit, and the file in HDFS is ok):
WARN hdfs.dfsClient: DFSOutputStream ResponseProcessor exception for block xxxxx
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:244)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStream$ResponseProcessor.run(DFSOutputStream.java:733)
WARN hdfs.DFSClient: Error Recovery for block xxxxx in pipeline DatanodeInfoWithStorage[ipxxx,DISK],DatanodeInfoWithStorage[ipxxx,DISK],DatanodeInfoWithStorage[ipxxx,DISK]
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
at org.apache.hadoop.net.SocketIOWithTImeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
at java.io.BUfferedOutputStream.flushBuffer(BufferOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputSream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStream.run(DFSOutputStream.java:508)
And the following two lines of code did not run, because I don't see the output:
print("df_rows[0] + df_rows(0))
print("df_rows size:" + df_rows.length)
In addition, I also monitored the memory usage of the machine where the driver is located and found that the memory usage has been increasing(After running for 30 minutes, the spark submit process[driver] takes about 120GB)
2020.04.16:
Another information:
Converted to rdd can run successfully:
val rows_array: Array[Row] = df.rdd.collect()
But in this case, the memory consumption is very expensive(The memory I allocated is almost used up).
Does anyone know the reason? Another question is, why is there such a large memory usage?

is there a way to optimize spark sql code?

updated:
I`m using spark sql 1.5.2. Trying to read many parquet files and filter and aggregate rows - there are ~35M of rows stored in ~30 files in my hdfs and it takes more than 10 minutes to process
val logins_12 = sqlContext.read.parquet("events/2015/12/*/login")
val l_12 = logins_12.where("event_data.level >= 90").select(
"pid",
"timestamp",
"event_data.level"
).withColumn("event_date", to_date(logins_12("timestamp"))).drop("timestamp").toDF("pid", "level", "event_date").groupBy("pid", "event_date").agg(Map("level"->"max")).toDF("pid", "event_date", "level")
l_12.first()
my spark is running in two node cluster with 8cores and 16Gb ram each, scala output makes me thing the computation runs in just one thread:
scala> x.first()
[Stage 1:=======> (50 + 1) / 368]
when I try count() instead of first() it looks like two threads are doing computations. which is still less than I was expecting as there are ~30 files that can be processed in parallel
scala> l_12.count()
[Stage 4:=====> (34 + 2) / 368]
I'm starting spark console with 14g for executor and 4g for driver in yarn-client mode
./bin/spark-shell -Dspark.executor.memory=14g -Dspark.driver.memory=4g --master yarn-client
my default config for spark:
spark.executor.memory 2g
spark.logConf true
spark.eventLog.dir maprfs:///apps/spark
spark.eventLog.enabled true
spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni
spark.executor.extraClassPath
spark.yarn.historyServer.address http://test-01:18080
there are 200 partitions of the rdd
scala> logins_12.rdd.partitions.size
res2: Int = 368
scala> l_12.rdd.partitions.size
res0: Int = 200
is there a way to optimize this code?
thanks
Both behaviors are more or less expected. Spark is rather lazy and it not only doesn't execute transformations unless you trigger an action but can also skip tasks if there are not required for the output. Since first requires only a single element it can compute only one partition. It is most likely the reason why you see only one running thread at some point.
Regarding the second issue it is most likely a matter of configuration. Assuming there is nothing wrong with YARN configuration (I don't use YARN but yarn.nodemanager.resource.cpu-vcores looks like a possible source of the problem) it is most likely a matter of Spark defaults. As you can read in the Configuration guide spark.executor.cores on Yarn is by default set to 1. Two workers gives two running threads.

How to parallelize loading data into an RDD in Spark

I am using the MLLib library and the MLUtils.loadLibSVMFile() method. I have a file that is 10GB and cluster with 5 slaves, with 2 cores and 6 GB memory each. According to the documentation found here, the method has the following parameters:
loadLibSVMFile(sc, path, multiclass=False, numFeatures=-1, minPartitions=None)
I would like to have 10 partitions and I don't know the number of features. When I run the method without specifying the minimum partitions, I get the java.lang.OutOfMemoryError: Java heap space as would be expected.
However, when I specify the numFeatures to be -1 as is default in the documentation, and the number to partitions to 10, based on the WebUI it does distribute the work, but after some time I get a java.lang.ArrayIndexOutOfBoundsException. The rest of the code looks identical to the example code written here.
I am brand new to Spark so please tell me if I am making some obvious mistake. Thanks!
EDIT 2: I ran the same thing with a sample dataset that was tiny and dense, and it works fine. Seems to be a problem with the number of features?
EDIT 3:
Here is the code I am trying to run:
object MLLibExample {
def main(args: Array[String]) {
val sc = new SparkContext()
//loading the RDD
val data = MLUtils.loadLibSVMFile(sc, "s3n://file.libsvm", -1, 10)
// Run training algorithm to build the model
val numIterations = 10
val model = SVMWithSGD.train(data, numIterations)
}
}
Here is the exception:
15/07/20 20:42:43 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 35, 10.249.67.97): java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:136)
at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:106)
at org.apache.spark.mllib.optimization.HingeGradient.compute(Gradient.scala:317)
at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:192)
at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:190)
-I have made sure the input file is correctly formatted (0/1 labels and space-delimited)
-Tried a subsample of the original file that is 10 MB
-Set both driver and executor memory
-Added this tag to my spark-submit command: --conf "spark.driver.maxResultSize=6G"

Sharing data between nodes using Apache Spark

Here is how I launch the Spark job :
./bin/spark-submit \
--class MyDriver\
--master spark://master:7077 \
--executor-memory 845M \
--deploy-mode client \
./bin/SparkJob-0.0.1-SNAPSHOT.jar
The class MyDriver accesses the spark context using :
val sc = new SparkContext(new SparkConf())
val dataFile= sc.textFile("/data/example.txt", 1)
In order to run this within a cluster I copy the file "/data/example.txt" to all nodes within the cluster. Is there a mechanism using Spark to share this data file between nodes without manually copying them ? I don't think I can use a broadcast variable in this case ?
Update :
An option is to have a dedicated file server which shares the file to be processed : val dataFile= sc.textFile("http://fileserver/data/example.txt", 1)
sc.textFile("/some/file.txt") read a file distributed in hdfs, i.e.:
/some/file.txt is (already) split in multiple parts which are distributed a couple of computers each.
and each worker/task read one parts of the file. This is useful because you don't need to manage which part yourself.
If you have copied the files on each worker node, you can read it in all task:
val myRdd = sc.parallelize(1 to 100) // 100 tasks
val fileReadEveryWhere = myRdd.map( read("/my/file.txt") )
and have the code of read(...) implemented somewhere.
Otherwise, you can also use a [broadcast variable] that is seed from the driver to all workers:
val myObject = read("/my/file.txt") // obj instantiated on driver node
val bdObj = sc.broadcast(myObject)
val myRdd = sc.parallelize(1 to 100)
.map{ i =>
// use bdObj in task i, ex:
bdObj.value.process(i)
}
In this case, myObject should be serializable and it is better if it is not too big.
Also, the method read(...) is run on the driver machine. So you only need the file on the driver. But if you don't know which machine it is (e.g. if you use spark-submit) then the file should be on all machines :-\ . In this case, it is maybe better to have access to some DB or external file system.