is there a way to optimize spark sql code? - scala

updated:
I`m using spark sql 1.5.2. Trying to read many parquet files and filter and aggregate rows - there are ~35M of rows stored in ~30 files in my hdfs and it takes more than 10 minutes to process
val logins_12 = sqlContext.read.parquet("events/2015/12/*/login")
val l_12 = logins_12.where("event_data.level >= 90").select(
"pid",
"timestamp",
"event_data.level"
).withColumn("event_date", to_date(logins_12("timestamp"))).drop("timestamp").toDF("pid", "level", "event_date").groupBy("pid", "event_date").agg(Map("level"->"max")).toDF("pid", "event_date", "level")
l_12.first()
my spark is running in two node cluster with 8cores and 16Gb ram each, scala output makes me thing the computation runs in just one thread:
scala> x.first()
[Stage 1:=======> (50 + 1) / 368]
when I try count() instead of first() it looks like two threads are doing computations. which is still less than I was expecting as there are ~30 files that can be processed in parallel
scala> l_12.count()
[Stage 4:=====> (34 + 2) / 368]
I'm starting spark console with 14g for executor and 4g for driver in yarn-client mode
./bin/spark-shell -Dspark.executor.memory=14g -Dspark.driver.memory=4g --master yarn-client
my default config for spark:
spark.executor.memory 2g
spark.logConf true
spark.eventLog.dir maprfs:///apps/spark
spark.eventLog.enabled true
spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni
spark.executor.extraClassPath
spark.yarn.historyServer.address http://test-01:18080
there are 200 partitions of the rdd
scala> logins_12.rdd.partitions.size
res2: Int = 368
scala> l_12.rdd.partitions.size
res0: Int = 200
is there a way to optimize this code?
thanks

Both behaviors are more or less expected. Spark is rather lazy and it not only doesn't execute transformations unless you trigger an action but can also skip tasks if there are not required for the output. Since first requires only a single element it can compute only one partition. It is most likely the reason why you see only one running thread at some point.
Regarding the second issue it is most likely a matter of configuration. Assuming there is nothing wrong with YARN configuration (I don't use YARN but yarn.nodemanager.resource.cpu-vcores looks like a possible source of the problem) it is most likely a matter of Spark defaults. As you can read in the Configuration guide spark.executor.cores on Yarn is by default set to 1. Two workers gives two running threads.

Related

spark cache manager behavior

I'm trying to understand the spark cache manager behavior as I deployed my test code to spark job server to have long running context and want to test the behavior by executing the same job multiple time after each other to see how caching is.
val manager = spark.sharedState.cacheManager
val DF = collectData.retrieveDataFromCass(spark) // loaded from cassandra sucessfully with 2k rows
val testCachedData = if (manager.lookupCachedData(DF.queryExecution.logical).isEmpty) 0 else 1
DF.createOrReplaceTempView(tempName1)
spark.sqlContext.cacheTable(tempName1)
DF.count() // action
testCachedData
Then I'm returning testCachedData.
I've expected to see testCachedData in the first job execution to see it 0 then in the next tries to be returning 1
But I've got all job returning it as 0 as it seems empty each time, But when I checked it from the spark UI STORAGE I could see there's a cache data there.
Why cache manager can't see my cache data in the same spark application ?
THIS SPARK TEST IS USING :
SPARK 3.2
spark-cassandra-connector 3.0.1

Spark SQL Dataset[Row] collect to driver throw java.io.EOFException: Premature EOF: no length prefix available

while using Spark to read a data set by the following code:
val df: Dataset[Row] = spark.read.format("csv).schema(schema).load("hdfs://master:9000/mydata")
Then I want to collect the data to the driver:
val rows_array: Array[Row] = df.collect()
An error occurred:
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:244)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStream$ResponseProcessor.run(DFSOutputStream.java:733)
The reason for this error seems to be that there are too many data items, because when I use
val rows_array: Array[Row] = df.take(10,000,000)
it can run successfully.
But when I use
val rows_array: Array[Row] = df.take(100,000,000)
the error appeared again(Successfully run the next day, but there are still errors when fetching all data)
Environment:
Spark 2.4.3
Hadoop 2.7.7
Total rows of data about 800,000,000, 12GB
and the memory is enough.
---------------------------------------------------------edit line--------------------------------------------------------
Today I ran it again using the code below:
val fields = Array.range(0, 2).map(i => StructField(s"col$i", IntegerType))
val schema: StructType = new StructType(fields)
val spark: SparkSession = SparkSession.builder.appName("test").getOrCreate
val df: Dataset[Row] = spark.read.format("csv).schema(schema).load("hdfs://master:9000/mydata")
df.cache()
df.count()
val df_rows: Array[Row] = df.collect()
print("df_rows[0] + df_rows(0))
print("df_rows size:" + df_rows.length)
I submit the application by:
{SPARK_HOME}/bin/spark-submit --master spark://master:7077 \
--conf spark.executor.cores=35 \
--total-executor-cores 105 \
--executor-memory 145g \
--driver-memory 200g \
--conf "spark.executor.extraJavaOptions=-Xms145g" \
--conf "spark.driver.extraJavaOptions=-Xms200g"
The size of the csv file I read is about 12GB, each line in the csv file is two integers, and the memory occupied by df.cache() is 5.5GB.
Environment:
Spark 2.4.3
Hadoop 2.7.7
Total rows of data about 800,000,000, 12GB
There are three machines in the cluster and they are all workers, the memory of the node submitting the job is 370GB(driver node)
I monitored the spark web ui, and all the tasks were successfully completed(Running time is about 60s), but after more than ten minutes, an error will be generated in the shell(Attention: The process did not exit, and the file in HDFS is ok):
WARN hdfs.dfsClient: DFSOutputStream ResponseProcessor exception for block xxxxx
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:244)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStream$ResponseProcessor.run(DFSOutputStream.java:733)
WARN hdfs.DFSClient: Error Recovery for block xxxxx in pipeline DatanodeInfoWithStorage[ipxxx,DISK],DatanodeInfoWithStorage[ipxxx,DISK],DatanodeInfoWithStorage[ipxxx,DISK]
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
at org.apache.hadoop.net.SocketIOWithTImeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
at java.io.BUfferedOutputStream.flushBuffer(BufferOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputSream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStream.run(DFSOutputStream.java:508)
And the following two lines of code did not run, because I don't see the output:
print("df_rows[0] + df_rows(0))
print("df_rows size:" + df_rows.length)
In addition, I also monitored the memory usage of the machine where the driver is located and found that the memory usage has been increasing(After running for 30 minutes, the spark submit process[driver] takes about 120GB)
2020.04.16:
Another information:
Converted to rdd can run successfully:
val rows_array: Array[Row] = df.rdd.collect()
But in this case, the memory consumption is very expensive(The memory I allocated is almost used up).
Does anyone know the reason? Another question is, why is there such a large memory usage?

Spark Dataframe writes part files to _temporary in instead directly creating partFiles in output directory [duplicate]

We are running spark 2.3.0 on AWS EMR. The following DataFrame "df" is non empty and of modest size:
scala> df.count
res0: Long = 4067
The following code works fine for writing df to hdfs:
scala> val hdf = spark.read.parquet("/tmp/topVendors")
hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint]
scala> hdf.count
res4: Long = 4067
However using the same code to write to a local parquet or csv file end up with empty results:
df.repartition(1).write.mode("overwrite").parquet("file:///tmp/topVendors")
scala> val locdf = spark.read.parquet("file:///tmp/topVendors")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
We can see why it fails:
ls -l /tmp/topVendors
total 0
-rw-r--r-- 1 hadoop hadoop 0 Jul 30 22:38 _SUCCESS
So there is no parquet file being written.
I have tried this maybe twenty times and for both csv and parquet and on two different EMR Servers: this same behavior is exhibited in all cases.
Is this an EMR specific bug? A more general EC2 bug? Something else? This code works on spark on macOS.
In case it matters - here is the versioning info:
Release label:emr-5.13.0
Hadoop distribution:Amazon 2.8.3
Applications:Spark 2.3.0, Hive 2.3.2, Zeppelin 0.7.3
That is not a bug and it is the expected behavior. Spark does not really support writes to non-distributed storage (it will work in local mode, just because you have shared file system).
Local path is not interpreted (only) as a path on the driver (this would require collecting the data) but local path on each executor. Therefore each executor will write its own chunk to its own local file system.
Not only output is no readable back (to load data each executor and the driver should see the same state of the file system), but depending on the commit algorithm, might not be even finalized (move from the temporary directory).
This error usually occurs when you try to read an empty directory as parquet.
You could check
1. if the DataFrame is empty with outcome.rdd.isEmpty() before write it.
2. Check the if the path you are giving is correct
Also in what mode you are running your application? Try running it in client mode if you are running in cluster mode.

Simple spark job fail due to GC overhead limit

I've created a standalone spark (2.1.1) cluster on my local machines
with 9 cores / 80G each machine (total of 27 cores / 240G Ram)
I've got a sample spark job that sum all the numbers from 1 to x
this is the code :
package com.example
import org.apache.spark.sql.SparkSession
object ExampleMain {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.master("spark://192.168.1.2:7077")
.config("spark.driver.maxResultSize" ,"3g")
.appName("ExampleApp")
.getOrCreate()
val sc = spark.SparkContext
val rdd = sc.parallelize(Lisst.range(1, 1000))
val sum = rdd.reduce((a,b) => a+b)
println(sum)
done
}
def done = {
println("\n\n")
println("-------- DONE --------")
}
}
When running the above code I get results after a few seconds
so I've crancked up the code to sum all the numbers from 1 to 1B (1,000,000,000) and than I get GC overhead limit reached
I read that spark should spill memory to the HDD if there isn't enough memory, I've tried to play with my cluster configuration but that didn't helped.
Driver memory = 6G
Number of workers = 24
Cores per worker = 1
Memory per worker = 10
I'm not a developer, and have no knowledge in Scala but would like to find a solution to run this code without GC issues.
Per #philantrovert request I'm adding my spark-submit command
/opt/spark-2.1.1/bin/spark-submit \
--class "com.example.ExampleMain" \
--master spark://192.168.1.2:6066 \
--deploy-mode cluster \
/mnt/spark-share/example_2.11-1.0.jar
In addition my spark/conf are as following:
slaves file contain the 3 IP addresses of my nodes (including the master)
spark-defaults contain:
spark.master spark://192.168.1.2:7077
spark.driver.memory 10g
spark-env.sh contain:
SPARK_LOCAL_DIRS= shared folder among all nodes
SPARK_EXECUTOR_MEMORY=10G
SPARK_DRIVER_MEMORY=10G
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=10G
SPARK_WORKER_INSTANCES=8
SPARK_WORKER_DIR= shared folder among all nodes
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
Thanks
I suppose the problem is that you create a List with 1 Billion entries on the driver, which is a huge datastructure (4GB). There is a more efficient way the programmatically create an Dataset/RDD:
val rdd = spark.range(1000000000L).rdd

How to parallelize loading data into an RDD in Spark

I am using the MLLib library and the MLUtils.loadLibSVMFile() method. I have a file that is 10GB and cluster with 5 slaves, with 2 cores and 6 GB memory each. According to the documentation found here, the method has the following parameters:
loadLibSVMFile(sc, path, multiclass=False, numFeatures=-1, minPartitions=None)
I would like to have 10 partitions and I don't know the number of features. When I run the method without specifying the minimum partitions, I get the java.lang.OutOfMemoryError: Java heap space as would be expected.
However, when I specify the numFeatures to be -1 as is default in the documentation, and the number to partitions to 10, based on the WebUI it does distribute the work, but after some time I get a java.lang.ArrayIndexOutOfBoundsException. The rest of the code looks identical to the example code written here.
I am brand new to Spark so please tell me if I am making some obvious mistake. Thanks!
EDIT 2: I ran the same thing with a sample dataset that was tiny and dense, and it works fine. Seems to be a problem with the number of features?
EDIT 3:
Here is the code I am trying to run:
object MLLibExample {
def main(args: Array[String]) {
val sc = new SparkContext()
//loading the RDD
val data = MLUtils.loadLibSVMFile(sc, "s3n://file.libsvm", -1, 10)
// Run training algorithm to build the model
val numIterations = 10
val model = SVMWithSGD.train(data, numIterations)
}
}
Here is the exception:
15/07/20 20:42:43 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 35, 10.249.67.97): java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:136)
at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:106)
at org.apache.spark.mllib.optimization.HingeGradient.compute(Gradient.scala:317)
at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:192)
at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:190)
-I have made sure the input file is correctly formatted (0/1 labels and space-delimited)
-Tried a subsample of the original file that is 10 MB
-Set both driver and executor memory
-Added this tag to my spark-submit command: --conf "spark.driver.maxResultSize=6G"