Read a huge Oracle table using Spark - scala

Spark version: 2.4.7
OS: Linux RHEL Fedora
I have the following use case:
Read an oracle table that contains ~150 million records (Done Daily), and then write these records to 800 files (using repartition) on a shared file system to be used in another application.
I can read the table into a data frame, but when trying to write it never finishes.
df
res38: org.apache.spark.sql.DataFrame = [ROW_ID: string, CREATED: string ... 276 more fields]
When I limit the number of retrieved records to 1 million, and repartition by (6) the process to read and write takes 2-3 minutes.
I searched for insights on the issue but couldn't figure it out, when running the process on the full data set I see in the UI logs the following line (keeps repeating) :
Blockquote
INFO sort.UnsafeExternalSorter: Thread 135 spilling sort data of 5.2 GB to disk (57 times so far)
I submit the job using the following:
time spark-submit --verbose --conf spark.dynamicAllocation.enabled=false --conf spark.spark.sql.broadcastTimeout=1000 --conf spark.sql.shuffle.partitions=1500 --conf "spark.ui.enabled=true" --master yarn --driver-memory 60G --executor-memory 10G --num-executors 40 --executor-cores 8 --jars ojdbc6.jar SparkOracleExtractor.jar
Please note there are enough resources on the cluster and the queue is only 5.0% used, and a constraint is to use Spark.
Appreciate any help on where to get information to resolve the issue and speed up the process.
This is the code:
val myquery = "select * from mytable"
val dt=20221023
spark.read.format("jdbc").
option("url", s"jdbc:oracle:thin:#$db_connect_string").
option("driver", "oracle.jdbc.driver.OracleDriver").
option("query", myquery).
option("user", db_user).
option("password", db_pass).
option("fetchsize", 10000).
option("delimiter", "|").
load()
df.repartition(800).write.csv(s"file:///fs/extrat_path/${dt}")
These are the shuffle spill sizes after 2.5 hours:
Shuffle Spill (Memory) Shuffle Spill (Disk)
259.4 GB 22.0 GB

Related

Spark SQL Dataset[Row] collect to driver throw java.io.EOFException: Premature EOF: no length prefix available

while using Spark to read a data set by the following code:
val df: Dataset[Row] = spark.read.format("csv).schema(schema).load("hdfs://master:9000/mydata")
Then I want to collect the data to the driver:
val rows_array: Array[Row] = df.collect()
An error occurred:
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:244)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStream$ResponseProcessor.run(DFSOutputStream.java:733)
The reason for this error seems to be that there are too many data items, because when I use
val rows_array: Array[Row] = df.take(10,000,000)
it can run successfully.
But when I use
val rows_array: Array[Row] = df.take(100,000,000)
the error appeared again(Successfully run the next day, but there are still errors when fetching all data)
Environment:
Spark 2.4.3
Hadoop 2.7.7
Total rows of data about 800,000,000, 12GB
and the memory is enough.
---------------------------------------------------------edit line--------------------------------------------------------
Today I ran it again using the code below:
val fields = Array.range(0, 2).map(i => StructField(s"col$i", IntegerType))
val schema: StructType = new StructType(fields)
val spark: SparkSession = SparkSession.builder.appName("test").getOrCreate
val df: Dataset[Row] = spark.read.format("csv).schema(schema).load("hdfs://master:9000/mydata")
df.cache()
df.count()
val df_rows: Array[Row] = df.collect()
print("df_rows[0] + df_rows(0))
print("df_rows size:" + df_rows.length)
I submit the application by:
{SPARK_HOME}/bin/spark-submit --master spark://master:7077 \
--conf spark.executor.cores=35 \
--total-executor-cores 105 \
--executor-memory 145g \
--driver-memory 200g \
--conf "spark.executor.extraJavaOptions=-Xms145g" \
--conf "spark.driver.extraJavaOptions=-Xms200g"
The size of the csv file I read is about 12GB, each line in the csv file is two integers, and the memory occupied by df.cache() is 5.5GB.
Environment:
Spark 2.4.3
Hadoop 2.7.7
Total rows of data about 800,000,000, 12GB
There are three machines in the cluster and they are all workers, the memory of the node submitting the job is 370GB(driver node)
I monitored the spark web ui, and all the tasks were successfully completed(Running time is about 60s), but after more than ten minutes, an error will be generated in the shell(Attention: The process did not exit, and the file in HDFS is ok):
WARN hdfs.dfsClient: DFSOutputStream ResponseProcessor exception for block xxxxx
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:244)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStream$ResponseProcessor.run(DFSOutputStream.java:733)
WARN hdfs.DFSClient: Error Recovery for block xxxxx in pipeline DatanodeInfoWithStorage[ipxxx,DISK],DatanodeInfoWithStorage[ipxxx,DISK],DatanodeInfoWithStorage[ipxxx,DISK]
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
at org.apache.hadoop.net.SocketIOWithTImeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
at java.io.BUfferedOutputStream.flushBuffer(BufferOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputSream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStream.run(DFSOutputStream.java:508)
And the following two lines of code did not run, because I don't see the output:
print("df_rows[0] + df_rows(0))
print("df_rows size:" + df_rows.length)
In addition, I also monitored the memory usage of the machine where the driver is located and found that the memory usage has been increasing(After running for 30 minutes, the spark submit process[driver] takes about 120GB)
2020.04.16:
Another information:
Converted to rdd can run successfully:
val rows_array: Array[Row] = df.rdd.collect()
But in this case, the memory consumption is very expensive(The memory I allocated is almost used up).
Does anyone know the reason? Another question is, why is there such a large memory usage?

How to Read a large avro file

I am trying to read a large avro file (2GB) using spark-shell but I am getting stackoverflow error.
val newDataDF = spark.read.format("com.databricks.spark.avro").load("abc.avro")
java.lang.StackOverflowError
at com.databricks.spark.avro.SchemaConverters$.toSqlType(SchemaConverters.scala:71)
at com.databricks.spark.avro.SchemaConverters$.toSqlType(SchemaConverters.scala:81)
I tried to increase driver memory and executor memory but I am still
getting same error.
./bin/spark-shell --packages com.databricks:spark-avro_2.11:3.1.0 --driver-memory 8G --executor-memory 8G
How can I read this file ? Is theere a way to partition this file?

is there a way to optimize spark sql code?

updated:
I`m using spark sql 1.5.2. Trying to read many parquet files and filter and aggregate rows - there are ~35M of rows stored in ~30 files in my hdfs and it takes more than 10 minutes to process
val logins_12 = sqlContext.read.parquet("events/2015/12/*/login")
val l_12 = logins_12.where("event_data.level >= 90").select(
"pid",
"timestamp",
"event_data.level"
).withColumn("event_date", to_date(logins_12("timestamp"))).drop("timestamp").toDF("pid", "level", "event_date").groupBy("pid", "event_date").agg(Map("level"->"max")).toDF("pid", "event_date", "level")
l_12.first()
my spark is running in two node cluster with 8cores and 16Gb ram each, scala output makes me thing the computation runs in just one thread:
scala> x.first()
[Stage 1:=======> (50 + 1) / 368]
when I try count() instead of first() it looks like two threads are doing computations. which is still less than I was expecting as there are ~30 files that can be processed in parallel
scala> l_12.count()
[Stage 4:=====> (34 + 2) / 368]
I'm starting spark console with 14g for executor and 4g for driver in yarn-client mode
./bin/spark-shell -Dspark.executor.memory=14g -Dspark.driver.memory=4g --master yarn-client
my default config for spark:
spark.executor.memory 2g
spark.logConf true
spark.eventLog.dir maprfs:///apps/spark
spark.eventLog.enabled true
spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni
spark.executor.extraClassPath
spark.yarn.historyServer.address http://test-01:18080
there are 200 partitions of the rdd
scala> logins_12.rdd.partitions.size
res2: Int = 368
scala> l_12.rdd.partitions.size
res0: Int = 200
is there a way to optimize this code?
thanks
Both behaviors are more or less expected. Spark is rather lazy and it not only doesn't execute transformations unless you trigger an action but can also skip tasks if there are not required for the output. Since first requires only a single element it can compute only one partition. It is most likely the reason why you see only one running thread at some point.
Regarding the second issue it is most likely a matter of configuration. Assuming there is nothing wrong with YARN configuration (I don't use YARN but yarn.nodemanager.resource.cpu-vcores looks like a possible source of the problem) it is most likely a matter of Spark defaults. As you can read in the Configuration guide spark.executor.cores on Yarn is by default set to 1. Two workers gives two running threads.

Spark: Executor Lost Failure (After adding groupBy job)

I’m trying to run Spark job on Yarn client. I have two nodes and each node has the following configurations.
I’m getting “ExecutorLostFailure (executor 1 lost)”.
I have tried most of the Spark tuning configuration. I have reduced to one executor lost because initially I got like 6 executor failures.
These are my configuration (my spark-submit) :
HADOOP_USER_NAME=hdfs spark-submit --class genkvs.CreateFieldMappings
--master yarn-client --driver-memory 11g --executor-memory 11G --total-executor-cores 16 --num-executors 15 --conf "spark.executor.extraJavaOptions=-XX:+UseCompressedOops
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" --conf spark.akka.frameSize=1000 --conf spark.shuffle.memoryFraction=1 --conf
spark.rdd.compress=true --conf
spark.core.connection.ack.wait.timeout=800
my-data/lookup_cache_spark-assembly-1.0-SNAPSHOT.jar -h
hdfs://hdp-node-1.zone24x7.lk:8020 -p 800
My data size is 6GB and I’m doing a groupBy in my job.
def process(in: RDD[(String, String, Int, String)]) = {
in.groupBy(_._4)
}
I’m new to Spark, please help me to find my mistake. I’m struggling for at least a week now.
Thank you very much in advance.
Two issues pop out:
the spark.shuffle.memoryFraction is set to 1. Why did you choose that instead of leaving the default 0.2 ? That may starve other non shuffle operations
You only have 11G available to 16 cores. With only 11G I would set the number of workers in your job to no more than 3 - and initially (to get past the executors lost issue) just try 1. With 16 executors each one gets like 700mb - which then no surprise they are getting OOME / executor lost.

Why is "Error communicating with MapOutputTracker" reported when Spark tries to send GetMapOutputStatuses?

I'm using Spark 1.3 to do an aggregation on a lot of data. The job consists of 4 steps:
Read a big (1TB) sequence file (corresponding to 1 day of data)
Filter out most of it and get about 1GB of shuffle write
keyBy customer
aggregateByKey() to a custom structure that build a profile for that customer, corresponding to a HashMap[Long, Float] per customer. The Long keys are unique and never bigger than 50K distinct entries.
I'm running this with this configuration:
--name geo-extract-$1-askTimeout \
--executor-cores 8 \
--num-executors 100 \
--executor-memory 40g \
--driver-memory 4g \
--driver-cores 8 \
--conf 'spark.storage.memoryFraction=0.25' \
--conf 'spark.shuffle.memoryFraction=0.35' \
--conf 'spark.kryoserializer.buffer.max.mb=1024' \
--conf 'spark.akka.frameSize=1024' \
--conf 'spark.akka.timeout=200' \
--conf 'spark.akka.askTimeout=111' \
--master yarn-cluster \
And getting this error:
org.apache.spark.SparkException: Error communicating with MapOutputTracker
at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:117)
at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:164)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
...
Caused by: org.apache.spark.SparkException: Error sending message [message = GetMapOutputStatuses(0)]
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:209)
at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:113)
... 21 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
The job and the logic have been shown to work with a small test set and I can even run this job for some dates but not for others. I've googled around and found hints that "Error communicating with MapOutputTracker" is related to internal Spark messages, but I already increased "spark.akka.frameSize", "spark.akka.timeout" and "spark.akka.askTimeout" (this last one does not even appear on Spark documentation, but was mentioned in the Spark mailing list), to no avail. There is still some timeout going on at 30 seconds that I have no clue how to identify or fix.
I see no reason for this to fail due to data size, as the filtering operation and the fact that aggregateByKey performs local partial aggregations should be enough to address the data size. The number of tasks is 16K (automatic from the original input), much more than the 800 cores that are running this, on 100 executors, so it is not as simple as the usual "increment partitions" tip. Any clues would be greatly appreciated! Thanks!
I had a similar issue, that my job would work fine with a smaller dataset, but will fail with larger ones.
After a lot of configuration changes, I found that the changing the driver memory settings has much more of an impact than changing the executor memory settings.
Also using the new garbage collector helps a lot. I am using the following configuration for a cluster of 3, with 40 cores each. Hope the following config helps:
spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:NewRatio=3 -
XX:InitiatingHeapOccupancyPercent=35 -XX:+PrintGCDetails -XX:MaxPermSize=4g
-XX:PermSize=1G -XX:+PrintGCTimeStamps -XX:+UnlockDiagnosticVMOptions
spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:NewRatio=3 -
XX:InitiatingHeapOccupancyPercent=35 -XX:+PrintGCDetails -XX:MaxPermSize=4g
-XX:PermSize=1G -XX:+PrintGCTimeStamps -XX:+UnlockDiagnosticVMOptions
spark.driver.memory=8g
spark.driver.cores=10
spark.driver.maxResultSize=8g
spark.executor.memory=16g
spark.executor.cores=25
spark.default.parallelism=50
spark.eventLog.dir=hdfs://mars02-db01/opt/spark/logs
spark.eventLog.enabled=true
spark.kryoserializer.buffer=512m
spark.kryoserializer.buffer.max=1536m
spark.rdd.compress=true
spark.storage.memoryFraction=0.15
spark.storage.MemoryStore=12g
What's going on in the driver at the time of this failure? It could be due to memory pressure on the driver causing it to be unresponsive. If I recall correctly, the MapOutputTracker that it's trying to get to when it calls GetMapOutputStatuses is running in the Spark driver driver process.
If you're facing long GCs or other pauses for some reason in that process this would cause the exceptions you're seeing above.
Some things to try would be to try jstacking the driver process when you start seeing these errors and see what happens. If jstack doesn't respond, it could be that your driver isn't sufficiently responsive.
16K tasks does sound like it would be a lot for the driver to keep track of, any chance you can increase the driver memory past 4g?
Try the following property
spark.shuffle.reduceLocality.enabled = false.
Refer to this link.
https://issues.apache.org/jira/browse/SPARK-13631