How to Read a large avro file - scala

I am trying to read a large avro file (2GB) using spark-shell but I am getting stackoverflow error.
val newDataDF = spark.read.format("com.databricks.spark.avro").load("abc.avro")
java.lang.StackOverflowError
at com.databricks.spark.avro.SchemaConverters$.toSqlType(SchemaConverters.scala:71)
at com.databricks.spark.avro.SchemaConverters$.toSqlType(SchemaConverters.scala:81)
I tried to increase driver memory and executor memory but I am still
getting same error.
./bin/spark-shell --packages com.databricks:spark-avro_2.11:3.1.0 --driver-memory 8G --executor-memory 8G
How can I read this file ? Is theere a way to partition this file?

Related

Read a huge Oracle table using Spark

Spark version: 2.4.7
OS: Linux RHEL Fedora
I have the following use case:
Read an oracle table that contains ~150 million records (Done Daily), and then write these records to 800 files (using repartition) on a shared file system to be used in another application.
I can read the table into a data frame, but when trying to write it never finishes.
df
res38: org.apache.spark.sql.DataFrame = [ROW_ID: string, CREATED: string ... 276 more fields]
When I limit the number of retrieved records to 1 million, and repartition by (6) the process to read and write takes 2-3 minutes.
I searched for insights on the issue but couldn't figure it out, when running the process on the full data set I see in the UI logs the following line (keeps repeating) :
Blockquote
INFO sort.UnsafeExternalSorter: Thread 135 spilling sort data of 5.2 GB to disk (57 times so far)
I submit the job using the following:
time spark-submit --verbose --conf spark.dynamicAllocation.enabled=false --conf spark.spark.sql.broadcastTimeout=1000 --conf spark.sql.shuffle.partitions=1500 --conf "spark.ui.enabled=true" --master yarn --driver-memory 60G --executor-memory 10G --num-executors 40 --executor-cores 8 --jars ojdbc6.jar SparkOracleExtractor.jar
Please note there are enough resources on the cluster and the queue is only 5.0% used, and a constraint is to use Spark.
Appreciate any help on where to get information to resolve the issue and speed up the process.
This is the code:
val myquery = "select * from mytable"
val dt=20221023
spark.read.format("jdbc").
option("url", s"jdbc:oracle:thin:#$db_connect_string").
option("driver", "oracle.jdbc.driver.OracleDriver").
option("query", myquery).
option("user", db_user).
option("password", db_pass).
option("fetchsize", 10000).
option("delimiter", "|").
load()
df.repartition(800).write.csv(s"file:///fs/extrat_path/${dt}")
These are the shuffle spill sizes after 2.5 hours:
Shuffle Spill (Memory) Shuffle Spill (Disk)
259.4 GB 22.0 GB

Container killed by YARN for exceeding memory limits in Spark Scala

I am Facing below Error while Running my Spark Scala code using Spark-submit command.
ERROR cluster.YarnClusterScheduler: Lost executor 14 on XXXX: Container killed by YARN for exceeding memory limits. 55.6 GB of 55 GB physical memory used.
The Code of the Line Number it throws the error is below...
df.write.mode("overwrite").parquet("file")
I am Writing a Parquet file.... It was working till yesterday not sure from last run only it is throwing the error with same input file.
Thanks,
Naveen
By Running with below conf in spark-submit command, the issues is resolved and code ran successfully.
--conf spark.dynamicAllocation.enabled=true
Thanks,
Naveen

Spark: Executor Lost Failure (After adding groupBy job)

I’m trying to run Spark job on Yarn client. I have two nodes and each node has the following configurations.
I’m getting “ExecutorLostFailure (executor 1 lost)”.
I have tried most of the Spark tuning configuration. I have reduced to one executor lost because initially I got like 6 executor failures.
These are my configuration (my spark-submit) :
HADOOP_USER_NAME=hdfs spark-submit --class genkvs.CreateFieldMappings
--master yarn-client --driver-memory 11g --executor-memory 11G --total-executor-cores 16 --num-executors 15 --conf "spark.executor.extraJavaOptions=-XX:+UseCompressedOops
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" --conf spark.akka.frameSize=1000 --conf spark.shuffle.memoryFraction=1 --conf
spark.rdd.compress=true --conf
spark.core.connection.ack.wait.timeout=800
my-data/lookup_cache_spark-assembly-1.0-SNAPSHOT.jar -h
hdfs://hdp-node-1.zone24x7.lk:8020 -p 800
My data size is 6GB and I’m doing a groupBy in my job.
def process(in: RDD[(String, String, Int, String)]) = {
in.groupBy(_._4)
}
I’m new to Spark, please help me to find my mistake. I’m struggling for at least a week now.
Thank you very much in advance.
Two issues pop out:
the spark.shuffle.memoryFraction is set to 1. Why did you choose that instead of leaving the default 0.2 ? That may starve other non shuffle operations
You only have 11G available to 16 cores. With only 11G I would set the number of workers in your job to no more than 3 - and initially (to get past the executors lost issue) just try 1. With 16 executors each one gets like 700mb - which then no surprise they are getting OOME / executor lost.

Increase Spark memory when using local[*]

How do I increase Spark memory when using local[*]?
I tried setting the memory like this:
val conf = new SparkConf()
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "4g")
.setMaster("local[*]")
.setAppName("MyApp")
But I still get:
MemoryStore: MemoryStore started with capacity 524.1 MB
Does this have something to do with:
.setMaster("local[*]")
Assuming that you are using the spark-shell.. setting the spark.driver.memory in your application isn't working because your driver process has already started with default memory.
You can either launch your spark-shell using:
./bin/spark-shell --driver-memory 4g
or you can set it in spark-defaults.conf:
spark.driver.memory 4g
If you are launching an application using spark-submit, you must specify the driver memory as an argument:
./bin/spark-submit --driver-memory 4g --class main.class yourApp.jar
I was able to solve this by running SBT with:
sbt -mem 4096
However the MemoryStore is half the size. Still looking into where this fraction is.
in spark 2.x ,you can use SparkSession,which looks like :
val spark= new SparkSession()
.config("spark.executor.memory", "1g")
.config("spark.driver.memory", "4g")
.setMaster("local[*]")
.setAppName("MyApp")
Tried --driver-memory 4g, --executor-memory 4g, neither worked to increase working memory. However, I noticed that bin/spark-submit was picking up _JAVA_OPTIONS, setting that to -Xmx4g resolved it. I use jdk7
The fraction of the heap used for Spark's memory cache is by default 0.6, so if you need more than 524,1MB, you should increase the spark.executor.memory setting :)
Technically you could also increase the fraction used for Spark's memory cache, but I believe this is discouraged or at least requires you to do some additional configuration. See https://spark.apache.org/docs/1.0.2/configuration.html for more details.
You can't change driver memory after application start link.
Version
spark-2.3.1
Source Code
org.apache.spark.launcher.SparkSubmitCommandBuilder:267
String memory = firstNonEmpty(tsMemory, config.get(SparkLauncher.DRIVER_MEMORY),
System.getenv("SPARK_DRIVER_MEMORY"), System.getenv("SPARK_MEM"), DEFAULT_MEM);
cmd.add("-Xmx" + memory);
SparkLauncher.DRIVER_MEMORY
--driver-memory 2g
SPARK_DRIVER_MEMORY
vim conf/spark-env.sh
SPARK_DRIVER_MEMORY="2g"
SPARK_MEM
vim conf/spark-env.sh
SPARK_MEM="2g"
DEFAULT_MEM
1g
To assign memory to Spark:
on Command shell: /usr/lib/spark/bin/spark-shell --driver-memory=16G --num-executors=100 --executor-cores=8 --executor-memory=16G
/usr/lib/spark/bin/spark-shell --driver-memory=16G --num-executors=100 --executor-cores=8 --executor-memory=16G --conf spark.driver.maxResultSize = 2G

Why is "Error communicating with MapOutputTracker" reported when Spark tries to send GetMapOutputStatuses?

I'm using Spark 1.3 to do an aggregation on a lot of data. The job consists of 4 steps:
Read a big (1TB) sequence file (corresponding to 1 day of data)
Filter out most of it and get about 1GB of shuffle write
keyBy customer
aggregateByKey() to a custom structure that build a profile for that customer, corresponding to a HashMap[Long, Float] per customer. The Long keys are unique and never bigger than 50K distinct entries.
I'm running this with this configuration:
--name geo-extract-$1-askTimeout \
--executor-cores 8 \
--num-executors 100 \
--executor-memory 40g \
--driver-memory 4g \
--driver-cores 8 \
--conf 'spark.storage.memoryFraction=0.25' \
--conf 'spark.shuffle.memoryFraction=0.35' \
--conf 'spark.kryoserializer.buffer.max.mb=1024' \
--conf 'spark.akka.frameSize=1024' \
--conf 'spark.akka.timeout=200' \
--conf 'spark.akka.askTimeout=111' \
--master yarn-cluster \
And getting this error:
org.apache.spark.SparkException: Error communicating with MapOutputTracker
at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:117)
at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:164)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
...
Caused by: org.apache.spark.SparkException: Error sending message [message = GetMapOutputStatuses(0)]
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:209)
at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:113)
... 21 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
The job and the logic have been shown to work with a small test set and I can even run this job for some dates but not for others. I've googled around and found hints that "Error communicating with MapOutputTracker" is related to internal Spark messages, but I already increased "spark.akka.frameSize", "spark.akka.timeout" and "spark.akka.askTimeout" (this last one does not even appear on Spark documentation, but was mentioned in the Spark mailing list), to no avail. There is still some timeout going on at 30 seconds that I have no clue how to identify or fix.
I see no reason for this to fail due to data size, as the filtering operation and the fact that aggregateByKey performs local partial aggregations should be enough to address the data size. The number of tasks is 16K (automatic from the original input), much more than the 800 cores that are running this, on 100 executors, so it is not as simple as the usual "increment partitions" tip. Any clues would be greatly appreciated! Thanks!
I had a similar issue, that my job would work fine with a smaller dataset, but will fail with larger ones.
After a lot of configuration changes, I found that the changing the driver memory settings has much more of an impact than changing the executor memory settings.
Also using the new garbage collector helps a lot. I am using the following configuration for a cluster of 3, with 40 cores each. Hope the following config helps:
spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:NewRatio=3 -
XX:InitiatingHeapOccupancyPercent=35 -XX:+PrintGCDetails -XX:MaxPermSize=4g
-XX:PermSize=1G -XX:+PrintGCTimeStamps -XX:+UnlockDiagnosticVMOptions
spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:NewRatio=3 -
XX:InitiatingHeapOccupancyPercent=35 -XX:+PrintGCDetails -XX:MaxPermSize=4g
-XX:PermSize=1G -XX:+PrintGCTimeStamps -XX:+UnlockDiagnosticVMOptions
spark.driver.memory=8g
spark.driver.cores=10
spark.driver.maxResultSize=8g
spark.executor.memory=16g
spark.executor.cores=25
spark.default.parallelism=50
spark.eventLog.dir=hdfs://mars02-db01/opt/spark/logs
spark.eventLog.enabled=true
spark.kryoserializer.buffer=512m
spark.kryoserializer.buffer.max=1536m
spark.rdd.compress=true
spark.storage.memoryFraction=0.15
spark.storage.MemoryStore=12g
What's going on in the driver at the time of this failure? It could be due to memory pressure on the driver causing it to be unresponsive. If I recall correctly, the MapOutputTracker that it's trying to get to when it calls GetMapOutputStatuses is running in the Spark driver driver process.
If you're facing long GCs or other pauses for some reason in that process this would cause the exceptions you're seeing above.
Some things to try would be to try jstacking the driver process when you start seeing these errors and see what happens. If jstack doesn't respond, it could be that your driver isn't sufficiently responsive.
16K tasks does sound like it would be a lot for the driver to keep track of, any chance you can increase the driver memory past 4g?
Try the following property
spark.shuffle.reduceLocality.enabled = false.
Refer to this link.
https://issues.apache.org/jira/browse/SPARK-13631