knowing the size of a broadcasted variable in spark - scala

I have broadcasted a variable in spark(scala) but because of the size of data, it gives output as this
WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, 10.240.0.33): java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
When run on smaller database, it works fine. I want to know the size of this broadcasted variable (in mb/gb). Is there a way to find this?

Assuming you are trying to broadcast obj, you can find its size as follows:
import org.apache.spark.util.SizeEstimator
val objSize = SizeEstimator.estimate(obj)
Note that this is an estimator which means it is not 100% correct

This is because the driver runs out of memory. By default this is 1g, this can be increased using --driver-memory 4g. By default Spark will broadcast a dataframe when it is <10m, although I found out that broadcasting bigger dataframes is also not a problem. This might significantly speed up the join, but when the dataframe becomes too big, it might even slow down the join operation because of the overhead of broadcasting all the data to the different executors.
What is your datasource? When the table is read into Spark, under the sql tab and then open the dag diagram of the query you are executing, should give some metadata about the number of rows and size. Otherwise you could also check the actual size on the hdfs using hdfs dfs -du /path/to/table/.
Hope this helps.

Related

Spark/Synapse Optimal handling huge gzipped xml file (+600mb compressed size)

For a task we need to process huge transactional xml files which are gz(ipped). Each line in the uncompressed file can be interpreted as its own xml record.
When working with small files like 100 MiB this works fine. The moment à collect() is performed on the huge input file it tends to fail OOM and the jvm crashes.
As this is a compressed (gz) file it can not be processed in parallel (AFAIK).
I was thinking about
using the toLocalIterator() to split it first up into smaller packets of 200K xml entries which are distributed to the other nodes for their cost om processing. Apparently the toLocalIterator() does also the collect() first (to test)
Other option is to use the some kind of index value and filter on it ("index > 5000") and set the limit(5000) to simulate paging through the 2 Million or more entries.
But I have no clue to what I should pay attention to parralize. Any tips are welcome.
Settings to pay attention and how to apply them in Azure Synapse etc.
how to push the read xml over the nodes to be processed in their executor/tasks.
could streaming a single file be an option?
any tips are welcome
Currently my code is done in scala due the fact the java libraries are easily accessible to convert the xml to json and extract the values I need.
Many thanks in advance (also for reading this)
TL;DR suggestion:
Step 1. Increase driver memory and test
Step 2. Increase executor memory and test if first step fails
Slightly longer version:
The fact that it gives OOM on collect() operation doesn't indicate whether it is OOM on spark.read operation or df.collect()
Spark scheduler will run the DAG when it encounters an Action but not when it is a Transformation.
So if collect is your first action, it is at that point it actually runs the DAG and the OOM may even be on read but manifesting as OOM on collect
Spark UI will provide insights on where the OOM happens
You are right that uncompressing gzip wont be parallelised. On read operation, it will use a single executor and even a single core. So I would increase executor memory until there is sufficient memory to gzip the whole file into memory - not just to the exact file size, leave the usual 400MB / 0.7% buffer.
If the error is indeed happening on the collect() operation, then you need to sufficiently increase driver memory.
Your app will not be parallelised at read. Your app will not be parallelised at collect()
You app can be parallelised during the transformations in between them and you can force the parallelisation to the extent you want to tune it by repartitioning your dataframe / dataset / rdd further.
Finally, I would consider again whether you do need the collect or whether you can store the output as a number of partitioned files?
I think the zipped file is always going to be a bottleneck so one alternative is to unzip it and see if that helps. I would also consider loading up the xml to a table in Synapse (which can deal with gzipped files). This would have the effect of unzipping it, you could then pass it into a Synapse Notebook with the synapsesql method, eg in Scala:
// Get the table with the XML column from the database and expose as temp view
val df = spark.read.synapsesql("yourPool.dbo.someXMLTable")
df.createOrReplaceTempView("someXMLTable")
You could process the XML as I have done here and then write it back to the Synapse dedicated SQL pool as an internal table:
val df2 = spark.sql("""
SELECT
colA,
colB,
xpath_string(pkData,'/DataSet/EnumObject[name="Inpatient"]/value') xvalue
FROM someXMLTable
""")
// Write that dataframe back to the dedicated SQL pool
df2.write.synapsesql("yourPool.dbo.someXMLTable_processed", Constants.INTERNAL)
This would ensure you are keeping things in parallel, no collect required. NB there are a couple of assumptions in there around uploading the gzipped files to a dedicated SQL pool and that the xpath_string does what you need which need to be checked and confirmed. The proposed pattern:

Performance tuning in spark

I am running a spark job which processes about 2 TB of data. The processing involves:
Read data (avrò files)
Explode on a column which is a map type
OrderBy key from the exploded column
Filter the DataFrame (I have a very small(7) set of keys (call it keyset) that I want to filter the df for). I do a df.filter(col("key").isin(keyset: _*) )
I write this df to a parquet (this dataframe is very small)
Then I filter the original dataframe again for all the key which are not in the keyset
df.filter(!col("key").isin(keyset: _*) ) and write this to a parquet. This is the larger dataset.
The original avro data is about 2TB. The processing takes about 1 hr. I would like to optimize it. I am caching the dataframe after step 3, using shuffle partition size of 6000. min executors = 1000, max = 2000, executor memory = 20 G, executor core = 2. Any other suggestions for optimization ? Would a left join be better performant than filter ?
All look right to me.
If you have small dataset then isin is okay.
1) Ensure that you can increase the number of cores. executor core=5
More than 5 cores not recommended for each executor. This is based on a study where any application with more than 5 concurrent threads would start hampering the performance.
2) Ensure that you have good/uniform partition strucutre.
Example (only for debug purpose not for production):
import org.apache.spark.sql.functions.spark_partition_id
yourcacheddataframe.groupBy(spark_partition_id).count.show()
This is will print spark partition number and how many records
exists in each partition. based on that you can repartition, if you wanot more parllelism.
3) spark.dynamicAllocation.enabled could be another option.
For Example :
spark-submit --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=100 --conf spark.shuffle.service.enabled=true
along with all other required props ..... thats for that job. If you give these props in spark-default.conf it would be applied for all jobs.
With all these aforementioned options your processing time might lower.
On top of what has been mentioned, a few suggestions depending on your requirements and cluster:
If the job can run at 20g executor memory and 5 cores, you may be able to fit more workers by decreasing the executor memory and keeping 5 cores
Is the orderBy actually required? Spark ensures that rows are ordered within partitions, but not between partitions which usually isn't terribly useful.
Are the files required to be in specific locations? If not, adding a
df.withColumn("in_keyset", when( col('key').isin(keyset), lit(1)).otherwise(lit(0)). \
write.partitionBy("in_keyset").parquet(...)
may speed up the operation to prevent the data from being read in + exploded 2x. The partitionBy ensures that the items in the keyset are in a different directory than the other keys.
spark.dynamicAllocation.enabled is enabled
partition sizes are quite uneven (based on the size of output parquet part files) since I am doing an orderBy key and some keys are more frequent than others.
keyset is a really small set (7 elements)

Are there any memory or table size constraints when using DataFrame.registerTempTable?

Is there any size limit for using registerTempTable in Spark? Is the data kept in memory or swapped in disk internally in case of large DataFrame? Can there be issues related to this if I use registerTempTable on a dataframe which has a lot of records?
Is there a limit in relation to the spark configuration (executor memory/driver memory etc) for registerTempTable to work normally? For example if executor memory is 2g then registerTempTable should only store a dataframe of size 1.8g or something?
Is there any size limit for using registerTempTable in Spark?
No.
Is the data kept in memory or swapped in disk internally in case of large DataFrame?
No and no.
Can there be issues related to this if I use registerTempTable on a dataframe which has a lot of records?
No.
Is there a limit in relation to the spark configuration (executor memory/driver memory etc) for registerTempTable to work normally?
No.
I hope the above No's helped a bit, but just to shed more light on DataFrame.registerTempTable think about it as a way to register a name (temporarily) that is associated to a structured query which is going to be executed when the data is required, i.e. when an action is executed that triggers a Spark job.
In other words, registering a temporary table is just a handy shortcut so you can use it in SQL queries rather than using the high-level DataFrame operators.

Why does Spark RDD partition has 2GB limit for HDFS?

I have get an error when using mllib RandomForest to train data. As my dataset is huge and the default partition is relative small. so an exception thrown indicating that "Size exceeds Integer.MAX_VALUE" ,the orignal stack trace as following,
15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in
stage 6.0 (TID 120, 10.215.149.47):
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517)
at
org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618)
at
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
The Integer.MAX_SIZE is 2GB, it seems that some partition out of memory. So i repartiton my rdd partition to 1000, so that each partition could hold far less data as before. Finally, the problem is solved!!!
So, my question is :
Why partition size has the 2G limit? It seems that there is no configure set for the limit in the spark
The basic abstraction for blocks in spark is a ByteBuffer, which unfortunately has a limit of Integer.MAX_VALUE (~2GB).
It is a critical issue which prevents use of spark with very large datasets.
Increasing the number of partitions can resolve it (like in OP's case), but is not always feasible, for instance when there is large chain of transformations part of which can increase data (flatMap etc) or in cases where data is skewed.
The solution proposed is to come up with an abstraction like LargeByteBuffer which can support list of bytebuffers for a block. This impacts overall spark architecture, so it has remained unresolved for quite a while.
the problem is when using datastores like Casandra, HBase, or Accumulo the block size is based on the datastore splits (which can be over 10 gig). when loading data from these datastores you have to immediately repartitions with 1000s of partitions so you can operated the data without blowing the 2gig limit.
most people that use spark are not really using large data; to them if it is bigger that excel can hold or tableau is is big data to them; mostly data scientist who use quality data or use a sample size small enough to work with the limit.
when processing large volumes of data, i end of having to go back to mapreduce and only used spark once the data has been cleaned up. this is unfortunate however, the majority of the spark community has no interest in addressing the issue.
a simple solution would be to create an abstraction and use bytearray as default; however, allow to overload a spark job with an 64bit data pointer to handle the large jobs.
The Spark 2.4.0 release removes this limit by replicating block data as a stream. See Spark-24926 for details.

Spark out of memory

I have a folder with 150 G of txt files (around 700 files, on average each 200 MB).
I'm using scala to process the files and calculate some aggregate statistics in the end. I see two possible approaches to do that:
manually loop through all the files, do the calculations per file and merge the results in the end
read the whole folder to one RDD, do all the operations on this single RDD and let spark do all the parallelization
I'm leaning towards the second approach as it seems cleaner (no need for parallelization specific code), but I'm wondering if my scenario will fit the constraints imposed by my hardware and data. I have one workstation with 16 threads and 64 GB of RAM available (so the parallelization will be strictly local between different processor cores). I might scale the infrastructure with more machines later on, but for now I would just like to focus on tunning the settings for this one workstation scenario.
The code I'm using:
- reads TSV files, and extracts meaningful data to (String, String, String) triplets
- afterwards some filtering, mapping and grouping is performed
- finally, the data is reduced and some aggregates are calculated
I've been able to run this code with a single file (~200 MB of data), however I get a java.lang.OutOfMemoryError: GC overhead limit exceeded
and/or a Java out of heap exception when adding more data (the application breaks with 6GB of data but I would like to use it with 150 GB of data).
I guess I would have to tune some parameters to make this work. I would appreciate any tips on how to approach this problem (how to debug for memory demands). I've tried increasing the 'spark.executor.memory' and using a smaller number of cores (the rational being that each core needs some heap space), but this didn't solve my problems.
I don't need the solution to be very fast (it can easily run for a few hours even days if needed). I'm also not caching any data, but just saving them to the file system in the end. If you think it would be more feasible to just go with the manual parallelization approach, I could do that as well.
Me and my team had processed a csv data sized over 1 TB over 5 machine #32GB of RAM each successfully. It depends heavily what kind of processing you're doing and how.
If you repartition an RDD, it requires additional computation that
has overhead above your heap size, try loading the file with more
paralelism by decreasing split-size in
TextInputFormat.SPLIT_MINSIZE and TextInputFormat.SPLIT_MAXSIZE
(if you're using TextInputFormat) to elevate the level of
paralelism.
Try using mapPartition instead of map so you can handle the
computation inside a partition. If the computation uses a temporary
variable or instance and you're still facing out of memory, try
lowering the number of data per partition (increasing the partition
number)
Increase the driver memory and executor memory limit using
"spark.executor.memory" and "spark.driver.memory" in spark
configuration before creating Spark Context
Note that Spark is a general-purpose cluster computing system so it's unefficient (IMHO) using Spark in a single machine
To add another perspective based on code (as opposed to configuration): Sometimes it's best to figure out at what stage your Spark application is exceeding memory, and to see if you can make changes to fix the problem. When I was learning Spark, I had a Python Spark application that crashed with OOM errors. The reason was because I was collecting all the results back in the master rather than letting the tasks save the output.
E.g.
for item in processed_data.collect():
print(item)
failed with OOM errors. On the other hand,
processed_data.saveAsTextFile(output_dir)
worked fine.
Yes, PySpark RDD/DataFrame collect() function is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should use the collect() on smaller dataset usually after filter(), group(), count() etc. Retrieving larger dataset results in out of memory.