Safe to assume pyspark's subtract on a rdd is slow / problematic? - pyspark

While running a pyspark job and as the input grows, I notice that I keep getting memory errors like the following...
ERROR cluster.YarnScheduler: Lost executor 12 on compute-2-10.local:
Container killed by YARN for exceeding memory limits. 1.5 GB of 1.5 GB
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
While it will successfully complete on a smaller input even when the error is repeated through execution, it eventually dies as the input size increases.
I have ~20,000,000 million rows that I need to filter out ~661,000 rows from that. I can't think of any other way of dealing with this besides using subtract given the format of the key.

1.5GB is very low for executor memory: typically try to give at least twice that amount. This problem should likely be resolved just by not starving the executor of resources.

I'd use at least 4gb of memory. You can set it like this:
pyspark --num-executors 5 --driver-memory 2g --executor-memory 4g

Related

How to perform large computations on Spark

I have 2 tables in Hive: user and item and I am trying to calculate cosine similarity between 2 features of each table for a cartesian product between the 2 tables, i.e. Cross Join.
There are around 20000 users and 5000 items resulting in 100 million rows of calculation. I am running the compute using Scala Spark on Hive Cluster with 12 cores.
The code goes a little something like this:
val pairs = userDf.crossJoin(itemDf).repartition(100)
val results = pairs.mapPartitions(computeScore) // computeScore is a function to compute the similarity scores I need
The Spark job will always fail due to memory issues (GC Allocation Failure) on the Hadoop cluster. If I reduce the computation to around 10 million, it will definitely work - under 15 minutes.
How do I compute the whole set without increasing the hardware specifications? I am fine if the job takes longer to run and does not fail halfway.
if you take a look in the Spark documentation you will see that spark uses different strategies for data management. These policies are enabled by the user via configurations in the spark configuration files or directly in the code or script.
Below the documentation about data management policies:
"MEMORY_AND_DISK" policy would be good for you because if the data (RDD) does not fit in the ram then the remaining partitons will be stored in the hard disk. But this strategy can be slow if you have to access the hard drive often.
There are few steps of doing that:
1. Check the expected Data volume after cross join and divide this by 200 as spark.sql.shuffle.partitions by default comes as 200. It has to be more than 1 GB raw data to each partition.
2. Calculate each row size and multiply with another table row count , you will be able to estimated the rough Volume. The process will work much better in Parquet in comparison to CSV file
3. spark.sql.shuffle.partitions needs to be set based on Total Data Volume/500 MB
4. spark.shuffle.minNumPartitionsToHighlyCompress needs to set a little less than Shuffle Partition
5. Bucketize the source parquet data based on the joining column for both of the files/tables
6. Provide a High Spark Executor Memory and Manage the Java Heap memory too considering the heap space

Is the memory limit affected when running multiple instances of Matlab?

I can run multiple instances of Matlab by simply opening the program multiple times. An instance of Matlab has a memory limit.
If I open two Matlab programs on my computer, will this limit be affected and how? E.g. will it be split in two?
As far as I observe, the memory limit per instance is calculated dynamically based on the actual available RAM. For instance, we run one instance per thread on a 12 thread CPU (and 64GB of RAM) and never had problems with lacking memory.
I did a simple test:
Run first matlab instance and use the memory command to get memory information:
memory
Maximum possible array: 7651 MB (8.023e+09 bytes)
Memory available for all arrays: 7651 MB (8.023e+09 bytes)
Memory used by MATLAB: 2268 MB (2.378e+09 bytes)
Physical Memory (RAM): 16263 MB (1.705e+10 bytes)
Open the second instance and use the memorycommand in both instances shows, that the available memory decreased in the first instance and is nearly the same as in the second instance.
Open some other programs which use some memory and use the memory command in both instances shows again, that the available memory decreases.
Creating some huge variables in one or both instances also reduces the memory in each instance.
I hope this answer helps, although it is more experimental.

Spark executor max memory limit

Am wondering if there is any size limit to Spark executor memory ?
Considering the case of running a badass job doing collect, unions, count, etc.
Just a bit of context, let's say I have these resources (2 machines)
Cores: 40 cores, Total = 80 cores
Memory: 156G, Total = 312
What's the recommendation, bigger vs smaller executors ?
The suggestion by Spark development team is to not have an executor that is more than 64GB or so (often mentioned in training videos by Databricks). The idea is that a larger JVM will have a larger Heap that can result in really slow garbage collection cycles.
I think is a good practice to have your executors 32GB or even 24GB or 16GB. So instead of having one large one you have 2-4 smaller ones.
It will perhaps have some more coordination overhead, but I think these should be ok for the vast majority of applications.
If you have not read this post, please do.

Size of a random forest model in MLlib

I have to compute and to keep in memory several (e.g. 20 or more) random forests model with Apache Spark.
I have only 8 GB available on the driver of the yarn cluster I use to launch the job. And I am faced to OutOfMemory errors because models do not fit in memory. I have already decreased the ratio spark.storage.memoryFraction to 0.1 to try to increase the non-RDD memory.
I have thus two questions:
How could I make these models fit in memory?
What could I check the size of my models?
EDIT
I have 200 executors which have 8GB of space.
I am not sure my models live in the driver but I suspect it as I get OutOfMemory errors and I have plenty of space in the executors. Furthermore, I stock these models in Arrays

Degrading performance when increasing number of slaves [duplicate]

I am doing a simple scaling test on Spark using sort benchmark -- from 1 core, up to 8 cores. I notice that 8 cores is slower than 1 core.
//run spark using 1 core
spark-submit --master local[1] --class john.sort sort.jar data_800MB.txt data_800MB_output
//run spark using 8 cores
spark-submit --master local[8] --class john.sort sort.jar data_800MB.txt data_800MB_output
The input and output directories in each case, are in HDFS.
1 core: 80 secs
8 cores: 160 secs
I would expect 8 cores performance to have x amount of speedup.
Theoretical limitations
I assume you are familiar Amdahl's law but here is a quick reminder. Theoretical speedup is defined as followed :
where :
s - is the speedup of the parallel part.
p - is fraction of the program that can be parallelized.
In practice theoretical speedup is always limited by the part that cannot be parallelized and even if p is relatively high (0.95) the theoretical limit is quite low:
(This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.
Attribution: Daniels220 at English Wikipedia)
Effectively this sets theoretical bound how fast you can get. You can expect that p will be relatively high in case embarrassingly parallel jobs but I wouldn't dream about anything close to 0.95 or higher. This is because
Spark is a high cost abstraction
Spark is designed to work on commodity hardware at the datacenter scale. It's core design is focused on making a whole system robust and immune to hardware failures. It is a great feature when you work with hundreds of nodes
and execute long running jobs but it is doesn't scale down very well.
Spark is not focused on parallel computing
In practice Spark and similar systems are focused on two problems:
Reducing overall IO latency by distributing IO operations between multiple nodes.
Increasing amount of available memory without increasing the cost per unit.
which are fundamental problems for large scale, data intensive systems.
Parallel processing is more a side effect of the particular solution than the main goal. Spark is distributed first, parallel second. The main point is to keep processing time constant with increasing amount of data by scaling out, not speeding up existing computations.
With modern coprocessors and GPGPUs you can achieve much higher parallelism on a single machine than a typical Spark cluster but it doesn't necessarily help in data intensive jobs due to IO and memory limitations. The problem is how to load data fast enough not how to process it.
Practical implications
Spark is not a replacement for multiprocessing or mulithreading on a single machine.
Increasing parallelism on a single machine is unlikely to bring any improvements and typically will decrease performance due to overhead of the components.
In this context:
Assuming that the class and jar are meaningful and it is indeed a sort it is just cheaper to read data (single partition in, single partition out) and sort in memory on a single partition than executing a whole Spark sorting machinery with shuffle files and data exchange.