Size of a random forest model in MLlib - scala

I have to compute and to keep in memory several (e.g. 20 or more) random forests model with Apache Spark.
I have only 8 GB available on the driver of the yarn cluster I use to launch the job. And I am faced to OutOfMemory errors because models do not fit in memory. I have already decreased the ratio spark.storage.memoryFraction to 0.1 to try to increase the non-RDD memory.
I have thus two questions:
How could I make these models fit in memory?
What could I check the size of my models?
EDIT
I have 200 executors which have 8GB of space.
I am not sure my models live in the driver but I suspect it as I get OutOfMemory errors and I have plenty of space in the executors. Furthermore, I stock these models in Arrays

Related

Spark Garbage Collection Tuning - Reduce Memory for Caching using spark.memory.fraction - Why?

I was going through the book Spark The Definitive giude for Garbage Collection Tuning where it says that
If a full garbage collection is invoked multiple times before a task completes, it means that there isn’t enough memory available for executing tasks, so you should decrease the amount of memory Spark uses for caching i.e. spark.memory.fraction
Also the Spark documentation says,
If the OldGen is close to being full, reduce the amount of memory used for caching by lowering spark.memory.fraction; it is better to cache fewer objects than to slow down task execution
(https://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning)
Question -
Why should we reduce spark.memory.fraction to reduce the memory for caching?
Shouldn't we reduce spark.memory.storageFraction which is the amount of storage memory immune to eviction, expressed as a fraction of the size of the region set aside by spark.memory.fraction?
THere is a relationship between the two:
spark.memory.fraction expresses the size of M as a fraction of the
(JVM heap space - 300MB) (default 0.6). The rest of the space (40%)
is reserved for user data structures, internal metadata in Spark, and
safeguarding against OOM errors in the case of sparse and unusually
large records.
spark.memory.storageFraction expresses the size of R
as a fraction of M (default 0.5). R is the storage space within M
where cached blocks immune to being evicted by execution.
So this is really like the course tuning knob and the fine tuning knob. But... But... In practice, for performance tuning, I would start by tuning the code, and then the # or partitions, and then consider thinking about tuning configuration settings. I hope your are following that path first before digging into the minutiae of these settings. They come in handy but only when you've done the rest of the work to get to them.

fastText can train with a corpus bigger than RAM?

I need to train a fastText model on a 400GB corpus. As I don't have a machine with 400GB of RAM I want to know if the fastText implementation ( for example, following this tutorial https://fasttext.cc/docs/en/unsupervised-tutorial.html ) supports corpus bigger than RAM, and which RAM requirements I would have.
Generally for such models, the peak RAM requirement is a function of the size of the vocabulary of unique words, rather than the raw training material.
So, are there only 100k unique words in your 400GB? No problem, it'll only be reading a range at a time, & updating a small, stable amount of RAM. Are there 50M unique words? You'll need a lot of RAM.
Have you tried it to see what wold happen?

AnyLogic: Efficiently exporting data from experiment with replications

I am doing a parameter variation experiment with 1000 replications for each iteration. For each of these model runs, I want to store a copy of a dataset that is in Main. My current setup is that I am writing that dataset to an excelfile after each simulation run, using the After simulation run field in the experiment with the following code:
ds_export.fillFrom(root.ds_costAll);
excelfile.writeDataSet(ds_export, 1, 2, 1 + i*2);
Where i is a counter for the current iteration.
However, I am running in some performance issues. I believe copies of ds_costAll are being stored in my system's memory, in anticipation of my experiment being completed, upon which it will be written to the excelfile. This means that my system's memory utilization is nearing 100% while the cpu is hardly even bothered. My system has 16gb of memory, and the maximum available memory of the experiment is also 16gb Is there a way to more efficiently export this data?
How many cores are you using in runtime?
Tools->Preferences->Runtime->Number of processes for parallel execution
Might be an option to reduce it a bit.

Datalab kernel crashes because of data set size. Is load balancing an option?

I am currently running the virtual machine with the highest memory,n1-highmem-32 (32 vCPUs, 208 GB memory).
My data set is around 90 gigs, but has the potential to grow in the future.
The data is in stored in many zipped csv files. I am loading the data into a sparse matrix in order to preform some dimensionality reduction and clustering.
The Datalab kernel runs on a single machine. Since you are already running on a 208GB RAM machine, you may have to switch to a distributed system to analyze the data.
If the operations you are doing on the data can be expressed as SQL, I'd suggest loading the data into BigQuery, which Datalab has a lot of support for. Otherwise you may want to convert your processing pipeline to use Dataflow (which has a Python SDK). Depending on the complexity of your operations, either of these may be difficult, though.

Degrading performance when increasing number of slaves [duplicate]

I am doing a simple scaling test on Spark using sort benchmark -- from 1 core, up to 8 cores. I notice that 8 cores is slower than 1 core.
//run spark using 1 core
spark-submit --master local[1] --class john.sort sort.jar data_800MB.txt data_800MB_output
//run spark using 8 cores
spark-submit --master local[8] --class john.sort sort.jar data_800MB.txt data_800MB_output
The input and output directories in each case, are in HDFS.
1 core: 80 secs
8 cores: 160 secs
I would expect 8 cores performance to have x amount of speedup.
Theoretical limitations
I assume you are familiar Amdahl's law but here is a quick reminder. Theoretical speedup is defined as followed :
where :
s - is the speedup of the parallel part.
p - is fraction of the program that can be parallelized.
In practice theoretical speedup is always limited by the part that cannot be parallelized and even if p is relatively high (0.95) the theoretical limit is quite low:
(This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.
Attribution: Daniels220 at English Wikipedia)
Effectively this sets theoretical bound how fast you can get. You can expect that p will be relatively high in case embarrassingly parallel jobs but I wouldn't dream about anything close to 0.95 or higher. This is because
Spark is a high cost abstraction
Spark is designed to work on commodity hardware at the datacenter scale. It's core design is focused on making a whole system robust and immune to hardware failures. It is a great feature when you work with hundreds of nodes
and execute long running jobs but it is doesn't scale down very well.
Spark is not focused on parallel computing
In practice Spark and similar systems are focused on two problems:
Reducing overall IO latency by distributing IO operations between multiple nodes.
Increasing amount of available memory without increasing the cost per unit.
which are fundamental problems for large scale, data intensive systems.
Parallel processing is more a side effect of the particular solution than the main goal. Spark is distributed first, parallel second. The main point is to keep processing time constant with increasing amount of data by scaling out, not speeding up existing computations.
With modern coprocessors and GPGPUs you can achieve much higher parallelism on a single machine than a typical Spark cluster but it doesn't necessarily help in data intensive jobs due to IO and memory limitations. The problem is how to load data fast enough not how to process it.
Practical implications
Spark is not a replacement for multiprocessing or mulithreading on a single machine.
Increasing parallelism on a single machine is unlikely to bring any improvements and typically will decrease performance due to overhead of the components.
In this context:
Assuming that the class and jar are meaningful and it is indeed a sort it is just cheaper to read data (single partition in, single partition out) and sort in memory on a single partition than executing a whole Spark sorting machinery with shuffle files and data exchange.