How to purge spark driver memory after collect() - scala

I have to do a lot of little collect() operations in my application to send data through HTTPcall.
val payload = sparkSession.sql(s"select * from table where ID = id").toJSON.collect().mkString("\n")
Is there a way to purge used objects to free some memory space in my driver between operations?

First off, I agree with #Luis Miguel Mejia Suarez here in that collects are generally bad practice and a bad code smell. I'd take a look at why you are doing collects, and determine if you can do this in a different way.
As for your actual question, the garbage collector will free any unreferenced memory once memory starts getting tight. The code snippet you showed above should be fine since the output of collect is immediately operated on and then discarded so that output should be removed during the next GC pause, while the mkString output would be kept. So make sure that this applies to the other collect statements you are using.
Additionally, if you are seeing long GC pauses, consider lowering your driver memory size, so that there's less memory to collect. You might also look into tuning your GC parameters. There's lots of documentation on that on the internet, and it is too intricate to describe in detail here.
Finally, you can force the JVM to run garbage collection. You should be able to use System.gc() (https://docs.oracle.com/javase/7/docs/api/java/lang/System.html#gc()). This is a Java function but Scala should be able to call it as well.

Related

Is manually managing memory with .unpersist() a good idea?

I've read a lot of questions and answers here about unpersist() on dataframes. I so far haven't found an answer to this question:
In Spark, once I am done with a dataframe, is it a good idea to call .unpersist() to manually force that dataframe to be unpersisted from memory, as opposed to waiting for GC (which is an expensive task)? In my case I am loading many dataframes so that I can perform joins and other transformations.
So, for example, if I wish to load and join 3 dataframes A, B and C:
I load dataframe A and B, join these two to create X, and then .unpersist() B because I don't need it any more (but I will need A), and could use the memory to load C (which is big). So then I load C, and join C to X, .unpersist() on C so I have more memory for the operations I will now perform on X and A.
I understand that GC will unpersist for me eventually, but I also understand than GC is an expensive task that should be avoided if possible. To re-phrase my question: Is this an appropriate method of manually managing memory, to optimise my spark jobs?
My understanding (please correct if wrong):
I understand that .unpersist() is a very cheap operation.
I understand that GC calls .unpersist() on my dataframes eventually anyway.
I understand that spark monitors cache and drops based on Last Recently Used policy. But in some cases I do not want the 'Last Used' DF to be dropped, so instead I can call.unpersist() on the datafames I know I will not need in future, so that I don't drop the DFs I will need and have to reload them later.
To re-phrase my question again if unclear: is this an appropriate use of .unpersist(), or should I just let Spark and GC do their job?
Thanks in advance :)
There seem to be some misconception. While using unpersist is a valid approach to get better control over the storage, it doesn't avoid garbage collection. In fact all the on heap objects associated with cached data will be left garbage collector.
So while operation itself is relatively cheap, chain of events it triggers might not be cheap. Luckily explicit persist is not worse than waiting for automatic cleaner or GC triggered cleaner, so if you want to clean specific objects, go ahead and do it.
To limit GC on unpersist it might be worth to take a look at the OFF_HEAP StorageLevel.

What's the difference between collect and count actions?

When I was coding scala program in local-mode spark, the code is similar to RDD.map(x => function()).collect, there is no log output in the console for a long time and I guess it stucks. So I change the action collect into count, the whole execution completed soon. Plus, there is little records produced by the map phase to be collected by collect, so problem can't be caused by the network trans when sending back the result to the driver.
Can anyone know the reason or has run into the similar problem?
The method count sums the number of entries of the RDD for every partition, and it returns an integer consisting in that number, hence the data transfer is minimal; in the other hand, the method collect as its name says brings all the entries to the driver as a List therefore if there isn't enough memory you may get several exceptions (this is why it's not recommended to do a collect if you aren't sure that it will fit in your driver, there are normally faster alternatives like take, which also triggers lazy transformations), besides it requires to transfer much more data.

Scala: performance boost on incremental garbage collection

I have written an application in Scala. Basically, the first step is to create a array of objects an then to initialise these objects from a csv file. When running the application on the jvm it is really slow, and after some experimenting I found out that using the -J-Xincgc flag which enables incremental garbage collection speeds up the application by a factor of 4 (it's 4 times faster with the switch!). I wonder:
Why?
Did I use some inefficient coding, and if so, where should I start to find out whats going on?
Thanks!
I'll assume you're running this on hotspot.
The hotspot JVM has a whole zoo of garbage collectors, most of which also may have some sort of sub-modes or various command-line switches that significantly alter their behavior.
Which GC is used by default varies based on JVM version, operating system and 32/64bit VM.
So you basically changed whatever the default was to a specific algorithm that happened to perform "faster" for your workload.
But "faster" is a fuzzy measure. Wall time is not the same as CPU cycles spent if you consider multi-threading. And some collectors may simply choose to grow the heap more aggressively, thus deferring the cost of collection to a later point in time, which you might not have measured if your program didn't run long enough.
To make an accurate assessment much more information would be needed
what GC was used by default
your VM version
how many cores your CPU has
what kind of workload do you have (multi/single-thread, long/short-running, expected memory footprint, object allocation rate)
Oracle's GC tuning guide may prove useful for you
In your case, -Xincgc translates to CMS in incremental mode, which is intended for single-core environments and has been deprecated as of java8. It probably just happened to be better than the default, but it's not necessarily an optimal choice.
If you get into a situation where you are running close to your heap-size limit, you can waste a lot of GC time, which can lead to a lot of false findings about performance. If that's your situation, first increase your heap-size limit before doing anything else. Consider use of jvisualvm to eyeball the situation - it's trivially easy to get started with.

Is it a good idea to warm up a cache in the BEGIN block in Perl?

Is it a good idea to warm up cache in the BEGIN block, when it gets used?
You didn't really provide any information on what kind of environment you're talking about, which I think is important. In most cases the answer is probably "no", but I can think of one case where it's a definite yes, which is preforking servers -- web applications and the like. In that case, any work that you can do "before the fork" not only saves the cost of having the children recompute the same values individually, it alo saves memory, since the pages containing the results can be shared across all of the child processes by the OS's COW mechanism.
If you're talking about a module you're writing and not an application, then I'd say no, don't lift things to compilation time without the user's permission unless they're things that have to be done for the module to work. Instead, provide a preheat_cache class method, and if your caller has a reason to need a hot cache at compile time they can put the call into a BEGIN block themselves. You could also use a :preheat_cache import tag but that's unnecessarily fancy in my book.
If it's a choice between preloading your cache at compile time, or preloading your cache as the first thing you do at run time, there's virtually no difference.
If your cache is large enough that loading it will trigger a lot of page swaps, that's an argument for waiting until run time. That way, all your module loading and other compile time code can be done while your system is under a lighter load.
I'm going to go with "no", even though I could be wrong. Reasoning goes like this: keep the code, and data it uses, small, so that it takes up less space in any caches (I am presuming you mean CPU cache, not programmatic hashes with common query results or some such thing).
Unless you see some sort of bad access pattern, trying to second guess what needs to be prefetched is probably useless at best. In fact such code or initialization data is likely to displace something you (or another process on the system) were actually using. Think about what you can do in the actual work part of the code to maximize locality of reference, to try to stay within smaller memory regions at any one time.
I used to use "top" to detect when processes were swapping between memory and disk. I don't know of any good tools yet to tell how often a process is getting cache misses and going to plain old slow mo'board memory. There must be such tools, I just don't know what they are yet (software tools, rather than some custom In Circuit Emulator type hardware). Perhaps some thought on this earlier in the day...
by warm up I assume you mean use BEGIN() to guarantee the cache is preloaded before anything else in your script executes?
If you need the cache for your program to run properly, then yes, I think it would be a good idea.

Immutable Map implementation for huge maps

If I have an immutable Map which I might expect (over a very short period of time - like a few seconds) to be adding/removing hundreds of thousands of items from, is the standard HashMap a bad idea? Let's say I want to pass 1Gb of data through the Map in <10 seconds in such a way that the maximum size of the Map at any once instant is only 256Mb.
I get the impression that the map keeps some kind of "history" but I will always be accessing the last-updated table (i.e. I do not pass the map around) because it is a private member variable of an Actor which is updated/accessed only from within reactions.
Basically I suspect that this data structure may be (partly) at fault for issues I am seeing around JVMs going out of memory when reading in large amounts of data in a short time.
Would I be better off with a different map implementation and, if so, what is it?
Ouch. Why do you have to use an immutable map? Poor garbage collector! Immutable maps generally require (log n) new objects per operation in addition to (log n) time, or they really just wrap mutable hash maps and layer changesets on top (which slows things down and can increase the number of object creations).
Immutability is great, but this does not seem to me like the time to use it. If I were you, I'd stick with scala.collection.mutable.HashMap. If you need concurrent access, wrap the Java util.concurrent one instead.
You also might want to increase the size of the young generation in the JVM: -Xmn1G or more (assuming you're running with -Xmx3G). Also, use the throughput (parallel) garbage collector.
That would be awful. You say you always want to access the last-updated table, that means you only need an ephemeral data structure, there is no need to pay the cost for a persistent data structure - it's like trading time and memory to gain completely arguable "style points". You are not building your karma by using blindly persistent structures when they are not called for.
Also, a hashtable is a particularly difficult structure to make persistent. In other words, "very, very slow" (basically it is usable when reads greatly outnumber writes - and you seem to talk about many writes).
By the way, a ConcurrentHashMap wouldn't make sense in this design, given that the map is accessed from a single actor (that's what I understand from the description).
Scala's so-called(*) immutable Map is broken beyond basic usage up to Scala 2.7. Don't trust me, just look up the number of open tickets for it. And the solution is just "it will be replaced with something else on Scala 2.8" (which it did).
So, if you want an immutable map for Scala 2.7.x, I'd advise looking for it in something other than Scala. Or just use TreeHashMap instead.
(*) Scala's immutable Map isn't really immutable. It is a mutable data structure internally, which requires lot of synchronization.