Scala specific -- does an immutable.Set consume significantly more memory than a mutable.Set? - scala

Using Scala 2.12.8 on Java Hotspot 17.0.1, have a large number of instances of an object which contains code like this:
var tmpSet[SomeType] = mutable.Set[SomeType]()
lazy val finalSet[SomeType] = { val tmp = tmpSet.toSet; tmpSet = null; tmp }
During initialization, the logic builds the tmpSet in all of the objects. Then runs this:
for(i <- 0 until numberOfInstances ) instance(i).finalSet.size
to force the conversion to an immutable.Set which will be used for all further processing.
Before the conversion, using an -Xmx14G parameter, about 4.5G of memory has been consumed (for all the tmpSet's). Running the conversion always throws OOM. Have placed traces of memory use at points within the for(...) loop and can see memory usage steadily increasing until the OOM.
Any idea what is happening here? Even if the GC is disabled and does not recover any of the tmpSet instances that have been set to null, there should still be enough RAM -- unless an immutable.Set takes far more memory than the equivalent mutable.Set.
WITHDRAWING this question. Wrote a testbed (below as an answer) to mimic this situation and it does NOT show this behaviour -- so must be some other problem within my codebase.

Testbed code to mimic the situation, and it does not exhibit the problem.
Note Had trouble pasting the code with some special symbols. In particular replaced < with & lt ; to keep the HTML happy (about line 16). Also increased the indent.
/** Test case to explore OOM issue when converting a large number of mutable.Set's to immutable.Set's
* The 'lazy val finalSet' logic converts each mutable.Set into an immutable.Set, releasing the mutable.Set in the process.
*
* The original expectation was that this would consume roughly the same amount of memory as before the conversion!!!
**/
import scala.collection.mutable
import scala.collection.immutable
case class TestSetOOM(numStrings:Int) {
import TestSetOOM._
// tmpSet is loaded with sizeOfSet unique strings when this class is initialized
var tmpSet:mutable.Set[String] = mutable.Set[String]()
// finalSet is an immutable.Set initialized when finalSet.size is invoked. tmpSet is converted, and then released
lazy val finalSet:immutable.Set[String] = { val tmp = tmpSet.toSet; tmpSet = null; tmp}
// Executes at initialization
for(i <- 0 until numStrings) tmpSet += getString
}
object TestSetOOM {
/////// The basic parameters controlling the tests ////////
val numInstances = 10000 // How many instances of TestSetOOM to create
val sizeOfString = 1000 // How large to make each test string (+ 16 for the number)
val sizeOfSet = 1000 // How many test strings to add to the 'tmpSet' within each instance
val gcEveryN = 1000 // During the conversion phase, request a GC & pause every N instances processed
// NOTE: Would REALLY love to have a method to FORCE a complete, deep GC for use
// in exactly test routines such as this!!!
val pauseMillis = 100 // Number of milliseconds to pause during the gdEveryN, can be zero for no pause
// The hope is that if we pause the main thread then the JVM may actually run the GC
val reportAsMB = true // True == show memory reports as MB, false as GBs
//////// ... end of basic parameters ... ////////
val baseData = numInstances.toLong * sizeOfSet * (sizeOfString + 16)
def main(ignored:Array[String]):Unit = {
var instances:List[TestSetOOM] = Nil // Create numInstances and prepend to this list
for(i 0) Thread.sleep(pauseMillis)
ln("")
ln(f"Instances: $numInstances%,d, Size Of mutable.Set: $sizeOfSet%,d, Size of String: ${sizeOfString + 16}%,d == ${MBorGB(baseData)} base data size")
ln("")
ln(s" BASELINE -- $memUsedStr -- after initialization of all test data")
var dummy = 0L
var cnt = 0
instances.foreach { item =>
dummy += item.finalSet.size // Forces the conversion
cnt += 1
if(gcEveryN > 0 && (cnt % gcEveryN) == 0){
runtime.gc
if(pauseMillis > 0) Thread.sleep(pauseMillis)
ln(f"$cnt%,11d converted -- $memUsedStr ")
}
}
ln("")
ln(s" FINAL -- $memUsedStr")
}
val runtime = Runtime.getRuntime
// Get a memory report in either MBs or GBs
def memUsedStr = { val max = runtime.maxMemory
val ttl = runtime.totalMemory
val free = runtime.freeMemory
val used = ttl - free
s"Memory -- Max: ${MBorGB(max)}, Total: ${MBorGB(ttl)}, Used: ${MBorGB(used)}"
}
def MBorGB(amount:Long) = { val amt = amount / (if(reportAsMB) 1000000 else 1000000000L)
val as = if(reportAsMB) "MB" else "GB"
f"$amt%,9d $as"
}
// Generate strings with leading unique numbers as test data so the compiler cannot
// recognize as equal and memoize just one if they were all the same!
val emptyString = "X" * sizeOfString
var numString = 0L
def getString = { numString += 1
f"${numString}%,16d$emptyString"
}
def ln(str:String) = println(str)
}
Output of execution
Instances: 10,000, Size Of mutable.Set: 1,000, Size of String: 1,016 == 10,160 MB base data size
BASELINE -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,746 MB -- after initialization of all test data
1,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,783 MB
2,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,825 MB
3,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,867 MB
4,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,909 MB
5,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,951 MB
6,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 10,992 MB
7,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 11,034 MB
8,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 11,076 MB
9,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 11,117 MB
10,000 converted -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 11,158 MB
FINAL -- Memory -- Max: 15,032 MB, Total: 15,032 MB, Used: 11,162 MB

Related

Speed up autovacuum in Postgres

I have a question regarding Postgres autovacuum / vacuum settings.
I have a table with 4.5 billion rows and there was a period of time with a lot of updates resulting in ~ 1.5 billion dead tuples. At this point autovacuum was taking a long time (days) to complete.
When looking at the pg_stat_progress_vacuum view I noticed that:
max_dead_tuples = 178956970
resulting in multiple index rescans (index_vacuum_count)
According to docs - max_dead_tuples is a number of dead tuples that we can store before needing to perform an index vacuum cycle, based on maintenance_work_mem.
According to this one dead tuple requires 6 bytes of space.
So 6B x 178956970 = ~1GB
But my settings are
maintenance_work_mem = 20GB
autovacuum_work_mem = -1
So what am I missing? why didn't all my 1.5b dead tuples fit in max_dead_tuples, since 20GB should give enough space, and why there were multiple runs necessary?
There is a hard-coded limit of 1GB for the number of dead tuples in one VACUUM cycle, see the source:
/*
* Return the maximum number of dead tuples we can record.
*/
static long
compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
{
long maxtuples;
int vac_work_mem = IsAutoVacuumWorkerProcess() &&
autovacuum_work_mem != -1 ?
autovacuum_work_mem : maintenance_work_mem;
if (useindex)
{
maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
maxtuples = Min(maxtuples, INT_MAX);
maxtuples = Min(maxtuples, MAXDEADTUPLES(MaxAllocSize));
/* curious coding here to ensure the multiplication can't overflow */
if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks)
maxtuples = relblocks * LAZY_ALLOC_TUPLES;
/* stay sane if small maintenance_work_mem */
maxtuples = Max(maxtuples, MaxHeapTuplesPerPage);
}
else
maxtuples = MaxHeapTuplesPerPage;
return maxtuples;
}
MaxAllocSize is defined in src/include/utils/memutils.h as
#define MaxAllocSize ((Size) 0x3fffffff) /* 1 gigabyte - 1 */
You could lobby on the pgsql-hackers list to increase the limit.

Async reading of large Cassandra table using Scala / Phantom-DSL

I have an issue reading a table containing >800k rows. I need to read the rows from top to bottom in order to process them.
I use Scala and Phantom for the purpose.
Here is how my table look.
CREATE TABLE raw (
id uuid PRIMARY KEY,
b1 text,
b2 timestamp,
b3 text,
b4 text,
b5 text
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
So far I've tried to read the table using:
def getAllRecords : Future[Seq[row]] = select.fetch
or the more fancy Play Enumerator and combine it with a Iteratee
def getAllRecords : Enumerator = select.fetchEnumrator
Nothing of this works, it seems like cassandra/driver/my program always tries to read all records upfront, what am I missing here ?
Have you tried reviewing the code in the bigger read tests?
class IterateeBigReadPerformanceTest extends BigTest with ScalaFutures {
it should "read the correct number of records found in the table" in {
val counter: AtomicLong = new AtomicLong(0)
val result = TestDatabase.primitivesJoda.select
.fetchEnumerator run Iteratee.forEach {
r => counter.incrementAndGet()
}
result.successful {
query => {
info(s"done, reading: ${counter.get}")
counter.get() shouldEqual 2000000
}
}
}
}
This is not something that will read your records upfront. In fact we have tests than run more than one hour to guarantee sufficient GC pauses, no GC overhead, permgen/metaspace pressure remains within bounds, etc.
If anything has indeed changed it is only in error, but this should still work.

Scope of 'spark.driver.maxResultSize'

I'm running a Spark job to aggregate data. I have a custom data structure called a Profile, which basically contains a mutable.HashMap[Zone, Double]. I want to merge all profiles that share a given key (a UUID), with the following code:
def merge = (up1: Profile, up2: Profile) => { up1.addWeights(up2); up1}
val aggregated = dailyProfiles
.aggregateByKey(new Profile(), 3200)(merge, merge).cache()
Curiously, Spark fails with the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 116318 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
The obvious solution is to increment "spark.driver.maxResultSize", but two things puzzle me.
Too much of a coincidence that I get 1024.0 greater than 1024.0
All the documentation and help I found googling this particular error and configuration parameter indicates that it affect functions that take a value back to the driver. (say take() or collect()), but I'm not taking ANYTHING to the driver, just reading from HDFS, aggregating, saving back to HDFS.
Does anyone know why I'm getting this error?
Yes, It's failing because The values we see in exception message are
rounded off by one precision and comparison happening in bytes.
That serialized output must be more than 1024.0 MB and less than 1024.1 MB.
Check added Apache Spark code snippet, It's very interesting and very rare to get this error. :)
Here totalResultSize > maxResultSize both are Long types and in holds the value in bytes. But msg holds rounded value from Utils.bytesToString().
//TaskSetManager.scala
def canFetchMoreResults(size: Long): Boolean = sched.synchronized {
totalResultSize += size
calculatedTasks += 1
if (maxResultSize > 0 && totalResultSize > maxResultSize) {
val msg = s"Total size of serialized results of ${calculatedTasks} tasks " +
s"(${Utils.bytesToString(totalResultSize)}) is bigger than spark.driver.maxResultSize " +
s"(${Utils.bytesToString(maxResultSize)})"
logError(msg)
abort(msg)
false
} else {
true
}
}
Apache Spark 1.3 - source
//Utils.scala
def bytesToString(size: Long): String = {
val TB = 1L << 40
val GB = 1L << 30
val MB = 1L << 20
val KB = 1L << 10
val (value, unit) = {
if (size >= 2*TB) {
(size.asInstanceOf[Double] / TB, "TB")
} else if (size >= 2*GB) {
(size.asInstanceOf[Double] / GB, "GB")
} else if (size >= 2*MB) {
(size.asInstanceOf[Double] / MB, "MB")
} else if (size >= 2*KB) {
(size.asInstanceOf[Double] / KB, "KB")
} else {
(size.asInstanceOf[Double], "B")
}
}
"%.1f %s".formatLocal(Locale.US, value, unit)
}
Apache Spark 1.3 - source

servicestack.redis wrapper poor performance

We are trying to store some big buffers (8MB each) in Redis using the ServiceStack wrapper. We use the “RedisNativeClient.Set(string key, byte[] value)” API to set the buffers.
Both client and server reside on the same machine.
Persistence in disabled.
We are currently using the evaluation version of ServiceStack.
The problem is that we get very poor performance - around 60 MB/Sec.
Using some different c# wrappers ("Sider"), we get better performance (~400 MB/Sec).
The code I used for my measurements:
public void SimpleTest()
{
Stopwatch sw;
long ms1, ms2, interval;
int nBytesHandled = 0;
int nBlockSizeBytes = 8000000;
int nMaxIterations = 5;
byte[] pBuffer = new byte[(int)(nBlockSizeBytes)];
// Create Redis Wrapper
ServiceStack.Redis.RedisNativeClient m_serviceStackRedisClient = new ServiceStack.Redis.RedisNativeClient();
// Clear DB
m_serviceStackRedisClient.FlushAll();
sw = Stopwatch.StartNew();
ms1 = sw.ElapsedMilliseconds;
for (int i = 0; i < nMaxIterations; i++)
{
m_serviceStackRedisClient.Set("eitan" + i.ToString(), pBuffer);
nBytesHandled += nBlockSizeBytes;
}
ms2 = sw.ElapsedMilliseconds;
interval = ms2 - ms1;
// Calculate rate
double dMBPerSEc = nBytesHandled / 1024.0 / 1024.0 / ((double)interval / 1000.0);
Console.WriteLine("Rate {0:N4}", dMBPerSEc);
}
What could the problem be ?
Thanks,
Eitan.
ServiceStack.Redis uses a reusable Buffer Pool to reduce memory pressure by reusing a pool of byte buffers. The size of the default byte[] buffer is 1450 bytes to fit within the Ethernet MTU packet size. Whilst this default configuration is optimal for the normal use-case of smaller payloads (<100k) it looks like it ends up being slower for larger payloads (>1MB+).
Based on this the ServiceStack.Redis client has now been modified so that it no longer uses the buffer pool for payloads larger than 500k which is now configurable with RedisConfig.BufferPoolMaxSize, e.g:
RedisConfig.BufferPoolMaxSize = 500000;
The default 1450 byte size of the byte[] buffer is now also configurable with:
RedisConfig.BufferLength = 1450;
This change now improves the throughput performance of ServiceStack.Redis for larger payloads as seen in RedisBenchmarkTests suite which uses your benchmark with different payload sizes, e.g:
public void Run(string name, int nBlockSizeBytes, Action<int,byte[]> fn)
{
Stopwatch sw;
long ms1, ms2, interval;
int nBytesHandled = 0;
int nMaxIterations = 5;
byte[] pBuffer = new byte[nBlockSizeBytes];
// Create Redis Wrapper
var redis = new RedisNativeClient();
// Clear DB
redis.FlushAll();
sw = Stopwatch.StartNew();
ms1 = sw.ElapsedMilliseconds;
for (int i = 0; i < nMaxIterations; i++)
{
fn(i, pBuffer);
nBytesHandled += nBlockSizeBytes;
}
ms2 = sw.ElapsedMilliseconds;
interval = ms2 - ms1;
// Calculate rate
double dMBPerSEc = nBytesHandled / 1024.0 / 1024.0 / (interval / 1000.0);
Console.WriteLine(name + ": Rate {0:N4}, Total: {1}ms", dMBPerSEc, ms2);
}
Results running from my MacBook Pro and redis-server running in an Ubuntu VirtualBox VM:
1K Results:
ServiceStack.Redis 1K: Rate 4.7684, Total: 1ms
Sider 1K: Rate 0.4768, Total: 10ms
10K Results:
ServiceStack.Redis 10K: Rate 47.6837, Total: 1ms
Sider 10K: Rate 4.3349, Total: 11ms
100K Results:
ServiceStack.Redis 100K: Rate 26.4910, Total: 18ms
Sider 100K: Rate 20.7321, Total: 23ms
1MB Results:
ServiceStack.Redis 1MB: Rate 103.6603, Total: 46ms
Sider 1MB: Rate 70.1231, Total: 68ms
8MB Results:
ServiceStack.Redis 8MB: Rate 77.0646, Total: 495ms
Sider 8MB: Rate 84.3960, Total: 452ms
Where the performance for of ServiceStack.Redis is faster for smaller payloads and now closer for payloads larger than 8MB.
This change is available from v4.0.41+ that's now available on MyGet.

Paradoxal timing functions

I have a function to compute the time spent in a block:
import collection.mutable.{Map => MMap}
var timeTotalMap = MMap[String, Long]()
var numMap = MMap[String, Float]()
var averageMsMap = MMap[String, Float]()
def time[T](key: String)(block: =>T): T = {
val start = System.nanoTime()
val res = block
val total = System.nanoTime - start
timeTotalMap(key) = timeTotalMap.getOrElse(key, 0L) + total
numMap(key) = numMap.getOrElse(key, 0f) + 1
averageMsMap(key) = timeTotalMap(key)/1000000000f/numMap(key)
res
}
I am timing a function and the place where it is called in one place.
time("outerpos") { intersectPos(a, b) }
and the function itself starts with:
def intersectPos(p1: SPosition, p2: SPosition)(implicit unify: Option[Identifier] =None): SPosition
= time("innerpos") {(p1, p2) match {
...
}
When I display the nano times for each key (timeTotalMap), I get (added spaces for lisibility)
outerpos-> 37 870 034 714
innerpos-> 53 647 956
It means that the total execution time of innerpos is 1000 less than the one of outerpos.
What ! there is a factor 1000 between the two ? And its says that the outer call takes 1000x more time than all the inner functions? am I missing something or is there a memory leaking ghost somewhere ?
Update
When I compare the number of execution for each block (numMap), I find the following:
outerpos -> 362878
innerpos -> 21764
This is paradoxal. Even if intersectPos is called somewhere else, shouldn't the number of times it is called be as great as the number of times outerpos is called?
EDIT
If I move the line numMap(key) = numMap.getOrElse(key, 0f) + 1 to the top of the time functinon, then these numbers become approximately equal.
nanoTime on JVM is considered safe but not very accurate. Here are some good reads:
Is System.nanoTime() completely useless?
System.currentTimeMillis vs System.nanoTime
Basically your test will suffer from timer inaccuracy. One way to work around that would be to call time("outerpos") for a quite a long time (note JIT compiler optimization might kick in at some point) and measure start and end times only once. Take an average and repeat with time("innerpos"). That's all I could think of. Still not the best test ever ;)