why is kdb process showing high memory usage on system? - kdb

I am running into serious memory issues with my kdb process. Here is the architecture in brief.
The process runs in slave mode (4 slaves). It loads a ton of data from database into memory initially (total size of all variables loaded in memory calculated from -22! is approx 11G). Initially this matches .Q.w[] and close to unix process memory usage. This data set increases by very little in incremental operations. However, after a long operation, although the kdb internal memory stats (.Q.w[]) show expected memory usage (both used and heap) ~ 13 G, the process is consuming close to 25G on the system (unix /proc, top) eventually running out of physical memory.
Now, when I run garbage collection manually (.Q.gc[]), it frees up memory and brings unix process usage close to heap number displayed by .Q.w[].
I am running Q 2.7 version with -g 1 option to run garbage collection in immediate mode.
Why is unix process usage so significantly differently from kdb internal statistic -- where is the difference coming from? Why is "-g 1" option not working? When i run a simple example, it works fine. But in this case, it seems to leak a lot of memory.
I tried with 2.6 version which is supposed to have automated garbage collection. Suprisingly, there is still a huge difference between used and heap numbers from .Q.w when running with version 2.6 both in single threaded (each) and multi threaded modes (peach). Any ideas?

I am not sure of the concrete answer but this is my deduction based on following information (and some practical experiments) which is mentioned on wiki:
http://code.kx.com/q/ref/control/#peach
It says:
Memory Usage
Each slave thread has its own heap, a minimum of 64MB.
Since kdb 2.7 2011.09.21, .Q.gc[] in the main thread executes gc in the slave threads too.
Automatic garbage collection within each thread (triggered by a wsful, or hitting the artificial heap limit as specified with -w on the command line) is only executed for that particular thread, not across all threads.
Symbols are internalized from a single memory area common to all threads.
My observations:
Thread Specific Memory:
.Q.w[] only shows stats of main thread and not the summation of all the threads (total process memory). This could be tested by starting 'q' with 2 threads. Total memory in that case should be at least 128MB as per point 1 but .Q.w[] it still shows 64 MB.
That's why in your case at the start memory stats were close to unix stats as all the data was in main thread and nothing on other threads. After doing some operations some threads might have taken some memory (used/garbage) which is not shown by .Q.w[].
Garbage collector call
As mentioned on wiki, calling garbage collector on main thread calls GC on all threads. So that might have collected the garbage memory from threads and reduced the total memory usage which was reflected by reduced unix memory stats.

Related

Common heap behavior for Wildfly or application memory leak?

We're running our application in Wildfly 14.0.1, with a -Xmx of 4096, running with OpenJDK 11.0.2. I've been using VisualVM 1.4.2 to monitor our heap since we previously were having OOM exceptions (because our -Xmx was only 512 which was incredibly bad).
While we are well within our memory allocation now, we have no more OOM exceptions happening, and even with a good amount of clients and processing happening we're nowhere near the -Xmx4096 (the servers have 16GB so memory isn't an issue), I'm seeing some strange heap behavior that I can't figure out where it's coming from.
Using VisualVM, Eclipse MemoryAnalyzer, as well as heaphero.io, I get summaries like the following:
Total Bytes: 460,447,623
Total Classes: 35,708
Total Instances: 2,660,155
Classloaders: 1,087
GC Roots: 4,200
Number of Objects Pending for Finalization: 0
However, in watching the Heap Monitor, I see the Used Heap over a 4 minute time period increase by about 450MB before the GC runs and drops back down only to spike again. Here's an image:
This is when no clients are connected and nothing is actively happening in our application. We do use Apache File IO to monitor remote directories, we have JMS topics, etc. so it's not like the application is completely idle, but there's zero logging and all that.
My largest objects are the well-known io.netty.buffer.PoolChunk, which in the heap dumps are about 60% of my memory usage, the total is still around 460MB so I'm confused why the heap monitor is going from ~425MB to ~900MB repeatedly, and no matter where I take my snapshots, I can't see any large increase of object counts or memory usage.
I'm just seeing a disconnect between the heap monitor, and .hprof analysis. So there doesn't see a way to tell what's causing the heap to hit that 900MB peak.
My question is if these heap spikes are totally expected when running within Wildfly, or is there something within our application that is spinning up a bunch of objects that then get GC'd? In doing a Component report, objects in our application's package structure make up an extremely small amount of the dump. Which doesn't clear us, we easily could be calling things without closing appropriately, etc.

Scala concurrency performance issues

I have a data mining app.
There is 1 Mining Actor which receives and processes a Json containing 1000 objects. I put this into a list and foreach, I log the data by sending it to 1 Logger Actor which logs data into many files.
Processing the list sequentially, my app uses 700MB and takes ~15 seconds of 20% cpu power to process (4 core cpu). When I parallelize the list, my app uses 2GB and ~ the same amount of time and cpu to process.
My questions are:
Since I parallelized the list and thus the computation, shouldn't the compute-time decrease?
I think having only one Logger Actor is a bottleneck in this case. The computation may be faster but the bottleneck hides the speed increase. So if I add more Loggers to the pool, the app time should decrease?
Why does the memory usage jump to 2GB? Does the JVM have to store the entire collection in memory to parallelize it? And after the computation is done, the JVM garbage collector should deal with it?
Without more details, any answer is a guess. However, even a guess might point you to the right direction.
Parallelized execution should decrease the running time but your problem might lie elsewhere. For some reason, your CPU is idling a lot even in the single-threaded mode. You do not specify whether you read the input from disk or the network or where you write your output to. You explicitly say that you write logs to a lot of files. Disk and network reading/writing might in your case take much longer than data processing. Most probably your process is idle due to this I/O waiting. You should not expect any speedups from parallelizing a job that spends 80% of its time waiting on I/O. I therefore also suspect that loggers are not the bottleneck here.
The memory usage might jump if your threads allocate a lot of memory each. In that case, the more threads you have the more memory will be required. I don't know what kind of collection you are parallelizing on, but most are stored in memory, completely. Yes, the garbage collector will free any resources that do not require you to explicitly free them, such as files.
How many threads for reading and writing to the hard disk?
The memory increases because I send messages faster than the Logger can write, so the Mailbox balloons in size until the Logger has processed the messages and the GC kicks in.
I solved this by writing state to a protocol buffer file. Before doing any writes, I compare with the protobuf file because reads are significantly cheaper than writes. My resource usage is now 10% for 2 seconds, and less than 400MB RAM.

PostgreSQL Table in memory

I created a database containing a total of 3 tables for a specific purpose. The total size of all tables is about 850 MB - very lean... out of which one single table contains about 800 MB (including index) of data and 5 million records (daily addition of about 6000 records).
The system is PG-Windows with 8 GB RAM Windows 7 laptop with SSD.
I allocated 2048MB as shared_buffers, 256MB as temp_buffers and 128MB as work_mem.
I execute a single query multiple times against the single table - hoping that the table stays in RAM (hence the above parameters).
But, although I see a spike in memory usage during execution (by about 200 MB), I do not see memory consumption remaining at at least 500 MB (for the data to stay in memory). All postgres exe running show 2-6 MB size in task manager. Hence, I suspect the LRU does not keep the data in memory.
Average query execution time is about 2 seconds (very simple single table query)... but I need to get it down to about 10-20 ms or even lesser if possible, purely because there are just too many times, the same is going to be executed and can be achieved only by keeping stuff in memory.
Any advice?
Regards,
Kapil
You should not expect postgres processes to show large memory use, even if the whole database is cached in RAM.
That is because PostgreSQL relies on buffered reads from the operating system buffer cache. In simplified terms, when PostgreSQL does a read(), the OS looks to see whether the requested blocks are cached in the "free" RAM that it uses for disk cache. If the block is in cache, the OS returns it almost instantly. If the block is not in cache the OS reads it from disk, adds it to the disk cache, and returns the block. Subsequent reads will fetch it from the cache unless it's displaced from the cache by other blocks.
That means that if you have enough free memory to fit the whole database in "free" operating system memory, you won't tend to hit the disk for reads.
Depending on the OS, behaviour for disk writes may differ. Linux will write-back cache "dirty" buffers, and will still return blocks from cache even if they've been written to. It'll write these back to the disk lazily unless forced to write them immediately by an fsync() as Pg uses at COMMIT time. When it does that it marks the cached blocks clean, but doesn't flush them. I don't know how Windows behaves here.
The point is that PostgreSQL can be running entirely out of RAM with a 1GB database, even though no PostgreSQL process seems to be using much RAM. Having shared_buffers too high just leads to double-caching and can reduce the amount of RAM available for the OS to cache blocks.
It isn't easy to see exactly what's cached in RAM because Pg relies on the OS cache. That's why I referred you to pg_fincore.
If you're on Windows and this won't work, you really just have to rely on observing disk activity. Does performance monitor show lots of uncached disk reads? Does operating system memory monitoring show lots of memory used for disk cache in the OS?
Make sure that effective_cache_size correctly reflects the RAM used for disk cache. It will help PostgreSQL choose appropriate query plans.
You are making the assumption, without apparent evidence, that the query performance you are experiencing is explained by disk read delays, and that it can be improved by in-memory caching. This may not be the case at all. You need to look at explain analyze output and system performance metrics to see what's going on.

MongoDB Stops Responding During Background Flush

Mongodb Background Flushing blocks all the requests:
Server: Windows server 2008 R2
CPU Usage: 10 %
Memory: 64G, Used 7%, 250MB for Mongod
Disk % Read/Write Time: less than 5% (According to Perfmon)
Mongodb Version: 2.4.6
Mongostat Normally:
insert:509 query:608 update:331 delete:*0 command:852|0 flushes:0 mapped:63.1g vsize:127g faults:6449 locked db:Radius:12.0%
Mongostat Before(maybe while) Flushing:
insert:1 query:4 update:3 delete:*0 command:7|0 flushes:0 mapped:63.1g vsize:127g faults:313 locked db:local:0.0%
And Mongostat After Flushing:
insert:1572 query:1849 update:1028 delete:*0 command:2673|0 flushes:1 mapped:63.1g vsize:127g faults:21065 locked db:.:99.0%
As you see when flushes happening lock is 99% just at this point mongod stops responding any read/write operation (mongotop and mongostat also stop). The flushing takes about 7 to 8 seconds to complete which does not increase disk load more than 10%.
Is there any suggestions?
Under Windows server 2008 R2 (and other versions of Windows I would suspect, although I don't know for sure), MongoDB's (2.4 and older) background flush process imposes a global lock, doing substantial blocking of reads and writes, and the length of the flush time tends to be proportional to the amount of memory MongoDB is using (both resident and system cache for memory-mapped files), even if very little actual write activity is going on. This is a phenomenon we ran into at our shop.
In one replica set where we were using MongoDB version 2.2.2, on a host with some 128 GBs of RAM, when most of the RAM was in use either as resident memory or as standby system cache, the flush time was reliably between 10 and 15 seconds under almost no load and could go as high as 30 to 40 seconds under load. This could cause Mongo to go into long pauses of unresponsiveness every minute. Our storage did not show signs of being stressed.
The basic problem, it seems, is that Windows handles flushing to memory-mapped files differently than Linux. Apparently, the process is synchronous under Windows and this has a number of side effects, although I don't understand the technical details well enough to comment.
MongoDb, Inc., is aware of this issue and is working on optimizations to address it. The problem is documented in a couple of tickets:
https://jira.mongodb.org/browse/SERVER-13444
https://jira.mongodb.org/browse/SERVER-12401
What to do?
The phenomenon is tied, to some degree, to the minimum latency of the disk subsystem as measured under low stress, so you might try experimenting with faster disks, if you can. Some improvements have been reported with this approach.
A strategy that worked for us in some limited degree is avoiding provisioning too much RAM. It happened that we really didn't need 128 GBs of RAM, so by dialing back on the RAM, we were able to reduce the flush time. Naturally, that wouldn't work for everyone.
The latest versions of MongoDB (2.6.0 and later) seem to handle the
situation better in that writes are still blocked during the long
flush but reads are able to proceed.
If you are working with a sharded cluster, you could try dividing the RAM by putting multiple shards on the same host. We didn't try this ourselves, but it seems like it might have worked. On the other hand, careful design and testing would be highly recommended in any such scenario to avoid compromising performance and/or high availability
We tried playing with syncdelay. Reducing it didn't help (the long flush times just happened more frequently). Increasing it helped a little (there was more time between flushes to get work done), but increasing it too much can exacerbate the problem severely. We boosted the syncdelay to five minutes (300 seconds), at one point, and were rewarded with a background flush of 20 minutes.
Some optimizations are in the works at MongoDB, Inc. These may be available soon.
In our case, to relieve the pressure on the primary host, we periodically rebooted one of the secondaries (clearing all memory) and then failed over to it. Naturally, there is some performance hit due to re-caching, and I think this only worked for us because our workload is write-heavy. Moreover, this technique not in any sense a solution. But if high flush times are causing serious disruption, this may be one way to "reduce the fever" so to speak.
Consider running on Linux... :-)
Background flush by default does not block read/write. mongod does flush every 60s, unless otherwise specified with -syncDelay parameter. syncDelay uses fsync() operation, which can set to block write while in-memory pages flush to disk. A blocked write could have potential to block reads as well. Read more: http://docs.mongodb.org/manual/reference/command/fsync/
However, normally a flush should not take more than 1000ms (1 second). If it does, it is likely the amount of data flushing to disk is too large for your disk to handle.
Solution: upgrade to a faster disk like SSD, or decrease flush interval (try 30s, rather than the default 60s).

Everyday GC is running on same time

In my server everyday 3:00AM GC is running and Heapspace is filling in a flash.
This causing site outage. ANY inputs?
following are my JVM settings.I am using JBOSS server.
-Dprogram.name=run.sh -server -Xms1524m -Xmx1524m -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -XX:NewSize=512m -XX:MaxNewSize=512m -Djava.net.preferIPv4Stack=true -XX:MaxPermSize=512m -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -Djavax.net.ssl.trustStorePassword=changeit -Dcom.sun.management.jmxremote.port=8888 -Djava.rmi.server.hostname=192.168.100.140 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote
Any suggestions really helpful..
(This turned out somewhat long; there is an actual suggestion for a fix at the end.)
Very very briefly, garbage collection when you use -XX:+UseConcMarkSweepGC works like this:
All objects are allocated in the so-called young generation. This is typically a couple of hundred megs up to a gig in size, depending om VM settings, number of CPU:s and total heap size. The young generation is collected in a stop-the-world pause, followed by a parallel (multiple CPU) compacting (moving objects) collection. The young generation is sized so as to make this pause reasonably large.
When objects have survived (are still reachable) young gen they get promoted to "old-gen" (old generation).
The old generation is where -XX:+UseConcMarkSweepGC kicks in. In the default mode (without -XX:+UseConcMarkSweepGC) when the old generation becomes full the entire heap is collected and compacted (moving around, eliminating fragmentation) at once in a stop-the-world copy. That pause will typically be longer than young-gen pauses because the entire heap is involved, which is bigger.
With CMS (-XX:+UseConcMarkSweepGC) the work to compact the old generation is mostly concurrent (meaning, running in the background with the application not paused). This work is also not compacting; it works more like malloc()/free() and you are subject to fragmentation.
The main upside of CMS is that when things work well, you avoid long pause times that are linear in the size of the heap, because the main work is cone concurrently (there are some stop-the-world steps involved but they are supposed to usually be short).
The two primary downsides are that:
You are subject to fragmentation because old-gen is not compacted.
If you don't finish a concurrent collection cycle before old-gen fills up, or if fragmentation prevents allocation, the resulting full collection of the entire heap is not parallel as it is with the default collector. I..e, only one CPU is used. That means that when/if you do hit a full garbage collection, the pause will be longer than it would have been with the default collector.
Now... your logs. "Concurrent mode failure" is intended to convey that the concurrent mark/sweep work did not complete in time for another young-gen GC that needs to promote surviving objects into the old generation. The "promotion failed" is rather that during promotion from young-gen to old-gen, an object was unable to be allocated in old-gen due to fragmentation.
Unless you are hitting a true bug in the JVM, the sudden increase in heap usage is almost certainly from your application, JBoss, or some external entity acting on your application. So I can't really help with that. However, what is likely happening is a combination of two things:
The spike in activity is causing an increase in heap usage too quick for the concurrent collection to complete in time.
Old-gen is too fragmented, causing problems especially when the old-gen is almost full.
I should also point out now that the default behavior of CMS is to try to postpone concurrent collections as long as possible (yet not too long) for performance reasons. The later it happens, the more efficient (in terms of CPU usage) the collection is. However, a trade-off is that you're increasing the risk of not finishing in time (which again, will trigger a full GC and a long pause). It should also (I have not made empirical tests here, but it stands to reason) result in fragmentation being a greater concern; basically the more full old-gen is when an object is promoted, the greater is the likelyhood that the object's promotion will worsen fragmentation concerns (too long to go into details here).
In your case, I would do two things:
Keep figuring out what is causing the activity. I would say it's fairly unlikely that it is a GC/JVM bug.
Re-configure the JVM to trigger concurrent collection cycles earlier in order to avoid the heap every becoming so full that fragmentation becomes a particularly huge concern, and giving it more time to complete in time even during your sudden spikes of activity.
You can accomplish (2) most easily be using the JVM options
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
in order to explicitly force the JVM to kick start a CMS cycle at a certain level of heap usage (in this example 75% - you may need to change that; the lower the percentage, the earlier it will kick in).
Note that depending on what your live size is (the number of bytes that are in fact live and reachable) in your application, forcing an earlier CMS cycle may also require that you increase your heap size to avoid CMS running constantly (not a good use of CPU).