TFDV generate_stats_from_csv trigger an out of memory error - tensorflow-data-validation

i have a problem when generating stats over a small dataset (~10MB).
it takes to much time and consume to much memory (it reaches 25 GB of my memory which makes no sense ). and at the end it stops throwing an out of memory error.

Related

Common heap behavior for Wildfly or application memory leak?

We're running our application in Wildfly 14.0.1, with a -Xmx of 4096, running with OpenJDK 11.0.2. I've been using VisualVM 1.4.2 to monitor our heap since we previously were having OOM exceptions (because our -Xmx was only 512 which was incredibly bad).
While we are well within our memory allocation now, we have no more OOM exceptions happening, and even with a good amount of clients and processing happening we're nowhere near the -Xmx4096 (the servers have 16GB so memory isn't an issue), I'm seeing some strange heap behavior that I can't figure out where it's coming from.
Using VisualVM, Eclipse MemoryAnalyzer, as well as heaphero.io, I get summaries like the following:
Total Bytes: 460,447,623
Total Classes: 35,708
Total Instances: 2,660,155
Classloaders: 1,087
GC Roots: 4,200
Number of Objects Pending for Finalization: 0
However, in watching the Heap Monitor, I see the Used Heap over a 4 minute time period increase by about 450MB before the GC runs and drops back down only to spike again. Here's an image:
This is when no clients are connected and nothing is actively happening in our application. We do use Apache File IO to monitor remote directories, we have JMS topics, etc. so it's not like the application is completely idle, but there's zero logging and all that.
My largest objects are the well-known io.netty.buffer.PoolChunk, which in the heap dumps are about 60% of my memory usage, the total is still around 460MB so I'm confused why the heap monitor is going from ~425MB to ~900MB repeatedly, and no matter where I take my snapshots, I can't see any large increase of object counts or memory usage.
I'm just seeing a disconnect between the heap monitor, and .hprof analysis. So there doesn't see a way to tell what's causing the heap to hit that 900MB peak.
My question is if these heap spikes are totally expected when running within Wildfly, or is there something within our application that is spinning up a bunch of objects that then get GC'd? In doing a Component report, objects in our application's package structure make up an extremely small amount of the dump. Which doesn't clear us, we easily could be calling things without closing appropriately, etc.

Observable takes infinite amount of memory?

Since each Observable has a cache which can be traced back to the very first emitted value it seems an amount of memory used to store this cache is not bounded.
I've tested this assumption with the following code:
Observable.interval(1.microsecond).map(_ => System.currentTimeMillis)
.subscribe(x => ())
And indeed the memory usage has been steadily rising during the whole 10 minute period while an app was running.
My question is if it's possible to instantiate a special Observable without cache or maybe instruct it to cap it's cache at some level?
Only a specific set of Observables (ReplaySubject, replay(), GroupedObservable for example) tend to cache items, but not Observable.interval().
What you are likely experiencing here is the hundreds of thousands of boxed Long values. If you have a lot of RAM, GC might not kick in but just increase the heap size up to a maximum. Assuming you can really get a 1 microsecond timer, you have roughly 24 MB/s allocation rate or 1.4 GB/minute. If left alone for 10 minutes, you'd likely see a sawtooth like shape in memory usage.

Scala concurrency performance issues

I have a data mining app.
There is 1 Mining Actor which receives and processes a Json containing 1000 objects. I put this into a list and foreach, I log the data by sending it to 1 Logger Actor which logs data into many files.
Processing the list sequentially, my app uses 700MB and takes ~15 seconds of 20% cpu power to process (4 core cpu). When I parallelize the list, my app uses 2GB and ~ the same amount of time and cpu to process.
My questions are:
Since I parallelized the list and thus the computation, shouldn't the compute-time decrease?
I think having only one Logger Actor is a bottleneck in this case. The computation may be faster but the bottleneck hides the speed increase. So if I add more Loggers to the pool, the app time should decrease?
Why does the memory usage jump to 2GB? Does the JVM have to store the entire collection in memory to parallelize it? And after the computation is done, the JVM garbage collector should deal with it?
Without more details, any answer is a guess. However, even a guess might point you to the right direction.
Parallelized execution should decrease the running time but your problem might lie elsewhere. For some reason, your CPU is idling a lot even in the single-threaded mode. You do not specify whether you read the input from disk or the network or where you write your output to. You explicitly say that you write logs to a lot of files. Disk and network reading/writing might in your case take much longer than data processing. Most probably your process is idle due to this I/O waiting. You should not expect any speedups from parallelizing a job that spends 80% of its time waiting on I/O. I therefore also suspect that loggers are not the bottleneck here.
The memory usage might jump if your threads allocate a lot of memory each. In that case, the more threads you have the more memory will be required. I don't know what kind of collection you are parallelizing on, but most are stored in memory, completely. Yes, the garbage collector will free any resources that do not require you to explicitly free them, such as files.
How many threads for reading and writing to the hard disk?
The memory increases because I send messages faster than the Logger can write, so the Mailbox balloons in size until the Logger has processed the messages and the GC kicks in.
I solved this by writing state to a protocol buffer file. Before doing any writes, I compare with the protobuf file because reads are significantly cheaper than writes. My resource usage is now 10% for 2 seconds, and less than 400MB RAM.

why is kdb process showing high memory usage on system?

I am running into serious memory issues with my kdb process. Here is the architecture in brief.
The process runs in slave mode (4 slaves). It loads a ton of data from database into memory initially (total size of all variables loaded in memory calculated from -22! is approx 11G). Initially this matches .Q.w[] and close to unix process memory usage. This data set increases by very little in incremental operations. However, after a long operation, although the kdb internal memory stats (.Q.w[]) show expected memory usage (both used and heap) ~ 13 G, the process is consuming close to 25G on the system (unix /proc, top) eventually running out of physical memory.
Now, when I run garbage collection manually (.Q.gc[]), it frees up memory and brings unix process usage close to heap number displayed by .Q.w[].
I am running Q 2.7 version with -g 1 option to run garbage collection in immediate mode.
Why is unix process usage so significantly differently from kdb internal statistic -- where is the difference coming from? Why is "-g 1" option not working? When i run a simple example, it works fine. But in this case, it seems to leak a lot of memory.
I tried with 2.6 version which is supposed to have automated garbage collection. Suprisingly, there is still a huge difference between used and heap numbers from .Q.w when running with version 2.6 both in single threaded (each) and multi threaded modes (peach). Any ideas?
I am not sure of the concrete answer but this is my deduction based on following information (and some practical experiments) which is mentioned on wiki:
http://code.kx.com/q/ref/control/#peach
It says:
Memory Usage
Each slave thread has its own heap, a minimum of 64MB.
Since kdb 2.7 2011.09.21, .Q.gc[] in the main thread executes gc in the slave threads too.
Automatic garbage collection within each thread (triggered by a wsful, or hitting the artificial heap limit as specified with -w on the command line) is only executed for that particular thread, not across all threads.
Symbols are internalized from a single memory area common to all threads.
My observations:
Thread Specific Memory:
.Q.w[] only shows stats of main thread and not the summation of all the threads (total process memory). This could be tested by starting 'q' with 2 threads. Total memory in that case should be at least 128MB as per point 1 but .Q.w[] it still shows 64 MB.
That's why in your case at the start memory stats were close to unix stats as all the data was in main thread and nothing on other threads. After doing some operations some threads might have taken some memory (used/garbage) which is not shown by .Q.w[].
Garbage collector call
As mentioned on wiki, calling garbage collector on main thread calls GC on all threads. So that might have collected the garbage memory from threads and reduced the total memory usage which was reflected by reduced unix memory stats.

Memory leak - absence of a garbage collector

Let us think of a memory leak program, wherein a block of heap memory is not freed and the program terminates. If this was (say) a Java Program, the in-built garbage collector would have automatically deallocated this heap block before the program exits.
But even in C++, if the program exits, wouldn't the Kernel automatically de-allocate all space associated with the process. Also in the Java code, the kernel would have to de-allocate the space for the text part (code) of the process (even if the stack and heap parts are deallocated by the garbage collector). So is the overall advantage of using a garbage collector feature - just the increased savings in time required to deallocate the heap by the program itself rather than the kernel? (if there is any such savings)
EDIT: A primary doubt of mine that has come about looking at the responses - will the GC invoke itself automatically when memory usage reaches a limit? Because, if a GC is only invoked just before the program terminates, it is not going to be useful for long programs.
That assumes that the kernel cleans up after you. Not all OSs take care of dynamically-allocated memory automatically. (But to be fair: Most modern ones, at least on the desktop, do.)
Even the OSs that reclaim all memory only do so when the process terminates. Most programs allocate far more memory over their total runtime than they need at any given point in time (when run long enough, where "long" can be a few seconds for many data-crunching applications).
Because of that, many - especially long-running - processes would create more and more garbage (memory that isn't used any more, and won't be used ever again) over their lifetime without any hope of getting rid of it without terminating. You don't want to kill and restart the whole process just to keep memory usage low, do you?
Because unused memory is almost never (there are quite a few processes that run indefinitely and some that can run for hours) disposed of, you get serious memory shortage after a while. Your browser keeps all images, HTML documents, JS objects, etc. you opened during this session in memory because you won't bother to restart it every few minutes. That's bullshit and a serious problem in the browser, you say? My point exactly.
Moreover, most (that is to say, all good ones) GCs don't deallocate everything - they run from time to time when they think it's worth it, but when the process shuts down, everything that remains in memory is left to a lower level (be it a custom allocator or the OS) to be freed. This is also why finalizers aren't guaranteed to run - in a short-running program that doesn't make many allocations, the GC may never run.
So no, GC isn't about saving time. It's about saving tons of memory, preventing long-running allocation-intense programs from hogging all available memory and eventually making everyone die from out of memory errors.
Let's say that program allocates some resource, uses it all time it is running, but doesn't release it properly before exit. When this program exits, kernel deallocates all program resources - it is OK.
Now consider situation when some function creates memory leak on every call, and this function is called 100 times in a second. After few minutes or hours this program crashes because there is no free memory.
The bad thing is that programmer who makes memory and resource leaks of type 1, usually makes a leaks of type 2, producing dirty and unstable code. Professional programmer writes perfect code with 0 resource and memory leaks. If garbage collector is available - it is OK. If not - manage resources yourself.
BTW, it is still possible to make a leaks with garbage collector - like well-known .NET event source-consumer leak. So, garbage collector is very useful, saves a lot of developer time, but in any case developer must carefully manage program resources.