Serialized KnowledgeBase Size (huge) - drools

(Drools 7.68.0 with dialect == MVEL)
I have a KieBase that contains 500 rules. The serialized size of this KieBase is 5MB. That amounts to 10KB/rule. This seems unreasonably large. Loading this KieBase takes up to 20 seconds to deserialize and instantiate in memory. This limits the number of rules a given KieBase can contain. We need to support a separate KieBase per customer (with multiple users per customer), so there are LOTS of KieBase instances of varying sizes which need to be loaded at runtime, and there is not enough memory to keep them all in memory on each server in the cluster.
Isn't there any way to (significantly) reduce the size of the KB prior to serialization?

Related

Large Temporal Cache that need to be Cleared on JVM

I have a in house data pipeline that need to process streams of events in a given window (few minutes) at a time. A window contains 10s of GBs of data (at least 30 million records). As internal state will be more than 1 TB, I am leveraging external KV storage (e.g. bigtable) as a durable storage for internal states. The issue is I need to keep large temporal cache in memory for a duration of window that is backed by the external KV storage to improve the latency and responsiveness of the pipeline.
If the cache state is long living, off-heap might have been a good option however for temporal states that should be cleared I am leaning toward on-heap memory, in this case heap would have to be large which is not optimal. Also GC is not on demand, so on-heap cache could potentially cause OOM.
What would be the best practices to store large amount of in-memory cache that is subject to garbage collection per some interval (window). Note: this is a Scala app on JVM.

BsonChunkPool and memory leak

I use Mongodb 3.6 + .net driver (MongoDb.Driver 2.10) to manage our data. Recenyly, we've noticed that our services (background) consume a lot of memory. After anaylyzing a dump, it turned out that there's a mongo object called BsonChunkPool that always consumes around 0.5 GB of memory. Is it normal ? I cannot really find any valuable documentation about this type, and what it actually does. Can anyone help ?
The BsonChunkPool exists so that large memory buffers (chunks) can be reused, thus easing the amount of work the garbage collector has to do.
Initially the pool is empty, but as buffers are returned to the pool the pool is no longer empty. Whatever memory is held in the pool will not be garbage collected. This is by design. That memory is intended to be reused. This is not a memory leak.
The default configuration of the BsonChunkPool is such that it can hold a maximum of 8192 chunks of 64KB each, so if the pool were to grow to its maximum size it would use 512MB of memory (even more than the 7 or 35 MB you are observing).
If for some reason you don't want the BsonChunkPool to use that much memory, you can configure it differently by putting the following statement at the beginning of your application:
BsonChunkPool.Default = new BsonChunkPool(16, 64 * 1024); // e.g. max 16 chunks of 64KB each, for a total of 1MB
We haven't experimented with different values for chunk counts and sizes so if you do decide to change the default BsonChunkPool configuration you should do some benchmarking and verify that it doesn't have an adverse impact on your performance.
From jira: BsonChunkPool and memory leak

Can Flink handle ~50 GB of state for a single table/window?

I am building a streaming analytic that requires ~50 GB of initial state in-memory for a single table. ~50 GB is the amount of RAM used when I load the state into a Scala HashMap[String,String].
Can Flink handle having ~50 GB of state for a single table that grows over time?
Will I be able to perform lookups and updates to this table in a streaming fashion?
Notes:
I cannot change the types to anything smaller.
The state is used as a lookup for mapping one String to another String.
It would take like three years for the state to double to 100 GB (aggressive estimate as the current state required ten years to produce).
This Flink blog claims that the state size should not be a problem but I thought I would double check before spinning it up. Terabytes of state are mentioned.
https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html
50-100 GB for a single table in Flink state is not a problem.
But to be clear, when we talk about having huge amounts of state in Flink (e.g., terabytes) we are talking about keyed state that is sharded across many parallel tasks. Yes, you can have a single table that is very large, but any given instance will only have a subset of the rows of that table.
Note that you will need to choose a state backend -- either a heap-based state backend that will keep the state in memory, as objects on the JVM heap, or the RocksDB state backend, that will keep the state as serialized bytes on disk with an in-memory cache.

Caching one big RDD or many small RDDs

I have a large RDD (R) which i cut it into 20 chunks (C_1, C_2, ..., C_20) such that:
If the time it takes to cache only depends on the size of the RDD (e.g. 10 second per MB) then caching the individual chunks is better.
However, i suspect there is some additional overhead i'm not aware of, like seek time in case of persisting to disk.
So, my questions are:
Are there any additional overheads when writing to memory?
Is it better to cache (i.e. in memory) the large RDD (R) or the 20 individual chunks?
EDIT: To give some more context, i'm currently running the application on my computer but at the end it will run on a cluster consisting of 10 nodes, each of which has 8 cores. However, since we only have access to the cluster for a small amount of time, i wanted to already experiment locally on my computer.
From my understanding, the application won't need a lot of shuffling as i can partition it rather nicely, such that each chunk runs on a single node.
However, i'm still thinking about the partitioning, so it is not yet 100% decided.
Spark performs the computations in memory. So there is no real extra overhead when you cache data to memory. Caching to memory essentially says, reuse these intermediate results. The only issue that you can run into is having too much data in memory and then it spills to disk. There you will incur disk read time costs. unpersist() will be needed for swapping things out of memory as you get finished with the various intermediate results, if you run into memory limitations.
When determining where to cache your data you need to look at the flow of your data. If you read in a file and then filter it 3 times and write out each one of those filters separately, without caching you will end up reading in that file 3 times.
val data = spark.read.parquet("file:///testdata/").limit(100)
data.select("col1").write.parquet("file:///test1/")
data.select("col2").write.parquet("file:///test2/")
data.select("col3").write.parquet("file:///test3/")
If you read in the file, cache it, then you filter 3 times and write out the results. You will read in the file once and then write out each result.
val data = spark.read.parquet("file:///testdata/").limit(100).cache()
data.select("col1").write.parquet("file:///test4/")
data.select("col2").write.parquet("file:///test5/")
data.select("col3").write.parquet("file:///test6/")
The general test that you can use as to what to cache is, "Am I performing multiple actions on the same RDD?" If yes, cache it. In your example if you break the large RDD into chunks and the large RDD isn't cached you will most likely be recalculating the large RDD every time that you perform an action on it. Then if you don't cache the chunks and you perform multiple actions on those then you will have to recalculate those chunks every time.
Is it better to cache (i.e. in memory) the large RDD (R) or the 20 individual chunks?
So to answer that, it all depends on what you are doing with each intermediate result. It looks like you will definitely want to properly repartition your large RDD according to the number of executors and then cache it. Then, if you perform more than one action on each one of the chunks that you create from the large RDD, you may want to cache those.

Garbage Collection issues on MapPartitions

I currently have a mapPartitions job which is flatMapping each value in the
iterator, and I'm running into an issue where there will be major GC costs
on certain executions. Some executors will take 20 minutes, 15 of which are
pure garbage collection, and I believe that a lot of it has to do with the
ArrayBuffer that I am outputting. Does anyone have any suggestions as to how
I can do some form of a stream output?
Also, does anyone have any advice in general for tracking down/addressing GC
issues in spark?
Please refer to the below documentation from official page of Spark tuning. I hope it will at least help to give direction to your analysis:
Memory Management Overview
Memory usage in Spark largely falls under one of two categories: execution and storage. Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations, while storage memory refers to that used for caching and propagating internal data across the cluster. In Spark, execution and storage share a unified region (M). When no execution memory is used, storage can acquire all the available memory and vice versa. Execution may evict storage if necessary, but only until total storage memory usage falls under a certain threshold (R). In other words, R describes a subregion within M where cached blocks are never evicted. Storage may not evict execution due to complexities in implementation.
This design ensures several desirable properties. First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills. Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted. Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.
Although there are two relevant configurations, the typical user should not need to adjust them as the default values are applicable to most workloads:
spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.
spark.memory.storageFraction expresses the size of R as a fraction of M (default 0.5). R is the storage space within M where cached blocks immune to being evicted by execution.
The value of spark.memory.fraction should be set in order to fit this amount of heap space comfortably within the JVM’s old or “tenured” generation. See the discussion of advanced GC tuning below for details.