Large Temporal Cache that need to be Cleared on JVM - scala

I have a in house data pipeline that need to process streams of events in a given window (few minutes) at a time. A window contains 10s of GBs of data (at least 30 million records). As internal state will be more than 1 TB, I am leveraging external KV storage (e.g. bigtable) as a durable storage for internal states. The issue is I need to keep large temporal cache in memory for a duration of window that is backed by the external KV storage to improve the latency and responsiveness of the pipeline.
If the cache state is long living, off-heap might have been a good option however for temporal states that should be cleared I am leaning toward on-heap memory, in this case heap would have to be large which is not optimal. Also GC is not on demand, so on-heap cache could potentially cause OOM.
What would be the best practices to store large amount of in-memory cache that is subject to garbage collection per some interval (window). Note: this is a Scala app on JVM.

Related

Can Flink handle ~50 GB of state for a single table/window?

I am building a streaming analytic that requires ~50 GB of initial state in-memory for a single table. ~50 GB is the amount of RAM used when I load the state into a Scala HashMap[String,String].
Can Flink handle having ~50 GB of state for a single table that grows over time?
Will I be able to perform lookups and updates to this table in a streaming fashion?
Notes:
I cannot change the types to anything smaller.
The state is used as a lookup for mapping one String to another String.
It would take like three years for the state to double to 100 GB (aggressive estimate as the current state required ten years to produce).
This Flink blog claims that the state size should not be a problem but I thought I would double check before spinning it up. Terabytes of state are mentioned.
https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html
50-100 GB for a single table in Flink state is not a problem.
But to be clear, when we talk about having huge amounts of state in Flink (e.g., terabytes) we are talking about keyed state that is sharded across many parallel tasks. Yes, you can have a single table that is very large, but any given instance will only have a subset of the rows of that table.
Note that you will need to choose a state backend -- either a heap-based state backend that will keep the state in memory, as objects on the JVM heap, or the RocksDB state backend, that will keep the state as serialized bytes on disk with an in-memory cache.

MongoDB: Disk I/O % utilization on Data Partition has gone

Last time I get alert from MongoDB Atlas:
Disk I/O % utilization on Data Partition has gone above 70 on nvme2n1
But I have no any ideas how can I localize / query / index / part of code / problematic collection.
In what way can I perform any analyze to find out problem root-cause?
Not answer, but just seen that many people faced with similar problem.
In My case root cause was: we had collection with huge documents that contain array of data (in fact - list of coordinates with some metadata), and update it as many times, as coordinates we have (when adding new coordinates). + some additional operations.
As I know MongoDB cannot fetch just part of document, it fetch full document, and when we fetch many different and big documents, they are not fit into MongoDB in-memory cache, and each time access into hard disc, that lead to this issue.
So, we just split up this document on several, and this fixed issue. While we need frequent access to update/add this data, we keep it into different documents, and finally, after process done, we gather back all this documents into one big document, for "history check" purpose.
Recently, we met this alert on MongoDB Atlas Disk I/O % utilization on Data Partition has gone above 90 after the instance reboots maintenance. After a discussion with Atlas support guys, we clearly understand this metric.
Understanding Disk I/O % Utilization
The definition of Disk I/O % Utilization and Disk I/O % utilization on Data Partition per doc
Disk I/O % Utilization alerts indicate that the percentage of time during which requests are being issued reaches a specified threshold.
Disk I/O % utilization on Data Partition occurs if the percentage of time during which requests are being issued to any partition that contains the MongoDB collection data meets or exceeds the threshold.
Two traps in iostat: %util and svctm
Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.
This means if there was even just one I/O operation in progress for a given time period, the operating system would report 100% Disk Util, as the disk was in use 100% of that time.
Thus, the disk utilization percentage by itself is NOT an indicator of stress on the disk relative to its maximum IOPS capacity.
Having disk utilization at 100% does not in itself imply there is an issue. Disk utilization is the percentage of time requests are issued to any partition containing the MongoDB collection data. This includes requests from any process, not just MongoDB processes. Modern disk storage can sustain multiple I/O operations simultaneously, so having a ~100% utilization is not unusual, because it just means that the disk is constantly processing at least one operation during the 100% interval.
Conclusion
We should look at a combination of all the available disk-related metrics, as well as IOWait in the System CPU when diagnosing potential disk performance-related issues.
Possible actions to help resolve Disk Utilization % alerts
Optimize your queries
Create an Index to Support Read Operations
Pay attention to Query Selectivity and Covered Query
Use the Atlas Performance Advisor to view slow queries and suggested indexes.
Review Indexing Strategies for possible further indexing improvements.
Analyze Query Performance to review how your queries are using your indexes.
Analyze Profile to optimize the long execution time query
Increase hardware resources, such as instance size and IOPS on Atlas
Source: Mongo Doc
As the alert says, it is due to the high utilization of the disk. The most common cause of it is unoptimized queries with poor Query Targeting Ratio, or simply reading/writing a lot of documents from/to the disk in a relatively shorter time window.
In order to identify these queries, start with the Profiler and look for the operations with a poor Examined:Returned ratio. You can also refer to the Performance Advisor to see if it suggests any indexes on the inefficient operations. Since Profiler's window is limited to the last 24 hours, you can also refer to your logs to identify the Slow Queries.
Ultimately, the effort to solve this is tri-directional:
Optimizing the query execution with efficient indexing and filtering strategies
Keep a check on the volume of data being read/written in one go.
Increase the IOPS of the cluster
For official reference, checkout the documentation here.

Scala concurrency performance issues

I have a data mining app.
There is 1 Mining Actor which receives and processes a Json containing 1000 objects. I put this into a list and foreach, I log the data by sending it to 1 Logger Actor which logs data into many files.
Processing the list sequentially, my app uses 700MB and takes ~15 seconds of 20% cpu power to process (4 core cpu). When I parallelize the list, my app uses 2GB and ~ the same amount of time and cpu to process.
My questions are:
Since I parallelized the list and thus the computation, shouldn't the compute-time decrease?
I think having only one Logger Actor is a bottleneck in this case. The computation may be faster but the bottleneck hides the speed increase. So if I add more Loggers to the pool, the app time should decrease?
Why does the memory usage jump to 2GB? Does the JVM have to store the entire collection in memory to parallelize it? And after the computation is done, the JVM garbage collector should deal with it?
Without more details, any answer is a guess. However, even a guess might point you to the right direction.
Parallelized execution should decrease the running time but your problem might lie elsewhere. For some reason, your CPU is idling a lot even in the single-threaded mode. You do not specify whether you read the input from disk or the network or where you write your output to. You explicitly say that you write logs to a lot of files. Disk and network reading/writing might in your case take much longer than data processing. Most probably your process is idle due to this I/O waiting. You should not expect any speedups from parallelizing a job that spends 80% of its time waiting on I/O. I therefore also suspect that loggers are not the bottleneck here.
The memory usage might jump if your threads allocate a lot of memory each. In that case, the more threads you have the more memory will be required. I don't know what kind of collection you are parallelizing on, but most are stored in memory, completely. Yes, the garbage collector will free any resources that do not require you to explicitly free them, such as files.
How many threads for reading and writing to the hard disk?
The memory increases because I send messages faster than the Logger can write, so the Mailbox balloons in size until the Logger has processed the messages and the GC kicks in.
I solved this by writing state to a protocol buffer file. Before doing any writes, I compare with the protobuf file because reads are significantly cheaper than writes. My resource usage is now 10% for 2 seconds, and less than 400MB RAM.

Garbage Collection issues on MapPartitions

I currently have a mapPartitions job which is flatMapping each value in the
iterator, and I'm running into an issue where there will be major GC costs
on certain executions. Some executors will take 20 minutes, 15 of which are
pure garbage collection, and I believe that a lot of it has to do with the
ArrayBuffer that I am outputting. Does anyone have any suggestions as to how
I can do some form of a stream output?
Also, does anyone have any advice in general for tracking down/addressing GC
issues in spark?
Please refer to the below documentation from official page of Spark tuning. I hope it will at least help to give direction to your analysis:
Memory Management Overview
Memory usage in Spark largely falls under one of two categories: execution and storage. Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations, while storage memory refers to that used for caching and propagating internal data across the cluster. In Spark, execution and storage share a unified region (M). When no execution memory is used, storage can acquire all the available memory and vice versa. Execution may evict storage if necessary, but only until total storage memory usage falls under a certain threshold (R). In other words, R describes a subregion within M where cached blocks are never evicted. Storage may not evict execution due to complexities in implementation.
This design ensures several desirable properties. First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills. Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted. Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.
Although there are two relevant configurations, the typical user should not need to adjust them as the default values are applicable to most workloads:
spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.
spark.memory.storageFraction expresses the size of R as a fraction of M (default 0.5). R is the storage space within M where cached blocks immune to being evicted by execution.
The value of spark.memory.fraction should be set in order to fit this amount of heap space comfortably within the JVM’s old or “tenured” generation. See the discussion of advanced GC tuning below for details.

What is mongodb behavior regarding keeping loaded indexes in ram?

Say I have a single collection in mongodb with only one index, and I require the index for the entire life cycle of the application using that mongo collection.
I would like to know about the behaviour of mongodb.
In this case once the index is loaded into memory, will mongodb keep it in the ram?
Thanks
The first thing MongoDB will knock out of RAM will be the LRU (least recently used) piece of data. So if you only have one index, chances are it will continue to be used pretty regularly and it should stay in memory.
Source
Unfortunately you cannot currently pin a collection or index in memory. MongoDB uses memory mapped files to load collections and indexes into memory. As your activities touch various pieces of your database thru queries, updates, insertions and deletions, that data will get loaded into memory. This is referred to as the working set. If the total memory required to load the working set is less than available memory, no problem.
If not, MongoDB is going to use an LRU algorithm to pick what to unload from memory. This is why it's so important to understand the concept of the working set and how it relates to your available memory.
This writeup from the documentation should be helpful:
How do I calculate how much RAM I need for my application?
The amount of RAM you need depends on several factors, including but
not limited to:
The relationship between database storage and working set.
The operating system’s cache strategy for LRU (Least Recently Used)
The impact of journaling
The number or rate of page faults and other MMS gauges to detect when you need more RAM
Each database connection thread will need up to 1 MB of RAM. MongoDB
defers to the operating system when loading data into memory from
disk. It simply memory maps all its data files and relies on the
operating system to cache data. The OS typically evicts the
least-recently-used data from RAM when it runs low on memory. For
example if clients access indexes more frequently than documents, then
indexes will more likely stay in RAM, but it depends on your
particular usage.
To calculate how much RAM you need, you must calculate your working
set size, or the portion of your data that clients use most often.
This depends on your access patterns, what indexes you have, and the
size of your documents. Because MongoDB uses a thread per connection
model, each database connection also will need up to 1MB of RAM,
whether active or idle.
If page faults are infrequent, your working set fits in RAM. If fault
rates rise higher than that, you risk performance degradation. This is
less critical with SSD drives than with spinning disks.
http://docs.mongodb.org/manual/faq/diagnostics/
You can use the serverStatus command to get an estimate of your current working set:
db.runCommand( { serverStatus: 1, workingSet: 1 } )