I have a situation where JBoss 6.1 eap server log file number continues to increase.The average number of serverlog files per day in April is 330, a maximum of 500.
In particular, since the upper limit of the number of rotations of GW serverlog per day is set to 300, it already exceeds the upper limit.
When the upper limit is exceeded, old files are getting deleted.
Please tell me how to avoid file deletion.(Do i need to increase the folder size or need to set rotation..)
As the number of users is expected to increase, the number of files is expected to continue to increase,
Related
Our pipeline is developed based on the Apache Beam Go SDK. I'm trying to profile the CPU of all workers by setting the flag --cpu_profiling=gs://gs_location: https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/dataflow/dataflow.go
The job finished with spending 16.636 vCPU hr and a maximum number of 104 workers:
As a result in the specified GCS location, a bunch of files are recorded with name "profprocess_bundle-*":
Then I downloaded these files, unzipped them all, and visualize the results with pprof (https://github.com/google/pprof):
So here are my questions:
How is the total time in the profiling result collected? The sampled time (1.06 hrs) is way shorter than the vCPU time (16.626 hrs) reported by Dataflow.
What is the the number in the file name "profprocess_bundle-*"? I was thinking it may correspond to the index of a worker. But the maximum of the number is larger than the worker number, and the number is not continuous. The maximum number is 122, but there are only 66 files.
when you set --cpu_profiling=true, the profiling starts when the SDK worker starts processing a bundle (a batch of input elements will go through a subgraph of your pipeline DAG, sometimes also referred as work item) and ends when the processing finishes. A job can contain many bundles. That's why the total vCPU time will be larger than the sample period.
As mentioned the number in profprocess_bundle-* is representing the bundle id being profiled.
I have an kafka recieving 1gb of data every min from certain events, due to which the no of files open is going above 1000000. I am not sure which setting needs to be changed to lessen this no. Can anyone guide what could be a quick fix? should i increase the log.segment.bytes=1073741824
to 10 GB to reduce no of files getting created , or increase log.retention.check.interval.ms=300000 to 15 mins so less get checked for deletion
Increasing the size of the segments will reduce number of files maintained by the broker, with the tradeoffs being that only closed segments are ones that get cleaned or compacted.
The other alternative is to reconsider what types of data you're using. For example, if sending files or other large binary blobs, consider using filesystem URIs rather than push the whole data through a topic
I am reading this question Kafka: Continuously getting FETCH_SESSION_ID_NOT_FOUND, and I am trying to apply the solution suggested by Hrishikesh Mishra, as we also face the similar issue, so I increased the broker setting max.incremental.fetch.session.cache.slots to 2000, default was 1000. But now I wonder how can I monitor the actual number of used incremental fetch session cache slots, in prometheus I see kafka_server_fetchsessioncache_numincrementalfetchpartitionscached metrics, and promql query shows on each of three brokers the number that is now significantly over 2000, that is 2703, 2655 and 2054, so I am confused if I look at the proper metrics. There is also kafka_server_fetchsessioncache_incrementalfetchsessionevictions_total that shows zeros on all brokers.
OK, there is also kafka_server_fetchsessioncache_numincrementalfetchsessions that shows cca 500 on each of three brokers, so that is total of cca 1500, which is between 1000 and 2000, so maybe that metrics is the one that is controlled by max.incremental.fetch.session.cache.slots ?
Actually, as of now, it is already more than 700 incremental fetch sessions on each broker, that is total of more than 2100, so, obviously, the limit of 2000 applies to each broker, so that the number in the whole cluster can go as far as 6000. The reason why the number is now below 1000 on each broker is because the brokers were restarted after the configuration change.
And the question is how can this allocation be checked on the individual consumer level. Such a query:
count by (__name__) ({__name__=~".*fetchsession.*"})
returns only this table:
Element Value
kafka_server_fetchsessioncache_incrementalfetchsessionevictions_total{} 3
kafka_server_fetchsessioncache_numincrementalfetchpartitionscached{} 3
kafka_server_fetchsessioncache_numincrementalfetchsessions{} 3
The metric named kafka.server:type=FetchSessionCache,name=NumIncrementalFetchSessions is the correct way to monitor the number of FetchSessions.
The size is configurable via max.incremental.fetch.session.cache.slots. Note that this setting is applied per-broker, so each broker can cache up to max.incremental.fetch.session.cache.slots sessions.
The other metric you saw, kafka.server:type=FetchSessionCache,name=NumIncrementalFetchPartitionsCached, is the total number of partitions used across all FetchSession. Many FetchSessions will used several partitions so it's expected to see a larger number of them.
As you said, the low number of FetchSessions you saw was likely due to the restart.
I am trying to understand the behaviour of wal files. The wal related settings of the database are as follows:
"min_wal_size" "2GB"
"max_wal_size" "20GB"
"wal_segment_size" "16MB"
"wal_keep_segments" "0"
"checkpoint_completion_target" "0.8"
"checkpoint_timeout" "15min"
The number of wal files is always 1281 or higher:
SELECT COUNT(*) FROM pg_ls_dir('pg_xlog') WHERE pg_ls_dir ~ '^[0-9A-F]{24}';
-- count 1281
As I understand it this means wal files currently never fall below max_wal_size (1281 * 16 MB = 20496 MB = max_wal_size) ??
I would expect the number of wal files to decrease below maximum right after a checkpoint is reached and data is synced to disk. But this is clearly not the case. What am I missing?
As per the documentation (emphasis added):
The number of WAL segment files in pg_xlog directory depends on min_wal_size, max_wal_size and the amount of WAL generated in previous checkpoint cycles. When old log segment files are no longer needed, they are removed or recycled (that is, renamed to become future segments in the numbered sequence). If, due to a short-term peak of log output rate, max_wal_size is exceeded, the unneeded segment files will be removed until the system gets back under this limit. Below that limit, the system recycles enough WAL files to cover the estimated need until the next checkpoint, and removes the rest
So, as per your observation, you are probably observing the "recycle" effect -- the old WAL files are getting renamed instead of getting removed. This saves the disk some I/O, especially on busy systems.
Bear in mind that once a particular file has been recycled, it will not be reconsidered for removal/recycle again until it has been used (i.e., the relevant LSN is reached and checkpointed). That may take a long time if your system suddenly becomes less active.
If your server is very busy and then abruptly becomes mostly idle, you can get into a situation where the log fails remain at max_wal_size for a very long time. At the time it was deciding whether to remove or recycle the files, it was using them up quickly and so decided to recycle up to max_wal_size for predicted future use, rather than remove them. Once recycled, they will never get removed until they have been used (you could argue that that is a bug), and if the server is now mostly idle it will take a very long time for them to be used and thus removed.
can somebody please explain me what is the cause of this error:
You have reached maximum pool size for given partition
In latest 2.1.x version, you do not have this exception any more.
You merely wait till new connection will be available.
But I will explain it any way. To increase multiprocessor scalability pool is split on partitions and several threads work together on single partition.
Each partition has queue , when limit if connections for this queue is reached exception is thrown. But again it is already not the case for latest version.
So the best approach to fix this issue is to upgrade to latest version and set limit of maximum connections. Would be cool if you will add more information in your question , but I suppose that you use OrientGraphFactory which in latest version has maximum limit of connections equals to number of CPU cores.