Java multiple processes write to same folder performance issue - redhat

We have 4 Java instances each have 20 process writing to same and different folder in Linux and have performance issues on same folder. Below is our configuration
Java Instance 1 - 20 process writing to FOLDER 1 having writing performance 200000 per hour
Java Instance 2 - 20 process writing to FOLDER 1 having writing performance 200000 per hour
Java Instance 3 - 20 process writing to FOLDER 2 having writing performance 400000 per hour
Java Instance 4 - 20 process writing to FOLDER 4 having writing performance 400000 per hour
3rd and 4th instance have double the performance of Instance 1 and Instance 2
OS - REDHAT LINUX
CPU - 150+
Memory - 200+
Please advise same folder writing on Linux file system will have performance degradation as linux treat folder like files.
Below is our Java IO Code for reference .
File outFile = new File(outputFilePathTmp);
BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(outFile));
marshaller.marshal(baseDocument.getBusinessDocument(),bufferedWriter);
String outputFilePathXml = regroupedXmlPath+File.separator+outputFileName.substring(0, indx)+"- Regrouped.xml";
outFile.setReadable(true,false);
outFile.setWritable(true,false);
outFile.setExecutable(true,false);
long outFileLength=outFile.length()/1024;
baseDocument.setFileSize(outFileLength);
bufferedWriter.close();

Related

Memory leak (?) using IO::Socket::Async (on FreeBSD 13.1)

In processing a stream of logs (via UDP) in a raku (v2022.07) app, I'm
hitting what appears to be a memory leak using IO::Socket::Async.
I pulled the code out into a simpler program which I've included below
(~ identical to code at https://docs.raku.org/type/IO::Socket::Async):
#!/usr/bin/env raku
#
my $socket = IO::Socket::Async.bind-udp('localhost', 24225);
react {
whenever $socket.Supply -> $v {
print $v if $v.chars > 0;
};
};
It leaks substantial ram - I let it run about 12 hours and
when I checked -- still running (on a 1T ram machine) -- with
ps auwwx [pid]
it showed 314974456 and 20739784 for VSZ and RSS (so, roughly 300G v size and 20G resident).
[btw, the UDP traffic is fairly light - average of 350 (~100 byte) packets/sec (spikes to ~1000/sec)]
So .. I rewrote above in perl5 (after similar leaky results w/
a couple of raku variants) which stabilizes quickly at about 8M resident - that's fine/stable/etc. -
but I'd prefer this process to feed a raku channel (without separate perl process/file
tailing, etc.).
My environment: FreeBSD 13.1-RELEASE-p2 GENERIC amd64 and raku:
v2022.07 built on MoarVM 2022.07 (installed with rakubrew).
I'm guessing this is unique to raku on freebsd but not sure.
I did attempt to upgrade (rakubrew) to v2022.12 to see if problem resolved there -
but in rebuilding modules (zef), too many failed (some issue with
Digest/Digest::HMAC) - so I had to revert to 2022.07.
I'll sure be grateful for any suggestions for addressing the leak or alternative
methods to address reading from a UDP port.
Not exactly a solution to your problem, but you can monitor memory usage from within your Raku code using built-in feature:
use Telemetry;
say T{"max-rss"};
Also remember that Supply by default decodes unicode chars. If your protocol is binary you may add :bin to Socket params to avoid treating binary data as text.

Full frequent garbage collection in wildfly 8.2.0 Final

We moved from Jboss AS 7.1.1 to wildfly 8.2.0 Final. After upgrade, we are seeing full frequent garbage collection on running 60 user load test. Full gc were not able to recover any memory. On analysis we found that org.apache.jasper.runtime.BodyContentImpl is having 1 gb of retained heap. We found root cause in PerThreadTagHandlerPool.java . Patched TagHandelrPool
diff --git a/src/main/java/org/apache/jasper/runtime/TagHandlerPool.java b/src/main/java/org/apache/jasper/runtime/TagHandlerPool.java
index eaa8560..c6c785f 100644
--- a/src/main/java/org/apache/jasper/runtime/TagHandlerPool.java
+++ b/src/main/java/org/apache/jasper/runtime/TagHandlerPool.java
## -53,7 +53,7 ## public class TagHandlerPool {
result = null;
}
}
- if( result==null ) result=new PerThreadTagHandlerPool();
+ if( result==null ) result=new TagHandlerPool();
result.init(config);
return result;
This fixed memory leak issue, however we are seeing full frequent gc's every 2 mins when running load test. Full gc is able to recover memory. On analyzing heap dump found most of the heap area is remainder(350 MB) in Eclipse MAT , io.undertow.server.session.InMemorySessionManager occupies around 17 MB and org.hibernate.internal.SessionFactoryImpl occupies around 17.5MB
Tried multiple options
1. Ourmax heap is 1536m, decreased to 1024m and increased to 2048m and 4096m. No benefits
2. Change XX:NewRatio to 3, but no help.
Appreciate your inputs.

akka custom fork-join-executor dispatcher behaves differently on OSX and RHEL

When I deploy a Play framework application, using the Akka framework to a production machine it behaves differently then on my development workstation.
This is a system that receives a batch of device IP addresses, it performs some processing on each device and aggregates the results after all devices in the batch have been processed. This processing isn't very CPU intensive.
I basically have 2 types of actors, A BatchActor, and a DeviceActor. For the devices, I've created a created an actor backed by a RoundRobinPool router, and a custom dispatcher. I'm attempting to process ~500 device at a time (in parallel).
This issue is that when I run this code on my OSX machine, it runs as I would except.
For instance if I submit a batch of 200 device IP addresses, the application running on my workstations all the devices in parallel.
However when I copy this application to the production machine, Red Hat Enterprise Linux (RHEL), and run it submitting the same list of devices, it only processes 1 to 2 devices at a time.
What do I need to do to fix this issue?
The relevant code is as follows:
object Application extends Controller {
...
val numberOfWorkers = 500
val workers = Akka.system.actorOf(Props[DeviceActor]
.withRouter(RoundRobinPool(nrOfInstances = numberOfWorkers))
.withDispatcher("my-dispatcher")
)
def batchActor(config:BatchConfig)
= Akka.system.actorOf(BatchActor.props(workers, config), s"batch-${config.batchId}")
...
def batch = Action(parse.json) { request =>
request.body.validate[BatchConfig] match {
case config:BatchConfig => {
...
val batch = batchActor(config)
batch ! BatchActorProtocol.Start
Ok(Json.toJson(status))
}
...
}
}
The application.conf configuration section looks like the following:
my-dispatcher {
# Dispatcher is the name of the event-based dispatcher
type = Dispatcher
# What kind of ExecutionService to use
executor = "fork-join-executor"
# Configuration for the fork join pool
fork-join-executor {
# Min number of threads to cap factor-based parallelism number to
parallelism-min = 1000
# Parallelism (threads) ... ceil(available processors * factor)
parallelism-factor = 100.0
# Max number of threads to cap factor-based parallelism number to
parallelism-max = 5000
}
# Throughput defines the maximum number of messages to be
# processed per actor before the thread jumps to the next actor.
# Set to 1 for as fair as possible.
throughput = 500
}
Inside the BatchActor I'm simply parsing the list of devices and feeding it to the
class BatchActor(val workers:ActorRef, val config:BatchConfig) extends Actor
...
def receive = {
case Start => start
...
}
private def start = {
...
devices.map { devices =>
results(devices.host) = None
workers ! DeviceWork(self, config, devices, steps)
}
...
}
after which the WorkerActor submits a result object back to the BatchActer.
My workstation: OS X - v10.9.3
java -version
java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
production machine: Red Hat Enterprise Linux Server release 6.5 (Santiago)
java -version
java version "1.7.0_65"
OpenJDK Runtime Environment (rhel-2.5.1.2.el6_5-x86_64 u65-b17)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
Software:
Scala: v2.11.2
SBT: v0.13.6
Play: v2.3.5
Akka: v2.3.4
I'm using typesafe activator/sbt to start the application. The command is as follows:
cd <project dir>
./activator run -Dhttp.port=6600
Any help appreciated. I've been stuck on this issue for a couple of days now.
I believe you have too much parallelism in your code i.e., you are creating too many threads in your dispatcher. How many cores do you have on your Redhat box ? I've never seen such high value used. A lot of threads in FJ pool may be resulting in a large number of context switches. Try just using the default dispatcher and see if that fixes your issue or not. You can also change the values of min and max parallelism to 2 or 3 times number of cores you have.
fork-join-executor {
# Min number of threads to cap factor-based parallelism number to
parallelism-min = 1000
# Parallelism (threads) ... ceil(available processors * factor)
parallelism-factor = 100.0
# Max number of threads to cap factor-based parallelism number to
parallelism-max = 5000
}
Another thing to try is to create an uber jar using (sbt-assembly) and then deploy that instead of using activator to deploy it.
Finally, you can look inside your JVMs using something like VisualJVM or Yourkit.
After hours spent trying different things including:
doing research on different threading implementations on linux - pthreads vs NPTL
reading through all the VM documentation on threading
ulimits
trying various changes in the Play and Akka framework configurations
and finally a complete re-write of the thread management using scala futures, etc..
Nothing seemed to work. Then I did a detailed comparison and the only thing that was different was that I used the Oracle Hotspot implementation on my laptop, and the OpenJDK implementation on the production machine.
So I installed the Oracle VM on the production machine and that seemed to fix the issue. Even though I couldn't determine what the ultimate solution was, it seems that the default installation of OpenJDK on RHEL is complied or configured differently enough to not allow spawning of ~ 500 threads at a time.
I'm sure I'm missing something, but after ~ 3 days of searching I couldn't find it.

pymongo cursor getMore takes long time

I am having trouble with the time it takes for my python script to iterate a data set. The data set is about 40k documents. This is large enough to cause the pymongo cursor to issue multiple fetches which are internal and abstracted away from the developer. I simplified my script down as much as possible to demonstrate the problem:
from pymongo import Connection
import time
def main():
starttime = time.time()
cursor = db.survey_answers.find()
counter=0;
lastsecond=-1;
for entry in cursor:
if int(time.time()-starttime)!=lastsecond:
print "loop number:", counter, " seconds:",int(time.time()-starttime);
lastsecond= int(time.time()-starttime)
counter+=1;
print (time.time()-starttime), "seconds for the mongo query to get rows:",counter;
connection = Connection(APPSERVER)#either localhost or hostname depending on test
db = connection.beacon
if __name__ == "__main__":
main()
My set up is as follows. I have 4 separate hosts, one APPSERVER running mongos, and 3 other shard hosts with each being a primary replica set and secondary replica sets of the other two.
I can run this from one of the shard servers (with the connection pointing to the APPSERVER hostname) and I get:
loop number: 0 seconds: 0
loop number: 101 seconds: 2
loop number: 7343 seconds: 5
loop number: 14666 seconds: 8
loop number: 21810 seconds: 10
loop number: 28985 seconds: 13
loop number: 36078 seconds: 15
16.0257680416 seconds for the mongo query to get rows: 41541
So it's obvious what's going on here, the first batchsize of a cursor request is 100, and then each subsequent one is 4m worth of data which appears to be just over 7k documents for me. And each fetch costs 2-3 seconds!!!!
I thought I could fix this problem by moving my application closer to the mongos instance. I ran the above code on APPSERVER (with the connection pointing to localhost) hoping to decrease the network usage .... but it was worse!
loop number: 0 seconds: 0
loop number: 101 seconds: 9
loop number: 7343 seconds: 19
loop number: 14666 seconds: 28
loop number: 21810 seconds: 38
loop number: 28985 seconds: 47
loop number: 36078 seconds: 53
53.5974030495 seconds for the mongo query to get rows: 41541
The cursor sizes are exactly the same in both test, which is nice, but each cursor fetch costs 9-10 seconds here!!!
I know I have four separate hosts that need to communicate, so this can't be instant. But I will need to iterate over collections of maybe 10m records. At 2 seconds per 7k, that would take just shy of an hour! I can't have this!
Btw, I'm new to the python/mongo world, I'm used to php and mysql where I would expect this to process in a fraction of a second:
$q=mysql_query("select * from big_table");//let's say 10m rows here ....
$c=0;
while($r=mysql_fetch_rows($q))
$c++;
echo $c." rows examined";
Can somebody explain the gargantuan difference between the pymongo (~1 hour) and php/mysql (<1 sec) approaches I've presented? Thanks!
I was able to figure this out with the help of A. Jesse Jiryu Davis. It turns out I didn't have C extension installed. I wanted to run another test without the shards so I could rule out the network latency as an issue. I got a fresh clean host, set up mongo, imported my data, and ran my script and it took the same amount of time. So I know the sharding/replica sets didn't have anything to do with the problem.
Before the fix, I was able to print:
pymongo.has_c(): False
pymongo version 2.3
I then followed the instructions to install the dependencies for c extensions:
yum install gcc python-devel
Then I reinstalled the pymongo driver:
git clone git://github.com/mongodb/mongo-python-driver.git pymongo
cd pymongo/
python setup.py install
I reran my script and it now prints:
pymongo.has_c(): True
pymongo version 2.3+
And it takes about 1.8 seconds to run as opposed to the 16 above. That still seems long to fetch 40k records and iterate over them, but it's a significant improvement.
I will now run these updates on my prod (sharded, replica set) environment to hopefully see the same results.
**UPDATE**
I updated my pymongo driver on my prod environment and there was an improvement, though not as much. It took about 2.5-3.5 seconds over a few tests. I presume the sharding nature was the fault here. That still seems incredibly slow to iterate over 40k records.

mongodb higher faults on Windows than on Linux

I am executing below C# code -
for (; ; )
{
Console.WriteLine("Doc# {0}", ctr++);
BsonDocument log = new BsonDocument();
log["type"] = "auth";
BsonDateTime time = new BsonDateTime(DateTime.Now);
log["when"] = time;
log["user"] = "staticString";
BsonBoolean bol = BsonBoolean.False;
log["res"] = bol;
coll.Insert(log);
}
When I run it on a MongoDB instance (version 2.0.2) running on virtual 64 bit Linux machine with just 512 MB ram, I get about 5k inserts with 1-2 faults as reported by mongostat after few mins.
When same code is run against a MongoDB instance (version 2.0.2) running on a physical Windows machine with 8 GB of ram, I get 2.5k inserts with about 80 faults as reported by mongostat after few mins.
Why more faults are occurring on Windows? I can see following message in logs-
[DataFileSync] FlushViewOfFile failed 33 file
Journaling is disable on both instances
Also, is 5k insert on a virtual machine with 1-2 faults a good enough speed? or should I be expecting better inserts?
Looks like this is a known issue - https://jira.mongodb.org/browse/SERVER-1163
page fault counter on Windows is in fact the total page faults which include both hard and soft page fault.
Process : Page Faults/sec. This is an indication of the number of page faults that
occurred due to requests from this particular process. Excessive page faults from a
particular process are an indication usually of bad coding practices. Either the
functions and DLLs are not organized correctly, or the data set that the application
is using is being called in a less than efficient manner.