We have a jetty server that is having 6 threads in a task executor that is outperforming 1 master and 4 slaves (2 local and 2 remote) that is having a partition grid size of 4. So if we have a 10k file even with 4 partitions it's not dropping down the time to perform by 4. Assuming some overhead for remote partition still, we are surprised by how slow it is. Has anyone done benchmarking of the remote partition to validate that the overhead of partitioning is not the culprit behind the slow performance? Any pointers as to our slow performance?
Related
We are trying to load data to postgres from oracle using nifi.
we are using PutDatabaseRecord to load data (which is in avro format).
we are using ExecuteSQL to extract data which is very fast but we can see that,
even though we are using 150+ threads for PutDatabaseRecord, it is maintaining an average of 1GB data writes for 5mins .
If suppose we are having 3 PutDatabaseRecord processors (i.e., let suppose for each table one processor) and each processor is of 50 threads, still it is maintaining an average of 1Gb for 5 mins (i.e., 250mb for 1 processor, 350 for 2nd processor and 400 for 3 processor. Or some other combinations but it is still 1Gb overall).
We are really, not sure if it is from postgres database end which is limiting write size or it's from nifi end.
Need help if we need to change NiFi properties or to change some settings in postgres, which will help the data loading performance.
One observation is that, data extraction from Oracle is very fast and we are able to see the Nifi queues are filling very quickly and waiting to be processed by PutDatabaseRecord process.
If you have a single NiFi instance, there will be limit on how much data you can push through regardless of the number of threads (once the number of threads reaches the number of cores on your machine). To increase throughput, you could set up a 3-5 node NiFi cluster and run the PutDatabaseRecord processors in parallel, then you should see 3-5 GB throughput to Postgres (as long as PG can handle that)
I have a MongoDB 4.2 cluster with 15 shards; the database stores a sharded collection of 6GB (i.e., about 400MB per machine).
I'm trying to read the whole collection from Apache Spark, which runs on the same machine. Spark's application runs with --num-executors 8 and --executor-cores 6; the connection is made through the spark-connector by configuring the MongoShardedPartitioner.
Besides the reading being very slow (about 1.5 minutes; but, as far as I understand, full scans are generally bad on MongoDB), I'm experiencing this weird behavior in Spark's task allocation:
The issues are the following:
For some reason, only one of the executors starts reading from the database, while all the others wait 25 seconds to begin their readings. The red bars correspond to "Task Deserialization Time", but my understanding is that they are simply idle (if there are concurrent stages, these executors work on something else and then come back to this stage only after the 25 seconds).
For some other reason, after some time the concurrent allocation of tasks is suspended and then it resumes altogether (at about 55 seconds from the start of the job); you can see it in the middle of picture, as a whole bunch of tasks is started at the same time.
Overall, the full scan could be completed in far less time if tasks were allocated properly.
What is the reason for these behaviors and who is responsible (is it Spark, the spark-connector, or MongoDB)? Is there some configuration parameter that could cause these problems?
I am using a cluster of 12 virtual machines, each of which has 16 GB memory and 6 cores(except master node with only 2 cores). To each worker node, 12GB memory and 4 cores were assigned.
When I submit a spark application to yarn, I set the number of executors to 10(1 as master manager, 1 as application master), and to maximize the parallelism of my application, most of my RDDs have 40 partitions as same as the number of cores of all executors.
The following is the problem I encountered: in some random stages, some tasks need to be processed extremely longer than others, which results in poor parallelism. As we can see in the first picture, executor 9 executed its tasks over 30s while other tasks could be finished with 1s. Furthermore, the reason for much time consumed is also randomized, sometimes just because of computation, but sometimes scheduler delay, deserialization or shuffle read. As we can see, the reason for second picture is different from first picture.
I am guessing the reason for this occurs is once some task got assigned to a specific slot, there is not enough resources on the corresponding machine, so jvm was waiting for cpus. Is my guess correct? And how to set the configuration of my cluster to avoid this situation?
computing
scheduler delay & deserialization
To get a specific answer you need to share more about what you're doing but most likely the partitions you get in one or more of your stages are unbalanced - i.e. some are much bigger than others. The result is slowdown since these partitions are handled by a specific task. One way to solve it is to increase the number of partitions or change the partitioning logic
When a big task finishes shipping the data to other tasks would take longer as well so that's why other tasks may take long
Below is the configuration:
2 JBoss application nodes
5 listeners on the application node with 50 threads each, supports
clustering and is set up as active-active listener, so they run on
both app nodes
The listener simply gets the message and logs the information into
database
50000 messages are posted into ActiveMQ using JMeter.
Here is the observation on first execution:
Total 50000 messages are consumed in approx 22 mins.
first 0-10000 messages consumed in 1 min approx
10000-20000 messages consumed in 2 mins approx
20000-30000 messages consumed in 4 mins approx
30000-40000 messages consumed in 6 mins approx
40000-50000 messages consumed in 8 mins
So we see the message consumption time is increasing with increasing number of messages.
Second execution without restarting any of the servers:
50000 messages consumed in 53 mins approx!
But after deleting data folder of activemq and restarting activemq,
performance again improves but degrades as more data enters the queue!
I tried multiple configuration in activemq.xml, but no success...
Anybody faced similar issue, and got any solution ? Let me know. Thanks.
I've seen similar slowdowns in our production systems when pending message counts go high. If you're flooding the queues then the MQ process can't keep all the pending messages in memory, and has to go to disk to serve a message. Performance can fall off a cliff in these circumstances. Increase the memory given to the MQ server process.
Also looks as though the disk storage layout is not particularly efficient - perhaps having each message as a file in a single directory? This can make access time rise as traversing disk directory takes longer.
50000 messages in > 20 mins seems very low performance.
Following configuration works well for me (these are just pointers. You may already have tried some of these but see if it works for you)
1) Server and queue/topic policy entry
// server
server.setDedicatedTaskRunner(false)
// queue policy entry
policyEntry.setMemoryLimit(queueMemoryLimit); // 32mb
policyEntry.setOptimizedDispatch(true);
policyEntry.setLazyDispatch(true);
policyEntry.setReduceMemoryFootprint(true);
policyEntry.setProducerFlowControl(true);
policyEntry.setPendingQueuePolicy(new StorePendingQueueMessageStoragePolicy());
2) If you are using KahaDB for persistence then use per destination adapter (MultiKahaDBPersistenceAdapter). This keeps the storage folders separate for each destination and reduces synchronization efforts. Also if you do not worry about abrupt server restarts (due to any technical reason) then you can reduce then disk sync efforts by
kahaDBPersistenceAdapter.setEnableJournalDiskSyncs(false);
3) Try increasing the memory usage, temp and storage disk usage values at server level.
4) If possible increase prefetchSize in prefetch policy. This will improve performance but also increases the memory footprint of consumers.
5) If possible use transactions in consumers. This will help to reduce the message acknowledgement handling and disk sync efforts by server.
Point 5 mentioned by #hemant1900 solved the problem :) Thanks.
5) If possible use transactions in consumers. This will help to reduce
the message acknowledgement handling and disk sync efforts by server.
The problem was in my code. I had not used transaction to persist the data in consumer, which is anyway bad programming..I know :(
But didn't expect that could have caused this issue.
Now 50000, messages are getting processed in less than 2 mins.
I have a cluster and I execute wholeTextFiles which should pull about a million text files who sum up to approximately 10GB total
I have one NameNode and two DataNode with 30GB of RAM each, 4 cores each. The data is stored in HDFS.
I don't run any special parameters and the job takes 5 hours to just read the data. Is that expected? are there any parameters that should speed up the read (spark configuration or partition, number of executors?)
I'm just starting and I've never had the need to optimize a job before
EDIT: Additionally, can someone explain exactly how the wholeTextFiles function works? (not how to use it, but how it was programmed). I'm very interested in understand the partition parameter, etc.
EDIT 2: benchmark assessment
So I tried repartition after the wholeTextFile, the problem is the same because the first read is still using the pre-defined number of partitions, so there are no performance improvements. Once the data is loaded the cluster performs really well... I have the following warning message when dealing with the data (for 200k files), on the wholeTextFile:
15/01/19 03:52:48 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (15795 KB). The maximum recommended task size is 100 KB.
Would that be a reason of the bad performance? How do I hedge that?
Additionally, when doing a saveAsTextFile, my speed according to Ambari console is 19MB/s. When doing a read with wholeTextFiles, I am at 300kb/s.....
It seems that by increase the number of partitions in wholeTextFile(path,partitions), I am getting better performance. But still only 8 tasks are running at the same time (my number of CPUs). I'm benchmarking to observe the limit...
To summarize my recommendations from the comments:
HDFS is not a good fit for storing many small files. First of all, NameNode stores metadata in memory so the amount of files and blocks you might have is limited (~100m blocks is a max for typical server). Next, each time you read file you first query NameNode for block locations, then connect to the DataNode storing the file. Overhead of this connections and responses is really huge.
Default settings should always be reviewed. By default Spark starts on YARN with 2 executors (--num-executors) with 1 thread each (--executor-cores) and 512m of RAM (--executor-memory), giving you only 2 threads with 512MB RAM each, which is really small for the real-world tasks
So my recommendation is:
Start Spark with --num-executors 4 --executor-memory 12g --executor-cores 4 which would give you more parallelism - 16 threads in this particular case, which means 16 tasks running in parallel
Use sc.wholeTextFiles to read the files and then dump them into compressed sequence file (for instance, with Snappy block level compression), here's an example of how this can be done: http://0x0fff.com/spark-hdfs-integration/. This will greatly reduce the time needed to read them with the next iteration