NiFi PutDatabaseRecord to Postgres database : Performance improvement in loading data into Postgres database - postgresql

We are trying to load data to postgres from oracle using nifi.
we are using PutDatabaseRecord to load data (which is in avro format).
we are using ExecuteSQL to extract data which is very fast but we can see that,
even though we are using 150+ threads for PutDatabaseRecord, it is maintaining an average of 1GB data writes for 5mins .
If suppose we are having 3 PutDatabaseRecord processors (i.e., let suppose for each table one processor) and each processor is of 50 threads, still it is maintaining an average of 1Gb for 5 mins (i.e., 250mb for 1 processor, 350 for 2nd processor and 400 for 3 processor. Or some other combinations but it is still 1Gb overall).
We are really, not sure if it is from postgres database end which is limiting write size or it's from nifi end.
Need help if we need to change NiFi properties or to change some settings in postgres, which will help the data loading performance.
One observation is that, data extraction from Oracle is very fast and we are able to see the Nifi queues are filling very quickly and waiting to be processed by PutDatabaseRecord process.

If you have a single NiFi instance, there will be limit on how much data you can push through regardless of the number of threads (once the number of threads reaches the number of cores on your machine). To increase throughput, you could set up a 3-5 node NiFi cluster and run the PutDatabaseRecord processors in parallel, then you should see 3-5 GB throughput to Postgres (as long as PG can handle that)

Related

Is it possible to configure batch size for SingleStore Kafka pipeline?

I'm using SingleStore to load events from Kafka. I created a Kafka pipeline with the following script:
create pipeline `events_stream`
as load data kafka 'kafka-all-broker:29092/events_stream'
batch_interval 10000
max_partitions_per_batch 6
into procedure `proc_events_stream`
fields terminated by '\t' enclosed by '' escaped by '\\'
lines terminated by '\n' starting by '';
And SingleStore failing with OOM error like the following:
Memory used by MemSQL (4537.88 Mb) has reached the 'maximum_memory' setting (4915 Mb) on this node. Possible causes include (1) available query execution memory has been used up for table memory (in use table memory: 71.50 Mb) and (2) the query is large and complex and requires more query execution memory than is available
I'm a quite confused why 4Gb is not enough to read Kafka by batches....
Is it possible to configure batch_size for the pipeline to avoid memory issues and make the pipeline more predictable?
unfortunately, current version of Singlestore's pipelines only has a global batch size and can not be set individually in stored procedures.
However, In general, each pipeline batch has some overhead, so running 1 batch for 10000 messages should be better in terms of total resources than 100 or even 10 batches for the same number of messages. If the stored procedure is relatively light, the time delay you are experiencing is likely dominated by the network download of 500mb.
It is worth checking if 10000 large messages are arriving on the same partition or several. In Singlestore's pipelines, each database partition downloads messages from a different kafka partition, and that helps parallelize the workload. However, if the messages arrive on just one partition, then you are not getting the benefits of parallel execution.

Apache Nifi : Oracle To Mongodb data transfer

I want to transfer data from oracle to MongoDB using apache nifi. Oracle has a total of 9 million records.
I have created nifi flow using QueryDatabaseTable and PutMongoRecord processors. This flow is working fine but has some performance issues.
After starting the nifi flow, records in the queue for SplitJson -> PutMongoRecord are increasing.
Is there any way to slow down records putting into the queue by SplitJson processor?
OR
Increase the rate of insertion in PutMongoRecord?
Right now, in 30 minutes 100k records are inserted, how to speed up this process?
#Vishal. The solution you are looking for is to increase the concurrency of PutMongoRecord:
You can also experiment with the the BATCH size in the configuration tab:
You can also reduce the execution time splitJson. However you should remember this process is going to take 1 flowfile and make ALOT of flowfiles regardless of the timing.
How much you can increase concurrency is going to depend on how many nifi nodes you have, and how many CPU Cores each node has. Be experimental and methodical here. Move up in single increments (1-2-3-etc) and test your file in each increment. If you only have 1 node, you may not be able to tune the flow to your performance expectations. Tune the flow instead for stability and as fast as you can get it. Then consider scaling.
How much you can increase concurrency and batch is also going to depend on the MongoDB Data Source and the total number of connections you can get fro NiFi to Mongo.
In addition to Steven's answer, there are two properties on QueryDatabaseTable that you should experiment with:
Max Results Per Flowfile
Use Avro logical types
With the latter, you might be able to do a direct shift from Oracle to MongoDB because it'll convert Oracle date types into Avro ones and those should in turn by converted directly into proper Mongo date types. Max results per flowfile should also allow you to specify appropriate batching without having to use the extra processors.

Loading data to Postgres RDS is still slow after tuning parameters

We have created a RDS postgres instance (m4.xlarge) with 200GB storage (Provisioned IOPS). We are trying to upload data from company data mart to the 23 tables in RDS using DataStage. However the uploads are quite slow. It takes about 6 hours to load 400K records.
Then I started tuning the following parameters according to Best Practices for Working with PostgreSQL:
autovacuum 0
checkpoint_completion_target 0.9
checkpoint_timeout 3600
maintenance_work_mem {DBInstanceClassMemory/16384}
max_wal_size 3145728
synchronous_commit off
Other than these, I also turned off multi AZ and back-up. SSL is enabled though, not sure this will change anything. However, after all the changes, still not much improvement. DataStage is uploading data in parallel already ~12 threads. Write IOPS is around 40/sec. Is this value normal? Is there anything else I can do to speed up the data transfer?
In Postgresql, you're going to have to wait 1 full round trip (latency) for each insert statement written. This latency is the latency between the database all the way to the machine where the data is being loaded from.
In AWS you have many options to improve performance.
For starters, you can load your raw data onto an EC2 instance and start importing from there, however, you will likely not be able to use your dataStage tool unless it can be loaded directly on the ec2 instance.
You can configure dataStage to use batch processing where each insert statement actually contains many rows.. generally, the more, the faster.
disable data compression and make sure you've done everything you can to minimize latency between the two endpoints.

Large number of connections to kdb

I have a grid with over 10,000 workers, and I'm using qpython to append data to kdb. Currently with 1000 workers, I'm getting ~40 workers that fail to connect and send data on the first try, top shows q is at 100% cpu when that happens. As I scale to 10k workers, the problem will escalate. The volume of data is only 100MBs. I've tried running extra slaves, but kdb tells me I can't use it with -P option, which I'm guessing I need to use qpython. Any ideas how to scale to support 10k workers. My current idea is to write a server in between that will buffer write requests and pass them to kdb, is there a better solution?
It amazes me that you're willing to dedicate 10,000 cpus to Python but only a single one to Kdb.
Simply run more Kdb cores (on other ports) and then, enable another process to receive the updates from the ingestion cores. The tickerplant (u.q) is a good model for this.

Zookeeper Overall data size limit

I am new to Zookeeper, trying to understand if it fits for my use case.
I have 10 million hierarchical data, which I want to store in Zookeeper.
10M key-value pair with size of the key and value will be 1KB each.
So the total data size is approximately ~20GB (10M * 2KB) without replication.
I know the zNodes data size limit is 1MB( which can be changed).
Questions:
Will zookeeper able to support 20GB of data, with no performance impact.
Is there max size after which the performance degrades?
Is there a limit to total number of nodes?
Zookeeper will no way be suitable for this use case. Zookeeper keeps dumping/snapshotting the data tree periodically and that means it will be dumping whole of the 20 GB data every few minutes. Moreover Zookeeper nodes in the cluster/ensemble are more like replica of each other and hence whole data is replicated to each Zookeeper node and hence no data partitioning either. Zookeeper is not a database.
I guess for your use case, it will be much better to go with some database or some distributed cache (Redis/Hazelcast etc.)
Anyway there are no limits on the total number of nodes on Zookeeper.