I have a grid with over 10,000 workers, and I'm using qpython to append data to kdb. Currently with 1000 workers, I'm getting ~40 workers that fail to connect and send data on the first try, top shows q is at 100% cpu when that happens. As I scale to 10k workers, the problem will escalate. The volume of data is only 100MBs. I've tried running extra slaves, but kdb tells me I can't use it with -P option, which I'm guessing I need to use qpython. Any ideas how to scale to support 10k workers. My current idea is to write a server in between that will buffer write requests and pass them to kdb, is there a better solution?
It amazes me that you're willing to dedicate 10,000 cpus to Python but only a single one to Kdb.
Simply run more Kdb cores (on other ports) and then, enable another process to receive the updates from the ingestion cores. The tickerplant (u.q) is a good model for this.
Related
We are trying to load data to postgres from oracle using nifi.
we are using PutDatabaseRecord to load data (which is in avro format).
we are using ExecuteSQL to extract data which is very fast but we can see that,
even though we are using 150+ threads for PutDatabaseRecord, it is maintaining an average of 1GB data writes for 5mins .
If suppose we are having 3 PutDatabaseRecord processors (i.e., let suppose for each table one processor) and each processor is of 50 threads, still it is maintaining an average of 1Gb for 5 mins (i.e., 250mb for 1 processor, 350 for 2nd processor and 400 for 3 processor. Or some other combinations but it is still 1Gb overall).
We are really, not sure if it is from postgres database end which is limiting write size or it's from nifi end.
Need help if we need to change NiFi properties or to change some settings in postgres, which will help the data loading performance.
One observation is that, data extraction from Oracle is very fast and we are able to see the Nifi queues are filling very quickly and waiting to be processed by PutDatabaseRecord process.
If you have a single NiFi instance, there will be limit on how much data you can push through regardless of the number of threads (once the number of threads reaches the number of cores on your machine). To increase throughput, you could set up a 3-5 node NiFi cluster and run the PutDatabaseRecord processors in parallel, then you should see 3-5 GB throughput to Postgres (as long as PG can handle that)
I am running a ksqlDB streaming application that consists of a large number of queries (>60 queries), including many joins and aggregations. My data comes from various sources, and requires plenty of manipulation to produce the desired processed data, hence the large number of queries. I've run this set of queries on a single machine, using interactive mode, and it produces the right results. But I observe an increasing consumer lag when I increase the amount of data fed into the application.
I read on ksqlDB's Capacity Planning page that I can scale by adding more servers, which is what I plan to do.
Under Important Sizing Factors, it's also stated that "You should avoid running a large number of queries on one ksqlDB cluster. Instead, use interactive mode to play with your data and develop sets of queries that function together. Then, run these in their own headless cluster." However, I am unsure how to do this- my queries are all dependent on each other.
Does anyone have any general recommendations on how to deploy a large number of interdependent ksql queries? As an added requirement, the data is refreshed each day and is independent for the each new day, so I need to do some sort of refresh of the queries each day.
I think that's just a recommendation if you can group queries that depend each other, and then split those groups into headless mode servers.
Another way, if you use interactive mode, is to partitioned your topics and add more ksql servers to your cluster. This will allow ksql to split the workload across the cluster, each server consuming and processing one partition. Say you have 4 partitions per topic and 2 servers, then you'll have 1 server processing 2 partitions and another server other 2 partitions. This should decrease the workload on each server.
Another improvement is to reduce the number of streams threads. Each query you create runs with 4 kafka streams threads by default. The more number of threads, the more parallel work is done in the server. With a large number of queries, performance decreases and lag is incremented. Try with 1 thread and see if that works. Set ksql.streams.num.stream.threads=1 in the ksql-server.properties to configure it.
I am trying to write large amounts of data to dynamo using AmazonDynamoDBAsyncClient and I am trying to understand what the best practice of handling throttling is?
For example, I have a capacity of 3000 writes and at a given moment I have, let's say, 100,000 records I'd like to write. I don't need them all in immediately, but I am trying to figure what the best way to get them in is.
This application is running in a distributed environment so there maybe 5 executors all trying to do this at the same time. Would the best way to handle this be this way? Where I sleep the write process should we hit the throttle? Or should I be doing something to avoid the throttle completely. In fact, is my code even doing what I think it is, which is retrying the data after waiting a second?
try{
amazonDynamoAsyncDb.updateItemAsync(updateRequest)
}catch{
case e: ThrottlingException => {
Thread.sleep(1000)
//retry here, but how?
}
}
The AWS SDK for Java will retry throttled requests 10 times by default, before throwing a ProvisionedThroughputExceededException. If your items are small (1KB or less) and you are performing the writes from EC2 in the same region as your table you can assume each write will take around 10 ms. That means each thread of processing can do about 100 writes per second. To scale your writes to 3000 writes per second, you would need 30 threads and 30 HTTP connections. 3000 small (1kb) writes per second translates to a data throughput of 2.92 MB per second. Thus, for this write load, it does not appear that EC2 hardware could become a bottleneck. I recommend you do some measurements to figure out how long it takes to write each of your items on average, and scale your threads and HTTP connections appropriately.
I have a general question regarding Apache Spark and how to distribute data from driver to executors.
I load a file with 'scala.io.Source' into collection. Then I parallelize the collection with 'SparkContext.parallelize'. Here begins the issue - when I don't specify the number of partitions, then the number of workers is used as the partitions value, task is sent to nodes and I got the warning that recommended task size is 100kB and my task size is e.g. 15MB (60MB file / 4 nodes). The computation then ends with 'OutOfMemory' exception on nodes. When I parallelize to more partitions (e.g. 600 partitions - to get the 100kB per task). The computations are performed successfully on workers but the 'OutOfMemory' exceptions is raised after some time in the driver. This case, I can open spark UI and observe how te memory of driver is slowly consumed during the computation. It looks like the driver holds everything in memory and doesn't store the intermediate results on disk.
My questions are:
Into how many partitions to divide RDD?
How to distribute data 'the right way'?
How to prevent memory exceptions?
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Thanks
How to distribute data 'the right way'?
You will need a distributed file system, such as HDFS, to host your file. That way, each worker can read a piece of the file in parallel. This will deliver better performance than serializing and the data.
How to prevent memory exceptions?
Hard to say without looking at the code. Most operations will spill to disk. If I had to guess, I'd say you are using groupByKey ?
Into how many partitions to divide RDD?
I think the rule of thumbs (for optimal parallelism) is 2-4x the amount of cores available for your job. As you have done, you can compromise time for memory usage.
Is there a way how to tell driver/worker to swap? Is it a configuration option or does it have to be done 'manually' in program code?
Shuffle spill behavior is controlled by the property spark.shuffle.spill. It's true (=spill to disk) by default.
Can I use MSMQ to reduce the number of synchronous write operations to a database and instead have the records written to the database every X number of minutes?
You can't reduce the number of write operations by queuing them, but you can use a message queue to cluster the writes together.
That might be a bit more efficient (by dint of sharing a single connection), and could also let you schedule the writes at a convenient time if you wanted to ('every X minutes' wouldn't do that, but you could perform the writes during low usage times).
The increased complexity of that arrangement will normally outweigh the benefits - what do you really want to achieve?