How do I increase number of workers in druid while using it through imply? - druid

I'm running druid through Imply's setup and I wanna increase the number of druid workers but I don't know exactly where should I change the configuration of Imply to increase the number of druid's workers. Can anybody please help me for this?

So finally I found the place where I can configure the number of workers for Druid in Imply's setup. Well, it was already documented in Druid's documentation but it was very vague to comprehend for newcomers.
Following is the location of the configuration file
imply-directory/conf-quickstart/druid/middleManager/runtime.properties
We have to add new property called druid.worker.capacity which specifies number of workers for druid.
druid.worker.capacity=3
For instance above line instructs the Druid to run 3 workers of Druid

Related

How to estimate RAM and CPU per Kubernetes Pod for a Spring Batch processing job?

I'm trying to estimate hardware resources for a Kubernetes Cluster to be able to handle the following scenarios:
On a daily basis I need to read 46,3 Million XML messages 10KB each (approx.) from a queue and then insert them in a Spark instance and in a Sybase DB instance. I need to come out with an estimation of how many pods I will need to process this amount of data and how much RAM and how many vCPUs will be required per pod in order to determine the characteristics of the nodes of the cluster. The reason behind all this is that we have some budget restrictions and we need to have an idea of the sizing before starting the corresponding development.
The second scenario is the same as the one already described but 18,65 times bigger, i.e. 833,33 Million XML messages per day. This is expected to be the case within a couple of years.
So far we plan to use Spring Batch with partitioning steps. I need orientation on how to determine the ideal Spring Batch configuration, required RAM, and required CPU per POD, as well as the number of PODS.
I will greatly appreciate any comments from your side.
Thanks in advance.

Kafka on multiple instances of ec2

I am new to Kafka, trying to do a project. Wanted to do it as it would be in real life example, but I am kinda confused. While searching thru the internet I found that if I want to have 3 brokers and 3 zookeepers, to provide replication factor = 2 and quorum, I need 6 EC2 instances. I am looking thru youtube to find some examples, but as far as I see all of them show multiple brokers on one cluster. From my understanding it's better to keep ZKs and all brokers separately on each VM, so if one goes down I still have all of the rest. Can you confirm that ?
Also, wondering how to set partitioning. Is it important at the beginning of creating a topic, or I change that later when I need to scale ?
Thanks in advance
looking for information on yt, google.
My suggestion would be to use MSK Serverless and forget how many machines exist.
Kafka 3.3.1 doesn't need Zookeeper anymore. Zookeeper doesn't need to be separate machines (although recommended). You can also run multiple brokers on one server... So, I'm not fully sure why you would need 6 for replication factor of 2.
Regarding partitions, yes, create it ahead of time (over provision, if necessary) since you cannot easily move data across partitions when you do scale

Lettuce Redis client doesn't balance load between slave servers

I'm setting up a master/slave Redis topology using the Lettuce client. My readPreference is slave_preferred and the topology has three slaves and one master.
The issue I'm experiencing is that once the StatefulRedisMasterSlaveConnection is stablished, all queries go to the same slave, instead of balancing the load between all available slaves.
I have also tried adding a commons-pool2 connection pool as per the documentation, but the behaviour seems to be the same.
I have also tried using a static topology discovery as well as a dynamic one.
Is there a way to balance the load between slaves and not have all queries go to the same slave?
Thank you
The short answer is no.
The longer answer is:
Lettuce selects a replica per slot hash and uses the selected replica for read operations until the next topology refresh. This is by design to reduce CPU overhead.
You might want to follow ticket #834 as this one is to add capabilities for load balancing/round-robin across read replicas.

Mongo Spark connector write issues

We're observing significant increase in duration for writes, which eventually results in timeouts.
We're using replica set based MongoDB cluster.
It only happens during the high peak days of the week due to high volume.
We've tried deploying additional nodes, but it hasn't helped.
Attaching the screen shots.
We're using Mongo-connector 2.2.1 on databricks Apache Spark 2.2.1
Any recommendations to optimise write speed will be truly appreciated.
how many workers are there? please check DAG, executor metrics for the job. if all writes happening from a single executor, try repartitioning the dataset based on no. of executors.
MongoSpark.save(dataset.repartition(50), writeConf);

Zookeeper for Data Storage?

I want a external config store for some of my services , and the data can be in following format like JSON,YML,XML. The use case I want is that I can save my configs , change them dynamically , and the read for these configs will be very frequent. So, for this is Zookeeper a good solution. Also my configs are of atmost 500MB.
The reason that Zookeeper is under consideration as it has synchronization property, version (as I will be changing configs a lot) ,can provide notifications to the depending service of changes to config. Kindly tell if Zookeeper can be data store and will be best for this use case,any other suggestion if possible.
Zookeeper may be used as data store but
Size of single node should not be longer than 1MB
Getting huge amount of nodes from zookeeper will take time, so you need to use caches. You can use Curator PathChildrenCache recipe. If you have tree structure in your zNodes you can use TreeCache, but be aware that TreeCache had memory leaks in various 2.x versions of Curator.
Zookeeper notifications is a nice feature, but if you have pretty big cluster you might have too many watchers which brings stress on your zookeeper cluster.
Please find more information about zookeeper failure reasons.
So generally speaking Zookeeper can be used as a datastore if the data is organized as key/value and value doesn't exceed 1MB. In order to get fast access to the data you should use caches on your application side: see Curator PathChildrenCache recipe.
Alternatives are Etcd and consul