Druid Tuning Configuration - druid

iam beginner using druid and kafka.
I want to create interactive data realtime kafka-druid. I still confuse what tuning should i change of this configuration?
Thank in advance

It depends:
on your size of kafka's messages
on your producer's capacity and the rate of messages produced to Kafka
on your Kafka's servers, partitioning and lots of other factors.
on your druid's deployment model (single or cluster)
and ....
But the most important thing I see missing here is your task count, which means the amount of parallel processing (and since yours is Kafka, it means parallel consumers). Increase it and make sure your Druid host (or your middle manager host if it is a cluster) has the adequate cores for your tasks. And make sure in your middle-manager you have increased the total number of available tasks:
druid.worker.capacity

Related

Running a single kafka s3 sink connector in standalone vs distributed mode

I have a kafka topic "mytopic" with 10 partitions and want to use S3 sink connector to sink records to an S3 bucket. For scaling purposes it should be running on multiple nodes to write partitions data in parallel to the same S3 bucket.
In Kafka connect user guide and actually many other blogs/tutorials it's recommended to run workers in distributed mode instead of standalone to achieve better scalability and fault tolerance:
... distributed mode is more flexible in terms of scalability and offers the added advantage of a highly available service to minimize downtime.
I want to figure out which mode to choose for my use case: having one logical connector running on multiple nodes in parallel. My understanding is following:
If I run in distributed mode, I will end up having only 1 worker processing all the partitions, since it's considered one connector task.
Instead I should run in standalone mode in multiple nodes. In that case I will have a consumer group and achieve parallel processing of partitions.
In above described standalone scenario I will actually have fault tolerance: if one instance dies, the consumer group will rebalance and other standalone workers will handle the freed partitions.
Is my understaning correct or am I missing something?
Unfortunately I couldn't find much information on this topic other than this google groups discussion, where the author came to the same conclusion as I did.
In theory, that might work, but you'll end up ssh-ing to multiple machines, having basically the same config files, and just not using the connect-distributed command instead of connect-standalone.
You're missing the part about Connect server task rebalancing, though, which communicates over the Connect server REST ports
The underlying task code is all the same, only the entrypoint and offset storage are different. So, why not just use distributed if you have multiple machines?
You don't need to run, multiple instances of standalone processes, the Kafka workers are taking care of distributing the tasks, rebalancing, offset management under the distributed mode, you need to specify the same group id ...

What is the StreamSets architecture?

I am not very clear about the architecture even after going through tutorials. How do we scale streamset in a distributed environment? Let's say, our input data velocity increases from origin then how to ensure that SDC doesn't give performance issues? How many daemons will be running? Will it be Master worker architecture or peer to peer architecture?
If there are multiple daemons running on multiple machines (e.g. one sdc along with one NodeManager in YARN) then how it will show centralized view of data i.e. total record count etc.?
Also please do let me know architecture of Dataflow performance manager. Which all daemons are there in this product?
StreamSets Data Collector (SDC) scales by partitioning the input data. In some cases, this can be done automatically, for example Cluster Batch mode runs SDC as a MapReduce job on the Hadoop / MapR cluster to read Hadoop FS / MapR FS data, while Cluster Streaming mode leverages Kafka partitions and executes SDC as a Spark Streaming application to run as many pipeline instances as there are Kafka partitions.
In other cases, StreamSets can scale by multithreading - for example, the HTTP Server and JDBC Multitable Consumer origins run multiple pipeline instances in separate threads.
In all cases, Dataflow Performance Manager (DPM) can give you a centralized view of the data, including total record count.

Scaling Kafka stream application across multiple users

I have a setup where I'm pushing events to kafka and then running a Kafka Streams application on the same cluster. Is it fair to say that the only way to scale the Kafka Streams application is to scale the kafka cluster itself by adding nodes or increasing Partitions?
In that case, how do I ensure that my consumers will not bring down the cluster and ensure that the critical pipelines are always "on". Is there any concept of Topology Priority which can avoid a possible downtime? I want to be able to expose the streams for anyone to build applications on without compromising the core pipelines. If the solution is to setup another kafka cluster, does it make more sense to use Apache storm instead, for all the adhoc queries? (I understand that a lot of consumers could still cause issues with the kafka cluster, but at least the topology processing is isolated now)
It is not recommended to run your Streams application on the same servers as your brokers (even if this is technically possible). Kafka's Streams API offers an application-based approach -- not a cluster-based approach -- because it's a library and not a framework.
It is not required to scale your Kafka cluster to scale your Streams application. In general, the parallelism of a Streams application is limited by the number of partitions of your app's input topics. It is recommended to over-partition your topic (the overhead for this is rather small) to guard against scaling limitations.
Thus, it is even simpler to "offer anyone to build applications" as everyone owns their application. There is no need to submit apps to a cluster. They can be executed anywhere you like (thus, each team can deploy their Streams application the same way by which they deploy any other application they have). Thus, you have many deployment options from a WAR file, over YARN/Mesos, to containers (like Kubernetes). Whatever works best for you.
Even if frameworks like Flink, Storm, or Samza offer cluster management, you can only use such tools that are integrated with those frameworks (for example, Samza requires YARN -- no other options available). Let's say you have already a Mesos setup, you can reuse it for your Kafka Streams applications -- no need for a dedicated "Kafka Streams cluster" (because there is no such thing).
An application’s processor topology is scaled by breaking it into
multiple tasks.
More specifically, Kafka Streams creates a fixed number of tasks based
on the input stream partitions for the application, with each task
assigned a list of partitions from the input streams (i.e., Kafka
topics).
The assignment of partitions to tasks never changes so that each task
is a fixed unit of parallelism of the application. Tasks can then
instantiate their own processor topology based on the assigned
partitions; they also maintain a buffer for each of its assigned
partitions and process messages one-at-a-time from these record
buffers.
As a result stream tasks can be processed independently and in
parallel without manual intervention.
It is important to understand that Kafka Streams is not a resource
manager, but a library that “runs” anywhere its stream processing
application runs. Multiple instances of the application are executed
either on the same machine, or spread across multiple machines and
tasks can be distributed automatically by the library to those running
application instances.
The assignment of partitions to tasks never changes; if an application
instance fails, all its assigned tasks will be restarted on other
instances and continue to consume from the same stream partitions.
The processing of the stream happens in the machines where the application is running.
I recommend you to have a look to this guide, it can help you to better understand the way Kafka Streams work.

Kafka recommended system configuration

I'm expecting our influx in to Kafka to raise to around 2 TB/day over a period of time. I'm planning to setup a Kafka cluster with 2 brokers (each running on separate system). What is the recommended hardware configuration for handling 2 TB/day ?
To use as a base you could look here: https://docs.confluent.io/4.1.1/installation/system-requirements.html#hardware
You need to know the amount of messages you get per second/hour because this will determine the size of your cluster. For HD, it's not necessary to get SSD because the system will use RAM to store the data first. Still you could need quite speed hard disk to ensure that the flushing process of the queue will not slow your system.
I would also recommend to use 3 kafka broker and 3 or 4 zookeeper server too.

How to scale single node Kafka to multiple node cluster?

I am going to install Kafka for company messaging. The plan is to first install the kafka on a single huge machine and scale it to 4-5 machines (a cluster) later if needed.
I have little experience about kafka. Want to ask whether it is possible to scale by just changing the parameter in broker configuration and install zookeeper on newly joined machine.
Or how can I roughly do this in the easiest way ? More specifically Cloudera Kafka in CDH.
Thanks
To scale Kafka you will have to add more partitions to topics if needed to using kafka-topics.sh. And than reassign partitions to your new brokers using kafka-reassign-partitions.sh.
The reassign utility will replicate and dispatch your data automatically. You can do it for a whole topic or for a selective set of partitions.
The complete documentation is here. Just take a look at section 6.