How many is the minimum server composition of HBase? - nosql

How many is the minimum server composition of HBase?
Full-distributed, use sharding, but not use Hadoop.
It's for production environment.
I'm looking forward to explain like this.
Server 1: Zookeeper
Server 2: Region server
... and more
Thank you.

The minimum is one- see pseudo-distributed mode. The moving parts involved are:
Assuming that you are running on HDFS (which you should be doing):
1 HDFS NameNode
1 or more HDFS Secondary NameNode(s)
1 or more HDFS DataNode(s)
For MapReduce (if you want it):
1 MapReduce JobTracker
1 or more MapReduce TaskTracker(s) (Usually same machines as datanodes)
For HBase itself
1 or more HBase Master(s) (Hot backups are a good idea)
1 or more HBase RegionServer(s) (Usually same machines as datanodes)
1 or more Thrift Servers (if you need to access HBase from the outside the network it is on)
For ZooKeeper
3 - 5 ZooKeeper node(s)
The number of machines that you need is really dependent on how much reliability you need in the face of hardware failure and for what kind of nodes. The only node of the above that does not (yet) support hot failover or other recovery in the face of hardware failure is the HDFS NameNode, though that is being fixed in the more recent Hadoop releases.
You typically want to set the HDFS replication factor of your RegionServers to 3, so that you can take advantage of rack awareness.
So after that long diatribe, I'd suggest at a minimum (for a production deployment):
1x HDFS NameNode
1x JobTracker / Secondary NameNode
3x ZK Nodes
3x DataNode / RegionServer nodes (And if you want to run MapReduce, TaskTracker)
1x Thrift Server (Only if accessing HBase from outside of the network it is running on)

Related

how many zookeper servers we need in order to support 18 kafka machines

6 kafka machines ( they are physical machines - DELL HW )
3 zookeeper server
we want to add 12 kafka machines to the cluster
in that case how many zookeeper server should be ?
in order to support 18 kafka machines ?
Well, your question was tagged with Hadoop, but for Kafka alone, 3 will "work", but 5-7 is "better".
But, these should be dedicated Zookeeper servers for Kafka, and not shared with Hadoop services such as the namenode, Hive, HBase, etc. Especially on the level of 30+ Hadoop servers. This is because Zookeeper is very latency specific, and needs lots of memory to handle these types of processes.
This can easily be done in Ambari with specific server configs, but not letting Ambari use its templates to populate the single Zookeeper quorum that it tracks (which is somewhat painful to find in every service, that it's really worth not using Ambari at all for configs, and rather Puppet or Ansible, etc, but I digress)
Keep in mind, your cluster will be 1/3 entirely unbalanced, and adding brokers will not move existing data or cause replicas to get assigned to the new brokers for existing topics

What is the StreamSets architecture?

I am not very clear about the architecture even after going through tutorials. How do we scale streamset in a distributed environment? Let's say, our input data velocity increases from origin then how to ensure that SDC doesn't give performance issues? How many daemons will be running? Will it be Master worker architecture or peer to peer architecture?
If there are multiple daemons running on multiple machines (e.g. one sdc along with one NodeManager in YARN) then how it will show centralized view of data i.e. total record count etc.?
Also please do let me know architecture of Dataflow performance manager. Which all daemons are there in this product?
StreamSets Data Collector (SDC) scales by partitioning the input data. In some cases, this can be done automatically, for example Cluster Batch mode runs SDC as a MapReduce job on the Hadoop / MapR cluster to read Hadoop FS / MapR FS data, while Cluster Streaming mode leverages Kafka partitions and executes SDC as a Spark Streaming application to run as many pipeline instances as there are Kafka partitions.
In other cases, StreamSets can scale by multithreading - for example, the HTTP Server and JDBC Multitable Consumer origins run multiple pipeline instances in separate threads.
In all cases, Dataflow Performance Manager (DPM) can give you a centralized view of the data, including total record count.

Zookeeper on same node as kafka?

I am setting up a kafka+zookeeper cluster. Let's say I want 3 kafka brokers. I am wondering if I can setup 3 machines with kafka on them and then run the zookeeper cluster on the same nodes. So each machine has a kafka+zookeeper node in the cluster, instead of having 3 machines for kafka and 3 machines for zookeeper (6 in total).
What are the advantages and disadvantages? These machines will most probably be dedicated to running kafka/zookeeper. I am thinking if I can reduce costs a bit without sacrificing performance.
We have been running zookeeper and kafka broker on the same node in production environment for years without any problems. The cluster is running at very very high qps and IO traffics, so I dare say that our experience suits most scenarios.
The advantage is quite simple, which is saving machines. Kafka brokers are IO-intensive, while zookeeper nodes don't cost too much disk IO as well as CPU. So they won't disturb each other in most occasions.
But do remember to keep watching at your CPU and IO(not only disk but also network) usages, and increase cluster capacity before they reach bottleneck.
I don't see any disadvantages because we have very good cluster capacity planning.
It makes sense to collocate them when Kafka cluster is small, 3-5 nodes. But keep in mind that it is a colocation of two applications that are sensitive to disk I/O. The workloads and how chatty they are with local Zk's also plays an important role here, especially from page cache memory usage perspective. 
Once Kafka cluster grows to a dozen or more nodes, collocation of Zk’s accordingly on each node will create quorum overheads(like slower writes, more nodes in quorum checks), so a separate Zk cluster has to be in place.
Overall, if from the start Kafka cluster usage is low and you want to save some costs, then it is reasonable to start them collocated, but have a migration strategy for setting up a separate Zk cluster to not be caught of guard once Kafka cluster has to be scaled horizontally. 

Kafka recommended system configuration

I'm expecting our influx in to Kafka to raise to around 2 TB/day over a period of time. I'm planning to setup a Kafka cluster with 2 brokers (each running on separate system). What is the recommended hardware configuration for handling 2 TB/day ?
To use as a base you could look here: https://docs.confluent.io/4.1.1/installation/system-requirements.html#hardware
You need to know the amount of messages you get per second/hour because this will determine the size of your cluster. For HD, it's not necessary to get SSD because the system will use RAM to store the data first. Still you could need quite speed hard disk to ensure that the flushing process of the queue will not slow your system.
I would also recommend to use 3 kafka broker and 3 or 4 zookeeper server too.

Apache Spark: Driver (instead of just the Executors) tries to connect to Cassandra

I guess I'm not yet fully understanding how Spark works.
Here is my setup:
I'm running a Spark cluster in Standalone mode. I'm using 4 machines for this: One is the Master, the other three are Workers.
I have written an application that reads data from a Cassandra cluster (see https://github.com/journeymonitor/analyze/blob/master/spark/src/main/scala/SparkApp.scala#L118).
The 3-nodes Cassandra cluster runs on the same machines that also host the Spark Worker nodes. The Spark Master node does not run a Cassandra node:
Machine 1 Machine 2 Machine 3 Machine 4
Spark Master Spark Worker Spark Worker Spark Worker
Cassandra node Cassandra node Cassandra node
The reasoning behind this is that I want to optimize data locality - when running my Spark app on the cluster, each Worker only needs to talk to its local Cassandra node.
Now, when submitting my Spark app to the cluster by running spark-submit --deploy-mode client --master spark://machine-1 from Machine 1 (the Spark Master), I expect the following:
a Driver instance is started on the Spark Master
the Driver starts one Executor on each Spark Worker
the Driver distributes my application to each Executor
my application runs on each Executor, and from there, talks to Cassandra via 127.0.0.1:9042
However, this doesn't seem to be the case. Instead, the Spark Master tries to talk to Cassandra (and fails, because there is no Cassandra node on the Machine 1 host).
What is it that I misunderstand? Does it work differently? Does in fact the Driver read the data from Cassandra, and distribute the data to the Executors? But then I could never read data larger than memory of Machine 1, even if the total memory of my cluster is sufficient.
Or, does the Driver talk to Cassandra not to read data, but to find out how to partition the data, and instructs the Executors to read "their" part of the data?
If someone can enlight me, that would be much appreciated.
Driver program is responsible for creating SparkContext, SQLContext and scheduling tasks on the worker nodes. It includes creating logical and physical plans and applying optimizations. To be able to do that it has to have access to the data source schema and possible other informations like schema or different statistics. Implementation details vary from source to source but generally speaking it means that data should be accessible on all nodes including application master.
At the end of the day your expectations are almost correct. Chunks of the data are fetched individually on each worker without going through driver program, but driver has to be able to connect to Cassandra to fetch required metadata.