How to list zookeeper znodes in a reliable way? - apache-zookeeper

I would like to list all of my subnodes , say: ls /mynode
Unfortunately the above command doesnt work when we have a massive amount of subnodes. The reason is the buffer limit. Even if we increase it by jute.maxbuffer we can reach that limit also.
So what can we do if we want to list all of our nodes?
1.Does Zk support paging? No.
2.Does Zk support wildcards? No.
3.Does Zk support filtering? No.
What is the solution?

Related

Kafka on multiple instances of ec2

I am new to Kafka, trying to do a project. Wanted to do it as it would be in real life example, but I am kinda confused. While searching thru the internet I found that if I want to have 3 brokers and 3 zookeepers, to provide replication factor = 2 and quorum, I need 6 EC2 instances. I am looking thru youtube to find some examples, but as far as I see all of them show multiple brokers on one cluster. From my understanding it's better to keep ZKs and all brokers separately on each VM, so if one goes down I still have all of the rest. Can you confirm that ?
Also, wondering how to set partitioning. Is it important at the beginning of creating a topic, or I change that later when I need to scale ?
Thanks in advance
looking for information on yt, google.
My suggestion would be to use MSK Serverless and forget how many machines exist.
Kafka 3.3.1 doesn't need Zookeeper anymore. Zookeeper doesn't need to be separate machines (although recommended). You can also run multiple brokers on one server... So, I'm not fully sure why you would need 6 for replication factor of 2.
Regarding partitions, yes, create it ahead of time (over provision, if necessary) since you cannot easily move data across partitions when you do scale

How to do stream processing with Redpanda?

Redpanda seems easy to work with, but how would one process streams in real-time?
We have a few thousand IoT devices that send us data every second. We would like to get the running average of the data from the last hour for each of the devices. Can the built-in WebAssembly stuff be used for this, or do we need something like Materialize?
Given that it is marketed as "Kafka Compatible," any Kafka library should work with RedPanda, including Kafka Streams, KSQL, Apache Spark, Flink, Storm, etc.
Thanks, folks. Since it hasn't been mentioned I'm going to add my own answer as well here.
We ended up using Bytewax.
It worked great with our existing Kubernetes setup. It supports stateful operations and scales horizontally into multiple pods if needed. It's pretty performant (1), and since it's basically just a python program it can be customized to read and write to whatever you want.
(1) The Bytewax pod actually uses less CPU than our KafkaJS pod, which just stores all messages to a DB.
Here's more information about stream processors that work with Redpanda.
https://redpanda.com/blog/kafka-stream-processors

Zookeeper for Data Storage?

I want a external config store for some of my services , and the data can be in following format like JSON,YML,XML. The use case I want is that I can save my configs , change them dynamically , and the read for these configs will be very frequent. So, for this is Zookeeper a good solution. Also my configs are of atmost 500MB.
The reason that Zookeeper is under consideration as it has synchronization property, version (as I will be changing configs a lot) ,can provide notifications to the depending service of changes to config. Kindly tell if Zookeeper can be data store and will be best for this use case,any other suggestion if possible.
Zookeeper may be used as data store but
Size of single node should not be longer than 1MB
Getting huge amount of nodes from zookeeper will take time, so you need to use caches. You can use Curator PathChildrenCache recipe. If you have tree structure in your zNodes you can use TreeCache, but be aware that TreeCache had memory leaks in various 2.x versions of Curator.
Zookeeper notifications is a nice feature, but if you have pretty big cluster you might have too many watchers which brings stress on your zookeeper cluster.
Please find more information about zookeeper failure reasons.
So generally speaking Zookeeper can be used as a datastore if the data is organized as key/value and value doesn't exceed 1MB. In order to get fast access to the data you should use caches on your application side: see Curator PathChildrenCache recipe.
Alternatives are Etcd and consul

Shard and rebalance via CLI

Is it possible to rebalance shards via rethinkDb command-line?
I tried to do it but all data remains in one of the shards. In web interface I can rebalance automatically.
Thanks!
This screencast (starting from 8:48) explains how to set up a cluster with a mix of command line and web interface.
In the documentation: Sharding and replication (section: Sharding via the command-line interface) there is some explanation on how to set up split points.
Unfortunately there is little documentation to do so specific things right now.
You can shard with the CLI, but this is done by manually setting split points (and not by setting the number of shards)
The syntax is
split shard <TABLE> <SPLIT-POINT>
The web interface infers what good split points are based on the distribution of keys. The CLI currently doesn't do it.

Maximum servers in a ZooKeeper ensemble cluster?

Use case: 100 Servers in a pool; I want to start a ZooKeeper service on each Server and Server applications (ZooKeeper client) will use the ZooKeeper cluster (read/write). Then there is no single point of failure.
Is this solution possible for this use case? What about the performance?
What if there are 1000 Servers in the pool?
If you are simply trying to avoid a single point of failure, then you only need 3 servers. In a 3 node ensemble, a single failure can be tolerated with the remaining 2 nodes forming the quorum. The more servers you have the worse write performance will be. And 100 servers is the extreme of this, if ZK can even handle it.
However, having that many clients is no problem at all. Zookeeper has active deployments with many more than 1000 clients. If you find that you need more servers to handle your read load, you can always add Observers. I highly recommend you join the list serve. It is an excellent way to quickly have your questions answered, and likely in much more detail than anyone will give you on SO.
Maybe zookeeper is not the right tool?
Hazelcast does what you want, I think. You can hundreds of peers, and if the master is lost a new one is elected from all the peers.
You don't need to use all of hazel cast. You can just use the maps, or just the worker pools, or just the synchronisation primitives, or just the messaging etc.