Flush data from Historical Node Memory to Deep Storage - druid

I had initially setup a druid cluster with 2 historical nodes with 30gb memory each. 2 middle manager nodes, one node with coordinator and overlord running, 1 broker node.
After successfully running it for 3-4weeks, I saw that my tasks were staying in the running state even after the window period. I then happened to add one more historical node with same configuration, this resulted in my tasks working fine again.
What this meant was all the data ingested to druid is going to memory and I will have to keep on adding historical nodes.
Is there a way to flush some of the data from memory to deep storage and it should get loaded into memory whenever a query is fired against that set of data?
Each of my historical node is of 30GB RAM. Configs :
druid.processing.buffer.sizeBytes=1073741824
druid.segmentCache.locations=[{"path":"var/druid/segment-cache","maxSize":32212254720}]
druid.port=7080
druid.service=druid/historical
druid.server.maxSize=100000000000
druid.server.http.numThreads=50
druid.processing.numThreads=5
druid.query.groupBy.maxResults=10000000
druid.query.groupBy.maxOnDiskStorage=10737418240

Well as mentioned in the question, my problem was that I had to launch a new node every few days not sure why. The root cause was the disk space on each historical node.
Essentially, even if druid pushes data to deep storage, it also keeps all the data locally on the historical nodes too.
So you can only store data equal to sum of 'druid.server.maxSize' config in all the historical nodes.
If you don't wish to scale horizontally, you can increase the disk of historical nodes and increase the value of this config and restart the historical nodes.

Related

What happens if I add a new node to a CockroachDB cluster with more storage than existing nodes?

Having a 3x 1TB CockroachDB cluster, what happens if I add a single 4 TB node? Presumably only some of the 4TB can be used as not all can be replicated? If I add 3 new nodes with 4TB each, can all disk space be replicated/used?
I think everything you say is right. If you only have a single large store, you need sufficiently many small nodes around it in order to fill the large one.
Our disk balancing tries to keep equal amounts of data on each store until a store is almost full, at which point it will prefer less full ones.

Apache Spark Auto Scaling properties - Add Worker on the Fly

During the execution of a Spark Program, let's say,
reading 10GB of data into memory, and just doing a filtering, a map, and then saving in another storage.
Can I auto-scale the cluster based on the load, and for instance add more Worker Nodes to the Program, if this program eventually needs to hangle 1TB instead of 10GB ?
If this is possible, how can it be done?
It is possible to some extent, using dynamic allocation, but behavior is dependent on the job latency, not direct usage of particular resource.
You have to remember that in general, Spark can handle data larger than memory just fine, and memory problems are usually caused by user mistakes, or vicious garbage collecting cycles. None of these could be easily solved, by "adding more resources".
If you are using any of the cloud platforms for creating the cluster you can use auto-scaling functionality. that will scale cluster horizontally(number of nodes with change)
Agree with #user8889543 - You can read much more data then your memory.
And as for adding more resources on the fly. It is depended on your cluster type.
I use standalone mode, and I have a code that add on the fly machines that attached to the master automatically, then my cluster has more cores and memory.
If you only have on job/program in the cluster then it is pretty simple. Just set
spark.cores.max
to a very high number and the job will take all the cores of the cluster always. see
If you have several jobs in the cluster it becomes complicate. as mentioned in #user8889543 answer.

Azure Service Fabric reliable collections and memory

Let's say I'm running a Service Fabric cluster on 5 D1 class (1 core, 3.5GB RAM, 50GB SSD) VMs. and that I'm running 2 reliable services on this cluster, one stateless and one stateful. Let's assume that the replica target is 3.
How to calculate how much can my reliable collections hold?
Let's say I add one or more stateful services. Since I don't really know how the framework distributes services do I need to take most conservative approach and assume that a node may run all of my stateful services on a single node and that their cumulative memory needs to be below the RAM available on a single machine?
TLDR - Estimating the expected capacity of a cluster is part art, part science. You can likely get a good lower bound which you may be able to push higher, but for the most part deploying things, running them, and collecting data under your workload's conditions is the best way to answer this question.
1) In general, the collections on a given machine are bounded by the amount of available memory or the amount of available disk space on a node, whichever is lower. Today we keep all data in the collections in memory and also persist it to disk. So the maximum amount that your collections across the cluster can hold is generally (Amount of available memory in the cluster) / (Target Replica Set Size).
Note that "Available Memory" is whatever is left over from other code running on the machines, including the OS. In your above example though you're not running across all of the nodes - you'll only be able to get 3 of them. So, (unrealistically) assuming 0 overhead from these other factors, you could expect to be able to put about 3.5 GB of data into that stateful service replica before you ran out of memory on the nodes on which it was running. There would still be 2 nodes in the cluster left empty.
Let's take another example. Let's say that it is about the same as your example above, except in this case you set up the stateful service to be partitioned. Let's say you picked a partition count of 5. So now on each node, you have a primary replica and 2 secondary replicas from other partitions. In this case, each partition would only be able to hold a maximum of around 1.16 GB of state, but now overall you can pack 5.83 GB of state into the cluster (since all nodes can now be utilized fully). Incidentally, just to prove out the math works, that's (3.5 GB of memory per node * 5 nodes in the cluster) [17.5] / (target replica set size of 3) = 5.83.
In all of these examples, we've also assumed that memory consumption for all partitions and all replicas is the same. A lot of the time that turns out to not be true (at least temporarily) - some partitions can end up with more or less work to do and hence have uneven resource consumption. We also assumed that the secondaries were always the same as the primaries. In the case of the amount of state, it's probably fair to assume that these will track fairly evenly, though for other resource consumption it may not (just something to keep in mind). In the case of uneven consumption, this is really where the rest of Service Fabric's Cluster Resource Management will help, since we can come to know about the consumption of different replicas and pack them efficiently into the cluster to make use of the available space. Automatic reporting of consumption of resources related to state in the collections is on our radar and something we want to do, so in the future, this would be automatic but today you'd have to report this consumption on your own.
2) By default, we will balance the services according to the default metrics (more about metrics is here). So by default, the different replicas of those two different services could end up on the machine, but in your example, you'll end up with 4 nodes with 1 replica from a service on it and then 1 node with two replicas from the two different services. This means that each service (each with 1 partition as per your example) would only be able to consume 1.75 GB of memory in each service for a total of 3.5 GB in the cluster. This is again less than the total available memory of the cluster since there are some portions of nodes that you're not utilizing.
Note that this is the maximum possible consumption, and presuming no consumption outside the service itself. Taking this as your maximum is not advisable. You'll want to reduce it for several reasons, but the most practical reason is to ensure that in the presence of upgrades and failures that there's sufficient available capacity in the cluster. As an example, let's say that you have 5 Upgrade Domains and 5 Fault Domains. Now let's say that a fault domain's worth of nodes fails while you have an upgrade going on in an upgrade domain. This means that (a little less than) 40% of your cluster capacity can be gone at any time, and you probably want enough room left over on the remaining nodes to continue. This means that if your cluster previously could hold 5.83 GB of state (from our prior calculations), in reality you probably don't want to put more than about 3.5 GB of state in it since with more of that the service may not be able to get back to 100% healthy (note also that we don't build replacement replicas immediately so the nodes would have to be down for your ReplicaRestartWaitDuration before you ran into this case). There's a bunch more information about metrics, capacity, buffered capacity (which you can use to ensure that room is left on nodes for the failure cases) and fault and upgrade domains are covered in this article.
There are some other things that practically will limit the amount of state you'll be able to store. You'll want to do several things:
Estimate the size of your data. You can make a reasonable estimate up-front of how big your data is by calculating the size of each field your objects hold. Be sure to take into consideration 64-bit references. This will give you a lower-bound starting point.
Storage overhead. Each object you store in a collection will come with some overhead for storing that object. In the reliable collections depending on the collection and the operations currently in flight (copy, enumerations, updates, etc.) this overhead can range from between 100 and around 700 bytes per item (row) stored in the collections. Do know also that we're always looking for ways to reduce the amount of overhead we introduce.
We also strongly recommend running your service over some period of time and measuring actual resource consumption via performance counters. Simulating some sort of real workload and then measuring the actual usage of the metrics you care about will serve you pretty well. The reason we recommend this in particular is that you will be able to see consumption from things like which CLR object heap your objects end up placed in, how often GC is running, if there's leaks, or other things like this which will impact the amount of memory you can actually utilize.
I know that this has been a long answer but I hope you find it helpful and complete.

MongoDB replication and EBS vs ephemeral

I've read all of the MongoDB related documentation talking about the recommended practices for deploying Mongo on AWS, but I don't understand the recommendation to install on EBS with RAID-10 (pdf) to avoid data loss.
This seems like admitting that replication doesn't work. Why shouldn't one run Mongo using ephemeral drives and a cluster of 5 servers doing replication?
Performance is much greater and latency is predictable on local disks.
If a server goes down, the EBS backed store would have to be resynced with the replica anyway. Sure you have the data, but it is already out of date.
Using EBS makes for a much more complicated setup. You need to use LVM or some other layer if you want to take snapshots, since EBS snapshots won't work across RAID. You need to monitor and manage your RAID array and rebuild in the case of failure or if one of the EBS volumes has performance issues.
What exactly does using EBS protect against if one has backups and a large replica set? It's almost admitting that replica sets won't protect you against dataloss. (ignoring for the moment the race condition when writes have been sent to secondaries and a failure on the master happens before acknowledgements have been sent).
Why shouldn't one run Mongo using ephemeral drives and a cluster of 5 servers doing replication?
AWS is not perfect, it can have a network failure which results in the entire set being down. with ephemeral memory you would lose all your data. Plus block devices survive restarts of nodes.
That is a few things, I am sure there are more.
If a server goes down, the EBS backed store would have to be resynced with the replica anyway.
Only after the point it went down, if that is a considerable amount of time then yes, it might be easier to copy the directory frm one replica to the other.
Using EBS makes for a much more complicated setup. You need to use LVM or some other layer if you want to take snapshots, since EBS snapshots won't work across RAID.
You don't really need RAID within AWS itself, I mean they RAID each of your block devices and replica sets are good as throw away sets. You can get by with one block device per node.
What exactly does using EBS protect against if one has backups and a large replica set?
It safe guards your sanity, restoring a backup of sizeable data across 10 odd members and resetting all the firewall/user permissions and OS etc etc could be...well...nasty.
I mean imagine having to re-setup your OS every single time you restart it.
It's almost admitting that replica sets won't protect you against dataloss.
Hmm, you must have misread some where brecaue THAT is not what they guarantee. It is true that it is harder to lose data with repilica sets (if they are setup right) but they are actually designed to give High Availability (HA).
Backups and jornalling and other consistentcy methods are designed to not lose data.
So where do you see the recommendation to run RAID10 on EBS for mongodb? Their docs list it as an option but specifically recommend only EBS and Provisioned IOPS.
For almost all deployments EBS will be the better choice. For production systems we recommend using
EBS-optimized EC2 instances
Provisioned IOPS (PIOPS) EBS volumes
http://docs.mongodb.org/ecosystem/platforms/amazon-ec2/
We run all of our mongodb instances at EC2 and all of them use EBS storage volumes with production instances using provisioned IO. Here's why:
Bringing back a failed member is faster. If an instance fails and needs to be stopped and restarted (not that frequent but it does happen) we can just detach the storage and re-attach it to another instance. Mongod comes up fine, recovers via the journal and then re-syncs with the primary for only the delta in data since the failure. This is a big deal when you have large data sets that may take many hours to restore from scratch. Storing the data on an ephemeral drive does not provide this capability.
Backups are easier (at least for replica sets under 1 TB). With a single EBS storage volume (up to 1 TB) we can take snapshots of a live secondary. As long as the journal is on the same storage volume the backup will be consistent. No need for a dedicated secondary for backups that has to be brought offline to backup.
No need for RAID except for multiple TB replica sets or for performance. EBS is already RAID behind the scenes for redundancy. We do use RAID when a replica set grows beyond 1 TB in storage but that's it and have not yet hit a point where a high IOPS EBS volume provides sufficient performance.
Provisioned IOPS give decent control of performance vs. cost. Being able to select EBS storage rated up to 4000 IOPS has allowed us to scale up performance for production systems (at higher cost) while still gaining the benefits of EBS storage. We use regular EBS volumes at lower cost for test systems.
Copying production data off for use in a test environment is much easier for large data sets. Snapshot the volumes, create a new storage volume from the snapshot and you're up and running.
I certainly can imagine future deployments using ephemeral storage (particularly as SSD costs drop) for certain high performance situations but EBS has been fairly reliable and dependable for us. Of course your experience and needs can and will differ but for us following the recommendation from MongoDB has served us well. In fact it's been reliable enough that for some environments we've moved to 1 Primary, 1 Secondary and an Arbiter, which helps with cost savings. Probably would not have done that without the ease of recovery and overall reliability of using EBS volumes on the Primary and Secondary.

Do NoSQL datacenter aware features enable fast reads and writes when nodes are distributed across high-latency connections?

We have a data system in which writes and reads can be made in a couple of geographic locations which have high network latency between them (crossing a few continents, but not this slow). We can live with 'last write wins' conflict resolution, especially since edits can't be meaningfully merged.
I'd ideally like to use a distributed system that allows fast, local reads and writes, and copes with the replication and write propagation over the slow connection in the background. Do the datacenter-aware features in e.g. Voldemort or Cassandra deliver this?
It's either this, or we roll our own, probably based on collecting writes using something like
rsync and sorting out the conflict resolution ourselves.
You should be able to get the behavior you're looking for using Voldemort. (I can't speak to Cassandra, but imagine that it's similarly possible using it.)
The key settings in the configuration will be:
replication-factor — This is the total number of times the data is stored. Each put or delete operation must eventually hit this many nodes. A replication factor of n means it can be possible to tolerate up to n - 1 node failures without data loss.
required-reads — The least number of reads that can succeed without throwing an exception.
required-writes — The least number of writes that can succeed without the client getting back an exception.
So for your situation, the replication would be set to whatever number made sense for your redundancy requirements, while both required-reads and required-writes would be set to 1. Reads and writes would return quickly, with a concomitant risk of stale or lost data, and the data would only be replicated to the other nodes afterwards.
I have no experience with Voldemort, so I can only comment on Cassandra.
You can deploy Cassandra to multiple datacenters with an inter-DC latency higher than a few milliseconds (see http://spyced.blogspot.com/2010/04/cassandra-fact-vs-fiction.html).
To ensure fast local reads, you can configure the cluster to replicate your data to a certain number of nodes in each datacenter (see "Network Topology Strategy"). For example, you specify that there should always be two replica in each data center. So even when you lose a node in a data center, you will still be able to read your data locally.
Write requests can be sent to any node in a Cassandra cluster. So for fast writes, your clients would always speak to a local node. The node receiving the request (the "coordinator") will replicate the data to other nodes (in other datacenters) in the background. If nodes are down, the write request will still succeed and the coordinator will replicate the data to the failed nodes at a later time ("hinted handoff").
Conflict resolution is based on a client-supplied timestamp.
If you need more than eventual consistency, Cassandra offers several consistency options (including datacenter-aware options).