What is the proper terminology for Amazon Redshift servers?
Specifically, if I'm querying against Redshift and I don't know or care whether it's a cluster, what do I call it? Is it an instance, a server, or even a cluster? Or is it something else?
Cluster = externally addressable group of instances, composed of:
Node = an instance in the cluster, composed of:
Leader Node = accepts connections, plans queries, returns results
Compute Node = stores data, executes queries (more or less…)
Related
The reason I am asking is, I have a resource-intensive collection that degrades performance of its entire database. I need to decide whether to migrate other collections away to a different database within the same cluster or to a different cluster altogether.
The answer I think depends on under-the-hood implementation. Does a poorly performing collection take resources only from its own database, or from the cluster as a whole?
Hosted on Atlas.
I would suggest first look at your logical and schema designs and try to optimize it but if that is not working then
"In MongoDB Atlas, all databases within a cluster share the same set of nodes (servers) and are subject to the same resource limitations. Each database has its own logical namespace and operates independently from the other databases, but they share the same underlying hardware resources, such as CPU, memory, and I/O bandwidth.
So, if you have a resource-intensive collection that is degrading performance for its entire database, migrating other collections to a different database within the same cluster may not significantly improve performance if the resource bottleneck is at the cluster level. In this case, you may need to consider scaling up the cluster or upgrading to a higher-tier plan to increase the available resources and improve overall cluster performance."
Reference: https://www.mongodb.com/community/forums/t/creating-a-new-database-vs-a-new-collection-vs-a-new-cluster/99187/2
The term "cluster" is overloaded. It can refer to a replica set or to a sharded cluster.
A sharded cluster is effectively a group of replica set with a query router.
If you are using a sharded cluster, you can design a sharding strategy that will put the busy collection on its own shard, the rest of the data on the other shard(s), and still have a common point to query them both.
Is there a way to deal with different server types in a sharded cluster? According to MongoDB documentation the balancer attempts to achieve an even distribution of chunks across all shards in the cluster. So, it purely seems to be based on the amount of data.
However, when you add new servers to an existing sharded cluster then typically the new server has more disc space, disc is faster and CPU has more power. Especially when you run an application for several years then this condition might come a fact.
Does the balancer take such topics into account or do you have to ensure that all servers in a sharded cluster have similar performance and resources?
You are correct that the balancer would assume that all parts of the cluster is of similar hardware. However you can use zone sharding to custom tailor the behaviour of the balancer.
To quote from the zone sharding docs page:
In sharded clusters, you can create zones of sharded data based on the shard key. You can associate each zone with one or more shards in the cluster. A shard can associate with any number of zones. In a balanced cluster, MongoDB migrates chunks covered by a zone only to those shards associated with the zone.
Using zones, you can specify data distribution to be by location, by hardware spec, by application/customer, and others.
To directly answer your question, the use case you'll be most interested in would be Tiered Hardware for Varying SLA or SLO. Please see the link for a tutorial on how to achieve this.
Note that defining the zones is a design decision on your part, and there is currently no automated way for the server to do this for you.
Small note: the balancer balances the cluster purely using the shard key. It doesn't take into account the amount of data at all. Thus in an improperly designed shard key, it is possible to have some shard overflowing with data while others are completely empty. In a pathological mis-design case, some chunks are not divisible, leading to a situation where the cluster is forever unbalanced until an extensive redesign is done.
How can I write data to multi mongodb instances and keep data synchronous among these instances? Just like in mariaDB.
Currently we use the replica-set in mongodb, but this seems can only support writing data to one node, and this may cause pressure issue if writing requests going up.
Sharded Cluster is not appropriate for me.
Thanks.
Please read the docs (Replication in MongoDB)
Of the data bearing nodes, one and only one member is deemed the primary node, while the other nodes are deemed secondary nodes
I'm new to Couchbase and NoSql technologies in general, but I'm working on a web chat application running on node js using express and some other modules.
I've chosen to work with NoSql to store sessions and all needed data on server-side. But I don't really understand some important features of Couchbase : What is a Cluster, a Bucket? Where can I find some clear definitions of how the server works?
Couchbase uses the term cluster in the same way as many other products, a Couchbase cluster is simply a collection of machines running as a co-ordinated, distributed system of Couchbase nodes.
A Bucket is a Couchbase specific term that is roughly analogous to a 'database' in traditional RDBMS terms. A Bucket provides a container for grouping your data, both in terms of organisation and grouping of similar data and resource allocation. You can configure your buckets separately, providing different quotas, different IO priorities and different security settings on a per bucket basis. Buckets are also the primary method for namespacing documents in Couchbase.
For further information, the Architecture and Concepts overview in the Couchbase documentation, specifically data storage, is a good starting point. A somewhat outdated, but still useful video on Introduction to Couchbase might also be useful to you.
Even though it's answered, hope the following would be more helpful for someone.
A Couchbase cluster contains nodes. Nodes contain buckets. Buckets contain documents. Documents can be retrieved multiple ways: by their keys, queried with N1QL, and also by using Views.(Ref)
As specified in the Couchbase Documentation,
Node
A single Couchbase Server instance running on a physical server,
virtual machine, or a container. All nodes are identical: they consist
of the same components and services and provide the same interfaces.
Cluster
A cluster is a collection of nodes that are accessed and managed as a
single group. Each node is an equal partner in orchestrating the
cluster to provide facilities such as operational information
(monitoring) or managing cluster membership of nodes and health of
nodes.
Clusters are scalable. You can expand a cluster by adding new nodes
and shrink a cluster by removing nodes.
The Cluster Manager is the main component that orchestrates the
cluster level operations. For more information, see Cluster Manager.
Bucket
A bucket is a logical container for a related set of items such as
key-value pairs or documents. Buckets are similar to databases in
relational databases. They provide a resource management facility for
the group of data that they contain. Applications can use one or more
buckets to store their data. Through configuration, buckets provide
segregation along the following boundaries:
Cache and IO management
Authentication
Replication and Cross Datacenter Replication (XDCR)
Indexing and Views
For further info : Couchbase Terminology
I have a production sharded cluster of PostgreSQL machines where sharding is handled at the application layer. (Created records are assigned a system generated unique identifier - not a UUID - which includes a 0-255 value indicating the shard # that record lives on.) This cluster is replicated in RDS so large read queries can be executed against it.
I'm trying to figure out the best option for accessing this data within Spark.
I was thinking of creating a small dataset (a text file) that contains only the shard names, i.e., integration-shard-0, integration-shard-1, etc. Then I'd partition this dataset across the Spark cluster so ideally each worker would only have a single shard name (but I'd have to handle cases where a worker has more than one shard). Then when I create a JdbcRDD I'd actually create 1..n such RDDs, one for each shard name residing on that worker, and merge the resulting RDDs together.
This seems like it would work but before I go down this path I wanted to see how other people have solved similar problems.
(I also have a separate Cassandra cluster available as second datacenter for analytic processing which I will be accessing with Spark.)
I ended up writing my own ShardedJdbcRDD for which the preliminary version can be found at the following gist:
https://gist.github.com/cfeduke/3bca88ed793ddf20ea6d
At the time I wrote it, this version doesn't support use from Java, only Scala. (I may update it.) It also doesn't have the same sub-partitioning scheme that JdbcRDD has, for which I will eventually create an overload constructor. Basically ShardedJdbcRDD will query your RDBMS shards across the cluster; if you have at least as many Spark slaves as shards, each slave will get one shard for its partition.
A future overloaded constructor will support the same range query that JdbcRDD has so if there are more Spark slaves in the cluster than shards the data can be broken up into smaller sets through range queries.