MongoDB Atlas Projects/Clusters - mongodb

I have recently finished a course for the merng tech stack and now I would like to take what I have learnt and create a project of my own. During the setup for MongoDB Atlas (Free tier) on my new project, I thought of these questions:
Should I start a new project on my MongoDB Atlas account and create a new cluster in that project? Or just create a new cluster in the previous project? (I would assume I should start a new project as the new one has no relevance to the previous one)
Why would/can you have more than one cluster for one project?
I'm still fairly new to this tech stack and would like some clarity on these questions, so I apologise in advance if these come across as stupid. Thanks.

If you're on the free tier, it's somewhat irrelevant as you can only have a single free cluster by project.
As to why you could want more than one cluster for a single project, it's mostly relevant for bigger and more complex projects. I expect a personal project to be able to scope itself to a single cluster. Where I work at, we mostly use clusters to separate domains between teams. It's also one of the easiest permission restriction to set. If you really get down to it, multiple clusters is a mean of organizing a project and you may want different configurations between your clusters, maybe you have a cluster where frequent backups is less necessary and since backups are very costly, you want to make sure you backup frequently only what needs to be.
Update
You might also want to explore sharding to remain on a single cluster, but that is also a costly and complex solution compared to maintaining multiple clusters since finding a relevant shard key to distribute equally the load is not a benign task. We've also moved away from separating clusters by domain, we now separate databases by domain. Databases are then distributed across clusters to balance the loads.

Related

Kubernetes clusters merge to minimize the clusters

This should be the topic of my bachelor thesis. At the moment I am looking for literature or general information, but I can't really find it. Do you have more information on this topic? I want to find out if it makes sense to run dev and test stages on a cluster instead of running each stage on its own.
I also want to find out, if it's a good idea, how I can consolidate the clusters.
That's a nice question. It's a huge topic to cover actually, in short Yes and No you can setup a single cluster for all of your environments.
But in general, everyone need to consider various things before merging all the environments into a single cluster. Some of them include, number of services that you are running on k8s, number of engineers you have in hand as Ops to manage and maintain the existing cluster without any issues, location of different teams who use the cluster(If you take latency into consideration).
There are many advantages and disadvantages in merging everything into one.
Advantages include easy to manage and maintain a small cluster, you can spread out your nodes with labels and deploy your applications with dev label on dev node and so on. But this also hits a waste of resources thing, if you are not letting k8s take the decision by restricting the deployment of the pod. People can argue on this topic for hours and hours if we setup a debate.
Costing of resources, imagine you have a cluster of 3 nodes and 3 masters on prod, similarly a cluster of 2 nodes and 3 masters on dev, a cluster of 2 nodes and 3 masters on test. The costing is huge as you are allocating 9 masters, if your merge dev and test into 1 you will be saving 3 VMs cost.
K8s is very new to many DevOps engineers in many organisations, and many of us need a region to experiment and figure out things with the latest versions of the softwares before that could be implemented on Prod. This is the biggest thing of all, because downtime is very costly and many cannot afford a downtime no matter what. If everything is in single cluster, it's difficult to debug the problems. One example is upgrading Helm 2.0 to 3.0 as this involves losing of helm data. One need to research and workout.
As I said, team locations is another thing. Imagine you have a offshore testing team, if you merge dev and test into single cluster. There might be network latency for the testing team to work with the product and as all of us have deadlines. This could be arguable but still one need to consider network latencies.
In short Yes and No, this is very debatable question and we can keep adding pros and cons to the list forever, but it is advisable to have different cluster for each environment until you become some kind of kubernetes guru understanding each and every packet of data inside your cluster.
This is already achievable using namespace
https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/
With name space we will be able to isolate.
Merging with different namespace would achieve what your thinking.

Reading material for distributed systems from practical aspect

I have been developing a distributed application on top of Akka. As the application is getting mature, I see problems related to distributed systems. Forexample,
during a rolling update of my cluster, some nodes have old jar files and some nodes have new jar file (because it is updated but other are not). This means that within the code I have to support both old and jar code.
similarly during a rolling update, I can have old and new config on different nodes at the same time.
currently, I am using Postgres database at the backend. If the VM which is hosting database is down for update, all the other nodes cannot write any data inside.
I have basic idea on how I can fix the above problems but I would also like to know how others have been solving these problems. So, is there any book which focuses on distributed systems from practical aspect?

What is the recommended way to upgrade a dataproc cluster?

Dataproc seems to be designed to be Stateless / Immutable. Is this assumption correct? Should we just quit right now if we are planning to deploy a Hive/Presto data warehouse?
We are struggling to find any documentation that suggests how one should care for a cluster once has been provisioned?
How to upgrade components?
How to install tools (e.g. Hue etc) after a cluster was established?
How to secure access to data + services once deployed?
The FAQs "Can I run a persistent cluster?" don't really address this either.
The internet
is suggesting we should just create a new cluster if we have a problem. As a developer I'm quite happy with the "Minimize State" argument but I work in the enterprise world that like solutions like Hive (and its metadata store), Hue and Zeppellin and want to connect external tools like Tableau into a cluster.
The documentation should really make it clear which use-cases dataproc excels at (Batch, on-demand & short lived workloads) vs things it isn't really designed for (e.g. OLAP)?
Dataproc indeed provides the most benefit for on-demand use cases, but this isn't necessarily at odds with being used for OLAP. The main idea is that the stateful components can all be separated from the "processing" resources so that you can better adjust resources according to needs at different points in time.
The recommended architecture for your Hive metadata is to keep your Hive metastore backend off the cluster, e.g. in a CloudSQL instance; many are able to use Dataproc in this way with short-lived or semi-short-lived clusters (e.g. keeping a pool of live clusters but deleting/recreating the oldest each day or each week) combined with initialization actions pointing the Hiveserver at CloudSQL: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/cloud-sql-proxy
In this world, the stateful metastore pieces are all in CloudSQL and bulk storage is all in GCS. Some clusters might sync from GCS to local HDFS for performance reasons (especially if running HDFS on local-SSD), but even for interactive OLAP use cases, this isn't usually necessary; running queries directly against GCS works fine too. There are admittedly some performance pitfalls for older formats due to longer round-trip latency to GCS, but a bit of tuning can bring it mostly in-line; here's a (non-google-owned) blog post about Presto on Dataproc going over some of those.
This also provides much easier ways to handle traditional cluster admin; upgrades are just swapping out entire clusters, additional tools should be done in initialization actions for easy reproducibility on new clusters, and you can more easily define security perimeters at a per-cluster granularity.

MongoDB each each cluster is on a different server or that they all in one

I am starting to use MongoDB and yet I am developing the first project with this. I can not to predict how volume of clients and usage it gonna to receive but I want to develop it from the beginning to be high volume handled.
I have heard about clusters and I saw the demonstrations in MongoDB official website.
And here is my question (cutted to small semi-questions):
Are clusters are different servers or that they are just pieces of one big server?
Maybe it seems a bit not related, but how Facebook or huge database handles its data across countries? I mean, they have users from Asia and from America. Surely with different servers, how the system knows how to host, aggregate and deliver with the right server? Is it automatically or that it is a tool that a third party supply to such large databases?
If I am using clusters, shall I still just insert the data to the database and the Mongo will manipulate them in cluster by it's own, or shall I do that manually?
I have a cloud VPS. Should I continue work with this for Mongo or maybe I should really consider about AWS / Google Cloud Platform / etc..?
And another important thing is: Im from Israel, and the clouds I have mentioned above are probably from Europe at least or even more far.
It will probably cause in high latency, is not it?
Thanks.

What are the challenges in embedding text search (Lucene/Solr/Hibernate Search) in applications that are hosted at client sites

We have a enterprise java web-app that our customers (external) deploy on their intranets. I am exploring different full text search options: Lucene/Solr/Hibernate Search and one common concern is deployment/administration/tuning overhead for this.
This is particularly challenging in our case, since we do not host these applications. From what I have seen, most uses of these technologies have been in hosted applications. Our customers typically deploy our application in a clustered environment and do not have any experience with Lucene/Solr.
Does anyone have any experience with this? What challenges have you encountered with this approach? How did you overcome them? At this point I am trying to determine if this is feasible.
Thank you
It is very feasible to deploy applications onto clients sites that use Lucene (or Solr).
Some things to keep in mind:
Administration
you need a way to version your index,
so if/when you change the document
structure in the index, it can be
upgraded.
you therefore need a good
way to force a re-index of all
existing data. Probably also a
good idea to provide an Admin
option to allow an Admin to
trigger a re-index as well.
you
could also provide an Admin option to
allow optimize() be called on your
index, or have this scheduled.
Best to test the actual impact
this will have first, since it may
not be needed depending on the shape
of your index
Deployement
If you are deploying into a clustered environment, the simplest (and fastest in terms of dev speed and runtime speed) solution could be to create the index on each node.
Tuning
* Do you have a reasonable approximation of the dataset you will be indexing? You will need to ensure you understand how your index scales (in both speed and size), since what you consider a reasonable dataset size, may not be the same as your clients... Therefore, you at least need to be able to let clients know what factors will lead to overly large index size, and possibly slower performance.
There are two advantages to embedding lucene in your app over sending the queries to a separate Solr cluster, performance and ease of deployment/installation. Embedding lucene means to run lucene in the same JVM which means no additional server round trips. Commits should be batched in a separate thread. Embedding lucene also means including some more JAR files in your class path so no separate install for Solr.
If your app is cluster aware, then the embedded lucene option becomes highly problematic. An update to one node in the cluster needs to be searchable from any node in the cluster. Synchronizing the lucene index on all nodes yields no better performance than using Solr. With Solr 4, you may find the administration to be less of a barrier to entry for your customers. Check out the literature of the grossly misnamed Solr Cloud.