What is the recommended way to upgrade a dataproc cluster? - google-cloud-dataproc

Dataproc seems to be designed to be Stateless / Immutable. Is this assumption correct? Should we just quit right now if we are planning to deploy a Hive/Presto data warehouse?
We are struggling to find any documentation that suggests how one should care for a cluster once has been provisioned?
How to upgrade components?
How to install tools (e.g. Hue etc) after a cluster was established?
How to secure access to data + services once deployed?
The FAQs "Can I run a persistent cluster?" don't really address this either.
The internet
is suggesting we should just create a new cluster if we have a problem. As a developer I'm quite happy with the "Minimize State" argument but I work in the enterprise world that like solutions like Hive (and its metadata store), Hue and Zeppellin and want to connect external tools like Tableau into a cluster.
The documentation should really make it clear which use-cases dataproc excels at (Batch, on-demand & short lived workloads) vs things it isn't really designed for (e.g. OLAP)?

Dataproc indeed provides the most benefit for on-demand use cases, but this isn't necessarily at odds with being used for OLAP. The main idea is that the stateful components can all be separated from the "processing" resources so that you can better adjust resources according to needs at different points in time.
The recommended architecture for your Hive metadata is to keep your Hive metastore backend off the cluster, e.g. in a CloudSQL instance; many are able to use Dataproc in this way with short-lived or semi-short-lived clusters (e.g. keeping a pool of live clusters but deleting/recreating the oldest each day or each week) combined with initialization actions pointing the Hiveserver at CloudSQL: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/cloud-sql-proxy
In this world, the stateful metastore pieces are all in CloudSQL and bulk storage is all in GCS. Some clusters might sync from GCS to local HDFS for performance reasons (especially if running HDFS on local-SSD), but even for interactive OLAP use cases, this isn't usually necessary; running queries directly against GCS works fine too. There are admittedly some performance pitfalls for older formats due to longer round-trip latency to GCS, but a bit of tuning can bring it mostly in-line; here's a (non-google-owned) blog post about Presto on Dataproc going over some of those.
This also provides much easier ways to handle traditional cluster admin; upgrades are just swapping out entire clusters, additional tools should be done in initialization actions for easy reproducibility on new clusters, and you can more easily define security perimeters at a per-cluster granularity.

Related

MongoDB Atlas Projects/Clusters

I have recently finished a course for the merng tech stack and now I would like to take what I have learnt and create a project of my own. During the setup for MongoDB Atlas (Free tier) on my new project, I thought of these questions:
Should I start a new project on my MongoDB Atlas account and create a new cluster in that project? Or just create a new cluster in the previous project? (I would assume I should start a new project as the new one has no relevance to the previous one)
Why would/can you have more than one cluster for one project?
I'm still fairly new to this tech stack and would like some clarity on these questions, so I apologise in advance if these come across as stupid. Thanks.
If you're on the free tier, it's somewhat irrelevant as you can only have a single free cluster by project.
As to why you could want more than one cluster for a single project, it's mostly relevant for bigger and more complex projects. I expect a personal project to be able to scope itself to a single cluster. Where I work at, we mostly use clusters to separate domains between teams. It's also one of the easiest permission restriction to set. If you really get down to it, multiple clusters is a mean of organizing a project and you may want different configurations between your clusters, maybe you have a cluster where frequent backups is less necessary and since backups are very costly, you want to make sure you backup frequently only what needs to be.
Update
You might also want to explore sharding to remain on a single cluster, but that is also a costly and complex solution compared to maintaining multiple clusters since finding a relevant shard key to distribute equally the load is not a benign task. We've also moved away from separating clusters by domain, we now separate databases by domain. Databases are then distributed across clusters to balance the loads.

Horizontally Scaling Database Guide

We want to horizontally scale our existing MongoDB database which is running on one server. Due to increased user base, we can't scale it vertically anymore. We need to scale it horizontally through sharding.
The MongoDB provides a good tutorial to achieve Sharding. But, we need to do it in less amount of time. We are not expert on this.
It seems there are multiple options available like Google Cloud and Amazon RDS. All we want is to use our database but achieve Sharding by some another service.
So my questions are:
1. Is it possible to build a fail-safe cluster architecture is less than a week using MongoDB Sharding with the team having no prior experience in this?
2. If not, do these services like Google cloud SQL and Amazon RDS provide a mechanism to use our database with their Sharding service?
Can anyone with expertise in this just guide me in this direction?
I tried MongoDB Atlas and it looks pretty good https://www.mongodb.com/cloud/atlas
It creates a cluster for you by default
Maybe, you can give it a try:
MongoDB Atlas delivers the world’s leading database for modern
applications as a fully automated cloud service engineered and run by
the same team that builds the database. Proven operational and
security practices are built in, automating time-consuming
administration tasks such as infrastructure provisioning, database
setup, ensuring availability, global distribution, backups, and more.
The easy-to-use UI and API let you spend more time building your
applications and less time managing your database.

What is the cheapest Google Compute Engine architecture for sharded MongoDB development setup?

After weeks of developing my various microservices, GC Pub/Sub and GC Functions using a basic MongoDB server, I would like to test the entire data flow using what I would use in production: a sharded MongoDB cluster. I've never used these and would like to get myself familiar with setting them up, updating, etc.
Costs are an issue at this stage, especially for testing. Therefore, what is the most cost-effective way to setup a (test) MongoDB sharded cluster on Google Compute Engine?
The easiest approach for you is to use Cloud Launcher for your deployment. It will let you choose the number of nodes and the machine types. In that way you can deploy something that suits your budget. You will get billed according to the resources you deploy and can use this online calculator to have an estimate. A drawback is that there does not seems to be a direct way to increase nodes or change machine types without manual reconfiguration.
While configuring your deployment the appropriate number of nodes and an arbitre will be created. Once you have tested, you might want to think about using more complex architectures that will be redundant against failures in one region (Those will certainly increase your cost since it will mean having additional nodes).
You can also consider running Mongo on GKE, it would be easier to escale but it will require that you get familiar with Kubernetes. Kubernetes Engine is also charged according to the resources used by the cluster.

MongoDB each each cluster is on a different server or that they all in one

I am starting to use MongoDB and yet I am developing the first project with this. I can not to predict how volume of clients and usage it gonna to receive but I want to develop it from the beginning to be high volume handled.
I have heard about clusters and I saw the demonstrations in MongoDB official website.
And here is my question (cutted to small semi-questions):
Are clusters are different servers or that they are just pieces of one big server?
Maybe it seems a bit not related, but how Facebook or huge database handles its data across countries? I mean, they have users from Asia and from America. Surely with different servers, how the system knows how to host, aggregate and deliver with the right server? Is it automatically or that it is a tool that a third party supply to such large databases?
If I am using clusters, shall I still just insert the data to the database and the Mongo will manipulate them in cluster by it's own, or shall I do that manually?
I have a cloud VPS. Should I continue work with this for Mongo or maybe I should really consider about AWS / Google Cloud Platform / etc..?
And another important thing is: Im from Israel, and the clouds I have mentioned above are probably from Europe at least or even more far.
It will probably cause in high latency, is not it?
Thanks.

What are the pros and cons of DynamoDB with respect to other NoSQL databases?

We use MongoDB database add-on on Heroku for our SaaS product. Now that Amazon launched DynamoDB, a cloud database service, I was wondering how that changes the NoSQL offerings landscape?
Specifically for cloud based services or SaaS vendors, how will using DynamoDB be better or worse as compared to say MongoDB? Are there any cost, performance, scalability, reliability, drivers, community etc. benefits of using one versus the other?
For starters, it will be fully managed by Amazon's expert team, so you can bet that it will scale very well with virtually no input from the end user (developer).
Also, since its built and managed by Amazon, you can assume that they have designed it to work very well with their infrastructure so you can can assume that performance will be top notch. In addition to being specifically built for their infrastructure, they have chosen to use SSD's as storage so right from the start, disk throughput will be significantly higher than other data stores on AWS that are HDD backed.
I havent seen any drivers yet and I think its too early to tell how the community will react to this, but I suspect that Amazon will have drivers for all of the most popular languages and the community will likely receive this well - and in turn create additional drivers and tools.
Using MongoDB through an add-on for Heroku effectively turns MongoDB into a SaaS product as well.
In reality one would be comparing whatever service a chosen provider has compared to what Amazon can offer instead of comparing one persistance solution to another.
This is very hard to do. Each provider will have varying levels of service at different price points and one could consider the option of running it on their own hardware locally for development purposes a welcome option.
I think the key difference to consider is MongoDB is a software that you can install anywhere (including at AWS or at other cloud service or in-house) where as DynamoDB is a SaaS available exclusively as hosted service from Amazon (AWS). If you want to retain the option of hosting your application in-house, DynamoDB is not an option. If hosting outside of AWS is not a consideration, then, DynamoDB should be your default choice unless very specific features are of higher consideration.
There's a table in the following link that summarizes the attributes of DynamoDB and Cassandra:
http://www.datastax.com/dev/blog/amazon-dynamodb
Something that needs improvement on DynamoDB in order to become more usable is the possibility to index columns other than the primary key.
UPDATE 1 (06/04/2013)
On 04/18/2013, Amazon announced support for Local Secondary Indexes, which made DynamoDB f***ing great:
http://aws.amazon.com/about-aws/whats-new/2013/04/18/amazon-dynamodb-announces-local-secondary-indexes/
I have to be honest; I was very excited when I heard about the new DynamoDB and did attend the webinar yesterday. However it's so difficult to make a decision right now as everything they said was still very vague; I have no idea the functions that are going to be allowed / used through their service.
The one thing I do know is that scaling is automatically handled; which is pretty awesome, yet there are still so many unknowns that it's tough to really make a great analysis until all the facts are in and we can start using it.
Thus far I still see mongo as working much better for me (personally) in the project undertaking that I've been working on.
Like most DB decisions, it's really going to come down to a project by project decision of what's best for your need.
I anxiously await more information on the product, as for now though it is in beta and I wouldn't jump ship to adopt the latest and greatest only to be a tester :)
I think one of the key differences between DynamoDB and other NoSQL offerings is the provisioned throughput - you pay for a specific throughput level on a table and provided you keep your data well-partitioned you can always expect that throughput to be met. So as your application load grows you can scale up and keep you performance more-or-less constant.
Amazon DynamoDB seems like a pretty decent NoSQL solution. It is fast, and it is pretty easy to use. Other than having an AWS account, there really isn't any setup or maintenance required. The feature set and API is fairly small right now compared to MongoDB/CouchDB/Cassandra, but I would probably expect that to grow over time as feedback from the developer community is received. Right now, all of the official AWS SDKs include a DynamoDB client.
Pros
Lightning Fast (uses SSDs internally)
Really (really) reliable. (chances of write failures are lower)
Seamless scaling (no need to do manual sharding)
Works as webservices (no server, no configuration, no installation)
Easily integrated with other AWS features (can store the whole table into S3 or use EMR etc)
Replication is managed internally, so chances of accidental loss of data is negligible.
Cons
Very (very) limited querying.
Scanning is painful (I remember once a scanning through Java ran for 6 hours)
pre-defined throughput, which means sudden increase beyond the set throughput will be throttled.
throughput is partitioned as table is sharded internally. (which means if you had a throughput for 1000 and its partitioned in two and if you are reading only the latest data(from one part) then your throughput of reading is 500 only)
No joins, Limited indexing allowed (basically 2).
No views, triggers, scripts or stored procedure.
It's really good as an alternative to session storage in scalable application. Another good use would be logging/auditing in extensive system. NOT preferable for feature rich application with frequent enhancement or changes.