Use multiple service accounts in Google Dataproc - google-cloud-storage

Can we use multiple service accounts within one Dataproc cluster.
Let's say I have 3 buckets:
Service account A has r/w access to bucket A, with r access to bucket B and C. Service account B has r/w access to bucket B, with r access to bucket A and C. Service account C has r/w access to bucket C, with r access to bucket A and B
Can I have a cluster spun up with service account D, but use each of the above defined service accounts (A, B and C) within the jobs to get appropriate access to the buckets?

The GCS Connector for Hadoop can be configured to use a different service account than what is provided by the GCE metadata server. Using this mechanism, it would be possible to access different buckets using different service accounts.
To use a json keyfile instead of the metadata, the configuration key "google.cloud.auth.service.account.json.keyfile" should be set to the location of a JSON keyfile that is local to each node in the cluster. How to set that key will depend on the context in which the filesystem is being accessed. For a standard MR job, only accessing a single bucket, you can set that key/value pair on the JobConf. If you're accessing GCS via the Hadoop FileSystrem interface, you can specify that key/value pair in the Configuration object used when acquiring the appropriate FileSystem instance.
That said, Dataproc does not attempt to segregate individual applications from each other. So if your intent is a multi-tenant cluster then there are not sufficient security boundaries around individual jobs to guarantee that any a job will not maliciously attempt to grab credentials from another job.
If you're intent is not multi-tenant clusters, consider creating a task specific service account that is allowed read or write access to all buckets that it will be required to interact with. For example, if you have a job 'meta-analysis' that reads and writes to multiple buckets, you can create a service account meta-analysis that has permission required for that job.

With this relatively new feature ( 6 months in GA at the moment of writing ) you can try to use dataproc cooperative multi tenancy to map user accounts submitting job against dataproc into service account
Here is the excellent article by Google's engineers:
https://cloud.google.com/blog/topics/developers-practitioners/dataproc-cooperative-multi-tenancy

Related

User limitation on Postgresql synthetic monitoring using Airflow

I am trying to write a synthetic monitoring for my on-prem postgresql service, using airflow. The monitoring should return if a cluster is available for creating tables, writing and reading data, and deleting tables.
The clusters on my service are using SSL certificates for authentication, which means a client is required to provide a suitable client certificate in order to connect to the cluster.
Currently, I have implemented my monitoring by creating a global user which will have a certificate with permissions to all the cluster. The user will have permissions to create, write and read only on one schema, dedicated to this monitoring. Using airflow, I will connect with this user each of my postgresql clusters and try to create a table, write to it, read, and then delete it. If one of the actions fails - the DAG will write a log describing the reason for failure.
My main problem with this solution it not being able to limit such a powerful user with accessibility to all of my clusters. In case an intruder will get the user's client certificate, he would be able to explode the DB storage by writing huge amount of data or overload queries and fail the cluster.
I am looking for some ideas for limiting this user so it will be able to act only for it's purpose- the simple actions required for this monitoring, and could not be exploit by an attacker. Alternatively, I would appreciate any suggestions for different implementation for this monitoring.
I searched for build in postgresql configurations that will allow me to limit the dedicated monitoring schema / limiting the amount of queries performed by the user.

Cloud Formation Template design

What factors do folk take into account when deciding to write 1 large CF template, or nest many smaller ones? The use case I have in mind is RDS based where I'll need to define RDS instances, VPC Security groups, parameter and option groups as well as execute some custom lambda resources.
My gut feel is that this should be split, perhaps by resource type, but I was wondering if there was generally accepted practice on this.
My current rule of thumb is to split resources by deployment units - what deploys together, goes together.
I want to have the smallest deployable stack, because it's fast to deploy or fail if there's an issue. I don't follow this rule religiously. For example, I often group Lambdas together (even unrelated ones, depends on the size of the project), as they update only if the code/config changed and I tend to push small updates where only one Lambda changed.
I also often have a stack of shared resources that are used (Fn::Import-ed) throughout the other stacks like a KMS key, a shared S3 Bucket, etc.
Note that I have a CD process set up for every stack, hence the rule.
My current setup requires deployment of a VPC (with endpoints), RDS & application (API gateway, Lambdas). I have broken them down as
VPC stack: a shared resource with 1 VPC per region with public & private subnets, VPC endpoints, S3 bucket, NAT gateways, ACLs, security groups.
RDS stack: I can have multiple RDS clusters inside a VPC so made sense to keep it separate. Also, this is created after VPC as it needs VPC resources such as private subnets, security groups. This cluster is shared by multiple application stacks.
Application stack: This deploys API gateway & Lambdas (basically a serverless application) with the above RDS cluster as the DB.
So, in general, I pretty much follow what #Milan Cermak described. But in my case, these deployments are done when needed (not part of CD) so exported parameters are stored in parameter store of AWS Systems Manager.

How can I use dataproc to pull data from bigquery that is not in the same project as my dataproc cluster?

I work for an organisation that needs to pull data from one of our client's bigquery datasets using Spark and given that both the client and ourselves use GCP it makes sense to use Dataproc to achieve this.
I have read Use the BigQuery connector with Spark which looks very useful however it seems to make the assumption that the dataproc cluster, the bigquery dataset and the storage bucket for temporary BigQuery export are all in the same GCP project - that is not the case for me.
I have a service account key file that allows me to connect to and interact with our client's data stored in bigquery, how can I use that service account key file in conjunction with the BigQuery connector and dataproc in order to pull data from bigquery and interact with it in dataproc? To put it another way, how can I modify the code provided at Use the BigQuery connector with Spark to use my service account key file?
To use service account key file authorization you need to set mapred.bq.auth.service.account.enable property to true and point BigQuery connector to a service account json keyfile using mapred.bq.auth.service.account.json.keyfile property (cluster or job). Note that this property value is a local path, that's why you need to distribute a keyfile to all the cluster nodes beforehand, using initialization action, for example.
Alternatively, you can use any authorization method described here, but you need to replace fs.gs properties prefix with mapred.bq for BigQuery connector.

Kubernetes: Databases & DB Users

We are planning to use Kube for Postgres deployments. Our applications will be microservices with separated schema (or logical database). For security sake, we'd like to have separate users for each schema/logical_db.
I suppose that the db/schema&user should be created by Kube, so the application itself does not need to have access to DB admin account.
In Stolon it seems there is just a possibility to create a single user and single database and this seems to be the case also for other HA Postgres charts.
Question: What is the preferred way in Microservices in Kube to create DB users?
When it comes to creating user, as you said, most charts and containers will have environment variables for creating a user at boot time. However, most of them do not consider the possibility of creating multiple users at boot time.
What other containers do is, as you said, have the root credentials in k8s secrets so they access the database and create the proper schemas and users. This does not necessarily need to be done in the application logic but, for example, using an init container that sets up the proper database for your application to run.
https://kubernetes.io/docs/concepts/workloads/pods/init-containers
This way you would have a pod with two containers: one for your application and an init container for setting up the DB.

Cloud Foundry for SaaS

I am implementing a service broker for my SaaS application on Cloud Foundry.
On create-service of my SaaS application, I create instance of another service (Say service-A) also ie. a new service instance of another service (service-A) is also created for every tenant which on-boards my application.
The details of the newly created service instance (service-A) is passed to my service-broker via environment variable.
To be able to process this newly injected environment variable, the service-broker need to be restaged/restarted.
This means a down-time for the service-broker for every new on-boarding customer.
I have following questions:
1) How these kind on use-cases are handled in Cloud Foundry?
2) Why Cloud Foundry chose to use environment variables to pass the info required to use a service? It seems limiting, as it requires application restart.
As a first guess, your service could be some kind of API provided to a customer. This API must store the data it is sent in some database (e.g. MongoDb or Mysql). So MongoDb or Mysql would be what you call Service-A.
Since you want the performance of the API endpoints for your customers to be independent of each other, you are provisioning dedicated databases for each of your customers, that is for each of the service instances of your service.
You are right in that you would need to restage your service broker if you were to get the credentials to these databases from the environment of your service broker. Or at least you would have to re-read the VCAP_SERVICES environment variable. Yet there is another solution:
Use the CC-API to create the services, and bind them to whatever app you like. Then use again the CC-API to query the bindings of this app. This will include the credentials. Here is the link to the API docs on this endpoint:
https://apidocs.cloudfoundry.org/247/apps/list_all_service_bindings_for_the_app.html
It sounds like you are not using services in the 'correct' manner. It's very hard to tell without more detail of your use case. For instance, why does your broker need to have this additional service attached?
To answer your questions:
1) Not like this. You're using service bindings to represent data, rather than using them as backing services. Many service brokers (I've written quite a few) need to dynamically provision things like Cassandra clusters, but they keep some state about which Cassandra clusters belong to which CF service in a data store of their own. The broker does not bind to each thing it is responsible for creating.
2) Because 12 Factor applications should treat backing services as attached, static resources. It is not normal to say add a new MySQL database to a running application.