How to obtain GCP project name within SQL run against BigQuery from Composer Airflow - instance-variables

I would like to write a query to be executed by Composer/Airflow (python BigQueryOperator referring to an SQL file) along the lines of
SELECT col1, col2, ... FROM `{{ GCP_PROJECT }}.dataset.table` ...
I wish for the GCP project to be parametrised in the SQL so I can deploy the same SQL file in the productiond/development (prod/dev) environments and test in dev without attempting to query prod tables the dev environment does not have access to.
Is this something that would already be set up?
On this point I couldn't find any helpful examples in the Composer guide, apart from that GCP_PROJECT is already reserved, not sure how to pass an environment variable on to the templating anyway.. Thanks.

Haven't tested it myself, but I think something like that should work:
import os
dag = DAG(...)
def print_env_var():
print(os.environ["GCP_PROJECT"])
print_context = PythonOperator(
task_id="gcp_project",
python_callable=print_env_var,
dag=dag,
)
Based on Google documentation it's not recommended to use reserved environmental variables (https://cloud.google.com/composer/docs/how-to/managing/environment-variables#reserved_names):
Cloud Composer uses these reserved names for variables that internal
processes use. Do not refer to reserved names in your workflows.
Variable values can change without notice.
I think better approach to handle staging/production environments is to set variables in Airflow itself and read them in the dag. Here's a great resource that goes into a lot of detail about Airflow variables.

Related

Azure Synapse SQL CICD with multiple environments

Might be a stupid question but seems to be very difficult to find information about Synapse with multiple environments.
We have dev/test/prod environment setup and need to create partially-automated CICD pipelines between those. The only problem is now that we cannot build dynamic SQL scripts to query from the respective storage accounts - so those could be identical no matter the environment. So, dev Synapse using data from dev-storage and so on. Dedicated SQL pool can benefit from Stored Procedures, and I could pass parameters there if it works. But what about serverless pool? What is the correct way?
I've tried to look options from OPENROWSET with DATA_SOURCE argument as well as EXTERNAL DATA SOURCE expression without any luck. Also, no one seems to offer any information about this so I'm beginning to think if this whole perspective is wrong.
This kind of "external" file reading is new to me, I may have tried to put this in a SQL Server context in my head.
Thank you for your time!
Okay, Serverless pool does support both procedures and dynamic SQL, yet you currently cannot call that straight from Synapse Pipelines.
You have to either trigger those procedures via Spark notebooks or by creating separate Synapse Analytics Linked Services for each of your databases in a Synapse Serverless pools and work from there.

Convert Terraform Templates to Cloudformation Templates

I want to convert the existing terraform templates(hcl) to aws cloudformation templates(json/yaml).
I basically want to find security issues with these templates through CFN_NAG.
An approach that I have already tried was converting HCL to JSON and then passing the template to CFN_NAG but I received a failure since both the templates have different structure.
Can anyone please provide any suggestions here?
A rather convoluted way of achieving this is to use Terraform to stand-up actual AWS environments, and then to use AWS’s CloudFormer to extract CloudFormation templates (JSON or YAML) from what Terraform has built. At which point you can use cfn-nag.
CloudFormer has some limitations, in that not all AWS resources are currently supported (RDS Security Groups for example) , but it will get you all the basic AWS resources.
Don't forget to remove all the environments, including CloudFormer's, to minimise the cost.
You want to use static code analysis to find security issues in your Terraform setup.
Trying to converting Terraform to CloudFormation to later use cfn-nag is one way. However, there exist tools now that directly operate on the Terraform setup.
I would recommend to take a look at terrascan. It is built on terraform_validate.
https://github.com/bridgecrewio/checkov/ runs security scanning for both terraform and cloudformation

How to implement the "One Binary" principle with Docker

The One Binary principle explained here:
http://programmer.97things.oreilly.com/wiki/index.php/One_Binary states that one should...
"Build a single binary that you can identify and promote through all the stages in the release pipeline. Hold environment-specific details in the environment. This could mean, for example, keeping them in the component container, in a known file, or in the path."
I see many dev-ops engineers arguably violate this principle by creating one docker image per environment (ie, my-app-qa, my-app-prod and so on). I know that Docker favours immutable infrastructure which implies not changing an image after deployment, therefore not uploading or downloading configuration post deployment. Is there a trade-off between immutable infrastructure and the one binary principle or can they complement each-other? When it comes to separating configuration from code what is the best practice in a Docker world??? Which one of the following approaches should one take...
1) Creating a base binary image and then having a configuration Dockerfile that augments this image by adding environment specific configuration. (i.e my-app -> my-app-prod)
2) Deploying a binary-only docker image to the container and passing in the configuration through environment variables and so on at deploy time.
3) Uploading the configuration after deploying the Docker file to a container
4) Downloading configuration from a configuration management server from the running docker image inside the container.
5) Keeping the configuration in the host environment and making it available to the running Docker instance through a bind mount.
Is there another better approach not mentioned above?
How can one enforce the one binary principle using immutable infrastructure? Can it be done or is there a trade-off? What is the best practice??
I've about 2 years of experience deploying Docker containers now, so I'm going to talk about what I've done and/or know to work.
So, let me first begin by saying that containers should definitely be immutable (I even mark mine as read-only).
Main approaches:
use configuration files by setting a static entrypoint and overriding the configuration file location by overriding the container startup command - that's less flexible, since one would have to commit the change and redeploy in order to enable it; not fit for passwords, secure tokens, etc
use configuration files by overriding their location with an environment variable - again, depends on having the configuration files prepped in advance; ; not fit for passwords, secure tokens, etc
use environment variables - that might need a change in the deployment code, thus lessening the time to get the config change live, since it doesn't need to go through the application build phase (in most cases), deploying such a change might be pretty easy. Here's an example - if deploying a containerised application to Marathon, changing an environment variable could potentially just start a new container from the last used container image (potentially on the same host even), which means that this could be done in mere seconds; not fit for passwords, secure tokens, etc, and especially so in Docker
store the configuration in a k/v store like Consul, make the application aware of that and let it be even dynamically reconfigurable. Great approach for launching features simultaneously - possibly even accross multiple services; if implemented with a solution such as HashiCorp Vault provides secure storage for sensitive information, you could even have ephemeral secrets (an example would be the PostgreSQL secret backend for Vault - https://www.vaultproject.io/docs/secrets/postgresql/index.html)
have an application or script create the configuration files before starting the main application - store the configuration in a k/v store like Consul, use something like consul-template in order to populate the app config; a bit more secure - since you're not carrying everything over through the whole pipeline as code
have an application or script populate the environment variables before starting the main application - an example for that would be envconsul; not fit for sensitive information - someone with access to the Docker API (either through the TCP or UNIX socket) would be able to read those
I've even had a situation in which we were populating variables into AWS' instance user_data and injecting them into container on startup (with a script that modifies containers' json config on startup)
The main things that I'd take into consideration:
what are the variables that I'm exposing and when and where am I getting their values from (could be the CD software, or something else) - for example you could publish the AWS RDS endpoint and credentials to instance's user_data, potentially even EC2 tags with some IAM instance profile magic
how many variables do we have to manage and how often do we change some of them - if we have a handful, we could probably just go with environment variables, or use environment variables for the most commonly changed ones and variables stored in a file for those that we change less often
and how fast do we want to see them changed - if it's a file, it typically takes more time to deploy it to production; if we're using environment variable
s, we can usually deploy those changes much faster
how do we protect some of them - where do we inject them and how - for example Ansible Vault, HashiCorp Vault, keeping them in a separate repo, etc
how do we deploy - that could be a JSON config file sent to an deployment framework endpoint, Ansible, etc
what's the environment that we're having - is it realistic to have something like Consul as a config data store (Consul has 2 different kinds of agents - client and server)
I tend to prefer the most complex case of having them stored in a central place (k/v store, database) and have them changed dynamically, because I've encountered the following cases:
slow deployment pipelines - which makes it really slow to change a config file and have it deployed
having too many environment variables - this could really grow out of hand
having to turn on a feature flag across the whole fleet (consisting of tens of services) at once
an environment in which there is real strive to increase security by better handling sensitive config data
I've probably missed something, but I guess that should be enough of a trigger to think about what would be best for your environment
How I've done it in the past is to incorporate tokenization into the packaging process after a build is executed. These tokens can be managed in an orchestration layer that sits on top to manage your platform tools. So for a given token, there is a matching regex or xpath expression. That token is linked to one or many config files, depending on the relationship that is chosen. Then, when this build is deployed to a container, a platform service (i.e. config mgmt) will poke these tokens with the correct value with respect to its environment. These poke values most likely would be pulled from a vault.

Best practice for getting RDS password to docker container on ECS

I am using Postgres Amazon RDS and Amazon ECS for running my docker containers.
The question is. What is the best practice for getting the username and password for the RDS database into the docker container running on ECS?
I see a few options:
Build the credentials into docker image. I don't like this since then everyone with access to the image can get the password.
Put the credentials in the userdata of the launch configuration used by the autoscaling group for ECS. With this approach all docker images running on my ECS cluster has access to the credentials. I don't really like that either. That way if a blackhat finds a security hole in any of my services (even services that does not use the database) he will be able to get the credentials for the database.
Put the credentials in a S3 and control the limit the access to that bucket with a IAM role that the ECS server has. Same drawbacks as putting them in the userdata.
Put the credentials in the Task Definition of ECS. I don't see any drawbacks here.
What is your thoughts on the best way to do this? Did I miss any options?
regards,
Tobias
Building it into the container is never recomended. Makes it hard to distribute and change.
Putting it into the ECS instances does not help your containers to use it. They are isolated and you'd end up with them on all instances instead of just where the containers are that need them.
Putting them into S3 means you'll have to write that functionality into your container. And it's another place to have configuration.
Putting them into your task definition is the recommended way. You can use the environment portion for this. It's flexible. It's also how PaaS offerings like Heroku and Elastic Beanstalk use DB connection strings for Ruby on rails and other services. Last benefit is it makes it easy to use your containers against different databases (like dev, test, prod) without rebuilding containers or building weird functionality
The accepted answer recommends configuring environment variables in the task definition. This configuration is buried deep in the ECS web console. You have to:
Navigate to Task Definitions
Select the correct task and revision
Choose to create a new revision (not allowed to edit existing)
Scroll down to the container section and select the correct container
Scroll down to the Env Variables section
Add your configuration
Save the configuration and task revision
Choose to update your service with the new task revision
This tutorial has screenshots that illustrate where to go.
Full disclosure: This tutorial features containers from Bitnami and I work for Bitnami. However the thoughts expressed here are my own and not the opinion of Bitnami.
For what it's worth, while putting credentials into environment variables in your task definition is certainly convenient, it's generally regarded as not particularly secure -- other processes can access your environment variables.
I'm not saying you can't do it this way -- I'm sure there are lots of people doing exactly this, but I wouldn't call it "best practice" either. Using Amazon Secrets Manager or SSM Parameter Store is definitely more secure, although getting your credentials out of there for use has its own challenges and on some platforms those challenges may make configuring your database connection much harder.
Still -- it seems like a good idea that anyone running across this question be at least aware that using the task definition for secrets is ... shall way say ... frowned upon?

Using Ansible to automatically configure AWS autoscaling group instances

I'm using Amazon Web Services to create an autoscaling group of application instances behind an Elastic Load Balancer. I'm using a CloudFormation template to create the autoscaling group + load balancer and have been using Ansible to configure other instances.
I'm having trouble wrapping my head around how to design things such that when new autoscaling instances come up, they can automatically be provisioned by Ansible (that is, without me needing to find out the new instance's hostname and run Ansible for it). I've looked into Ansible's ansible-pull feature but I'm not quite sure I understand how to use it. It requires a central git repository which it pulls from, but how do you deal with sensitive information which you wouldn't want to commit?
Also, the current way I'm using Ansible with AWS is to create the stack using a CloudFormation template, then I get the hostnames as output from the stack, and then generate a hosts file for Ansible to use. This doesn't feel quite right – is there "best practice" for this?
Yes, another way is just to simply run your playbooks locally once the instance starts. For example you can create an EC2 AMI for your deployment that in the rc.local file (Linux) calls ansible-playbook -i <inventory-only-with-localhost-file> <your-playbook>.yml. rc.local is almost the last script run at startup.
You could just store that sensitive information in your EC2 AMI, but this is a very wide topic and really depends on what kind of sensitive information it is. (You can also use private git repositories to store sensitive data).
If for example your playbooks get updated regularly you can create a cron entry in your AMI that runs every so often and that actually runs your playbook to make sure your instance configuration is always up to date. Thus avoiding having "push" from a remote workstation.
This is just one approach there could be many others and it depends on what kind of service you are running, what kind data you are using, etc.
I don't think you should use Ansible to configure new auto-scaled instances. Instead use Ansible to configure a new image, of which you will create an AMI (Amazon Machine Image), and order AWS autoscaling to launch from that instead.
On top of this, you should also use Ansible to easily update your existing running instances whenever you change your playbook.
Alternatives
There are a few ways to do this. First, I wanted to cover some alternative ways.
One option is to use Ansible Tower. This creates a dependency though: your Ansible Tower server needs to be up and running at the time autoscaling or similar happens.
The other option is to use something like packer.io and build fully-functioning server AMIs. You can install all your code into these using Ansible. This doesn't have any non-AWS dependencies, and has the advantage that it means servers start up fast. Generally speaking building AMIs is the recommended approach for autoscaling.
Ansible Config in S3 Buckets
The alternative route is a bit more complex, but has worked well for us when running a large site (millions of users). It's "serverless" and only depends on AWS services. It also supports multiple Availability Zones well, and doesn't depend on running any central server.
I've put together a GitHub repo that contains a fully-working example with Cloudformation. I also put together a presentation for the London Ansible meetup.
Overall, it works as follows:
Create S3 buckets for storing the pieces that you're going to need to bootstrap your servers.
Save your Ansible playbook and roles etc in one of those S3 buckets.
Have your Autoscaling process run a small shell script. This script fetches things from your S3 buckets and uses it to "bootstrap" Ansible.
Ansible then does everything else.
All secret values such as Database passwords are stored in CloudFormation Parameter values. The 'bootstrap' shell script copies these into an Ansible fact file.
So that you're not dependent on external services being up you also need to save any build dependencies (eg: any .deb files, package install files or similar) in an S3 bucket. You want this because you don't want to require ansible.com or similar to be up and running for your Autoscale bootstrap script to be able to run. Generally speaking I've tried to only depend on Amazon services like S3.
In our case, we then also use AWS CodeDeploy to actually install the Rails application itself.
The key bits of the config relating to the above are:
S3 Bucket Creation
Script that copies things to S3
Script to copy Bootstrap Ansible. This is the core of the process. This also writes the Ansible fact files based on the CloudFormation parameters.
Use the Facts in the template.