Is there a way to deploy scheduled queries to GCP directly through a github action, with a configurable schedule? - github

Currently using GCP BigQuery UI for scheduled queries, everything is manually done.
Wondering if there's a way to automatically deploy to GCP using a config JSON that contains the scheduled query's parameters and scheduled times through github actions?
So far, this is one option I've found that makes it more "automated":
- store query in a file on Cloud Storage. When invoking Cloud Function, you read the file content and you perform a bigQuery job on it.
- have to update the file content to update the query
- con: read file from storage, then call BQ: 2 api calls and query file to manage
Currently using DBT in other repos to automate and make this process quicker: https://docs.getdbt.com/docs/introduction
Would prefer the github actions version though, just haven't found a good docu yet :)

Related

Consume a REST API and store into BigQuery

We are trying to ingest data from Zoho Creator which exposes data through REST APIs.
Currently we are manually running a python script which retrieves the data and converts it into an AVRO file. Then we invoke a BQ load command to load it into BigQuery as large string objects per request.
What are the options to deploy this on GCP? Looking for minimal coding/configuration options.
We are thinking of a Python operator on Composer or Cloud functions.
Thanks.
You can do exactly the same thing with Cloud Functions or Cloud Run.
With Cloud Functions you have to wrap your script in a Cloud Functions function pattern
With Cloud Run, you have to wrap your script in a function and then to deploy a webserver which call this function. Finally, you need to build a container that you can deploy on Cloud Run. It seems more boiler plate but it offers more flexibility and runtime customization.
Do you want to call it on a scheduler? User Cloud Scheduler for that.
EDIT 1
If you use a Cloud Composer python operator, it can also work. But I don't like this solution:
You execute your business logic inside Composer, a workflow orchestrator. In term of separation of concern, it's not so good
The runtime isolation isn't guaranteed. I prefer Cloud Function for that.
If tomorrow you need to perform 100 requests in parallel, Composer isn't so scalable, Cloud Functions is.
There are a lot of change that Cloud Functions processing will be free. Cloud Composer cost at least $400 per month. If you already have a cluster, why not, but if not....
Cloud Run (more than Cloud Functions) is portable. if you need to deploy that elsewhere, you can. With composer you are sticky to the Python operator and you need rework to achieve the portability.

Spring Cloud Data Flow: versioned streams

I'm implementing a stream pipe with Spring Cloud Data Flow.
My problem is that I configured MANUALLY the pipe (e.g. http | log_sink) in the server and it will be lost if I reset that server (think in an Amazon EC2 instance that can be hard reseted).
Which is the suggested way to keep versioning of pipes using SCDF?
Thanks.
I am summarizing the discussion from the comments.
To automate the promotion of Stream/Task workloads from lower to higher-level environments, the recommended approach would be the use of SCDF's Java DSL. With this, users can programmatically register, create, deploy, or launch stream/task in a repeatable manner and across many different platforms simultaneously (if there's a need for it). The Boot App built with the Java DSL can be versioned in Git, and it can be CD/GitOps friendly. With sufficient generalization to this App, it can also be reused by many different teams by overriding the defaults.
We put this for use in the product proper for or IT and Acceptance tests, which run on every upstream commit daily across multiple Kubernetes and Cloud Foundry installations.
Alternatively, all of the register, create, deploy, or launch stream/task commands can also be dumped in a text or a property file. Once when you have the file, the dataflow:>script --file command can help slurp in all the commands in each of the new environments — see docs.

In teraform, is there a way to refresh the state of a resource using TF files without using CLI commands?

I have a requirement to refresh the state of a resource "ibm_is_image" using TF files without using CLI commands ?
I know that we can import the state of a resource using "terraform import ". But I should do the same using IaC in TF files.
How to achieve this ?
Example:
In workspace1, I create a resource "f5_custom_image" which gets deleted later from command line. In workspace2, the same code in TF file will assume that "f5_custom_image" already exists and it fails to read the custom image resource. So, my code has to refresh the terraform state of this resource for every execution of "terraform apply":
resource "ibm_is_image" "f5_custom_image" {
depends_on = ["data.ibm_is_images.custom_images"]
href = "${local.image_url}"
name = "${var.vnf_vpc_image_name}"
operating_system = "centos-7-amd64"
timeouts {
create = "30m"
delete = "10m"
}
}
In Terraform's model, an object is fully managed by a single Terraform configuration and nothing else. Having an object be managed by multiple configurations or having an object be created by Terraform but then deleted later outside of Terraform is not a supported workflow.
Terraform is intended for managing long-lived architecture that you will gradually update over time. It is not designed to manage build artifacts like machine images that tend to be created, used, and then destroyed.
The usual architecture for this sort of use-case is to consider the creation of the image as a "build" step, carried out using some other software outside of Terraform, and then we use Terraform only for the "deploy" step, at which point the long-lived infrastructure is either created or updated to use the new image.
That leads to a build and deploy pipeline with a series of steps like this:
Use separate image build software to construct the image, and record the id somewhere from which it can be retrieved using a data source in Terraform.
Run terraform apply to update the long-lived infrastructure to make use of the new image. The Terraform configuration should include a data block to read the image id from wherever it was recorded in the previous step.
If desired, destroy the image using software outside of Terraform once Terraform has completed.
When implementing a pipeline like this, it's optional but common to also consider a "rollback" process to use in case the new image is faulty:
Reset the recorded image id that Terraform is reading from back to the id that was stored prior to the new build step.
Run terraform apply to update the long-lived infrastructure back to using the old image.
Of course, supporting that would require retaining the previous image long enough to prove that the new image is functioning correctly, so the normal build and deploy pipeline would need to retain at least one historical image per run to roll back to. With that said, if you have a means to quickly recreate a prior image during rollback then this special workflow isn't strictly needed: instead, you can implement rollback instead by "rolling forward" to an image constructed with the prior configuration.
An example software package commonly used to prepare images for use with Terraform on other cloud vendors is HashiCorp Packer, but sadly it looks like it does not have IBM Cloud support and so you may need to look for some similar software that does support IBM Cloud, or write something yourself using the IBM Cloud SDK.

Triggering a Dataflow job when new files are added to Cloud Storage

I'd like to trigger a Dataflow job when new files are added to a Storage bucket in order to process and add new data into a BigQuery table. I see that Cloud Functions can be triggered by changes in the bucket, but I haven't found a way to start a Dataflow job using the gcloud node.js library.
Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result (inserting new data to BigQuery when files are added to a Storage bucket)?
This is supported in Apache Beam starting with 2.2. See Watching for new files matching a filepattern in Apache Beam.
Maybe this post would help on how to trigger Dataflow pipelines from App Engine or Cloud Functions?
https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions

Using Ansible to automatically configure AWS autoscaling group instances

I'm using Amazon Web Services to create an autoscaling group of application instances behind an Elastic Load Balancer. I'm using a CloudFormation template to create the autoscaling group + load balancer and have been using Ansible to configure other instances.
I'm having trouble wrapping my head around how to design things such that when new autoscaling instances come up, they can automatically be provisioned by Ansible (that is, without me needing to find out the new instance's hostname and run Ansible for it). I've looked into Ansible's ansible-pull feature but I'm not quite sure I understand how to use it. It requires a central git repository which it pulls from, but how do you deal with sensitive information which you wouldn't want to commit?
Also, the current way I'm using Ansible with AWS is to create the stack using a CloudFormation template, then I get the hostnames as output from the stack, and then generate a hosts file for Ansible to use. This doesn't feel quite right – is there "best practice" for this?
Yes, another way is just to simply run your playbooks locally once the instance starts. For example you can create an EC2 AMI for your deployment that in the rc.local file (Linux) calls ansible-playbook -i <inventory-only-with-localhost-file> <your-playbook>.yml. rc.local is almost the last script run at startup.
You could just store that sensitive information in your EC2 AMI, but this is a very wide topic and really depends on what kind of sensitive information it is. (You can also use private git repositories to store sensitive data).
If for example your playbooks get updated regularly you can create a cron entry in your AMI that runs every so often and that actually runs your playbook to make sure your instance configuration is always up to date. Thus avoiding having "push" from a remote workstation.
This is just one approach there could be many others and it depends on what kind of service you are running, what kind data you are using, etc.
I don't think you should use Ansible to configure new auto-scaled instances. Instead use Ansible to configure a new image, of which you will create an AMI (Amazon Machine Image), and order AWS autoscaling to launch from that instead.
On top of this, you should also use Ansible to easily update your existing running instances whenever you change your playbook.
Alternatives
There are a few ways to do this. First, I wanted to cover some alternative ways.
One option is to use Ansible Tower. This creates a dependency though: your Ansible Tower server needs to be up and running at the time autoscaling or similar happens.
The other option is to use something like packer.io and build fully-functioning server AMIs. You can install all your code into these using Ansible. This doesn't have any non-AWS dependencies, and has the advantage that it means servers start up fast. Generally speaking building AMIs is the recommended approach for autoscaling.
Ansible Config in S3 Buckets
The alternative route is a bit more complex, but has worked well for us when running a large site (millions of users). It's "serverless" and only depends on AWS services. It also supports multiple Availability Zones well, and doesn't depend on running any central server.
I've put together a GitHub repo that contains a fully-working example with Cloudformation. I also put together a presentation for the London Ansible meetup.
Overall, it works as follows:
Create S3 buckets for storing the pieces that you're going to need to bootstrap your servers.
Save your Ansible playbook and roles etc in one of those S3 buckets.
Have your Autoscaling process run a small shell script. This script fetches things from your S3 buckets and uses it to "bootstrap" Ansible.
Ansible then does everything else.
All secret values such as Database passwords are stored in CloudFormation Parameter values. The 'bootstrap' shell script copies these into an Ansible fact file.
So that you're not dependent on external services being up you also need to save any build dependencies (eg: any .deb files, package install files or similar) in an S3 bucket. You want this because you don't want to require ansible.com or similar to be up and running for your Autoscale bootstrap script to be able to run. Generally speaking I've tried to only depend on Amazon services like S3.
In our case, we then also use AWS CodeDeploy to actually install the Rails application itself.
The key bits of the config relating to the above are:
S3 Bucket Creation
Script that copies things to S3
Script to copy Bootstrap Ansible. This is the core of the process. This also writes the Ansible fact files based on the CloudFormation parameters.
Use the Facts in the template.