Are there any docs on how you envision the data engineer workflows looks like with Mage? - workflow

We run several environments with different data residency so would like to follow a declarative gitops workflow if possible.
Essentially data engineers writes new pipeline in mage and creates a PR, the PR gets merge and deployed to all Mage environments.

Related

Semantic Versioning on multiple services in the same Github repository using GH Actions

Our team uses a mono-repo, with several microservices, and some common packages between them.
I am tasked with adding CI/CD automation, and traditionally I rely in Git tags for the sem-ver and utilize comments to decide on major/minor/patch. The semantic-release node library does an good job of automating this.
The problem here is that it is a mono-repo and thus commits and tags are only useful across a global sem-ver. However in my case I have multiple microservices that each will have their own sem-ver.
One thought I have is maintaining a json manifest to store the versions of the services. By blocking direct pushes to the main branch, I can guarentee this file would not be changed on master except by the CI/CD actions.
I also would like to get some ideas from the community on what they would do in this situation? Or what they have done similar to this in the past?

Release pipeline using several build pipelines?

I'm facing the following issue: I have one git repo with a Node.js application. The application is divided into several components, namely: server, client, microserviceA, microserviceB. There is also a directory named shared with some sharaed code used by all the other components.
I have a pipeline for each of the components that only runs tests, which run on pull-request to master. Each pipeline only runs when the PR contains changes relevant to him, e.g. server-ci will run only when there were changes in the server component, etc.
Now, on merge to master, I would like to build the components and deploy them on a staging server. Currently what I have is as follows: for each component (beside the shared) I have another build pipeline (<component>-build) which on merge to master builds the corresponding component (depending on the changes made, as above). I have one Release pipeline which takes as artifacts all these build pipelines and deploys them on the staging server. So the good thing about this is that merge to master which includes only changes in client will build only client and not all the rest of the components.
The problem is, that on merge to master that contains changes to several components, I'll have more than one build pipeline running, so they will both trigger the Release pipeline. This is not desirable.
A possible solution I thought about was, to have only one build pipeline which runs on merge to master, but then I'd have to build ALL the components on each merge, which is inefficient.
What is the best way to deal with such situation?
In the release stage settings you can configure that Number of parallel deployments will be 1:

Development and Production Environments with GitHub flow

At work, we're now using GitHub, and with that GitHub flow. My understanding of GitHub flow is that there is a master branch and feature branches. Unlike git flow, there is no develop branch.
This works quite well on projects that we've done, and simplifies things.
However, for our products, we have a development and production environment. For the production environment, we use the master branch, whereas for the development environment we're not sure how to do it?
The only idea I can think of is:
When a branch is merged with master, redeploy master using GitHub actions.
When another branch is pushed, set up a GitHub action so that any other branch (other than master) is deployed to this environment.
Currently, for projects that require a development environment, we're essentially using git flow (features -> develop -> master).
Do you think my idea is sensible, and if not what would you recommend?
Edit:
Just to clarify, I'm asking the best way to implement development with GitHub Flow and not git flow.
In my experience, GitHub Flow with multiple environments works like this. Merging to master does not automatically deploy to production. Instead, merging to master creates a build artifact that is able to be promoted through environments using ChatOps tooling.
For example, pushing to master creates a build artifact named something like my-service-47cbd6c, which is a combination of the service name and the short commit hash. This is pushed to an artifact repository of some kind. The artifact can then be deployed to various environments using tooling such as ChatOps style slash commands to trigger the deloy. This tooling could also have checks to make sure test environments are not skipped, for example. Finally, the artifact is promoted to production.
So for your use case with GitHub Actions, what I would suggest is this:
Pushing to master creates the build artifact and automatically deploys it to the development environment.
Test in development
Promote the artifact by deploying to production using a slash command. The action slash-command-dispatch would help you with this.
You might also consider the notion of environments (as illustrated here)
Recently (Feb. 2021), you can:
##Limit which branches can deploy to an environment
You can now limit which branches can deploy to an environment using Environment protection rules.
When a job tries to deploy to an environment with Deployment branches configured Actions will check the value of github.ref against the configuration and if it does not match the job will fail and the run will stop.
The Deployment branches rule can be configured to allow:
All branches – Any branch in the repository can deploy
Protected branches – Only branches with protection rules
Selected branches – Branches matching a set of name patterns
That means you can define a job to deploy in dev environment, and that job, as a condition, will only run if triggered from a commit pushed from a given branch (master in your case)
For anyone facing the same question or wanting to simplify their process away from gitflow, I'd recommend taking a look at this article. Whilst it doesn't talk about Github flow explicitly it does effectively provide one solution to the OP.
Purests may consider this to be not strictly Gitflow but to my mind it's a simple tweak that makes the deployment & CI/CD strategy more explicit in git. I prefer to have this approach rather than add some magic to the tooling which can make a process harder for devs to follow and understand.
I think the Gitflow intro is written fairly pragmatically as well:
Different teams may have different deployment strategies. For some, it may be best to deploy to a specially provisioned testing environment. For others, deploying directly to production may be the better choice...
The diagram in the article sums it up well:
So here we have Master == Gitflow main and the useful addition is the temporary release branch from which you can deploy to other environments such as development. What is worth considering is what you choose to call this temporary branch, in the above it's considered a release, in your process it may be a test branch, etc.
You can take or leave the squashing and tagging and the tooling will change between teams. Equally you may or may not care about actual version numbers.
This isn't a million miles away from VonC's answer, the difference is the process is more tightly defined and it's more towards having multiple developers merge into a single branch & apply fixes in order to get a new version ready for production. It may well be that you configure the deployment of this temporary branch via a naming convention as in his answer.
The way I've implemented this flow is using PRs. I did it with Azure DevOps, but I'd say that the same can be achieved with GitHub Actions.
When you have a branch that you intent to test and eventually merge to master and release to production, you create a PR from that branch to master. The PR will trigger a pipeline, which will run your build, static analysis and tests. If that passes, the PR is deployed to a test environment where further automated and manual testing can happen. That PR can be reviewed and approved by other developers and, if you need to, by QA after manual testing. You can configure GitHub PR rules to enforce the approvals. Once approved, you can merge the PR to master.
What happens once in master is independent of the workflow above, but most likely a new pipeline will be triggered, which will build a release candidate and run the whole path to production (with or without manual intervention).
One of the tricks is how the PR pipeline decides which environment to deploy the PR too. I can think of three options:
Create an environment on the fly which will be killed once the PR is merged or closed. This is the most advanced and flexible option. This would require the system to publish the environment location to the PR.
Have a pool of environments and have the automation figure out which are free and automatically choose one. The environments could be stopped, so you find an environment which is stopped, start it up and deploy there. Once the PR is closed/merged, stop the environment again.You can publish the environment location to the PR.
Add a label to the PR indicating the environment (ie. env-1, env-2, etc.). This is the simplest option, but it requires that developers look at the open PRs to see which environments are already in use in other PRs to avoid overwriting other people's code.
With all these options, once the PR is created, you can just push new commits to the branch and the environment will be updated.
You also need to decide what you want to do when a new commit is pushed to master. You most likely want to trigger a new PR build to update the environments with the latest master, but you can do this automatically or manually, depending on how busy your master is.
Nathan, adding a development branch is good idea, you can work on development changes in new branch and test them in dev environment and after getting signoff to move to production environment you can merge your changes in master branch.
Don't forget to perform regression testing on merged master branch to test both old features and new features are working fine before releasing your code for installation in production

Azure DevOps: Why does new pipeline commit the yaml file to default branch

I created a new pipeline in Azure DevOps, and created a new branch for it.
As a result, DevOps automatically committed the YAML file for the new pipeline to my 'development' branch.
None of the other pipelines I've created have YAML files committed into the repo...
Why does it do this?
Do we have to keep the YAML file there?
It has nothing to do with the source code of the application, so doesn't seem to make sense why its stored there.
YAML is code for how your application is deployed, thus it is part of the source code. By putting it under source control it can keep track of version changes and any additional changes to parameters or variables that are determined or inserted in the build process.
This is opposed to the older ways of doing things where it was updated via UI and not source control and did not have peer reviews, branching merging, and additional polices that can be applied.
This on top of the the YAML Pipelines for Releases going GA the other week will make YAML under a repo even more powerful as the YAMLs will not only build but also release code.
In Azure Devops Service we define pipelines using the YAML syntax or through the user interface (Classic). So there're two kinds of pipelines, Yaml pipelines and Classic UI(Classic build and release) pipelines.
None of the other pipelines I've created have YAML files committed
into the repo...
Why does it do this?
It's expected behavior when defining pipelines using Yaml syntax: The pipeline is versioned with your code. It follows the same branching structure.
And one advantage for this is: A change to the build process might cause a break or result in an unexpected outcome. Because the change is in version control with the rest of your codebase, you can more easily identify the issue.
To sum up, the yaml pipeline will be added into version control and it's by-design behavior. If you don't want this behavior, you can feel free to use Classic Build and Classic Release pipelines. It's also a good choice! About the differences between these formats you can check Feature availability. Hope it helps :)

Azure datafactory deployment automation from multiple branches

I want to create automated deployment pipeline for azure datafactory.
For one stream of development we can configure it using doc
https://learn.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment
But when it comes to deploying to two diff test datafactories for parrallel features development (in two different branches), it is not working because the adb_publish which gets generated is only specific to the one datafactory.
Currently we are doing deployement using powershell scripts and passing object list which needs to be deployed.
Our repo is in Azure devops.
I tried
linking the repo to multiple df but then it is causing issue, perhaps when finding deltas to publish.
Creating forks of repo instead of branches so that adb_publish can be seperate for the every datafactory - but this approach will not work when there is a conflict, which needs manual merge, so the testing will be required again instead of moving to prod.
Adf_publish get generated whenever you publish. Publishing takes whatever you have in your repo and updates data factory with it.
To develop multiple features in parallel, you need to just use "Save". Save will commit your changes to the branch you are actually working on. Other branches will do the same. Whenever you want to publish, you need to first make a pull request from your branch to master, then publish. Any merge conflict should be solved when merging everything in the master branch. Then just publish and there shouldn't be any conflicts, and adf_publish will get generated after that.
Hope this helped!
Since a GitHub repository can be associated with only one data factory. And you are only allowed to publish to the Data Factory service from your collaboration branch. Check this
It seems there is not a direct and easy way to accomplish this. If forking repo as workaround, you may have to solve the conflicts before merging as #Martin suggested.