Any contrib package for Apache Beam, where I can commit a dataflow pipeline - apache-beam

I made this dataflow pipeline that will connect Pub/Sub to Big query. Any ideas where would be the right place to commit this upsteam in Apache Beam.

Related

AWS Glue - version control and setting up for continuous integration

We are in the process of setting up the CI / CD process for AWS Glue ETL Process. The existing ETL process contains the following AWS Glue Components - Crawlers, Registered tables in catalog, Jobs, Triggers and workflows.
Obviously the first step is to set up a code repository and link the existing artifacts from different components mentioned above to the repository, which will ideally need to facilitate the developers in performing the check-ins and pull request from the tool (Something similar to ADF and Databricks). However as far as we have explored, AWS glue does not have integration to any of the source code repository which can directly provide this feature unless we are missing something.
Hence what is the method to setup the environment for CI (I'm still not talking about CD), the below link gives a reference for CI/CD:
https://aws.amazon.com/blogs/big-data/implement-continuous-integration-and-delivery-of-serverless-aws-glue-etl-applications-using-aws-developer-tools/
However it mentions at the beginning that, AWS CloudFormation template file for deploying the ETL jobs are both committed to version control - so not clear on how this is done for the on-going regular commits from the developers.
However as far as we have explored, AWS glue does not have integration
to any of the source code repository which can directly provide this
feature unless we are missing something.
Correct, Glue does not have VC integration.
I develop (python and cloudformation) locally on vscode and use it's git integration plugin. And I use a container if I want to test something locally, but Glue also has a Dev Endpoint for similar tasks.

Apache Kafka patch release process

How Kafka will release the patch updates?
How users will get to know Kafka patch updates?
Kafka is typically available as a zip/tar that contains the binary files which we will use to start/stop/manage Kafka. You may want to:
Subscribe to https://kafka.apache.org/downloads by generating a feed for it.
Subscribe to any feeds that give you updates
Write a script that checks for new kafka releases https://downloads.apache.org/kafka/ periodically to notify or download.
The Kafka versioning format typically is major.minor.patch release.
Every time, there is a new Kafka release, we need to download the latest zip, use the old configuration files (make changes if required) and start Kafka using new binaries. The upgrade process is fully documented in the Upgrading section at https://kafka.apache.org/documentation
For production environments, we have several options:
1. Using Managed Kafka Service (like in AWS, Azure, Confluent etc)
In this case, we need not worry about patching and security updates to Kafka because it is taken care by the service provider itself. For AWS, you will typically get notifications in the Console regarding when your Kafka update is scheduled.
It is easy to get started to use Managed Kafka service for production environments.
2. Using self-hosted kafka in Kubernetes (eg, using Strimzi)
If you are running Kafka in Kubernetes environment, you can use Strimzi operator and helm upgrade to update to the version you require. You need to update helm chart info from repository using helm repo update.
Managed services and Kubernetes operators make managing easy, however, manually managing Kafka clusters is relatively difficult.

Is it possible to use Databricks-Connect along with Github to make changes to my Azure Databricks notebooks from an IDE?

My aim is to make changes to my Azure Databricks notebooks using an IDE rather than in Databricks. While at the same time implementing some sort of version control.
Reading the Databricks-Connect documentation this doesn't look like it supports this kind of functionality. Was wondering if anyone else has tried to do this and had any success?

apache ambari local repository cloudera

i have a production cluster using ambari from hortonworks. Now cloudera has blocked every access to hdp repository, because a paid support license is needed.
This hit us really hard because we have big infrastructure using ambari, kafka, ans storm.
I'm trying to build ambari from source but i think that a local hdp repository is needed.
Anyone know how to build a repo strarting from kafka and storm source?

Is it possible to Implement Job Repository of Spring Batch using any of the latest versions of MongoDB with transactional support?

I have gone through several implementation of job repository using mongo dB but couldn't find any stable one, and that support transactions in job repository. I have also read a note, that mongo DB is not recommended for job repository as it does not support transactions. So need to know the possibilities of implement job repository using any latest versions of mongo DB with transactional support.
I have also read a note, that mongo DB is not recommended for job repository as it does not support transactions.
MongoDB added support for transactions in v4. There is a feature request against Spring Batch to use MongoDB as a job repository: https://github.com/spring-projects/spring-batch/issues/877, but this feature has not been implemented yet.