Special characters (accent, apostrophe, trema) work in custom Source tests, but no longer when deployed in dockerized Streamsets - streamsets

I've written a custom Streamsets origin. Some of the records contain characters like é or ë. When running my automated tests I can validate that the data is emitted as a list of SDC Records as intended.
When I use my custom origin in a pipeline on a dockerized Streamsets Data Collector however, all of those special characters are displayed in the UI (preview) and pushed to my Target as '?'.
Is Streamsets interpreting the output of my origin and applying some character encoding?

The problem was not in the custom origin or Streamsets at all, rather it was an issue with the Docker container itself. The official Streamsets container from which I inherit, is based on Alpine Linux. No locale support is installed by default, so the trick is to add it by yourself.
This post helped me out in installing it in my container and configuring the container. Afterwards, all worked as expected.

Related

Could open-telemetry collector be used without instrumentation?

We are building application with argo workflows. We are wondering if we could just set up opentelemetry collector inside of our kubernetes cluster, and start using it as stdout exporter into elastic stack. Couldn't find information if OTEL can export logs without instrumentation. Any thoughts?
The short answer is: yes. For logs you don't need instrumentation, in principle.
However, logs are still in development (so, you'd need to track upstream and deal with the fact that you're operating against a moving target. There are a number of components upstream you can use, for example, you can use the Filelog Receiver and the Elasticsearch Exporter in a pipeline. I recently did a POC and demo (just ignore the custom collector part and use upstream) that you could use as a starting point.
As indicated by the following OpenTelemetry (Otel) website:
https://opentelemetry.io/docs/concepts/data-collection/
the Receiver component of Otel Collector can be either push or pull based. For example, using the following Java agent of Otel Collector:
https://github.com/open-telemetry/opentelemetry-java-instrumentation
a Spring Boot jar can be bootstrapped with such an agent without the need to change any code inside this Spring Boot app and still be able to push the telemetry data onto the Otel Collector. Another example is to use a puller instance of Otel Collector to pull the telemetry data out of a Redis runtime using the following example:
https://signoz.io/blog/redis-opentelemetry/

Make Airflow KubernetesPodOperator clear image cache without setting image_pull_policy on DAG?

I'm running Airflow on Google Composer. My tasks are KubernetesPodOperators, and by default for Composer, they run with the Celery Executor.
I have just updated the Docker image that one of the KubernetesPodOperator uses, but my changes aren't being reflected. I think the cause is that it is using a cached version of the image.
How do I clear the cache for the KubernetesPodOperator? I know that I can set image_pull_policy=Always in the DAG, but I want it to use cached images in the future, I just need it to refresh the cache now.
Here is how my KubernetesPodOperator (except for the commented line):
processor = kubernetes_pod_operator.KubernetesPodOperator(
task_id='processor',
name='processor',
arguments=[filename],
namespace='default',
pool='default_pool',
image='gcr.io/proj/processor',
# image_pull_policy='Always' # I want to avoid this so I don't have to update the DAG
)
Update - March 3, 2021
I still do not know how to make the worker nodes in Google Composer reload their images once while using the :latest tag on images (or using no tag, as the original question states).
I do believe that #rsantiago's comment would work, i.e. doing a rolling restart. A downside of this approach that I see is that, by default, in Composer worker nodes run in the same node pool as the Airflow infra itself. This means that doing a rolling restart would possibly affect the Airflow scheduling system, Web interface, etc. as well, although I haven't tried it so I'm not sure.
The solution that my team has implemented is adding version numbers to each image release, instead of using no tag, or the :latest tag. This ensures that you know exactly which image should be running.
Another thing that has helped is adding core.logging_level=DEBUG to the "Airflow Configuration Overrides" section of Composer. This will output the command that launched the docker image. If you're using version tags as suggested, this will display that tag.
I would like to note that setting up local debugging has helped tremendously. I am using PyCharm with the docker image as a "Remote Interpreter" which allows me to do step-by-step debugging inside the image to be confident before I push a new version.

Publish a Service Fabric Container project

I don't make it to publish a container. In my case I want to put an MVC4 Webrole into the container. ...but actually, what's inside the container does not matter.
Their primary tutorial for using a container to lift-and-shift old apps uses Continuous Delivery. The average user does not always need this.
Instead of Continuous Delivery one may use the Visual Studio's support for Docker Compose:
Connect-ServiceFabricCluster <mycluster> and then New-ServiceFabricComposeApplication -ApplicationName> <mytestapp> -Compose docker-compose.yml
But following exactly their tutorial still leads to errors. The applicaton appears in the cluster but outputs immediatly an error event:
"SourceId='System.Hosting', Property='Download:1.0:1.0'. Error during
download. Failed to download container image fabrikamfiber.web"
Do I miss a whole step, which they expect to be obvious? But even placing the image in my Docker Hub registry myself did not help? Or does it need to be Azure Container Registry?
Docker Hub should work fine, ACR is not required.
These blog posts may help:
about running containers
about docker compose on Service Fabric

Is it possible to do a rolling update and retain same version in Kubernetes?

Background:
We're currently using a continuous delivery pipeline and at the end of the pipeline we deploy the generated Docker image to some server(s) together with the latest application configuration (set as environment variables when starting the Docker container). The continuous delivery build number is used as version for the Docker image and it's currently also this version that gets deployed to the server(s).
Sometimes though we need to update the application configuration (environment variables) and reuse an existing Docker image. Today we simply deploy an existing Docker image with an updated configuration.
Now we're thinking of switching to Kubernetes instead of our home-built solution. Thus it would be nice for us if the version number generated by our continuous delivery pipeline is reflected as the pod version in Kubernetes as well (even if we deploy the same version of the Docker image that is currently deployed but with different environment variables).
Question:
I've read the documentation of rolling-update but it doesn't indicate that you can do a rolling-update and only change the environment variables associated with a pod without changing its version.
Is this possible?
Is there a workaround?
Is this something we should avoid altogether and use a different approach that is more "Kubernetes friendly"?
Rolling update just scales down one replicationController and scales up another one. Therefore, it deletes the old pods and make new pods, at a controlled rate. So, if the new replication controller json file has different env vars and the same image, then the new pods will have that too.
In fact, even if you don't change anything in the json file, except one label value (you have to change some label), then you will get new pods with the same image and env. I guess you could use this to do a rolling restart?
You get to pick what label(s) you want to change when you do a rolling update. There is no formal Kubernetes notion of a "version". You can make a label called "version" if you want, or "contdelivver" or whatever.
I think if I were in your shoes, I would look at two options:
Option 1: put (at least) two labels on the rcs, one for the docker image version (which, IIUC is also a continuous delivery version), and one for the "environment version". This could be a git commit, if you store your environment vars in git, or something more casual. So, your pods could have labels like "imgver=1.3,envver=a34b87", or something like that.
Option 2: store the current best known replication controller, as a json (or yaml) file in version control (git, svn, whatevs). Then use the revision number from version control as a single label (e.g. "version=r346"). This is not the same as your continuous delivery label.
It is a label for the whole configuration of the pod.

Using Ansible to automatically configure AWS autoscaling group instances

I'm using Amazon Web Services to create an autoscaling group of application instances behind an Elastic Load Balancer. I'm using a CloudFormation template to create the autoscaling group + load balancer and have been using Ansible to configure other instances.
I'm having trouble wrapping my head around how to design things such that when new autoscaling instances come up, they can automatically be provisioned by Ansible (that is, without me needing to find out the new instance's hostname and run Ansible for it). I've looked into Ansible's ansible-pull feature but I'm not quite sure I understand how to use it. It requires a central git repository which it pulls from, but how do you deal with sensitive information which you wouldn't want to commit?
Also, the current way I'm using Ansible with AWS is to create the stack using a CloudFormation template, then I get the hostnames as output from the stack, and then generate a hosts file for Ansible to use. This doesn't feel quite right – is there "best practice" for this?
Yes, another way is just to simply run your playbooks locally once the instance starts. For example you can create an EC2 AMI for your deployment that in the rc.local file (Linux) calls ansible-playbook -i <inventory-only-with-localhost-file> <your-playbook>.yml. rc.local is almost the last script run at startup.
You could just store that sensitive information in your EC2 AMI, but this is a very wide topic and really depends on what kind of sensitive information it is. (You can also use private git repositories to store sensitive data).
If for example your playbooks get updated regularly you can create a cron entry in your AMI that runs every so often and that actually runs your playbook to make sure your instance configuration is always up to date. Thus avoiding having "push" from a remote workstation.
This is just one approach there could be many others and it depends on what kind of service you are running, what kind data you are using, etc.
I don't think you should use Ansible to configure new auto-scaled instances. Instead use Ansible to configure a new image, of which you will create an AMI (Amazon Machine Image), and order AWS autoscaling to launch from that instead.
On top of this, you should also use Ansible to easily update your existing running instances whenever you change your playbook.
Alternatives
There are a few ways to do this. First, I wanted to cover some alternative ways.
One option is to use Ansible Tower. This creates a dependency though: your Ansible Tower server needs to be up and running at the time autoscaling or similar happens.
The other option is to use something like packer.io and build fully-functioning server AMIs. You can install all your code into these using Ansible. This doesn't have any non-AWS dependencies, and has the advantage that it means servers start up fast. Generally speaking building AMIs is the recommended approach for autoscaling.
Ansible Config in S3 Buckets
The alternative route is a bit more complex, but has worked well for us when running a large site (millions of users). It's "serverless" and only depends on AWS services. It also supports multiple Availability Zones well, and doesn't depend on running any central server.
I've put together a GitHub repo that contains a fully-working example with Cloudformation. I also put together a presentation for the London Ansible meetup.
Overall, it works as follows:
Create S3 buckets for storing the pieces that you're going to need to bootstrap your servers.
Save your Ansible playbook and roles etc in one of those S3 buckets.
Have your Autoscaling process run a small shell script. This script fetches things from your S3 buckets and uses it to "bootstrap" Ansible.
Ansible then does everything else.
All secret values such as Database passwords are stored in CloudFormation Parameter values. The 'bootstrap' shell script copies these into an Ansible fact file.
So that you're not dependent on external services being up you also need to save any build dependencies (eg: any .deb files, package install files or similar) in an S3 bucket. You want this because you don't want to require ansible.com or similar to be up and running for your Autoscale bootstrap script to be able to run. Generally speaking I've tried to only depend on Amazon services like S3.
In our case, we then also use AWS CodeDeploy to actually install the Rails application itself.
The key bits of the config relating to the above are:
S3 Bucket Creation
Script that copies things to S3
Script to copy Bootstrap Ansible. This is the core of the process. This also writes the Ansible fact files based on the CloudFormation parameters.
Use the Facts in the template.