How data-prepper finds processor - opensearch

I am new to data-prepper. I am following the instructions given in the getting started web page of data-prepper. The simple pipeline example mentions a configuration file and a pipeline file. One of the examples of pipeline file mentions 'processor'' which is as below.
simple-sample-pipeline:
workers: 2
delay: "5000"
source:
random:
processor:
- string_converter:
upper_case: true
sink:
- stdout:
I am executing this pipeline as per the instructions given, in which a data-prepper docker container is executed and through docker run command the pipeline and configuration files are copioed into the container and the pipeline is executed. I need to understand the following based on these examples:
I would like to know how this pipeline finds the processor?. If I develop my own processor, where do I keep this processor and how the pipeline would find this new processor?
For development of new-plugins, is the approach of running data-prepper docker container and copying configuration/pipeline files the only approach?

Related

Cannot run ASP.NET Core Web API on Azure Devops deployment group (self-hosted)

Im working on a simple deployment pipeline with azure devops. I created a deployment pipeline running on a self hosted ubuntu deployment group.
The pipeline looks like this:
Download artifacts from CI pipeline (created with dotnet publish)
Stop running deployment
Unzip the ASP.NET Core Web API to the deployment directory
Run new deployment with dotnet MyApp.dll
The first two steps work as expected. However, when the dotnet My App.dll command is run, the process runs for 10 seconds with following "error" message being printed at the end:
The STDIO streams did not close within 10 seconds of the exit event from process '/usr/bin/bash'. This may indicate a child process inherited the STDIO streams and has not yet exited.
The deployment task is successful despite the message and the app not running. I tried to work around this feature by using nohup & and relocating the command output. After some research I found that all processes started by a pipeline agent are stopped after the agent's work is done - meaning this behaviour is intended and my understanding of azure deployments/agents is wrong.
How do I deploy and run my app in an automated way on my own ubuntu machine using azure devops pipelines?
How do I deploy and run my app in an automated way on my own ubuntu machine using azure devops pipelines?
You are already on the right way.
All the process launched in the pipeline will be finished/clean up in “Finalize Job” step when the pipeline is over.
If you don't want the process to be closed, please try set variable Process.clean= false to stops the "finalize job" step from killing all processes.
But when you create a new pipeline next time, you need to close the app before starting it.

ApacheBeam on FlinkRunner doesn't read from Kafka

I'm trying to run Apache Beam backed by a local Flink cluster in order to consume from a Kafka Topic, as described in the Documentation for ReadFromKafka.
The code is basically this pipeline and some other setups as describe in the Beam Examples
with beam.Pipeline() as p:
lines = p | ReadFromKafka(
consumer_config={'bootstrap.servers': bootstrap_servers},
topics=[topic],
) | beam.WindowInto(beam.window.FixedWindows(1))
output = lines | beam.FlatMap(lambda x: print(x))
output | WriteToText(output)
Since I attempted to run on Flink, I followed this doc for Beam on Flink and did the following:
--> I download the binaries for flink 1.10 and followed these instructions to proper setup the cluster.
I checked the logs for the server and task instance. Both were properly initialized.
--> Started kafka using docker and exposing it in port 9092.
--> Executed the following in the terminal
python example_1.py --runner FlinkRunner --topic myTopic --bootstrap_servers localhost:9092 --flink_master localhost:8081 --output output_folder
The terminal outputs
2.23.0: Pulling from apache/beam_java_sdk Digest: sha256:3450c7953f8472c2312148a2a8324b0115fd71b3a7a01a3b017f6de69d89dfe1 Status: Image is up to date for apache/beam_java_sdk:2.23.0 docker.io/apache/beam_java_sdk:2.23.0
But then after writing some messags to myTopic, the terminal remains frozen and I don't see anything in the output folder. I checked flink-conf.yml and given these two lines
jobmanager.rpc.address: localhost
jobmanager.rpc.port: 6123
I assumed that the port for the jobs would be 6123 instead of 8081 as specified in beam documentation, but the behaviour for both ports is the same.
I'm very new to Beam/Flink, so I'm not quite sure that it can be, I have two hypothesis as of now, but can't quite figure out how to investigate'em:
Something related to the port that Beam communicates with Flink in order to send the jobs.
2.The Expansions Service for Python SDK mentioned in the apache.beam.io.external.ReadFromKafka docs
Note: To use these transforms, you need to start a Java Expansion Service. Please refer to the portability documentation on how to do that. Flink Users can use the built-in Expansion Service of the Flink Runner’s Job Server. The expansion service address has to be provided when instantiating the transforms.
But reading the portability documentation, it refers me back to the same doc for Beam on Flink.
Could someone, please, help me out?
Edit: I was writing to the topic using Debezium Source Connector for PostgreSQL and seeing the behavior mentioned above. But when I tried to the topic manually, the application crashed with the following
RuntimeError: org.apache.beam.sdk.util.UserCodeException: org.apache.beam.sdk.coders.CoderException: cannot encode a null byte[]
at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36)
You are doing everything correctly; the Java Expansion Service no longer needs to be started manually (see the latest docs). Also, Flink serves the web UI at 8081, but accepts job submission there just as well, so either port works fine.
It looks like you may be running into the issue that Python's TextIO does not yet support streaming.
Additionally, there is the complication that when running Python pipelines on Flink, the actual code runs in a docker image, and so if you are trying to write to a "local" file it will be a file inside the image, not on your machine.

How I make Scala code runs on EMR cluster by using SDK?

I wrote code with Scala to run a Cluster in EMR. Also, I have a Spark application written in Scala. I want to run this Spark application on EMR Cluster. But is it possible for me to do this in the first script (that launch EMR Cluster)? I want to do all of them with the SDK, not through the console or CLI. It has to be a kind of automatization, not a single manual job (or minimize manual job).
Basically;
Launch EMR Cluster -> Run Spark Job on EMR -> Terminate after job finished
How do I do it if possible?
Thanks.
HadoopJarStepConfig sparkStepConf = new HadoopJarStepConfig()
.withJar("command-runner.jar")
.withArgs(params);
final StepConfig sparkStep = new StepConfig()
.withName("Spark Step")
.withActionOnFailure("CONTINUE")
.withHadoopJarStep(sparkStepConf);
AddJobFlowStepsRequest request = new AddJobFlowStepsRequest(clusterId)
.withSteps(new ArrayList<StepConfig>(){{add(sparkStep);}});
AddJobFlowStepsResult result = emr.addJobFlowSteps(request);
return result.getStepIds().get(0);
If you are looking just for automation you should read about Pipeline Orchestration-
EMR is the AWS service which allows you to run distributed applications
AWS DataPipeline is an Orchestration tool that allows you to run jobs (activities) on resources (EMR or even EC2)
If you'd just like to run a spark job consistently, I would suggest creating a data pipeline, and configuring your pipeline to have one step, which is to run the Scala spark jar on the master node using a "shellcommandactivity". Another benefit is that the jar you are running can be stored in AWS S3 (object storage service) and you'd just provide the s3 path to your DataPipeline and it will pick up that jar, log onto the EMR service it has brought up (with the configurations you've provided)- clone that jar on the master node, run the jar with the configuration provided in the "shellcommandactivity", and once the the job exits (successfully or with an error) it will kill the EMR cluster so you aren't paying for it and log the output
Please read more into it: https://aws.amazon.com/datapipeline/ & https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
And if you'd like you can trigger this pipeline via the AWS SDK or even set the pipeline to run on a schedule

Specify order of pipelines and dependencies

I'm having a hard time getting a grasp on this to be honest.
Right now my lab project is as follows:
PR to master -> Triggers Pre-Build Pipeline as condition to merge the code ->
On merge Infrastructure pipe runs only if any changes happen in my Infrastructure folder ->
On merge I want to run my deploy pipeline to deploy my web app to Azure.
The pipes in question do the things they ought to, i.e.
Pre build builds, publishes artifact, runs Unit tests, validates ARM templates.
Infra pipe deploys the necessary infra for my web app such as ResourceGroup, App plan, app service, key vault.
Deploy Pipe downloads the artifact produced in pre deploy and deploys to a stage slot and swaps it to production slot.
What I can't seem to get to work is the pipeline chaining through dependencies, if changes happen to both infra and web app code in master I want the infra pipe to run first and the deploy pipe only if it succeeds.
If I merge only app code I want only the deploy pipe to run regardless if the infra pipe ran or not.
If I merge only infra code I want only the infra pipe to run.
If I merge both app and infra code I want both infra and deploy pipe to run in specific order.
I feel this shouldn't be all that hard to accomplish, but I've spent way too much time trying to solve this to no avail, anyone able to help? :)
Edit:
Hey Sorry #HughLin-MSFT Been Trying to work around this a bit since we're trying to avoid running scripts left and right. :)
I saw you have Build Queuing planned in an upcoming release so for now I think we might have to wait for that.
If I were to merge my deploy and infra pipe, can I use:
trigger:
branches:
include:
- master
paths:
include:
- Infrastructure/*
At stage level and somehow skip a stage instead?
Seen multiple articles mention "Continue if skipped" but can't find any information on how to actually skip a stage.
For the first and second cases, you just need to set Path filters in Triggers, the pipeline only triggers when the file at the specified path is changed. Please refer to this.
For the third case, you can try to add two agent jobs in the infra pipe, add Trigger Azure DevOps Pipeline task to the second agent job to trigger the deploy pipe, and then set Only when all previous jobs have succeeded in Run this job drop-down box for job2. In addition, you need to add a powershell task before the Trigger Azure DevOps Pipeline task, and use a script to detect whether there is app code, run job2 if there is, and cancel job2 if not.
Update:
First you can create a new pipeline and create a variable:changedcode
Use Builds - Get rest api to get the commit , then get the changed code folder with Commits - Get Changes rest api.
Assign changed code folder name as value to changedcode variable.
Set custom conditions for the agent job. In the Infra job, if the changedcode variable value is Infra, run the Infra job. In the Infra job, use the Builds-Queue rest api or Trigger Azure DevOps Pipeline task to trigger the Infra pipeline. The same is true for Deploy job, the only difference is the custom condition expression.
Here is a sample structure in yaml:
jobs:
variables:
changedcode: ""
- job:
steps:
- powershell: |
#Get the changed code folder with rest api
- job: Infra
condition: containsValue($(changedcode), "Infra"))
- powershell: |
#queue Infra pipeline with rest api or Trigger Azure DevOps Pipeline task
- job: Deploy
condition: (containsValue($(changedcode), "deploy")),and ....
- powershell: |
#queue Deploy pipeline with rest api or Trigger Azure DevOps Pipeline task

How to copy yarn ssh logs automatically using scala to blob storage

We have a requirement to download the yarn ssh logs to blob storage automatically. I found that the yarn logs does get added to storage account under /app-logs/user/logs/ etc path but they are in a binary format and there is no documented way to convert these into text format. So we are trying to run the external command yarn logs -application <application_id> using scala at the end of our application run to capture the logs and save them to the blob storage but facing issues with that. Looking for a solution to get these logs automatically downloaded to storage account as part of the spark pipeline itself.
I tried redirecting the output of the yarn logs command to a temp file and then copying the file from local to blob storage. These commands work fine when I ssh into the head node of the spark cluster and run them. But they are not working when executed from jupyter notebook or scala application.
("yarn logs -applicationId application_1561088998595_xxx > /tmp/yarnlog_2.txt") !!
("hadoop dfs -fs wasbs://dev52mss#sahdimssperfdev.blob.core.windows.net -copyFromLocal /tmp/yarnlog_2.txt /tmp/") !!
When I run these commands using jupyter notebook, the first command works fine to redirect to a local file but the second one to move the file to blob fails with the following error:
warning: there was one feature warning; re-run with -feature for details
java.lang.RuntimeException: Nonzero exit value: 1
at scala.sys.package$.error(package.scala:27)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:132)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:102)
... 56 elided
Initially I tried capturing the output of the command as a Dataframe and writing the dataframe to blob. It succeeded for small logs but for huge logs it failed with the error:
Serialized task 15:0 was 137500581 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values
val yarnLog = Seq(Process("yarn logs -applicationId " + "application_1560960859861_0003").!!).toDF()
yarnLog.write.mode("overwrite").text("wasbs://container#storageAccount.blob.core.windows.net/Dev/Logs/application_1560960859861_0003.txt")
Note: You can directly access the log files using Azure Storage => Blobs => Select Container => app logs
Azure HDInsight stores its log files both in the cluster file system and in Azure storage. You can examine log files in the cluster by opening an SSH connection to the cluster and browsing the file system, or by using the Hadoop YARN Status portal on the remote head node server. You can examine the log files in Azure storage using any of the tools that can access and download data from Azure storage.
Examples are AzCopy, CloudXplorer, and the Visual Studio Server Explorer. You can also use PowerShell and the Azure Storage Client libraries, or the Azure .NET SDKs, to access data in Azure blob storage.
For more details, refer "Manage logs for Azure HDInsight cluster".
Hope this helps.
Currently, you will need to use the 'yarn logs' command to view Yarn logs.
As regards your requirement, there are two methods to achieve this;
Method 1:
Schedule a daily copy of the app-logs folder into a desired container within the blob storage. This will do a differential copy every day at a specific time of the day. For this one, I had to use Azure Data Factory to achieve the scheduling. Quite easy and no manual copy or coding required.
However, because the yarn applications logs are stored in TFile binary format and can only be read using ‘yarn logs’ command, it means that you need to have another tool application to read the file when from the destination later on. You can use the tool here to read the files https://github.com/shanyu/hadooplogparser
Alternatively, you can have your own simple script that converts it to a readable file before the transfer. Sample script below
**
yarn logs -applicationId application_15645293xxxxx > /tmp/source/applog_back.txt
hadoop dfs -fs wasbs://hdiblob #sandboxblob.blob.core.windows.net -copyFromLocal /tmp/source/applog_back.txt /tmp/destination
**
Method 2:
This is the simplest and cheapest method. You can disable the retention period of the Yarn Application logs, this means the logs will be retained indefinitely. To do this, change the config “yarn.log-aggregation.retain-seconds” to value -1. This config can be found in yarn-site.xml.
Once this is done, you can always read your Yarn Applications logs anytime from the cluster using the Yarn UI or CLI.
Hope this helps