Inconsistent behavior on the functioning of the dataflow templates? - scala

When i create a dataflow template, the characteristics of Runtime parameters are not persisted in the template file.
At runtime, if i try to pass a value for this parameter, i take a 400 error
I'm using Scio 0.3.2, scala 2.11.11 with apache beam (0.6).
My parameters are the following :
trait XmlImportJobParameters extends PipelineOptions {
def getInput: ValueProvider[String]
def setInput(value: ValueProvider[String]): Unit
}
They are registred with this code
val options = PipelineOptionsFactory.fromArgs(cmdlineArgs: _*).withValidation().as[XmlImportJobParameters](classOf[XmlImportJobParameters])
PipelineOptionsFactory.register(classOf[XmlImportJobParameters])
implicit val (sc, args) = ContextAndArgs(cmdlineArgs)
To create the template i call sbt with this parameters :
run-main jobs.XmlImportJob --runner=DataflowRunner --project=MyProject --templateLocation=gs://myBucket/XmlImportTemplate --tempLocation=gs://myBucket/staging --instance=myInstance
If i pass explicitly --input, it becomes a StaticValue instead of RuntimeValue, and this time, i can see it in the template file.
The template is called from a google function watching a bucket storage (inspired from https://shinesolutions.com/2017/03/23/triggering-dataflow-pipelines-with-cloud-functions/) :
...
dataflow.projects.templates.create({
projectId: projectId,
resource: {
parameters: {
input: `gs://${file.bucket}/${file.name}`
},
jobName: jobs[job].name,
gcsPath: 'gs://MyBucket/MyTemplate'
}
}
...
The 400 error :
problem running dataflow template, error was: { Error: (109c1c52dc52fec7): The workflow could not be created. Causes: (109c1c52dc52fb8e): Found unexpected parameters: ['input' (perhaps you meant 'runner')] at Request._callback (/user_code/node_modules/googleapis/node_modules/google-auth-library/lib/transporters.js:85:15) at Request.self.callback (/user_code/node_modules/googleapis/node_modules/request/request.js:188:22) at emitTwo (events.js:106:13) at Request.emit (events.js:191:7) at Request.<anonymous(/user_code/node_modules/googleapis/node_modules/request/request.js:1171:10) at emitOne (events.js:96:13) at Request.emit (events.js:188:7) at IncomingMessage.<anonymous> (/user_code/node_modules/googleapis/node_modules/request/request.js:1091:12) at IncomingMessage.g (events.js:291:16) at emitNone (events.js:91:20) code: 400, errors: [ { message: '(109c1c52dc52fec7): The workflow could not be created. Causes: (109c1c52dc52fb8e): Found unexpected parameters: [\'input\' (perhaps you meant \'runner\')]', domain: 'global', reason: 'badRequest' } ] }
Same error when i try this :
gcloud beta dataflow jobs run xmlJobImport --gcs-location gs://MyBucket/MyTemplate --parameters input=gs://MyBucket/file.csv
=>
(gcloud.beta.dataflow.jobs.run) INVALID_ARGUMENT: (260a4f3f738a8ad9): The workflow could not be created. Causes: (260a4f3f738a8f96): Found unexpected parameters: ['input' (perhaps you meant 'runner'), 'projectid' (perhaps you meant 'project'), 'table' (perhaps you meant 'zone')]
The current settings are :
Current Settings:
appName: XmlImportJob$
autoscalingAlgorithm: THROUGHPUT_BASED
input: RuntimeValueProvider{propertyName=input, default=null, value=null}
instance: StaticValueProvider{value=staging}
jobName: xml-import-job
maxNumWorkers: 1
network: staging
numWorkers: 1
optionsId: 0
project: myProjectId
projectid: StaticValueProvider{value=myProjectId}
provenance: StaticValueProvider{value=ac3}
record: StaticValueProvider{value=BIEN}
root: StaticValueProvider{value=LISTEPA}
runner: class org.apache.beam.runners.dataflow.DataflowRunner
stableUniqueNames: WARNING
streaming: false
subnetwork: regions/europe-west1/subnetworks/net-staging
table: StaticValueProvider{value=annonce}
tempLocation: gs://-flux/staging/xmlImportJob/
templateLocation: gs://-flux-templates/XmlImportTemplate
workerMachineType: n1-standard-1
zone: europe-west1-c
Environement

Coping the answer from the issue:
Scio does not currently expose ValueProvider based APIs - we now have an issue open for this #696
A working example would be something like:
object WordCount {
def main(cmdlineArgs: Array[String]): Unit = {
val (sc, args) = ContextAndArgs(cmdlineArgs)
sc.customInput("input", TextIO.read().from(sc.optionsAs[XmlImportJobParameters].getInput))
.map(_.toUpperCase)
.saveAsTextFile(args("output"))
sc.close()
}
}
For the job above, to create template:
run-main com.example.WordCount --runner=DataflowRunner --project=<project> --templateLocation=gs://<template-bucket> --tempLocation=gs://<temp-location> --output=gs://<example-of-static-arg-output>
To submit job:
gcloud beta dataflow jobs run rav-test --gcs-location=gs://<template-bucket> --parameters=input=gs://<runtime-value>

Related

Airflow Dataproc serverless job creator doesnt take python parameters

I'm trying to setup a Dataproc Serverless Batch Job from google cloud composer using the DataprocCreateBatchOperator operator that takes some arguments that would impact the underlying python code. However I'm running into the following error:
error: unrecognized arguments: --run_timestamp "2022-06-17T13:22:51.800834+00:00" --temp_bucket "gs://pipeline/spark_temp_bucket/hourly/" --bucket "pipeline" --pipeline "hourly"
This is how my operator is setup:
create_batch = DataprocCreateBatchOperator(
task_id="hourly_pipeline",
project_id="dev",
region="us-west1",
batch_id="".join(random.choice(string.ascii_lowercase + string.digits + "-") for i in range(40)),
batch={
"environment_config": {
"execution_config": {
"service_account": "<service_account>",
"subnetwork_uri": "<uri>
}
},
"pyspark_batch": {
"main_python_file_uri": "gs://pipeline/code/pipeline_feat_creation.py",
"args": [
'--run_timestamp "{{ ts }}"',
'--temp_bucket "gs://pipeline/spark_temp_bucket/hourly/"',
'--bucket "pipeline"',
'--pipeline "hourly"'
],
"jar_file_uris": [
"gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.0.jar"
],
}
}
)
Regarding the args array: I tried setting the parameters with and without encapsulating them with "". I've also already did a gcloud submit that worked like so:
gcloud dataproc batches submit pyspark "gs://pipeline/code/pipeline_feat_creation.py" \
--batch=jskdnkajsnd-test-10 --region=us-west1 --subnet="<uri>" \
-- --run_timestamp "2020-01-01" --temp_bucket gs://pipeline/spark_temp_bucket/hourly/ --bucket pipeline --pipeline hourly
The error I was running into was that I wasn't adding a = after each parameter; I've also eliminated the " encapsulation around each parameter. This is how the args are now setup:
"args": [
'--run_timestamp={{ ts }}',
'--temp_bucket=gs://pipeline/spark_temp_bucket/hourly/',
'--bucket=pipeline',
'--pipeline=hourly'
]

I am getting a vmSize error when trying to deploy a batch service pool using bicep

I have the following Bicep code:
resource pool 'Microsoft.Batch/batchAccounts/pools#2021-06-01' = {
name: '${bs.name}/run-python'
properties: {
scaleSettings: {
fixedScale: {
nodeDeallocationOption: 'TaskCompletion'
targetDedicatedNodes: 1
}
}
deploymentConfiguration: {
cloudServiceConfiguration: {
osFamily: '6'
}
}
vmSize: 'standard_A1_v2'
startTask: {
commandLine: 'cmd /c "pip install azure-storage-blob pandas"'
userIdentity: {
autoUser: {
elevationLevel: 'NonAdmin'
scope: 'Pool'
}
}
waitForSuccess: true
}
}
dependsOn: [
bs
]
}
Where I try to create a pool for my batch service, but when I try to deploy it, I get the following error:
{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.","details":[{"code":"PropertyName","message":"vmSize"}]}
So I think the problem is the vmSize, but there I cannot find concrete example on what the value should be.
CloudServiceConfiguration Pools are deprecated. Please retry using VirtualMachineConfiguration.

Chainlink Node: job type 'webhook' has not been registered with the job.Spawner

I have a toml job spec that looks like:
type = "webhook"
schemaVersion = 1
observationSource = """
send_to_bridge [type=bridge name="vesper-test" requestData="{\\"data\\": {\\"quote\\":\\"USD\\"}}"]
send_to_bridge
"""
However, I'm getting the following error:
job type 'webhook' has not been registered with the job.Spawner
How do I fix?
Set FEATURE_WEBHOOK_V2=true in your .env

elastic4s NoNodeAvailableException when connecting through TcpClient.transport

Am trying to get my hands on elastic4s by one of the samples as given
here
But I keep getting the below exception when am trying to connect through TcpClient.transport :
Exception in thread "main" NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{IFyYWnE_S4aHRVxT9v60LQ}{dockerhost}{192.168.99.100:9300}]]
Am trying to connect to an elastic instance on docker, elastic version is 2.3.4
Here is my code below dependencies.
import com.sksamuel.elastic4s.{ElasticClient, ElasticsearchClientUri, TcpClient}
import org.elasticsearch.action.support.WriteRequest.RefreshPolicy
import org.elasticsearch.common.settings.Settings
import com.sksamuel.elastic4s.ElasticDsl._
object Main extends App {
//val settings = Settings.builder().put("cluster.name", "elasticsearch").build()
val client = TcpClient.transport(ElasticsearchClientUri("elasticsearch://dockerhost:9300"))
client.execute {
bulk(
indexInto("myindex" / "mytype").fields("country" -> "Mongolia", "capital" -> "Ulaanbaatar"),
indexInto("myindex" / "mytype").fields("country" -> "Namibia", "capital" -> "Windhoek")
).refresh(RefreshPolicy.WAIT_UNTIL)
}.await
val result = client.execute {
search("myindex").matchQuery("capital", "ulaanbaatar")
}.await
println(result.hits.head.sourceAsString)
client.close()
}
build.gradle:
compile group: 'com.sksamuel.elastic4s', name: 'elastic4s-core_2.11', version: '5.4.9'
compile group: 'com.sksamuel.elastic4s', name: 'elastic4s-tcp_2.11', version: '5.4.9'
compile group: 'com.sksamuel.elastic4s', name: 'elastic4s-http_2.11', version: '5.4.9'
compile group: 'com.sksamuel.elastic4s', name: 'elastic4s-streams_2.11', version: '5.4.9'
Any help regarding this issue would be helpful.
I am asked this question a lot, and 99% of the time, the answer to this question is
The cluster name is not the default (elasticsearch) and therefore it must be specified in the connection string.
The server is not setup to listen outside of localhost. https://www.elastic.co/blog/elasticsearch-unplugged

cloudify custom workflow missing cloudify_agent runtime information

I want to develop my own workflow named "backup" in cloudify with my own plugin, but when i ran that workflow, the below error occured
'backup' workflow execution failed: RuntimeError: Workflow failed: Task failed 'script_runner.tasks.run' -> Missing cloudify_agent runtime information. This most likely means that the Compute node never started successfully
I don't understand why, anybody can solved me this problem?
Here is my main blueprint code and plugin code
My main blueprint
tosca_definitions_version: cloudify_dsl_1_2
imports:
- plugins/backup.yaml
- types/types.yaml
node_templates:
mynode:
type: cloudify.nodes.Compute
properties:
ip: "ip"
agent_config:
install_method: none
user: "user"
key: "key_uri"
myapp:
type: cloudify.nodes.ApplicationModule
interfaces:
test_platform_backup:
backup:
implementation: scripts/backup.sh
inputs:
port: 6969
post_backup:
implementation: scripts/post_backup.sh
relationships:
- type: cloudify.relationships.contained_in
target: mynode
My plugin code:
from cloudify.decorators import workflow
from cloudify.workflows import ctx
from cloudify.workflows.tasks_graph import forkjoin
#workflow
def backup(operation, type_name, operation_kwargs, is_node_operation, **kwargs):
graph = ctx.graph_mode()
send_event_starting_tasks = {}
send_event_done_tasks = {}
for node in ctx.nodes:
if type_name in node.type_hierarchy:
for instance in node.instances:
send_event_starting_tasks[instance.id] = instance.send_event('Starting to run operation')
send_event_done_tasks[instance.id] = instance.send_event('Done running operation')
for node in ctx.nodes:
if type_name in node.type_hierarchy:
for instance in node.instances:
sequence = graph.sequence()
if is_node_operation:
operation_task = instance.execute_operation(operation, kwargs=operation_kwargs)
else:
forkjoin_tasks = []
for relationship in instance.relationships:
forkjoin_tasks.append(relationship.execute_source_operation(operation))
forkjoin_tasks.append(relationship.execute_target_operation(operation))
operation_task = forkjoin(*forkjoin_tasks)
sequence.add(
send_event_starting_tasks[instance.id],
operation_task,
send_event_done_tasks[instance.id])
for node in ctx.nodes:
for instance in node.instances:
for rel in instance.relationships:
instance_starting_task = send_event_starting_tasks.get(instance.id)
target_done_task = send_event_done_tasks.get(rel.target_id)
if instance_starting_task and target_done_task:
graph.add_dependency(instance_starting_task, target_done_task)
return graph.execute()
It seems that your VM did not start.
From your code I can't understand what you are trying to do.
You don't install and agent and you don't have a fabric connection to the VM, yet you are trying to run operations on the VM.
You should either install an agent, E.g remove the "install_method: none", or add a fabric connection to the VM and run the operations with it.