Dataflow template is not taking input parameters - dataflow

I have a dataflow template created with below command
mvn compile exec:java \
-Dexec.mainClass=com.StarterPipeline \
-Dexec.args="--runner=DataflowRunner \
--project=jason-qa \
--stagingLocation=gs://jason_test/dataflow/staging \
--region=asia-east1 \
--zone=asia-east1-a \
--subnetwork=regions/asia-east1/subnetworks/dmz-asia-east1 \
--templateLocation=gs://jason_test/dataflow/Template \
--campaignId="
Executing Templates with below command
gcloud dataflow jobs run jason203 \
--project=jason-qa \
--region=asia-east1 \
--gcs-location gs://jason_test/dataflow/Template \
--parameters campaignId=run-test323,output=gs://jason_test/dataflow/counts
code copy from Count example and have few changes
public interface MyOptions extends PipelineOptions {
#Description("campaign id")
#Required
#Default.String("default-test123")
ValueProvider<String> getCampaignId();
void setCampaignId(ValueProvider<String> campaignId);
}
static void run(MyOptions options) {
Pipeline p = Pipeline.create(options);
String id = options.getCampaignId().get();
p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
.apply(new Count())
.apply(MapElements.via(new FormatAsTextFn()))
.apply("WriteCounts", TextIO.write().to(options.getOutput() + id));
p.run();
}
from the DataFlow Job summary --> Pipeline options
I can find the info: campaignId run-test323
but the result in bucket, file name is -00000-of-00003(It should be run-test323-00000-of-00003)

should remove the .get() in the code, then it will work

Related

How do I pass a "blob" to AWS sns publish on the command line?

I am trying to publish a message to an SNS topic from the command line, which includes binary data.
How do I pass this binary data into the message-attributes field, as indicated in https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sns/publish.html for --message-attributes?
This is my command:
awslocal sns publish \
--topic-arn ${AWS_TOPIC_ARN} \
--subject="Test Subject" \
--message "Data for today..." \
--message-attributes=file://sns_input.json
and my input:
{
"ProtoData": {
"DataType": "Binary",
"BinaryValue": **BLOB**
}
}
My "BLOB" data is in file://sample_proto_data.db

PySpark-streaming: How to access files sent using --files

I am running a pyspark-streaming client with kafka. I want to send files to cluster.
I am using --files option:
spark-submit --master yarn \
--deploy-mode client \
--jars "/home/aiman/testing_aiman/spark-sql-kafka-0-10_2.11-2.4.0-cdh6.3.4.jar" \
--files /home/aiman/testing_aiman/kafka.keystore.uat.jks#keystore.jks,/home/aiman/testing_aiman/kafka.truststore.uat.jks#truststore.jks \
sparkStreamingTest.py
and trying to access the files using SparkFiles.get():
from pyspark.sql import SparkSession
from pyspark import SparkFiles
spark = SparkSession.builder.appName("Test Streaming").getOrCreate()
# Get the Keystore File and Truststore File
keystore = str(SparkFiles.get('keystore.jks'))
truststore = str(SparkFiles.get('truststore.jks'))
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","kafka.server.com:9093") \
.option("subscribe","TEST_TOPIC") \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SSL") \
.option("kafka.ssl.keystore.location", keystore) \
.option("kafka.ssl.keystore.password", "abcd") \
.option("kafka.ssl.key.password","abcd") \
.option("kafka.ssl.truststore.type","JKS") \
.option("kafka.ssl.truststore.location", truststore) \
.option("kafka.ssl.truststore.password","abcd") \
.option("kafka.ssl.enabled.protocols","TLSv1") \
.option("kafka.ssl.endpoint.identification.algorithm","") \
.load()
....
...
but i am still getting NoSuchFileException:
Caused by: org.apache.kafka.common.KafkaException: Failed to load SSL keystore /tmp/spark-4578a498-f96d-4c8a-a716-e128d90531fb/userFiles-5792bc5c-d513-4aa3-9014-26df66ace1db/keystore.jks of type JKS
at org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:357)
at org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:240)
at org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:141)
... 55 more
Caused by: java.nio.file.NoSuchFileException: /tmp/spark-4578a498-f96d-4c8a-a716-e128d90531fb/userFiles-5792bc5c-d513-4aa3-9014-26df66ace1db/keystore.jks
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:350)
... 57 more
Where am I going wrong ?
Instead of using SparkFiles.get() method to get the absolute path, using the file names directly, also removing the #keystore.jks and #truststore.jks from --files option in spark-submit command:
spark-submit --master yarn \
--deploy-mode client \
--jars "/home/aiman/testing_aiman/spark-sql-kafka-0-10_2.11-2.4.0-cdh6.3.4.jar" \
--files /home/aiman/testing_aiman/kafka.keystore.uat.jks,/home/aiman/testing_aiman/kafka.truststore.uat.jks \
sparkStreamingTest.py
Using the file actual file names:
#Commenting the SparkFiles.get() method
#keystore = str(SparkFiles.get('keystore.jks'))
#truststore = str(SparkFiles.get('truststore.jks'))
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","kafka.server.com:9093") \
.option("subscribe","TEST_TOPIC") \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SSL") \
.option("kafka.ssl.keystore.location", "kafka.keystore.uat.jks") \
.option("kafka.ssl.keystore.password", "abcd") \
.option("kafka.ssl.key.password","abcd") \
.option("kafka.ssl.truststore.type","JKS") \
.option("kafka.ssl.truststore.location", "kafka.truststore.uat.jks") \
.option("kafka.ssl.truststore.password","abcd") \
.option("kafka.ssl.enabled.protocols","TLSv1") \
.option("kafka.ssl.endpoint.identification.algorithm","") \
.load()

Does HTTPie have the equivalent of curl's -d option?

I want to query a REST API with HTTPie. I am usuale to do so with curl, with which I am able to specify maxKeys and startAfterFilename e.g.
curl --location --request GET -G \
"https://some.thing.some.where/data/v1/datasets/mydataset/versions/2/files" \
-d maxKeys=100 \
-d startAfterFilename=YYYMMDD_HHMMSS.file \
--header "Authorization: verylongtoken"
How can I use those -d options in HTTPie?
In your case the command looks like this:
http -F https://some.thing.some.where/data/v1/datasets/mydataset/versions/2/files \
Authorization:verylongtoken \
startAfterFilename=="YYYMMDD_HHMMSS.file" \
maxKeys=="100"
Although, there is a bunch of methods to pass some data with httpie. For example
http POST http://example.com/posts/3 \
Origin:example.com \ # : HTTP headers
name="John Doe" \ # = string
q=="search" \ # == URL parameters (?q=search)
age:=29 \ # := for non-strings
list:='[1,3,4]' \ # := json
file#file.bin \ # # attach file
token=#token.txt \ # =# read from file (text)
user:=#user.json # :=# read from file (json)
Or, in the case of forms
http --form POST example.com \
name="John Smith" \
cv=#document.txt

Yocto - tools-profile' in IMAGE_FEATURES (added via EXTRA_IMAGE_FEATURES) is not a valid image feature

I am trying to install the tools-profile in the yocto but I get an error saying that tools-profile is not a valid option. How can I debug this? How to check why is it failing? Here is how I tried it.
Here is my bblayers.conf
LCONF_VERSION = "7"
BBPATH = "${TOPDIR}"
BBFILES ?= ""
BBLAYERS ?= " \
${TOPDIR}/../poky/meta \
${TOPDIR}/../poky/meta-poky \
${TOPDIR}/../poky/meta-yocto-bsp \
${TOPDIR}/../layers/meta-gplv2 \
${TOPDIR}/../layers/meta-xilinx/meta-xilinx-bsp \
${TOPDIR}/../layers/openembedded-core/meta \
${TOPDIR}/../layers/meta-openembedded/meta-oe \
${TOPDIR}/../layers/meta-openembedded/meta-multimedia \
${TOPDIR}/../layers/meta-openembedded/meta-networking \
${TOPDIR}/../layers/meta-openembedded/meta-python \
${TOPDIR}/../layers/meta-custom \
"
BBLAYERS_NON_REMOVABLE ?= " \
${TOPDIR}/../poky/meta \
${TOPDIR}/../poky/meta-poky \
"
In the local.conf, I have added the following.
EXTRA_IMAGE_FEATURES = "debug-tweaks tools-profile"
Probably too late but I will answer this question as this happened to me yesterday.
The issue is that you have an image that inherits only image.bbclass. If you look into image.bbclass you will see that it doesn't know anything about tools-profile but core-image.bbclass does.
All you have to do is to change inherit image to inherit core-image in an image that is throwing the error.
In my case it was swupdate-image.

Does scala has "Options" to parse command line arguments in spark-submit just like in Java? [duplicate]

This question already has answers here:
Best way to parse command-line parameters? [closed]
(26 answers)
Closed 3 years ago.
In order to parse command line arguments while using spark-submit:
SPARK_MAJOR_VERSION=2 spark-submit --class com.partition.source.Pickup --master=yarn --conf spark.ui.port=0000 --driver-class-path /home/hdpusr/jars/postgresql-42.1.4.jar --conf spark.jars=/home/hdpusr/jars/postgresql-42.1.4.jar,/home/hdpusr/jars/postgresql-42.1.4.jar --executor-cores 4 --executor-memory 4G --keytab /home/hdpusr/hdpusr.keytab --principal hdpusr#DEVUSR.COM --files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties --name Spark_APP --conf spark.executor.extraClassPath=/home/hdpusr/jars/greenplum.jar sparkload_2.11-0.1.jar ORACLE
I am passing a database name: ORACLE which I am parsing it in the code as
def main(args: Array[String]): Unit = {
val dbtype = args(0).toString
.....
}
Is there a way I can give it a name like "--dbname" and then check for that option in the spark-submit to get the option's value ?
Ex:
SPARK_MAJOR_VERSION=2 spark-submit --class com.partition.source.Pickup --master=yarn --conf spark.ui.port=0000 --driver-class-path /home/hdpusr/jars/postgresql-42.1.4.jar --conf spark.jars=/home/hdpusr/jars/postgresql-42.1.4.jar,/home/hdpusr/jars/postgresql-42.1.4.jar --executor-cores 4 --executor-memory 4G --keytab /home/hdpusr/hdpusr.keytab --principal hdpusr#DEVUSR.COM --files /usr/hdp/current/spark2-client/conf/hive-site.xml,testconnection.properties --name Spark_APP --conf spark.executor.extraClassPath=/home/hdpusr/jars/greenplum.jar sparkload_2.11-0.1.jar --dbname ORACLE
In Java there are two packages which can be used to do the same:
import org.apache.commons.cli.Option;
import org.apache.commons.cli.Options;
public static void main(String[] args) {
Options options = new Options();
Option input = new Option("s", "ssn", true, "source system names");
input.setRequired(false);
options.addOption(input);
CommandLineParser parser = new DefaultParser();
HelpFormatter formatter = new HelpFormatter();
CommandLine cmd = null;
try {
cmd = parser.parse(options, args);
if(cmd.hasOption("s")) { // Checks if there is an argument '--s' in the CLI. Runs the Recon only for the received SSNs.
}
} catch(ParseException e) {
formatter.printHelp("utility-name", options);
e.printStackTrace();
System.exit(1);
} catch(Exception e) {
e.printStackTrace();
}
}
Could anyone let me know if it is possible to name the command line arguments and parse them accordingly ?
If you use --dbname=ORACLE for example.
val pattern = """--dbname=(.*)""".r
val params = args.map {
case pattern(pair, _) => pair
case arg => throw new ConfigException.Generic(s"""unable to parse command-line argument "$arg"""")
}
\s Matches whitespace, you can use it to create --dbname ORACLE, but it's easier if you just use a string.
Here you can see all the possibilities.
If we are not specific about the key name, we can prefix the key name with spark. in this case spark.dbname, and pass an conf argument like spark-submit --conf spark.dbname=<> .... or add it to the spark-defaults.conf
In the user code, we can access the key as sparkContext.getConf.get("spark.dbname")