PySpark-streaming: How to access files sent using --files - pyspark

I am running a pyspark-streaming client with kafka. I want to send files to cluster.
I am using --files option:
spark-submit --master yarn \
--deploy-mode client \
--jars "/home/aiman/testing_aiman/spark-sql-kafka-0-10_2.11-2.4.0-cdh6.3.4.jar" \
--files /home/aiman/testing_aiman/kafka.keystore.uat.jks#keystore.jks,/home/aiman/testing_aiman/kafka.truststore.uat.jks#truststore.jks \
sparkStreamingTest.py
and trying to access the files using SparkFiles.get():
from pyspark.sql import SparkSession
from pyspark import SparkFiles
spark = SparkSession.builder.appName("Test Streaming").getOrCreate()
# Get the Keystore File and Truststore File
keystore = str(SparkFiles.get('keystore.jks'))
truststore = str(SparkFiles.get('truststore.jks'))
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","kafka.server.com:9093") \
.option("subscribe","TEST_TOPIC") \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SSL") \
.option("kafka.ssl.keystore.location", keystore) \
.option("kafka.ssl.keystore.password", "abcd") \
.option("kafka.ssl.key.password","abcd") \
.option("kafka.ssl.truststore.type","JKS") \
.option("kafka.ssl.truststore.location", truststore) \
.option("kafka.ssl.truststore.password","abcd") \
.option("kafka.ssl.enabled.protocols","TLSv1") \
.option("kafka.ssl.endpoint.identification.algorithm","") \
.load()
....
...
but i am still getting NoSuchFileException:
Caused by: org.apache.kafka.common.KafkaException: Failed to load SSL keystore /tmp/spark-4578a498-f96d-4c8a-a716-e128d90531fb/userFiles-5792bc5c-d513-4aa3-9014-26df66ace1db/keystore.jks of type JKS
at org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:357)
at org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:240)
at org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:141)
... 55 more
Caused by: java.nio.file.NoSuchFileException: /tmp/spark-4578a498-f96d-4c8a-a716-e128d90531fb/userFiles-5792bc5c-d513-4aa3-9014-26df66ace1db/keystore.jks
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:350)
... 57 more
Where am I going wrong ?

Instead of using SparkFiles.get() method to get the absolute path, using the file names directly, also removing the #keystore.jks and #truststore.jks from --files option in spark-submit command:
spark-submit --master yarn \
--deploy-mode client \
--jars "/home/aiman/testing_aiman/spark-sql-kafka-0-10_2.11-2.4.0-cdh6.3.4.jar" \
--files /home/aiman/testing_aiman/kafka.keystore.uat.jks,/home/aiman/testing_aiman/kafka.truststore.uat.jks \
sparkStreamingTest.py
Using the file actual file names:
#Commenting the SparkFiles.get() method
#keystore = str(SparkFiles.get('keystore.jks'))
#truststore = str(SparkFiles.get('truststore.jks'))
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","kafka.server.com:9093") \
.option("subscribe","TEST_TOPIC") \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SSL") \
.option("kafka.ssl.keystore.location", "kafka.keystore.uat.jks") \
.option("kafka.ssl.keystore.password", "abcd") \
.option("kafka.ssl.key.password","abcd") \
.option("kafka.ssl.truststore.type","JKS") \
.option("kafka.ssl.truststore.location", "kafka.truststore.uat.jks") \
.option("kafka.ssl.truststore.password","abcd") \
.option("kafka.ssl.enabled.protocols","TLSv1") \
.option("kafka.ssl.endpoint.identification.algorithm","") \
.load()

Related

Update/Replace value in Mongo Database using Mongo Spark Connector (Pyspark) v10x

I am using the spark version in the image below. Details:
mongo-spark-connector:10.0.5
Spark version 3.1.3
And I config the spark-mongo-connector by following:
spark = SparkSession.builder \
.appName("hello") \
.master("yarn") \
.config("spark.executor.memory", "4g") \
.config('spark.driver.memory', '2g') \
.config('spark.driver.cores', '4') \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector:10.0.5') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.27.1.jar') \
.enableHiveSupport() \
.getOrCreate()
I want to ask the question, how to update and replace value in Mongo Database.
I read the following question in Updating mongoData with MongoSpark. But it is successful for mongo-spark v2.x. About mongo-spark v10 above is fail.
Example:
I have these following attributes:
from bson.objectid import ObjectId
data = {
'_id' : ObjectId("637367d5262dc89a8e318d09"),
'database' : database_name,
"table" : table,
"latestSyncAt": lastestSyncAt,
"lastest_id" : str(lastest_id)
}
df = spark.createDataFrame(data)
How do I update or replace _id attribute value in Mongo Database by using Mongo-spark-connector?
Thank you very much for your support.

Does HTTPie have the equivalent of curl's -d option?

I want to query a REST API with HTTPie. I am usuale to do so with curl, with which I am able to specify maxKeys and startAfterFilename e.g.
curl --location --request GET -G \
"https://some.thing.some.where/data/v1/datasets/mydataset/versions/2/files" \
-d maxKeys=100 \
-d startAfterFilename=YYYMMDD_HHMMSS.file \
--header "Authorization: verylongtoken"
How can I use those -d options in HTTPie?
In your case the command looks like this:
http -F https://some.thing.some.where/data/v1/datasets/mydataset/versions/2/files \
Authorization:verylongtoken \
startAfterFilename=="YYYMMDD_HHMMSS.file" \
maxKeys=="100"
Although, there is a bunch of methods to pass some data with httpie. For example
http POST http://example.com/posts/3 \
Origin:example.com \ # : HTTP headers
name="John Doe" \ # = string
q=="search" \ # == URL parameters (?q=search)
age:=29 \ # := for non-strings
list:='[1,3,4]' \ # := json
file#file.bin \ # # attach file
token=#token.txt \ # =# read from file (text)
user:=#user.json # :=# read from file (json)
Or, in the case of forms
http --form POST example.com \
name="John Smith" \
cv=#document.txt

Yocto - tools-profile' in IMAGE_FEATURES (added via EXTRA_IMAGE_FEATURES) is not a valid image feature

I am trying to install the tools-profile in the yocto but I get an error saying that tools-profile is not a valid option. How can I debug this? How to check why is it failing? Here is how I tried it.
Here is my bblayers.conf
LCONF_VERSION = "7"
BBPATH = "${TOPDIR}"
BBFILES ?= ""
BBLAYERS ?= " \
${TOPDIR}/../poky/meta \
${TOPDIR}/../poky/meta-poky \
${TOPDIR}/../poky/meta-yocto-bsp \
${TOPDIR}/../layers/meta-gplv2 \
${TOPDIR}/../layers/meta-xilinx/meta-xilinx-bsp \
${TOPDIR}/../layers/openembedded-core/meta \
${TOPDIR}/../layers/meta-openembedded/meta-oe \
${TOPDIR}/../layers/meta-openembedded/meta-multimedia \
${TOPDIR}/../layers/meta-openembedded/meta-networking \
${TOPDIR}/../layers/meta-openembedded/meta-python \
${TOPDIR}/../layers/meta-custom \
"
BBLAYERS_NON_REMOVABLE ?= " \
${TOPDIR}/../poky/meta \
${TOPDIR}/../poky/meta-poky \
"
In the local.conf, I have added the following.
EXTRA_IMAGE_FEATURES = "debug-tweaks tools-profile"
Probably too late but I will answer this question as this happened to me yesterday.
The issue is that you have an image that inherits only image.bbclass. If you look into image.bbclass you will see that it doesn't know anything about tools-profile but core-image.bbclass does.
All you have to do is to change inherit image to inherit core-image in an image that is throwing the error.
In my case it was swupdate-image.

Add OP-TEE to Yocto

I am attempting to build a Yocto image incorporating OP-TEE. I'm used to the output from OP-TEE's build repo (bl1.bin etc) and I can't get the Yocto system to do the same. Also no /dev/tee devices exist (so maybe the kernel isn't configured). Although xtest and optee-examples have been installed. I am attempting to first build it against QEMU ARMv8.
Here is my config so far:
local.conf
MACHINE ?= "qemuarm64"
PREFERRED_PROVIDER_virtual/kernel = "linux-linaro-aarch64"
IMAGES_CLASSES = "image_types_bios image_types_uefi"
DISTRO ?= "poky"
PACKAGE_CLASSES ?= "package_rpm"
EXTRA_IMAGE_FEATURES ?= "debug-tweaks"
PATCHRESOLVE = "noop"
BB_DISKMON_DIRS ??= "\
STOPTASKS,${TMPDIR},1G,100K \
STOPTASKS,${DL_DIR},1G,100K \
STOPTASKS,${SSTATE_DIR},1G,100K \
STOPTASKS,/tmp,100M,100K \
ABORT,${TMPDIR},100M,1K \
ABORT,${DL_DIR},100M,1K \
ABORT,${SSTATE_DIR},100M,1K \
ABORT,/tmp,10M,1K"
PACKAGECONFIG_append_pn-qemu-system-native = " sdl"
CONF_VERSION = "1"
OPTEEOUTPUTMACHINE = "vexpress"
OPTEEMACHINE = "vexpress-qemu_armv8a"
CORE_IMAGE_EXTRA_INSTALL += "optee-client optee-examples optee-os optee-test python-pycrypto"
INSANE_SKIP_optee-examples = "ldflags"
DISTRO_FEATURES_append = "optee"
bblayers.conf
BBPATH = "${TOPDIR}"
BBFILES ?= ""
BBLAYERS ?= " \
/mnt/raid/yocto_stuff/arm64/poky/meta \
/mnt/raid/yocto_stuff/arm64/poky/meta-poky \
/mnt/raid/yocto_stuff/arm64/poky/meta-yocto-bsp \
/mnt/raid/yocto_stuff/arm64/meta-linaro/meta-optee \
/mnt/raid/yocto_stuff/arm64/meta-linaro/meta-linaro \
/mnt/raid/yocto_stuff/arm64/meta-openembedded/meta-oe \
/mnt/raid/yocto_stuff/arm64/meta-openembedded/meta-networking \
/mnt/raid/yocto_stuff/arm64/meta-openembedded/meta-python \
"
Thanks for any help.

Dataflow template is not taking input parameters

I have a dataflow template created with below command
mvn compile exec:java \
-Dexec.mainClass=com.StarterPipeline \
-Dexec.args="--runner=DataflowRunner \
--project=jason-qa \
--stagingLocation=gs://jason_test/dataflow/staging \
--region=asia-east1 \
--zone=asia-east1-a \
--subnetwork=regions/asia-east1/subnetworks/dmz-asia-east1 \
--templateLocation=gs://jason_test/dataflow/Template \
--campaignId="
Executing Templates with below command
gcloud dataflow jobs run jason203 \
--project=jason-qa \
--region=asia-east1 \
--gcs-location gs://jason_test/dataflow/Template \
--parameters campaignId=run-test323,output=gs://jason_test/dataflow/counts
code copy from Count example and have few changes
public interface MyOptions extends PipelineOptions {
#Description("campaign id")
#Required
#Default.String("default-test123")
ValueProvider<String> getCampaignId();
void setCampaignId(ValueProvider<String> campaignId);
}
static void run(MyOptions options) {
Pipeline p = Pipeline.create(options);
String id = options.getCampaignId().get();
p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
.apply(new Count())
.apply(MapElements.via(new FormatAsTextFn()))
.apply("WriteCounts", TextIO.write().to(options.getOutput() + id));
p.run();
}
from the DataFlow Job summary --> Pipeline options
I can find the info: campaignId run-test323
but the result in bucket, file name is -00000-of-00003(It should be run-test323-00000-of-00003)
should remove the .get() in the code, then it will work