PySpark Streaming from Cloudkarfka on Google Dataproc - pyspark

Service: GCP Dataproc & DataprocHub
Service: Cloudkarafka (www.cloudkarafka.com), a simple and quick way to launch your kafka service.
1: GCP-DataprocHub out of box gives Spark 2.4.8.
2: Create a Dataproc Cluster with Specific version.
gcloud shell:
gcloud dataproc clusters create dataproc-spark312 --image-version=2.0-ubuntu18 --region=us-central1 --single-node
3: Export it and save in gs bucket
gcloud dataproc clusters export dataproc-spark312 --destination dataproc-spark312.yaml --region us-central1
gsutil cp dataproc-spark312.yaml gs://gcp-learn-lib/
4: Create a ENV file
DATAPROC_CONFIGS=gs://gcp-learn-lib/dataproc-spark312.yaml
NOTEBOOKS_LOCATION=gs://gcp-learn-notebooks/notebooks
DATAPROC_LOCATIONS_LIST=a,b,c
Save as: dataproc-hub-config.env and upload to gs: bucket
5: Create a Datapro-Hub and link the above Cluster.
Section: "custom env setup"
key:container-env-file
value: gs://gcp-learn-lib/dataproc-hub-config.env
6: Complete the Create
7: Click on Jupyter Link
8: It should show your cluster, select it. and select the region (same as the original cluster in step2)
9: Dataproc -> Cluster -> Click on Cluster (starts with hub-) -> VM Instances -> SSH
Download and copy the kafka jars to
cd /usr/lib/spark/jars/
wget https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.6.2/commons-pool2-2.6.2.jar
wget https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/2.6.0/kafka-clients-2.6.0.jar
wget https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/3.1.2/spark-sql-kafka-0-10_2.12-3.1.2.jar
wget https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-10_2.12/3.1.2/spark-streaming-kafka-0-10_2.12-3.1.2.jar
wget https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-10-assembly_2.12/3.1.2/spark-streaming-kafka-0-10-assembly_2.12-3.1.2.jar
10: Prepare JAAS and CA file
vi /tmp/cloudkarafka_gcp_oct2021.jaas
vi /tmp/cloudkarafka_gcp_oct2021.ca
11: Append to the end of the file:
sudo vi /etc/spark/conf/spark-defaults.conf
spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/tmp/cloudkarafka_gcp_oct2021.jaas -Dsasl.jaas.config=/tmp/cloudkarafka_gcp_oct2021.jaas -Dssl.ca.location=/tmp/cloudkarafka_gcp_oct2021.ca
spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/tmp/cloudkarafka_gcp_oct2021.jaas
12: Go back to JupyterHub. Restart Kernel - Test your code
spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", "server1:9094,serer2:9094,server3:9094") \
.option("subscribe", "foo-default") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.mechanism", "SCRAM-SHA-256") \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream.format("kafka") \
.option("kafka.bootstrap.servers", "server1:9094,serer2:9094,server3:9094") \
.option("topic", "foo-test") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.mechanism", "SCRAM-SHA-256") \
.option("checkpointLocation", "/tmp/stream/kafkatest") \
.start()
13: Enjoy
==============

Related

How to pass differnt filenames to spark using scala

I have below code at cluster:
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("SparkData").getOrCreate()
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.implicits._
import spark.sql
//----------Write Logic Here--------------------------
//Read csv file
val df = spark.read.format("csv").load("books.csv")//Here i want to accept parameter
df.show()
spark.stop
}
I want to pass different files to spark.read.format using spark-submit command.
The files are on my linux box.
I used this :
csv_file="/usr/usr1/Test.csv"
spark2-submit \
--num-executors 30 \
--driver-memory 12g \
--executor-memory 14g \
--executor-cores 4 \
--class driver_class \
--name TTTTTT \
--master yarn \
--deploy-mode cluster \
--files myprop.properties,${csv_file} \
abc.jar
Howevr the program just tries to look for the path from root folder from hdfs cluseter and says no file found exception.
Can anyone please help me getting used the file from the filepath I mention. So i want my spark program to read the file from the path I say. Not from the root.
I tried:
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("SparkData").getOrCreate()
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.implicits._
import spark.sql
val filepath = args(0)
//----------Write Logic Here--------------------------
//Read csv file
val df = spark.read.format("csv").load(filepath)//Here i want to accept parameter
df.show()
spark.stop
}
Used below to submit which doesnt work:
csv_file="/usr/usr1/Test.csv"
spark2-submit \
--num-executors 30 \
--driver-memory 12g \
--executor-memory 14g \
--executor-cores 4 \
--class driver_class \
--name TTTTTT \
--master yarn \
--deploy-mode cluster \
--files myprop.properties \
abc.jar ${csv_file}
But program is not picking the fie. Can anyone please help?
The local files URL format should be:
csv_file="file:///usr/usr1/Test.csv".
Note that the local files must also be accessible at the same path on all worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
I don't have a cluster on my hand right now, so I cannot test it. However:
You submit code to yarn, so it will deploy the spark driver on one of the cluster's node. But you don't know which.
When reading a file type path starting by "file://" or nothing, spark will look for a file on the file system of the node the driver is running on.
as you've seen using sparp-submit --file will copy the file in the starting folder of spark driver (so on the master node). The path is king of arbitrary, and you should not try to infer it.
But maybe it'd work to pass as argument to spark.read just the filename at let spark driver look for it in its starting folder (but I didn't check):
spark-submit\
...\
--files ..., /path/to/your/file.csv\
abs.jar file.csv
=> The proper/standard way to do it is: first copy you file(s) on hdfs, or other distributed file system the spark cluster has access to. Then, you can give to the spark app the hdfs file path to use. Something like (again, didn't test it)
hdfs fs -put /path/to/your/file.csv /user/your/data
spark-submit ... abc.jar hdfs:///user/your/data/file.csv
For info, if you don't know: to use hdfs command, you need to have hdfs client install on you machine (the actual hdfs command), with the suitable configuration to point to the hdfs cluster. Also there are usually security config to do on the cluster for the client to communicate with it. But that another issue that depends hdfs is running (local, aws, ...)
Replace ${csv_file} at the end of your spark-submit command with basename ${csv_file}:
spark2-submit \
... \
--files myprop.properties,${csv_file} \
abc.jar `basename ${csv_file}`
basename strips the directory part from the full path leaving only the file name:
$ basename /usr/usr1/foo.csv
foo.csv
That way Spark will copy the file to the staging directory and the driver program should be able to access it by its relative path. If the cluster is configured to stage on HDFS, the executors will also have access to the file.

Setting Dynamic Properties in Dataproc Job

Here's what I am trying to accomplish. I want to create a workflow template so that I can spin up a cluster, run a job, and delete the cluster. Within the job, I want to pass in properties that can be set dynamically. For example, set a property to the current date.
Below is a simple example. I uses the data function correctly but that is handled at creation time so it looks like it will always be 12/31/2020 if I setup the workflow today. I know I can delete the job and add it back to the template for each run, but I was was hoping for a simpler way.
gcloud dataproc workflow-templates create workflow-mk-test --region us-east1 --project data-engineering-doz4
gcloud dataproc workflow-templates set-managed-cluster workflow-mk-test \
--cluster-name=cluster-mk-test \
--project data-engineering-doz4 \
--image-version=1.3-ubuntu18 \
--bucket data-engineering-dev \
--region us-east1 \
--subnet ml-data-engineering-east1 \
--no-address \
--zone us-east1-b \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 15 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 15
gcloud dataproc workflow-templates add-job pyspark gs://data-engineering-dev/jobs/millard-test.py \
--workflow-template=workflow-mk-test \
--step-id=test-job \
--region=us-east1 \
--project=data-engineering-doz4 \
-- date `date -v -1d '+%Y/%m/%d'` \
--output-location s3n://missionlane-data-engineering-dev-us-east-1/delete-me/`date -v -1d '+%Y/%m/%d'`
Dynamic properties generated by running shell commands is not a supported feature of Dataproc jobs. In this case, you might want to consider making the logic part of your job, i.e., getting the current date dynamically in millard-test.py.

How to start master and slave on EMR

I'm new to EMR and now I can't run my Spark application on EMR.
My question is how I can $start-master.sh and $start-slave.sh on EMR.
I put this 2 commands into a bash file and upload to S3 for bootstrap.
aws emr create-cluster --release-label $release_label \
--instance-groups InstanceGroupType="MASTER",InstanceCount=$instance_count,InstanceType=$instance_type,BidPrice=0.2,Name="MASTER" \
InstanceGroupType="CORE",InstanceType=$instance_type,InstanceCount=$instance_count,BidPrice=0.2,Name="CORE" \
--auto-terminate \
--use-default-roles \
--name knx-attribution-spark-$product-$environment-$build_number \
--log-uri s3://knx-logs/emr/knx-attribution-spark-$product-$environment \
--ec2-attributes KeyName=$keypair,SubnetId=$subnet,EmrManagedMasterSecurityGroup=$sg1,EmrManagedSlaveSecurityGroup=$sg1,AdditionalMasterSecurityGroups=$sg2,AdditionalSlaveSecurityGroups=$sg2 \
--tags Name="knx-emr-attribution-spark-$product-$environment" Environment=$environment \
--applications Name=Spark Name=Hadoop\
--bootstrap-actions Path="s3://${BOOTSTRAP_FILE}" \
--steps Type=Spark,Name=Stage,ActionOnFailure=CONTINUE,Args=[--deploy-mode,client,--packages,org.mongodb.spark:mongo-spark-connector_2.11:2.3.0,--driver-memory,8g,--executor-memory,4g,--num-executors,4,--py-files,s3://${FILE_ZIP},--master,spark://127.0.0.1:7077,s3://${BUCKET}]
the bootstrap file is:
./spark/sbin/start-master.sh
./spark/sbin/start-slave.sh spark://127.0.0.1:7077
and it always throw this error:
/emr/instance-controller/lib/bootstrap-actions/1/install_lib.sh: line 4: start-master.sh: command not found
Make sure the the .sh files are executable before you try.
If not try this command to make it as an executable:
chmod +x start-master.sh
And then try running the scripts.
You cannot run Spark on Amazon EMR in standalone mode.
On EMR, Spark is run on YARN rather than in Standalone mode. Unless there’s something else wrong that I’m not seeing, the only thing you should need to change is to remove “--master,spark://127.0.0.1:7077” from the arguments of your spark-submit step. The correct args to use would be “--master,yarn”, but that is the default on EMR, so you don’t need to specify that.

Attempting to write csv file in sftp mode with Spark on Yarn in a Kerberos environment

I'm trying to write a Dataframe into a csv file and put this csv file into a remote machine. The Spark job is running on Yarn into a Kerberos cluster.
Below, the error I get when the job tries to write the csv file on the remote machine :
diagnostics: User class threw exception:
org.apache.hadoop.security.AccessControlException: Permission denied:
user=dev, access=WRITE,
inode="/data/9/yarn/local/usercache/dev/appcache/application_1532962490515_15862/container_e05_1532962490515_15862_02_000001/tmp/spark_sftp_connection_temp178/_temporary/0":hdfs:hdfs:drwxr-xr-x
In order to write this csv file, i'm using the folowing parameters in a method that write this file in sftp mode :
def writeToSFTP(df: DataFrame, path: String) = {
df.write
.format("com.springml.spark.sftp")
.option("host", "hostname.test.fr")
.option("username", "test_hostname")
.option("password", "toto")
.option("fileType", "csv")
.option("delimiter", ",")
.save(path)
}
I'm using the Spark SFTP Connector library as described in the link : https://github.com/springml/spark-sftp
The script which is used to launch the job is :
#!/bin/bash
kinit -kt /home/spark/dev.keytab dev#CLUSTER.HELP.FR
spark-submit --class fr.edf.dsp.launcher.LauncherInsertion \
--master yarn-cluster \
--num-executors 1 \
--driver-memory 5g \
--executor-memory 5g \
--queue dev \
--files /home/spark/dev.keytab#user.keytab,\
/etc/krb5.conf#krb5.conf,\
/home/spark/jar/dev-application-SNAPSHOT.conf#app.conf \
--conf "spark.executor.extraJavaOptions=-Dapp.config.path=./app.conf -Djava.security.auth.login.config=./jaas.conf" \
--conf "spark.driver.extraJavaOptions=-Dapp.config.path=./app.conf -Djava.security.auth.login.config=./jaas.conf" \
/home/spark/jar/dev-SNAPSHOT.jar > /home/spark/out.log 2>&1&
The csv files are not written into HDFS. Once the Dataframe is built i try to send it to the machine. I suspect a Kerberos issue with the sftp Spark connector : Yarn can't contact a remote machine...
Any help is welcome, thanks.
add temporary location where you have write access, and do not worry about cleanup this because in the end after ftp done these files will be deleted,
def writeToSFTP(df: DataFrame, path: String) = {
df.write
.format("com.springml.spark.sftp")
.option("host", "hostname.test.fr")
.option("username", "test_hostname")
.option("password", "toto")
.option("fileType", "csv")
**.option("hdfsTempLocation","/user/currentuser/")**
.option("delimiter", ",")
.save(path)
}

CDH spark steaming consumer kerberos kafka

Does any one tried to use spark-steaming(pyspark) as consumer for kerberos KAFKA in CDH ?
I search the CDH and just find some example about Scala.
Does it means CDH does not support this ?
Anyone can help on this ???
CDH supports Pyspark based Structured Streaming API to connect Kerberos-secured Kafka cluster as well. Even I found it hard to find example code . You can refer below sample code which well tested and implemented in CDH prod environment .
Note : Points to consider in below sample code .
Adjust packages version based on your environment .
Mention right JAAS,Keytab file location in spark submit command and config parameters in code.
This code has been given as an example to read Kerberos enabled Kafka cluster topic and writing into HDFS location.
spark/bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0,com.databricks:spark-avro_2.11:3.2.0 --conf spark.ui.port=4055 --files /home/path/spark_jaas,/home/bdpda/bdpda.headless.keytab --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/bdpda/spark_jaas" --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/home/bdpda/spark_jaas" pysparkstructurestreaming.py
Pyspark code: pysparkstructurestreaming.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
import time
# Spark Streaming context :
spark = SparkSession.builder.appName('PythonStreamingDirectKafkaWordCount').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topic_name"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'kafka_server:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"Clinet_id")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/home/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "password_rd") \
.option("kafka.sasl.kerberos.keytab","/home/path.keytab") \
.option("kafka.sasl.kerberos.principal","path") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
# Creating Writestream DataFrame :
df1.writeStream \
.option("path","target_directory") \
.format("csv") \
.option("checkpointLocation","chkpint_directory") \
.outputMode("append") \
.start()
ssc.awaitTermination()