CDH spark steaming consumer kerberos kafka - pyspark

Does any one tried to use spark-steaming(pyspark) as consumer for kerberos KAFKA in CDH ?
I search the CDH and just find some example about Scala.
Does it means CDH does not support this ?
Anyone can help on this ???

CDH supports Pyspark based Structured Streaming API to connect Kerberos-secured Kafka cluster as well. Even I found it hard to find example code . You can refer below sample code which well tested and implemented in CDH prod environment .
Note : Points to consider in below sample code .
Adjust packages version based on your environment .
Mention right JAAS,Keytab file location in spark submit command and config parameters in code.
This code has been given as an example to read Kerberos enabled Kafka cluster topic and writing into HDFS location.
spark/bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0,com.databricks:spark-avro_2.11:3.2.0 --conf spark.ui.port=4055 --files /home/path/spark_jaas,/home/bdpda/bdpda.headless.keytab --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/bdpda/spark_jaas" --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/home/bdpda/spark_jaas" pysparkstructurestreaming.py
Pyspark code: pysparkstructurestreaming.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
import time
# Spark Streaming context :
spark = SparkSession.builder.appName('PythonStreamingDirectKafkaWordCount').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topic_name"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'kafka_server:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"Clinet_id")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/home/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "password_rd") \
.option("kafka.sasl.kerberos.keytab","/home/path.keytab") \
.option("kafka.sasl.kerberos.principal","path") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
# Creating Writestream DataFrame :
df1.writeStream \
.option("path","target_directory") \
.format("csv") \
.option("checkpointLocation","chkpint_directory") \
.outputMode("append") \
.start()
ssc.awaitTermination()

Related

Read Json Kafka message without schema

Currently we are working on a real time data feeds having Json data.
While reading the examples from -
https://sparkbyexamples.com/spark/spark-streaming-with-kafka/
It looks like we need a schema for kafka json message.
Is there any other way to process data without schema ?
try below code after running the zookeeper, Kafka server and other required service.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
.option("subscribe", kafka_topic_name) \
.option("startingOffsets", "latest")\
.load() #earliest
print("Printing Schema of transaction_detail_df: ")
df.printSchema()
transaction_detail_df1 = df.selectExpr("CAST(value AS STRING)")
trans_detail_write_stream = transaction_detail_df1 \
.writeStream \
.trigger(processingTime='2 seconds') \
.option("truncate", "false") \
.format("console") \
.start()
trans_detail_write_stream.awaitTermination()
just change the basic configuration, you would be able to see the output
You can use get_json_object SparkSQL function to parse data out of JSON string data without defining any additional schema.
You can simply use cast function to deserialize the binary key/value, as the example shows

Databricks - overwriteSchema

Multiple times I've had an issue while updating a delta table in Databricks where overwriting the Schema fails the first time, but is then successful the second time. The solution to my problem was to simply run it again, and I'm unable to reproduce at this time. If it happens again I'll come back and post the exact error message, but it was in essence a Schema Mismatch error. Has anyone else had a similar problem?
overwriteSchema = True
DF.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", overwriteSchema) \
.partitionBy(datefield) \
.saveAsTable(deltatable)
Key-value should be string, not Boolean.
.option("overwriteSchema", "True")
DF.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "True") \
.partitionBy(datefield) \
.saveAsTable(deltatable)

How to pass differnt filenames to spark using scala

I have below code at cluster:
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("SparkData").getOrCreate()
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.implicits._
import spark.sql
//----------Write Logic Here--------------------------
//Read csv file
val df = spark.read.format("csv").load("books.csv")//Here i want to accept parameter
df.show()
spark.stop
}
I want to pass different files to spark.read.format using spark-submit command.
The files are on my linux box.
I used this :
csv_file="/usr/usr1/Test.csv"
spark2-submit \
--num-executors 30 \
--driver-memory 12g \
--executor-memory 14g \
--executor-cores 4 \
--class driver_class \
--name TTTTTT \
--master yarn \
--deploy-mode cluster \
--files myprop.properties,${csv_file} \
abc.jar
Howevr the program just tries to look for the path from root folder from hdfs cluseter and says no file found exception.
Can anyone please help me getting used the file from the filepath I mention. So i want my spark program to read the file from the path I say. Not from the root.
I tried:
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("SparkData").getOrCreate()
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.implicits._
import spark.sql
val filepath = args(0)
//----------Write Logic Here--------------------------
//Read csv file
val df = spark.read.format("csv").load(filepath)//Here i want to accept parameter
df.show()
spark.stop
}
Used below to submit which doesnt work:
csv_file="/usr/usr1/Test.csv"
spark2-submit \
--num-executors 30 \
--driver-memory 12g \
--executor-memory 14g \
--executor-cores 4 \
--class driver_class \
--name TTTTTT \
--master yarn \
--deploy-mode cluster \
--files myprop.properties \
abc.jar ${csv_file}
But program is not picking the fie. Can anyone please help?
The local files URL format should be:
csv_file="file:///usr/usr1/Test.csv".
Note that the local files must also be accessible at the same path on all worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
I don't have a cluster on my hand right now, so I cannot test it. However:
You submit code to yarn, so it will deploy the spark driver on one of the cluster's node. But you don't know which.
When reading a file type path starting by "file://" or nothing, spark will look for a file on the file system of the node the driver is running on.
as you've seen using sparp-submit --file will copy the file in the starting folder of spark driver (so on the master node). The path is king of arbitrary, and you should not try to infer it.
But maybe it'd work to pass as argument to spark.read just the filename at let spark driver look for it in its starting folder (but I didn't check):
spark-submit\
...\
--files ..., /path/to/your/file.csv\
abs.jar file.csv
=> The proper/standard way to do it is: first copy you file(s) on hdfs, or other distributed file system the spark cluster has access to. Then, you can give to the spark app the hdfs file path to use. Something like (again, didn't test it)
hdfs fs -put /path/to/your/file.csv /user/your/data
spark-submit ... abc.jar hdfs:///user/your/data/file.csv
For info, if you don't know: to use hdfs command, you need to have hdfs client install on you machine (the actual hdfs command), with the suitable configuration to point to the hdfs cluster. Also there are usually security config to do on the cluster for the client to communicate with it. But that another issue that depends hdfs is running (local, aws, ...)
Replace ${csv_file} at the end of your spark-submit command with basename ${csv_file}:
spark2-submit \
... \
--files myprop.properties,${csv_file} \
abc.jar `basename ${csv_file}`
basename strips the directory part from the full path leaving only the file name:
$ basename /usr/usr1/foo.csv
foo.csv
That way Spark will copy the file to the staging directory and the driver program should be able to access it by its relative path. If the cluster is configured to stage on HDFS, the executors will also have access to the file.

PySpark dataframe show() gives 'Unsupported class file major version 58' Error

I am trying to read in from a mongo DB using pyspark and the mongo-spark connector. Reading in the data works fine, but when I try to see the dataframe with df.show() I get the error
IllegalArgumentException: 'Unsupported class file major version 58'
This is what my code looks like right now.
spark = SparkSession.builder.appName("myApp") \
.config("spark.mongodb.input.uri", "...") \
.config("spark.mongodb.output.uri", "...") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.4.1') \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.show()

Attempting to write csv file in sftp mode with Spark on Yarn in a Kerberos environment

I'm trying to write a Dataframe into a csv file and put this csv file into a remote machine. The Spark job is running on Yarn into a Kerberos cluster.
Below, the error I get when the job tries to write the csv file on the remote machine :
diagnostics: User class threw exception:
org.apache.hadoop.security.AccessControlException: Permission denied:
user=dev, access=WRITE,
inode="/data/9/yarn/local/usercache/dev/appcache/application_1532962490515_15862/container_e05_1532962490515_15862_02_000001/tmp/spark_sftp_connection_temp178/_temporary/0":hdfs:hdfs:drwxr-xr-x
In order to write this csv file, i'm using the folowing parameters in a method that write this file in sftp mode :
def writeToSFTP(df: DataFrame, path: String) = {
df.write
.format("com.springml.spark.sftp")
.option("host", "hostname.test.fr")
.option("username", "test_hostname")
.option("password", "toto")
.option("fileType", "csv")
.option("delimiter", ",")
.save(path)
}
I'm using the Spark SFTP Connector library as described in the link : https://github.com/springml/spark-sftp
The script which is used to launch the job is :
#!/bin/bash
kinit -kt /home/spark/dev.keytab dev#CLUSTER.HELP.FR
spark-submit --class fr.edf.dsp.launcher.LauncherInsertion \
--master yarn-cluster \
--num-executors 1 \
--driver-memory 5g \
--executor-memory 5g \
--queue dev \
--files /home/spark/dev.keytab#user.keytab,\
/etc/krb5.conf#krb5.conf,\
/home/spark/jar/dev-application-SNAPSHOT.conf#app.conf \
--conf "spark.executor.extraJavaOptions=-Dapp.config.path=./app.conf -Djava.security.auth.login.config=./jaas.conf" \
--conf "spark.driver.extraJavaOptions=-Dapp.config.path=./app.conf -Djava.security.auth.login.config=./jaas.conf" \
/home/spark/jar/dev-SNAPSHOT.jar > /home/spark/out.log 2>&1&
The csv files are not written into HDFS. Once the Dataframe is built i try to send it to the machine. I suspect a Kerberos issue with the sftp Spark connector : Yarn can't contact a remote machine...
Any help is welcome, thanks.
add temporary location where you have write access, and do not worry about cleanup this because in the end after ftp done these files will be deleted,
def writeToSFTP(df: DataFrame, path: String) = {
df.write
.format("com.springml.spark.sftp")
.option("host", "hostname.test.fr")
.option("username", "test_hostname")
.option("password", "toto")
.option("fileType", "csv")
**.option("hdfsTempLocation","/user/currentuser/")**
.option("delimiter", ",")
.save(path)
}