checkpointLocation requires correct hdfs path - pyspark

I am new to Spark.
I try to work with streams.
I wrote the following code:
streaming_query = input_df.groupBy("sensor_id").count().writeStream \
.format("console") \
.outputMode("update") \
.option("timestampFormat", "YYYY-mm-dd HH:MM:SS") \
.option("checkpointLocation", os.path.join(basepath, "Spark-Streaming-Demo")) \
.trigger(processingTime="1 second") \
.start()
streaming_query.awaitTermination()
However, when running this piece of code I get the error message:
java.lang.IllegalStateException: Error reading delta file file:/path/on/local-system/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0,part=0),dir = file:/path/on/local-system/state/0/0]: file:/path/on/local-system/state/0/0/1.delta does not exist
I think checkpointLocation should be a HDFS filepath and not a local absolute path.
How do I find out the HDFS URI for os.path.join(basepath, "Spark-Streaming-Demo")?

Related

Unable to run PySpark (Kafka to Delta) in local and getting SparkException: Cannot find catalog plugin class for catalog 'spark_catalog'

I'm trying to write a PySpark code to read from the Kafka topic and publish to the Delta table. But it is not working.
Sample Code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta import *
spark = SparkSession \
.builder \
.appName("test") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.1.0") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "demo.topic") \
.option("startingOffsets", "latest") \
.load() \
.withColumn("current_timestamp", unix_timestamp()) \
.withColumn("value_str", col("value").cast(StringType())) \
.select("current_timestamp", "value_str")
stream = kafka_df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "./data/tmp/delta/events/_checkpoints/") \
.toTable("events")
stream.awaitTermination()
Command:
Spark version: 3.3.1
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 kafka_to_delta.py
Console:
Traceback (most recent call last):
File "/Users/user/Desktop/python-module/kafka_to_delta.py", line 24, in <module>
stream = kafka_df.writeStream \
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 1468, in toTable
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o63.toTable.
: org.apache.spark.SparkException: Cannot find catalog plugin class for catalog 'spark_catalog': org.apache.spark.sql.delta.catalog.DeltaCatalog
at org.apache.spark.sql.errors.QueryExecutionErrors$.catalogPluginClassNotFoundForCatalogError(QueryExecutionErrors.scala:1638)
at org.apache.spark.sql.connector.catalog.Catalogs$.load(Catalogs.scala:65)
at org.apache.spark.sql.connector.catalog.CatalogManager.loadV2SessionCatalog(CatalogManager.scala:67)
at org.apache.spark.sql.connector.catalog.CatalogManager.$anonfun$v2SessionCatalog$2(CatalogManager.scala:86)
at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
at org.apache.spark.sql.connector.catalog.CatalogManager.$anonfun$v2SessionCatalog$1(CatalogManager.scala:86)
at scala.Option.map(Option.scala:230)
at org.apache.spark.sql.connector.catalog.CatalogManager.v2SessionCatalog(CatalogManager.scala:85)
at org.apache.spark.sql.connector.catalog.CatalogManager.catalog(CatalogManager.scala:51)
at org.apache.spark.sql.connector.catalog.CatalogManager.currentCatalog(CatalogManager.scala:122)
at org.apache.spark.sql.connector.catalog.LookupCatalog.currentCatalog(LookupCatalog.scala:34)
at org.apache.spark.sql.connector.catalog.LookupCatalog.currentCatalog$(LookupCatalog.scala:34)
at org.apache.spark.sql.catalyst.analysis.Analyzer.currentCatalog(Analyzer.scala:187)
at org.apache.spark.sql.connector.catalog.LookupCatalog$CatalogAndIdentifier$.unapply(LookupCatalog.scala:111)
at
:
:
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.delta.catalog.DeltaCatalog
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
at org.apache.spark.sql.connector.catalog.Catalogs$.load(Catalogs.scala:55)
... 25 more
23/01/29 18:00:53 INFO SparkContext: Invoking stop() from shutdown hook
Do I need to specify catalog and schema, before running this code? And what is the best practice for doing this?
Try to add io.delta:delta-core_2.12:2.1.0 to the list of packages passed via --packages (you also need to make sure that you use matching version of the spark-sql-kafka):
spark-submit --packages \
org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,io.delta:delta-core_2.12:2.1.0 \
kafka_to_delta.py
or remove --packages and specify Kafka dependency inside the script, although it's always better not to have specific versions inside the code, and instead provide all options outside of it.

Spark read stream from kafka using delta tables

i'm trying to read stream from Kafka topic using spark streaming/python, I can read the message and dump it to a bronze table with default kafka message schema, but i cannot cast the key and values from binary to string, I've tried the following approach, none of them worked:
approach 1:
raw_kafka_events = (spark.readStream
.format("kafka")
.option("subscribe", TOPIC)
.option("kafka.bootstrap.servers", KAFKA_BROKER)
.option("startingOffsets", "earliest")
.option("kafka.security.protocol", "SSL") \
.option("kafka.ssl.truststore.location", SSL_TRUST_STORE_FILE_LOCATION) \
.option("kafka.ssl.keystore.location", SSL_KEY_STORE_FILE_LOCATION) \
.option("kafka.ssl.keystore.password", SSL_KEY_STORE_PASSWORD) \
.option("kafka.ssl.truststore.password", SSL_TRUST_STORE_PASSWORD) \
.option("kafka.ssl.key.password", SSL_KEY_PASSWORD) \
.option("kafka.ssl.keystore.type", "JKS") \
.option("kafka.ssl.truststore.type", "JKS") \
.option("failOnDataLoss", "false") \
.load()).selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
#dlt.table(
comment="the raw message from kafa topic",
table_properties={"pipelines.reset.allowed":"false"}
)
def kafka_bronze():
return raw_kafka_events
error:
Failed to merge fields 'key' and 'key'. Failed to merge incompatible data types BinaryType and StringType
approach 2:
raw_kafka_events = (spark.readStream
.format("kafka")
.option("subscribe", TOPIC)
.option("kafka.bootstrap.servers", KAFKA_BROKER)
.option("startingOffsets", "earliest")
.option("kafka.security.protocol", "SSL") \
.option("kafka.ssl.truststore.location", SSL_TRUST_STORE_FILE_LOCATION) \
.option("kafka.ssl.keystore.location", SSL_KEY_STORE_FILE_LOCATION) \
.option("kafka.ssl.keystore.password", SSL_KEY_STORE_PASSWORD) \
.option("kafka.ssl.truststore.password", SSL_TRUST_STORE_PASSWORD) \
.option("kafka.ssl.key.password", SSL_KEY_PASSWORD) \
.option("kafka.ssl.keystore.type", "JKS") \
.option("kafka.ssl.truststore.type", "JKS") \
.option("failOnDataLoss", "false") \
.load())
raw_kafka_events.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
#dlt.table(
comment="the raw message from kafa topic",
table_properties={"pipelines.reset.allowed":"false"}
)
def kafka_bronze():
return raw_kafka_events
No error message, but later when i checked the table kafka_bronze, it showed the column key and value are still binary format
approach 3: added kafka_silver table:
raw_kafka_events = (spark.readStream
.format("kafka")
.option("subscribe", TOPIC)
.option("kafka.bootstrap.servers", KAFKA_BROKER)
.option("startingOffsets", "earliest")
.option("kafka.security.protocol", "SSL") \
.option("kafka.ssl.truststore.location", SSL_TRUST_STORE_FILE_LOCATION) \
.option("kafka.ssl.keystore.location", SSL_KEY_STORE_FILE_LOCATION) \
.option("kafka.ssl.keystore.password", SSL_KEY_STORE_PASSWORD) \
.option("kafka.ssl.truststore.password", SSL_TRUST_STORE_PASSWORD) \
.option("kafka.ssl.key.password", SSL_KEY_PASSWORD) \
.option("kafka.ssl.keystore.type", "JKS") \
.option("kafka.ssl.truststore.type", "JKS") \
.option("failOnDataLoss", "false") \
.load())
#dlt.table(
comment="the raw message from kafa topic",
table_properties={"pipelines.reset.allowed":"false"}
)
def kafka_bronze():
return raw_kafka_events
#dlt.table(comment="real schema for kafka payload",
temporary=False)
def kafka_silver():
return (
# kafka streams are (timestamp,value)
# value contains the kafka payload
dlt.read_stream("kafka_bronze")
.select(col("key").cast("string"))
.select(col("value").cast("string"))
)
error:
Column 'value' does not exist.
How can I cast the key/value to string after reading them from kafka topic? i'd prefer to dump the string valued key/value to bronze table, but it's impossible, i can dump them to silver table too
First, it's recommended to define that raw_kafka_events variable inside the function, so it will be local to that function.
In the second approach your problem is that you just do raw_kafka_events.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") without assingning it to the variable, like this: raw_kafka_events = raw_kafka_events.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").
Second problem is that when you use expressions like CAST(key AS STRING), then fields get a new name, matching this expression. Change it to CAST(key AS STRING) as key, and CAST(value AS STRING) as value - this should fix the 1st problem.
In the second approach, you have a select statements chained:
def kafka_silver():
return (
# kafka streams are (timestamp,value)
# value contains the kafka payload
dlt.read_stream("kafka_bronze")
.select(col("key").cast("string"))
.select(col("value").cast("string"))
)
but after your first select you will get a dataframe only with one column - key. You need to change code to:
dlt.read_stream("kafka_bronze") \
.select(col("key").cast("string").alias("key"),
col("value").cast("string").alias("value"))

PySpark how to correctly start streaming query

I am working on a structured streaming program. I have below code
def dataload(sparkSession):
df = sparkSession.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "bar") \
.option("startingOffsets", "earliest") \
.load()
return df
if __name__ == '__main__':
sparkSession = createSparkSession()
df = dataload(sparkSession)
df.select("value").foreach(lambda bytebuffer: processFrame(bytebuffer=bytebuffer.value)).groupBy(window("timestamp","10 minutes","10 minutes"),"key").count() \
.writeStream \
.queryName("qraw") \
.outputMode("append")\
.format("console") \
.start().awaitTermination()
With above code I keep getting the error
pyspark.sql.utils.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
kafka
I do not know where is the error

pyspark kafka streaming data handler

I'm using spark 2.3.2 with pyspark and just figured out that foreach and foreachBatch are not available in 'DataStreamWriter' object in this configuration. The problem is the company Hadoop is 2.6 and spark 2.4(that provides what I need) doesn't work(SparkSession is crashing). There is some another alternative to send data to a custom handler and process streaming data?
This is my code until now:
def streamLoad(self,customHandler):
options = self.options
self.logger.info("Recuperando o schema baseado na estrutura do JSON")
jsonStrings = ['{"sku":"9","ean":"4","name":"DVD","description":"foo description","categories":[{"code":"M02_BLURAY_E_DVD_PLAYER"}],"attributes":[{"name":"attrTeste","value":"Teste"}]}']
myRDD = self.spark.sparkContext.parallelize(jsonStrings)
jsonSchema = self.spark.read.json(myRDD).schema # Maybe there is a way to serialize this
self.logger.info("Iniciando o streaming no Kafka[opções: {}]".format(str(options)))
df = self.spark \
.readStream \
.format("kafka") \
.option("maxFilesPerTrigger", 1) \
.option("kafka.bootstrap.servers", options["kafka.bootstrap.servers"]) \
.option("startingOffsets", options["startingOffsets"]) \
.option("subscribe", options["subscribe"]) \
.option("failOnDataLoss", options["failOnDataLoss"]) \
.load() \
.select(
col('value').cast("string").alias('json'),
col('key').cast("string").alias('kafka_key'),
col("timestamp").cast("string").alias('kafka_timestamp')
) \
.withColumn('pjson', from_json(col('json'), jsonSchema)).drop('json')
query = df \
.writeStream \
.foreach(customHandler) \ #This doesn't work in spark 2.3.x Alternatives, please?
.start()
query.awaitTermination()

kafka to pyspark structured streaming, parsing json as dataframe

I am experimenting with spark structured streaming (spark v2.2.0) to consume json data from kafka. However I encountered the following error.
pyspark.sql.utils.StreamingQueryException: 'Missing required
configuration "partition.assignment.strategy" which has no default
value.
Does anyone know why? The job was submitted using spark-submit below.
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 sparksstream.py
This is the entire python script.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("test") \
.getOrCreate()
# Define schema of json
schema = StructType() \
.add("Session-Id", StringType()) \
.add("TransactionTimestamp", IntegerType()) \
.add("User-Name", StringType()) \
.add("ID", StringType()) \
.add("Timestamp", IntegerType())
# load data into spark-structured streaming
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "xxxx:9092") \
.option("subscribe", "topicName") \
.load() \
.select(from_json(col("value").cast("string"), schema).alias("parsed_value"))
# Print output
query = df.writeStream \
.outputMode("append") \
.format("console") \
.start()
use this instead to submit:
spark-submit \
--conf "spark.driver.extraClassPath=$SPARK_HOME/jars/kafka-clients-1.1.0.jar" \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 \
sparksstream.py
Assuming that you have donwloaded the kafka-clients*jar in you $SPARK_HOME/jars folder