I'm using Pyspark V3.2.1 to write on a kafka broker
data_frame \
.selectExpr('CAST(id AS STRING) AS key', "to_json(struct(*)) AS value") \
.write \
.format('kafka') \
.option('topic', topic)\
.option('kafka.bootstrap.servers', 'localhost:9092') \
.mode('append') \
I'm facing a real engineering issue ! How to ensure having an atomic operation and an idempotent producer ?
Any envisaged solutions, thanks
I am trying to use Dataproc on Google Cloud Platform for my Spark Streaming jobs.
I use Kafka as my source and try to write it to MongoDB. Its working fine, but after the job fails it starts to read the messages from my Kafka topic from the beginning instead of from where it stopped.
Here is my config for reading from Kafka:
clickstreamTestDf = (
.option("kafka.bootstrap.servers", confluentBootstrapServers)
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))
.option("kafka.ssl.endpoint.identification.algorithm", "https")
.option("kafka.sasl.mechanism", "PLAIN")
.option("subscribe", "customer_experience")
.option("failOnDataLoss", "false")
.option("startingOffsets", "earliest")
And here is my write stream code:
finished_df.writeStream \
.option("spark.mongodb.connection.uri", connectionString) \
.option("spark.mongodb.database", "Company-Environment") \
.option("spark.mongodb.collection", "customer_experience") \
.option("checkpointLocation", "gs://firstsparktest_1/checkpointCustExp") \
.option("forceDeleteTempCheckpointLocation", "true") \
.outputMode("append") \
.start() \
Do I need to set startingOffsets to latest? I tried but it still didn't read from where it stopped.
Can I use checkpointLocation like this? Is it okay to use a directory in google storage?
I want to run the streaming job, stop it, delete the Dataproc cluster and then create a new one the next day and continue reading from where it left off. Is that possible and how?
Really need some help here!
I am doing a structured streaming using SparkSession.readStream and writing it to hive table, but seems it does not allow me to time-based micro-batches, i.e. I need a batch of 5 secs. All the messages should forms a batch of 5 secs, and the batch data should get written to hive table.
Right now its reading the messages as and when they are being posted to Kafka topic, and each message is one record for the table.
Working Code
def hive_write_batch_data(data, batchId):
kafka_config = {
"startingOffsets": offsetValue,
"kafka.ssl.keystore.location": "kafka.keystore.uat.jks",
"kafka.ssl.keystore.password": "abcd123",
"kafka.ssl.truststore.location": "kafka.truststore.uat.jks",
df = spark.readStream \
.format("kafka") \
.options(**kafka_config) \
data = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)","offset","timestamp","partition")
data_new = data.select(col("offset"),col("partition"),col("key"),json_tuple(col("value"),"product_code","rec_time")) \
data_new.writeStream. \
.foreachBatch(hive_write_batch_data) \
.start() \
Problem Statement
Since each message is being treated as one record entry in hive table, a single parquet file is being created for each record, which can trigger hive's small-file issue.
I need to create a time-based batch so that multiple records gets inserted into hive table in one batch. For that I only found KafkaUtils to be having support for time-based using ssc = StreamingContext(sc, 5) but it does not create Dataframes.
How should I use KafkaUtils to create batches read into dataframes ?
Adding a trigger worked. Added a trigger in the stream writer:
data_new.writeStream \
.trigger(processingTime="5 seconds") \ #Trigger
.foreachBatch(hive_write_batch_data) \
.start() \
Found the article here
I am running pyspark on AWS Glue and I am trying to connect to an aurora mysql db with a third party jdbc(not the AWS one but J connect). The problem I am facing is that I do not know how to pass the certificate (.pem) so I can successfully connect to that db.
spark = SparkSession.builder.enableHiveSupport() \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.config("spark.jars.packages", "mysql:mysql-connector-java:8.0.28.jar,mysql:mysql-connector-java:8.0.28") \
.appName($JOB) \
url = "jdbc:mysql://host_name:3306/db_name"
crtf = "tls_ca_cert" # location of the certificate
df = spark.read \
.format("jdbc") \
.option("url", url) \
.option("table", table) \
.option("user", user) \
.option("password", psw) \
.option("ssl", True) \
.option("sslmode", "require") \
.option("SSLServerCertificate", crtf) \
This is how I am trying to read the db. Obviously I am missing something. Any help would be appreciated!
I am trying to insert a refined log created on each row through foreach & want to store into a Kafka topic as follows -
def refine(df):
log = df.value
event_logs = json.dumps(get_event_logs(log)) #A function to refine the row/log
pdf = pd.DataFrame({"value": event_logs}, index=[0])
spark = SparkSession.builder.appName("myAPP").getOrCreate()
df = spark.createDataFrame(pdf)
query = df.selectExpr("CAST(value AS STRING)") \
.write \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "intest") \
And I am calling it using the following code.
query = streaming_df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.outputMode("append") \
.format("console") \
But the refine function is somehow unable to get Kafka packages I sent while submitting the code. I believe the executioner has no access to the Kafka package send via the following command -
./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 ...
because when I submit my code, I get the following error message,
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
So, my question is how to sink data into Kafka inside foreach? And as a side question I want to know if it's a good idea to create another session inside foreach; I had to redeclare session inside foreach because the exiting session of the main driver couldn't be used inside foreach for some issues regarding serializability.
P.S: If I try to sink it to console (...format("console")) inside foreach, then it works without any error.
I need to connect to postgresql DB available in on-prem subscription to my Azure Databricks notebook(cloud subscription) and load a postgresql table to a sparkDataFrame, please let me know if anybody has worked on it, i know i can run the below pyspark code to read the data from a table but need help on how to establish a connection from my Azure Databricks notebook to postgresql DB available in on-prem subscription.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", "/path_to_postgresDriver/postgresql-42.2.5.jar") \
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost:5432/databasename") \
.option("dbtable", "tablename") \
.option("user", "username") \
.option("password", "password") \
.option("driver", "org.postgresql.Driver") \