Databricks: Azure Queue Storage structured streaming key not found error - pyspark

I am trying to write ETL pipeline for AQS streaming data. Here is my code
CONN_STR = dbutils.secrets.get(scope="kvscope", key = "AZURE-STORAGE-CONN-STR")
schema = StructType([
StructField("id", IntegerType()),
StructField("parkingId", IntegerType()),
StructField("capacity", IntegerType()),
StructField("freePlaces", IntegerType()),
StructField("insertTime", TimestampType())
])
stream = spark.readStream \
.format("abs-aqs") \
.option("fileFormat", "json") \
.option("queueName", "freeparkingplaces") \
.option("connectionString", CONN_STR) \
.schema(schema) \
.load()
display(stream)
When I run this I am getting java.util.NoSuchElementException: key not found: eventType
Here is how my queue looks like
Can you spot and explain me what is the problem?

The abs-aqs connector isn’t for consumption of data from AQS, but it’s for getting data about new files in the blob storage using events reported to AQS. That’s why you’re specifying the the file format option, and schema - but these parameters will be applied to the files, not messages in AQS.
As far as I know (I could be wrong), there is no Spark connector for AQS, and it’s usually recommended to use EventHubs or Kafka as messaging solution.

Related

pyspark: how to perform structured streaming using KafkaUtils

I am doing a structured streaming using SparkSession.readStream and writing it to hive table, but seems it does not allow me to time-based micro-batches, i.e. I need a batch of 5 secs. All the messages should forms a batch of 5 secs, and the batch data should get written to hive table.
Right now its reading the messages as and when they are being posted to Kafka topic, and each message is one record for the table.
Working Code
def hive_write_batch_data(data, batchId):
data.write.format("parquet").mode("append").saveAsTable("test.my_table")
kafka_config = {
"checkpointLocation":"/user/aiman/temp/checkpoint",
"kafka.bootstrap.servers":"kafka.bootstrap.server.com:9093",
"subscribe":"TEST_TOPIC",
"startingOffsets": offsetValue,
"kafka.security.protocol":"SSL",
"kafka.ssl.keystore.location": "kafka.keystore.uat.jks",
"kafka.ssl.keystore.password": "abcd123",
"kafka.ssl.key.password":"abcd123",
"kafka.ssl.truststore.type":"JKS",
"kafka.ssl.truststore.location": "kafka.truststore.uat.jks",
"kafka.ssl.truststore.password":"abdc123",
"kafka.ssl.enabled.protocols":"TLSv1",
"kafka.ssl.endpoint.identification.algorithm":""
}
df = spark.readStream \
.format("kafka") \
.options(**kafka_config) \
.load()
data = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)","offset","timestamp","partition")
data_new = data.select(col("offset"),col("partition"),col("key"),json_tuple(col("value"),"product_code","rec_time")) \
.toDF("offset","partition","key","product_code","rec_time")
data_new.writeStream. \
.foreachBatch(hive_write_batch_data) \
.start() \
.awaitTermination()
Problem Statement
Since each message is being treated as one record entry in hive table, a single parquet file is being created for each record, which can trigger hive's small-file issue.
I need to create a time-based batch so that multiple records gets inserted into hive table in one batch. For that I only found KafkaUtils to be having support for time-based using ssc = StreamingContext(sc, 5) but it does not create Dataframes.
How should I use KafkaUtils to create batches read into dataframes ?
Adding a trigger worked. Added a trigger in the stream writer:
data_new.writeStream \
.trigger(processingTime="5 seconds") \ #Trigger
.foreachBatch(hive_write_batch_data) \
.start() \
.awaitTermination()
Found the article here

how to connect to mongodb Atlas from databricks cluster using pyspark

how to connect to mongodb Atlas from databricks cluster using pyspark
This is my simple code in notebook
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb+srv://admin:<password>#mongocluster.fxilr.mongodb.net/TestDatabase.Events") \
.getOrCreate()
df = spark.read.format("mongo").load()
df.printSchema()
But I am getting error as
IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
What is wrong am i doing
I followed this steps and I was able to connect.
Install org.mongodb.spark:mongo-spark-connector_2.12:3.0.2 maven library to your cluster as I was using scala2.12
Goto Cluster detail page and in Advance option under Spark tab , you add below two config parameters
spark.mongodb.output.uri connection-string
spark.mongodb.input.uri connection-string
Note connection-string should look like this - (have your appropriate user, password and database names)
mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database?retryWrites=true&w=majority
Use following python code in your notebook and it should load your sample collection to a dataframe
# Reading from MongoDB
df = spark.read\
.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/database?retryWrites=true&w=majority")\
.option("database", "my_database")\
.option("collection", "my_collection")\
.load()
You can use following to write to mongo db
events_df.write\
.format('com.mongodb.spark.sql.DefaultSource')\
.mode("append")\
.option( "uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database.my_collection?retryWrites=true&w=majority") \
.save()
Hope this should work for you. Please do let others know if it worked.

How to use foreach in PySpark to write to a kafka topic?

I am trying to insert a refined log created on each row through foreach & want to store into a Kafka topic as follows -
def refine(df):
log = df.value
event_logs = json.dumps(get_event_logs(log)) #A function to refine the row/log
pdf = pd.DataFrame({"value": event_logs}, index=[0])
spark = SparkSession.builder.appName("myAPP").getOrCreate()
df = spark.createDataFrame(pdf)
query = df.selectExpr("CAST(value AS STRING)") \
.write \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "intest") \
.save()
And I am calling it using the following code.
query = streaming_df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.outputMode("append") \
.format("console") \
.foreach(refine)\
.start()
query.awaitTermination()
But the refine function is somehow unable to get Kafka packages I sent while submitting the code. I believe the executioner has no access to the Kafka package send via the following command -
./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 ...
because when I submit my code, I get the following error message,
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
So, my question is how to sink data into Kafka inside foreach? And as a side question I want to know if it's a good idea to create another session inside foreach; I had to redeclare session inside foreach because the exiting session of the main driver couldn't be used inside foreach for some issues regarding serializability.
P.S: If I try to sink it to console (...format("console")) inside foreach, then it works without any error.

Filtering millions of files with pySpark and Cloud Storage

I am facing the following task: I have individual files (like Mb) stored in Google Cloud Storage Bucket grouped in directories by date (each directory contains around 5k files). I need to look at each file (xml) , filter proper one and put them into Mongo or write back to Google Cloud Storage in lets say parquet format. I wrote a simple pySpark program that looks like this:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = (
SparkSession
.builder
.appName('myApp')
.config("spark.mongodb.output.uri", "mongodb://<mongo_connection>")
.config("spark.mongodb.output.database", "test")
.config("spark.mongodb.output.collection", "test")
.config("spark.hadoop.google.cloud.auth.service.account.enable", "true")
.config("spark.dynamicAllocation.enabled", "true")
.getOrCreate()
)
spark_context = spark.sparkContext
spark_context.setLogLevel("INFO")
sql_context = pyspark.SQLContext(spark_context)
# configure Hadoop
hadoop_conf = spark_context._jsc.hadoopConfiguration()
hadoop_conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
hadoop_conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
# DataFrame schema
schema = StructType([
StructField('filename', StringType(), True),
StructField("date", DateType(), True),
StructField("xml", StringType(), True)
])
# -------------------------
# Main operation
# -------------------------
# get all files
files = spark_context.wholeTextFiles('gs://bucket/*/*.gz')
rows = files \
.map(lambda x: custom_checking_map(x)) \
.filter(lambda x: x is not None)
# transform to DataFrame
df = sql_context.createDataFrame(rows, schema)
# write to mongo
df.write.format("mongo").mode("append").save()
# write back to Cloud Storage
df.write.parquet('gs://bucket/test.parquet')
spark_context.stop()
I tested it on a subset (single directory gs://bucket/20191010/*.gz) and it works. I deploy it on Google Dataproc cluster, but doubt anything is happening single the logs stop after 19/11/06 15:41:40 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1573054807908_0001
I am running 3 worker cluster with 4 cores and 15GB RAM + 500GB HDD. Spark version 2.3.3, scala 2.11 mongo-connector-spark_2.11-2.3.3.
I am new to Spark so any suggestions are appreciated. Normally, I would write this work using Python multiprocessing, but wanted to move to something "better", but now I am not sure.
It could take significant amount of time to list very large number of files in GCS - most probably your job "hangs" while Spark driver listing all files before starting processing.
You will achieve much better performance by listing all directories first and after that processing files in each directory - to achieve best performance you can process directories in parallel, but taking into account that each directory has 5k files and your cluster only 3 workers, it could be good enough to process directories sequentially.

Loading data from Amazon redshift to HDFS

I'm trying to load data from Amazon Redshift to HDFS.
val df = spark.read.format("com.databricks.spark.redshift")
> .option("forward_spark_s3_credentials", "true").option("url",
> "jdbc:redshift://xxx1").option("user","xxx2").option("password",
> "xxx3") .option("query", "xxx4") .option("driver",
> "com.amazon.redshift.jdbc.Driver") .option("tempdir", "s3n://xxx5")
> .load()
This is the Scala code I'm using. When I do df.count() and df.printSchema(), it's giving me the right schema and count. But, when I do df.show() or try to write it to hdfs it says
S3ServiceException:The AWS Access Key Id you provided does not exist in our records.,Status 403,Error InvalidAccessKeyId
you need to export below environment variables to write to s3.
export AWS_SECRET_ACCESS_KEY=XXX
export AWS_ACCESS_KEY_ID=XXX