PySpark structured streaming read Kafka to delta table - pyspark

Exploring PySpark Structured Streaming and databrick. I want to write a spark structural streaming job to read all the data from a kafka topic and publish to delta tables.
Let's assume I'm using latest version and kafka has following details.
kafka topic name: ABC
kafka broker: localhost:9092
sample data: name=qwerty&company_name=stackoverflow&profession=learner
I want to store the kafka topic data in the delta table with the following fields:
timestamp, company_name, data
2022-11-14 07:50:00+0000, StackOverflow, name=qwerty&company_name=stackoverflow&profession=learner
Is there a way that I can see delta table data in console?

You can read and display your data using spark. Something like:
people_df = spark.read.load(table_path)
display(people_df)
# or
people_df.show(5)
Then you can submit this like any other spark job. Refer to doc for more details.

Related

How to store data on topic which will be passed as a message through producer, such that it can loaded to database on respective columns

I followed this blog to produce a message on to topic using kafka spring so far.
The message I'm going to pass is just the name and now I would like to stream through this topic and add incremental value as ID along with name and store it on OutputTopic and now, I would like to store data to cassandra.
My table structure in cassandra as follows:-
CREATE TABLE emp(
emp_id int PRIMARY KEY,
emp_name text,
)
In which format the data should be on output topic so that I can easily store it on cassandra table?
How can I achieve the above functionality?
Once you've got the data published on a Kafka topic, you can just use the DataStax Kafka connector for Apache Cassandra, DataStax Enterprise and Astra DB.
The connector lets you save records from a Kafka topic to Cassandra tables. It is open-source so it's free to use.
There's a working example here that includes a Cassandra table schema, details of how to configure the Kafka sink connector and map a topic to the CQL table. Cheers!

Avro data not appearing in ksql query

I am trying to set up a topic with an avro schema on confluent platform (with docker).
My topic is running and I have messages.
I also configured the avro schema for the value for this specific topic:
Thus, I can't use the data from for example ksql.
Any idea of what I am doing wrong?
EDIT 1:
So what I expect is:
From the confluent platform, on the topic view, I expect to see the value in a readable format (not Avro), when the schema is in the registry.
From KSQL, I tried to create a Stream with the following command:
CREATE STREAM hashtags
WITH (KAFKA_TOPIC='mytopic',
VALUE_FORMAT='AVRO');
But when I try to visualize my created stream, no data are showing up.

Use cases for using multiple queries for Spark Structured streaming

I have a requirement of streaming from multiple Kafka topics[Avro based] and putting them in Greenplum with small modification in the payload.
The Kaka topics are defined as a list in a configuration file and each Kafka topic will have one target table.
I am looking for a single Spark Structured application and an update in the configuration file to listen to new topics or stop. listening to the topic.
I am looking for help as I am confused about using a single query vs multiple:
val query1 = df.writeStream.start()
val query2 = df.writeStream.start()
spark.streams.awaitAnyTermination()
or
df.writeStream.start().awaitAnyTermination()
Under which use cases multiple queries should be used over the single query
Apparently, you can use regex pattern for consuming the data from different kafka topics.
Lets say, you have topic names like "topic-ingestion1", "topic-ingestion2" - then you can create a regex pattern for consuming data from all topics ending with "*ingestion".
Once the new topic gets created in the format of your regex pattern - spark will automatically start streaming data from the newly created topic.
Reference:
[https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#consumer-caching]
you can use this parameter to specify your cache timeout.
"spark.kafka.consumer.cache.timeout".
From the spark documentation:
spark.kafka.consumer.cache.timeout - The minimum amount of time a
consumer may sit idle in the pool before it is eligible for eviction
by the evictor.
Lets say if you have multiple sinks where you are reading from kafka and you are writing it into two different locations like hdfs and hbase - then you can branch out your application logic into two writeStreams.
If the sink (Greenplum) supports batch mode of operations - then you can look at forEachBatch() function from spark structured streaming. It will allow us to reuse the same batchDF for both the operations.
Reference:
[https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#consumer-caching]

Spark stream job writing to Hdfs in a Json format

I have made a spark streaming job, that polls massages from Kafka and stores it in a json format to Hdfs. Got an example from here : https://github.com/sryza/simplesparkavroapp/blob/specifics/src/main/scala/com/cloudera/sparkavro/SparkSpecificAvroWriter.scala
There is another job that creates a hive table based on a avro with the following properties - AvroContainerInputFormat/ AvroConrainerOutputFormat.
Now I’m facing a problem that produced json file is not visualized querying hive table.
Seems that input/ output formats are different
Did someone had the similar problem ?

How to use Kafka consumer in spark

I am using spark 2.1 and Kafka 0.10.1.
I want to process the data by reading the entire data of specific topics in Kafka on a daily basis.
For spark streaming, I know that createDirectStream only needs to include a list of topics and some configuration information as arguments.
However, I realized that createRDD would have to include all of the topic, partitions, and offset information.
I want to make batch processing as convenient as streaming in spark.
Is it possible?
I suggest you to read this text from Cloudera.
This example show you how to get from Kafka the data just one time. That you will persist the offsets in a postgres due to the ACID archtecture.
So I hope that will solve your problem.