Spark stream job writing to Hdfs in a Json format - scala

I have made a spark streaming job, that polls massages from Kafka and stores it in a json format to Hdfs. Got an example from here : https://github.com/sryza/simplesparkavroapp/blob/specifics/src/main/scala/com/cloudera/sparkavro/SparkSpecificAvroWriter.scala
There is another job that creates a hive table based on a avro with the following properties - AvroContainerInputFormat/ AvroConrainerOutputFormat.
Now I’m facing a problem that produced json file is not visualized querying hive table.
Seems that input/ output formats are different
Did someone had the similar problem ?

Related

PySpark structured streaming read Kafka to delta table

Exploring PySpark Structured Streaming and databrick. I want to write a spark structural streaming job to read all the data from a kafka topic and publish to delta tables.
Let's assume I'm using latest version and kafka has following details.
kafka topic name: ABC
kafka broker: localhost:9092
sample data: name=qwerty&company_name=stackoverflow&profession=learner
I want to store the kafka topic data in the delta table with the following fields:
timestamp, company_name, data
2022-11-14 07:50:00+0000, StackOverflow, name=qwerty&company_name=stackoverflow&profession=learner
Is there a way that I can see delta table data in console?
You can read and display your data using spark. Something like:
people_df = spark.read.load(table_path)
display(people_df)
# or
people_df.show(5)
Then you can submit this like any other spark job. Refer to doc for more details.

Why in Apache Spark Structured Streaming console sink does not work in update output mode

I wrote very simple application that read stream of csv files does some aggregation and write the stream to console. This is the error it reports:
Data source v2 streaming sinks does not support Update mode
I did everything according to the book. Where is the problem?

Design stream pipeline using spark structured streaming and databricks delta to handle multiple tables

I am designing a streaming pipeline where my need is to consume events from Kafka topic.
Single Kafka topic can have data from around 1000 tables, data coming as a json record. Now I have below problems to solve.
Reroute messages based on its table to its seperate folder: this is done using spark structured streaming partition by on table name.
Second wanted to parse each json record attach appropriate table schema to it and create/append/update corresponding delta table. I am not able find best solution to do it. Infering json schema dyanamically and write to delta table dynamically. How this can be done? As records are coming as json string.
As I have to process so many tables do I need to write those many number of streaming queries? How it can be solved?
Thanks

How to write a partitioned parquet file in apache beam java

I am new to Apache Beam and not sure how to accomplish this task.
I want to write a partitioned parquet file using Apache Beam in Java.
Data is read from Kafka and I want the file to have a new partition every hour. The timestamp columns is present in the data.
Try to use FixedWindows for that. There is an example of windowed WordCount that writes every window into separate text file - so, I believe, it can be adapted for your case.

StreamSets Design Of Ingestion

Dears,
I am considering options how to use Streamsets properly in a given generic Data Hub Architecture:
I have several data types (csv, tsv, json, binary from IOT) that needs to be captured by CDC and saved into a Kafka topic with as-is format and then sinked to HDFS Data Lake as -is.
Then, an other Streamsets Pipeline will consume from this Kafka topic and convert to a common format (depending on data type) into JSON and perform validations, masking, meta-data, etc and save to another Kafka topic.
The same JSON message will be saved into HDFS Data Lake in Avro format for batch processing.
I will then use Spark Streaming to consume the same JSON messages for real-time processing assuming the JSON data is all ready and can further be enriched with other data for scalable complex transformation.
I have not used Streamsets for further processing and relying on Spark Streaming for scalable complex transformations which is not part of the SLA management (as Spark Jobs are not triggered from within Streamsets) Also, I could not use Kafka Registry with Avro in this design to validate JSON schema and JSON schema is validated based on custom logic embedded into StreamSets as Javascript.
What can be done better in the above design?
Thanks in advance...
Your pipeline design looks good.
However I would recommend consolidating several of those steps using Striim.
Striim has built in CDC (change data capture) from all the sources you listed plus databases
It has native kafka integration so you can write to and read from kafka in the same pipeline
Striim also has built in caches and processing operators for enrichment. That way you don't need to write Spark code to do enrichment. Everything is done through our simple UI.
You can try it out here:
https://striim.com/instant-download
Full disclosure: I'm a PM at Striim.