Why in Apache Spark Structured Streaming console sink does not work in update output mode - spark-structured-streaming

I wrote very simple application that read stream of csv files does some aggregation and write the stream to console. This is the error it reports:
Data source v2 streaming sinks does not support Update mode
I did everything according to the book. Where is the problem?

Related

Talend Big Data Streaming not supporting subjob

I am trying to read messages from a kafka topic loading it into a tcacheoutput. then using tcacheinput in a subjob am reading conerting the data into a format, joining with another table and loading it into sql server. The problem is Big Data streaming is not allowing me to run it. it says,
java.lang.Exception: In Talend, Spark Streaming jobs only support one job. Please deactivate the extra jobs.
Is there an another way achive it? My job looks like this:

Stream CDC change with Kafka and Spark still processes it in batches, whereas we wish to process each record

I'm still new in Spark and I want to learn more about it. I want to build and data pipeline architecture with Kafka and Spark.Here is my proposed architecture where PostgreSQL provide data for Kafka. The condition is the PostgreSQL are not empty and I want to catch any CDC change in the database. At the end,I want to grab the Kafka Message and process it in stream with Spark so i can get analysis about what happen at the same time when the CDC event happen.
However, when I try to run an simple stream, it seems Spark receive the data in stream, but process the data in batch, which not my goal. I have see some article that the source of data for this case came from API which we want to monitor, and there's limited case for Database to Database streaming processing. I have done the process before with Kafka to another database, but i need to transform and aggregate the data (I'm not use Confluent and rely on generic Kafka+Debezium+JDBC connectors)
According to my case, is Spark and Kafka can meet the requirement? Thank You
I have designed such pipelines and if you use Structured Streaming KAFKA in continuous or non-continuous mode, you will always get a microbatch. You can process the individual records, so not sure what the issue is.
If you want to process per record, then use the Spring Boot KAFKA setup for consumption of KAFKA messages, that can work in various ways, and fulfill your need. Spring Boor offers various modes of consumption.
Of course Spark Structured Streaming can be done using Scala and has a lot of support obviating extra work elsewhere.
https://medium.com/#contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b This article discusses the single message processing approach.

Spark stream job writing to Hdfs in a Json format

I have made a spark streaming job, that polls massages from Kafka and stores it in a json format to Hdfs. Got an example from here : https://github.com/sryza/simplesparkavroapp/blob/specifics/src/main/scala/com/cloudera/sparkavro/SparkSpecificAvroWriter.scala
There is another job that creates a hive table based on a avro with the following properties - AvroContainerInputFormat/ AvroConrainerOutputFormat.
Now I’m facing a problem that produced json file is not visualized querying hive table.
Seems that input/ output formats are different
Did someone had the similar problem ?

How to push Avro event to Memsql

I am working on POC to implement real time analytics where we have following components.
Confluent Kafka : Which gets events from third party services in Avro format (Event contains many fields up to 40). We are also using Kafka-Registry to deal with different kind of event formats.
I am trying to use MemSQL for analytics for which I have to push events to memsql table in specific format.
I have gone through memsql website , blogs etc but most of them are suggesting to use Spark memsql connector in which you can transform data which we are getting from confluent Kafka.
I have few questions.
If I use simple Java/Go application in place of Spark.
Is there any utility provided by Confluent Kafka and memsql
Thanks.
I recommend using MemSQL Pipelines. https://docs.memsql.com/memsql-pipelines/v6.0/kafka-pipeline-quickstart/
In current versions of MemSQL, you'll need to set up a transform, which will be a small golang or python script which reads in the avro and outputs TSV. Instructions on how to do that is here https://docs.memsql.com/memsql-pipelines/v6.0/transforms/, but the tldr is, you need a script which does
while True:
record_size = read_an_8_byte_int_from_stdin()
avro_record = stdin.read(record_size)
stdout.write(AvroToTSV(avro_record))
Stay tuned for native Avro support in MemSQL.

StreamSets Design Of Ingestion

Dears,
I am considering options how to use Streamsets properly in a given generic Data Hub Architecture:
I have several data types (csv, tsv, json, binary from IOT) that needs to be captured by CDC and saved into a Kafka topic with as-is format and then sinked to HDFS Data Lake as -is.
Then, an other Streamsets Pipeline will consume from this Kafka topic and convert to a common format (depending on data type) into JSON and perform validations, masking, meta-data, etc and save to another Kafka topic.
The same JSON message will be saved into HDFS Data Lake in Avro format for batch processing.
I will then use Spark Streaming to consume the same JSON messages for real-time processing assuming the JSON data is all ready and can further be enriched with other data for scalable complex transformation.
I have not used Streamsets for further processing and relying on Spark Streaming for scalable complex transformations which is not part of the SLA management (as Spark Jobs are not triggered from within Streamsets) Also, I could not use Kafka Registry with Avro in this design to validate JSON schema and JSON schema is validated based on custom logic embedded into StreamSets as Javascript.
What can be done better in the above design?
Thanks in advance...
Your pipeline design looks good.
However I would recommend consolidating several of those steps using Striim.
Striim has built in CDC (change data capture) from all the sources you listed plus databases
It has native kafka integration so you can write to and read from kafka in the same pipeline
Striim also has built in caches and processing operators for enrichment. That way you don't need to write Spark code to do enrichment. Everything is done through our simple UI.
You can try it out here:
https://striim.com/instant-download
Full disclosure: I'm a PM at Striim.