Talend Big Data Streaming not supporting subjob - talend

I am trying to read messages from a kafka topic loading it into a tcacheoutput. then using tcacheinput in a subjob am reading conerting the data into a format, joining with another table and loading it into sql server. The problem is Big Data streaming is not allowing me to run it. it says,
java.lang.Exception: In Talend, Spark Streaming jobs only support one job. Please deactivate the extra jobs.
Is there an another way achive it? My job looks like this:

Related

How to do batch processing on kafka connect generated datasets?

Suppose we have batch jobs producing records into kafka and we have a kafka connect cluster consuming records and moving them to HDFS. We want the ability to run batch jobs later on the same data but we want to ensure that batch jobs see the whole records generated by producers. What is a good design for this?
You can run any MapReduce, Spark, Hive, etc query on the data, and you will get all records that have been thus far been written to HDFS. It will not see data that has not been consumed by the Sink from the producers, but this has nothing to do with Connect or HDFS, that is a pure Kafka limitation.
Worth pointing out that Apache Pinot is a better place to combine Kafka streaming data and have batch query support.

how to implement a short lived queues for ETL

I am looking for a suggestion on how we can implement a short lived queues(topic) to perform an ETL, after the ETL is completed that queue(topic) and data is not needed anymore.
Here is the scenario.. where a particular job runs, it has to run a query to extract data from database(assume teradata) and load it in a topic. Then a spark job will be kicked off and it will process all the records in that topic and stop the spark job. After that topic and data in that is not needed anymore.
For this I see Kafka and Redis stream as 2 options, looks to me Redis steam is the most appropriate tool because of ease of creating topics and destroying. with Kafka I see it requires additional custom handlers for creating the topics and drop the topic etc, also don't want to exploitate Kafka with too many topics.
I am open and happy to hear from you if we have another alternate and better solution out there.

Design stream pipeline using spark structured streaming and databricks delta to handle multiple tables

I am designing a streaming pipeline where my need is to consume events from Kafka topic.
Single Kafka topic can have data from around 1000 tables, data coming as a json record. Now I have below problems to solve.
Reroute messages based on its table to its seperate folder: this is done using spark structured streaming partition by on table name.
Second wanted to parse each json record attach appropriate table schema to it and create/append/update corresponding delta table. I am not able find best solution to do it. Infering json schema dyanamically and write to delta table dynamically. How this can be done? As records are coming as json string.
As I have to process so many tables do I need to write those many number of streaming queries? How it can be solved?
Thanks

Stream CDC change with Kafka and Spark still processes it in batches, whereas we wish to process each record

I'm still new in Spark and I want to learn more about it. I want to build and data pipeline architecture with Kafka and Spark.Here is my proposed architecture where PostgreSQL provide data for Kafka. The condition is the PostgreSQL are not empty and I want to catch any CDC change in the database. At the end,I want to grab the Kafka Message and process it in stream with Spark so i can get analysis about what happen at the same time when the CDC event happen.
However, when I try to run an simple stream, it seems Spark receive the data in stream, but process the data in batch, which not my goal. I have see some article that the source of data for this case came from API which we want to monitor, and there's limited case for Database to Database streaming processing. I have done the process before with Kafka to another database, but i need to transform and aggregate the data (I'm not use Confluent and rely on generic Kafka+Debezium+JDBC connectors)
According to my case, is Spark and Kafka can meet the requirement? Thank You
I have designed such pipelines and if you use Structured Streaming KAFKA in continuous or non-continuous mode, you will always get a microbatch. You can process the individual records, so not sure what the issue is.
If you want to process per record, then use the Spring Boot KAFKA setup for consumption of KAFKA messages, that can work in various ways, and fulfill your need. Spring Boor offers various modes of consumption.
Of course Spark Structured Streaming can be done using Scala and has a lot of support obviating extra work elsewhere.
https://medium.com/#contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b This article discusses the single message processing approach.

sSpark structured streaming PostgreSQL updatestatebykey

How to update state of OUTPUT TABLE by Spark structured streaming computation triggered by changes in INPUT PostgreSQL table?
As a real life scenario USERS table has been updated by user_id = 0002, how to trigger Spark computation for that user only and write / update results to another table?
Although there is no solution out of the box you can implement it the following way.
You can use Linkedin's Databus or other similar tools which mines the databse logs and produce respective events to kafka. The tool tracks the changes in database bin logs. You can write a kafka connector to transform and filter data. You can then consume events from kafka and process them to any sink format you want.