StreamSets Design Of Ingestion - streamsets

Dears,
I am considering options how to use Streamsets properly in a given generic Data Hub Architecture:
I have several data types (csv, tsv, json, binary from IOT) that needs to be captured by CDC and saved into a Kafka topic with as-is format and then sinked to HDFS Data Lake as -is.
Then, an other Streamsets Pipeline will consume from this Kafka topic and convert to a common format (depending on data type) into JSON and perform validations, masking, meta-data, etc and save to another Kafka topic.
The same JSON message will be saved into HDFS Data Lake in Avro format for batch processing.
I will then use Spark Streaming to consume the same JSON messages for real-time processing assuming the JSON data is all ready and can further be enriched with other data for scalable complex transformation.
I have not used Streamsets for further processing and relying on Spark Streaming for scalable complex transformations which is not part of the SLA management (as Spark Jobs are not triggered from within Streamsets) Also, I could not use Kafka Registry with Avro in this design to validate JSON schema and JSON schema is validated based on custom logic embedded into StreamSets as Javascript.
What can be done better in the above design?
Thanks in advance...

Your pipeline design looks good.
However I would recommend consolidating several of those steps using Striim.
Striim has built in CDC (change data capture) from all the sources you listed plus databases
It has native kafka integration so you can write to and read from kafka in the same pipeline
Striim also has built in caches and processing operators for enrichment. That way you don't need to write Spark code to do enrichment. Everything is done through our simple UI.
You can try it out here:
https://striim.com/instant-download
Full disclosure: I'm a PM at Striim.

Related

Stream CDC change with Kafka and Spark still processes it in batches, whereas we wish to process each record

I'm still new in Spark and I want to learn more about it. I want to build and data pipeline architecture with Kafka and Spark.Here is my proposed architecture where PostgreSQL provide data for Kafka. The condition is the PostgreSQL are not empty and I want to catch any CDC change in the database. At the end,I want to grab the Kafka Message and process it in stream with Spark so i can get analysis about what happen at the same time when the CDC event happen.
However, when I try to run an simple stream, it seems Spark receive the data in stream, but process the data in batch, which not my goal. I have see some article that the source of data for this case came from API which we want to monitor, and there's limited case for Database to Database streaming processing. I have done the process before with Kafka to another database, but i need to transform and aggregate the data (I'm not use Confluent and rely on generic Kafka+Debezium+JDBC connectors)
According to my case, is Spark and Kafka can meet the requirement? Thank You
I have designed such pipelines and if you use Structured Streaming KAFKA in continuous or non-continuous mode, you will always get a microbatch. You can process the individual records, so not sure what the issue is.
If you want to process per record, then use the Spring Boot KAFKA setup for consumption of KAFKA messages, that can work in various ways, and fulfill your need. Spring Boor offers various modes of consumption.
Of course Spark Structured Streaming can be done using Scala and has a lot of support obviating extra work elsewhere.
https://medium.com/#contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b This article discusses the single message processing approach.

How to push Avro event to Memsql

I am working on POC to implement real time analytics where we have following components.
Confluent Kafka : Which gets events from third party services in Avro format (Event contains many fields up to 40). We are also using Kafka-Registry to deal with different kind of event formats.
I am trying to use MemSQL for analytics for which I have to push events to memsql table in specific format.
I have gone through memsql website , blogs etc but most of them are suggesting to use Spark memsql connector in which you can transform data which we are getting from confluent Kafka.
I have few questions.
If I use simple Java/Go application in place of Spark.
Is there any utility provided by Confluent Kafka and memsql
Thanks.
I recommend using MemSQL Pipelines. https://docs.memsql.com/memsql-pipelines/v6.0/kafka-pipeline-quickstart/
In current versions of MemSQL, you'll need to set up a transform, which will be a small golang or python script which reads in the avro and outputs TSV. Instructions on how to do that is here https://docs.memsql.com/memsql-pipelines/v6.0/transforms/, but the tldr is, you need a script which does
while True:
record_size = read_an_8_byte_int_from_stdin()
avro_record = stdin.read(record_size)
stdout.write(AvroToTSV(avro_record))
Stay tuned for native Avro support in MemSQL.

Is there a way that i can push historical data into druid over http?

I have an IOT project and want to use Druid as Time Series DBMS. Sometimes the IOT device may lose the network and will re-transfer the historical data and real-time data when reconnecting to the server. I know the Druid can ingest real-time data over http push/pull and historical data over http pull or KIS, but i can't find the document about ingesting historical data over http push.
Is there a way that i can send historical data into druid over http push?
I see a few options here:
Keep pushing historical data to the same kafka topic (or other streaming source) and do a rejection based on message-timestamp inside Druid. This simplifies your application architecture and let druid handle expired events rejection
Use batch ingestion for historical data. You push the historical data to another Kafka topic, run a spark/gobblin/any other index job to get the data to HDFS. Then do a batch ingestion onto Druid. But remember that Druid overwrites any real-time segments with batch segments for the specified windowPeriod. So if the historical data is not complete, you run into data loss. To prevent this, you could always pump real-time data into hadoop as well and do a de-duplication on the HDFS data periodically and ingest into Druid. As you can see this is a complicated architecture, but this can result in minimal data loss.
If I were you, I would simplify and send all data to the same streaming source like Kafka. I would index segments in Druid based on my message's timestamp and not current time (which is the default I believe).
kafka indexing service released recently guarantees exactly once ingestion.
Refer the below link - http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html
If you still want to ingest over http, you can checkout tranquility server. It has some mechanisms built-in for handling duplicates.

Apache NiFi : Validate the FlowFile data created by ConsumeKafka

I am pretty new to NiFi. We have the setup done already where we are able to consume the Kafka messages.
In the NiFi UI, I created the Processor with ConsumeKafka_0_10. When the messages are published (different process), My processor is able to pick up the required data/messages properly.
I go to "Data provenance" and can see that the correct data is received.
However, I want to have the next process as some validator. That will read the flowfile from consumekafka and do basic validation (user-supplied script should be good)
How do we that or which processor works here?
Also any way to convert the flowfile input format into csv or json format?
You have a few options. Depending on the flowfile content format, you can use ValidateRecord with a *Reader record reader controller service configured to validate it. If you already have a script to do this in Groovy/Javascript/Ruby/Python, ExecuteScript is also a solution.
Similarly, to convert the flowfile content into CSV or JSON, use a ConvertRecord processor, with a ScriptedReader and a CSVRecordSetWriter or JsonRecordSetWriter to output into the correct format. These processes use the Apache NiFi record structure internally to convert from arbitrary input/output formats with high performance. Further reading is available at blogs.apache.org/nifi and bryanbende.com.

flume or kafka's equivalent to mongodb

In Hadoop world, flume or kafka is used to streaming or collecting data and store them in Hadoop. I am just wondering that does Mango DB has some similar mechanisms or tools to achieve the some?
MongoDB is just the database layer, not the complete solution like the Hadoop ecosystem. I actually use Kafka along with Storm to store data in MongoDB in cases where there is a very large flow of incoming data which needs to be processed and stored.
Although Flume is frequently used and treated as a member of the Hadoop ecosystem, it's not impossible to use it with other sources/sinks. MongoDB is not an exception. In fact, Flume is flexible enough to be extended to create your own custom sources/sinks. See this project, for example. This is a custom Flume-Mongo-sink.