I am working on POC to implement real time analytics where we have following components.
Confluent Kafka : Which gets events from third party services in Avro format (Event contains many fields up to 40). We are also using Kafka-Registry to deal with different kind of event formats.
I am trying to use MemSQL for analytics for which I have to push events to memsql table in specific format.
I have gone through memsql website , blogs etc but most of them are suggesting to use Spark memsql connector in which you can transform data which we are getting from confluent Kafka.
I have few questions.
If I use simple Java/Go application in place of Spark.
Is there any utility provided by Confluent Kafka and memsql
Thanks.
I recommend using MemSQL Pipelines. https://docs.memsql.com/memsql-pipelines/v6.0/kafka-pipeline-quickstart/
In current versions of MemSQL, you'll need to set up a transform, which will be a small golang or python script which reads in the avro and outputs TSV. Instructions on how to do that is here https://docs.memsql.com/memsql-pipelines/v6.0/transforms/, but the tldr is, you need a script which does
while True:
record_size = read_an_8_byte_int_from_stdin()
avro_record = stdin.read(record_size)
stdout.write(AvroToTSV(avro_record))
Stay tuned for native Avro support in MemSQL.
Related
This question already has answers here:
Kafka Connect- Modifying records before writing into sink
(2 answers)
Closed 1 year ago.
As I've read from Kafka: The definitive guide book, Kafka Connect can simplify the task of loading CSV files into Kafka. But because we didn't write any code for business logic implementation (like Python/Java code), what should I do if I want to get data from CSV, and add many data from different sources to generate a new message, or even generate new data from system logs to that new message, before loading it into Kafka? Is Kafka Connect still a good approach in this use case?
The source for this answer is from this Stackoverflow thread: Kafka Connect- Modifying records before writing into HDFS sink
You have several options.
Single Message Transforms, great for light-weight changes as messages pass through Connect. Configuration-file-based, and extensible using the provided API if there's not an existing transform that does what you want. See the discussion here on when SMT is suitable for a given requirement.
KSQL is a streaming SQL engine for Kafka. You can use it to modify your streams of data before sending them to HDFS.
KSQL is built on the Kafka Stream's API, which is a Java library and gives you the power to transform your data as much as you'd like.
I'm still new in Spark and I want to learn more about it. I want to build and data pipeline architecture with Kafka and Spark.Here is my proposed architecture where PostgreSQL provide data for Kafka. The condition is the PostgreSQL are not empty and I want to catch any CDC change in the database. At the end,I want to grab the Kafka Message and process it in stream with Spark so i can get analysis about what happen at the same time when the CDC event happen.
However, when I try to run an simple stream, it seems Spark receive the data in stream, but process the data in batch, which not my goal. I have see some article that the source of data for this case came from API which we want to monitor, and there's limited case for Database to Database streaming processing. I have done the process before with Kafka to another database, but i need to transform and aggregate the data (I'm not use Confluent and rely on generic Kafka+Debezium+JDBC connectors)
According to my case, is Spark and Kafka can meet the requirement? Thank You
I have designed such pipelines and if you use Structured Streaming KAFKA in continuous or non-continuous mode, you will always get a microbatch. You can process the individual records, so not sure what the issue is.
If you want to process per record, then use the Spring Boot KAFKA setup for consumption of KAFKA messages, that can work in various ways, and fulfill your need. Spring Boor offers various modes of consumption.
Of course Spark Structured Streaming can be done using Scala and has a lot of support obviating extra work elsewhere.
https://medium.com/#contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b This article discusses the single message processing approach.
Dears,
I am considering options how to use Streamsets properly in a given generic Data Hub Architecture:
I have several data types (csv, tsv, json, binary from IOT) that needs to be captured by CDC and saved into a Kafka topic with as-is format and then sinked to HDFS Data Lake as -is.
Then, an other Streamsets Pipeline will consume from this Kafka topic and convert to a common format (depending on data type) into JSON and perform validations, masking, meta-data, etc and save to another Kafka topic.
The same JSON message will be saved into HDFS Data Lake in Avro format for batch processing.
I will then use Spark Streaming to consume the same JSON messages for real-time processing assuming the JSON data is all ready and can further be enriched with other data for scalable complex transformation.
I have not used Streamsets for further processing and relying on Spark Streaming for scalable complex transformations which is not part of the SLA management (as Spark Jobs are not triggered from within Streamsets) Also, I could not use Kafka Registry with Avro in this design to validate JSON schema and JSON schema is validated based on custom logic embedded into StreamSets as Javascript.
What can be done better in the above design?
Thanks in advance...
Your pipeline design looks good.
However I would recommend consolidating several of those steps using Striim.
Striim has built in CDC (change data capture) from all the sources you listed plus databases
It has native kafka integration so you can write to and read from kafka in the same pipeline
Striim also has built in caches and processing operators for enrichment. That way you don't need to write Spark code to do enrichment. Everything is done through our simple UI.
You can try it out here:
https://striim.com/instant-download
Full disclosure: I'm a PM at Striim.
I'm new to Kafka/AWS.My requirement to load data's from several sources into DW(Redshift).
One of my sources is PostgreSQL. I found a good article using Kafka to Sync data into Redshift.
This article is more good enough to sync the data between the PostgreSQL to redshift.But my requirement is to transform the data's before loading into Redshift.
Can somebody help me to how to transform the data's in Kafka (PostgreSQL->Redhsift)?
Thanks in Advance
Jay
Here's an article I just published on exactly this pattern, describing how to use Apache Kafka's Connect API, and KSQL (which is built on Kafka's Streams API) to do streaming ETL: https://www.confluent.io/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data
You should check out Debezium for streaming events from Postgres into Kafka.
For this, you can use any streaming application be it storm/spark/kafka streaming. These application will consume data from diff sources and the data transformation can be done on the fly. All three have their own advantages and complexity.
In Hadoop world, flume or kafka is used to streaming or collecting data and store them in Hadoop. I am just wondering that does Mango DB has some similar mechanisms or tools to achieve the some?
MongoDB is just the database layer, not the complete solution like the Hadoop ecosystem. I actually use Kafka along with Storm to store data in MongoDB in cases where there is a very large flow of incoming data which needs to be processed and stored.
Although Flume is frequently used and treated as a member of the Hadoop ecosystem, it's not impossible to use it with other sources/sinks. MongoDB is not an exception. In fact, Flume is flexible enough to be extended to create your own custom sources/sinks. See this project, for example. This is a custom Flume-Mongo-sink.