Google Pubsub vs Kafka comparison on the restart of pipeline - apache-kafka

I am trying to write an ingestion application on GCP by using Apache Beam.I should write it in a streaming way to read data from Kafka or pubsub topics and then ingest to datasource.
while it seems straight forward to write it with pubsub and apache beam but my question is what would happen if my ingestion fails or to be restarted and if it again reads all data from the start of pubsub topic or like kafka it can read from latest committed offsets in the topic?

Pub/sub messages are persisted until they are delivered and acknowledge by the subscribers which receives pending messages from its subscription. Once the message is acknowledge, it's removed from the subscription's queue.
For more information regarding the message flow, check this document
Hope it helps.

Related

Can kafka publish messages to AWS lambda

I have to publish messages from a kafka topic to lambda to process them and store in a database using a springboot application, i did some research and found something to consume messages from kafka
public Function<KStream<String, String>, KStream<String, String>> process(){} however, im not sure if this is only used to publish the consumed messages to another kafka topic or can be used as an event source to lambda, I need some guidance on consuming and converting the consumed kafka message to event source.
Brokers do not push. Consumers always poll.
Code shown is for Kafka Streams API, which primarily writes to new Kafka topics. While you could fire HTTP events to start a lambda, that's not recommended.
Alternatively, Kafka is already supported as an event source. You don't need to write any consumer code.
https://aws.amazon.com/about-aws/whats-new/2020/12/aws-lambda-now-supports-self-managed-apache-kafka-as-an-event-source/
This is possible from MSK or a self managed Kafka
process them and store in a database
Your lambda could process the data and send to a new Kafka topic using a producer. You can then use MSK Connect or run your own Kafka Connect cluster elsewhere to dump records into a database. No Spring/Java code would be necessary.

Kafka to BigQuery, best way to consume messages

I need to receive messages to my BigQuery tables and I want to know what is the best way to consume those messages.
My Kafka servers who are at AWS they produce AVRO messages and from what I saw Dataflow needs receive JSON format messages. So I googled and found an article explaining how to receive messages to PubSub, but on PubSub what I only see in this type of architecture, they create a Kafka VM on GCP to produce the messages.
What I need to know is:
It's possible to receive AVRO messages on PubSub from external Kafka Servers and then deserialize the message using my Schema, sending it to Dataflow and finally send it to BigQuery tables?
Or do I need to create a Kafka VM and use it to consume messages from external servers?
This might seem a bit confusing but it is what I am feeling right now. The main goal here is to get messages from Kafka (AVRO format) at AWS and put them on BigQuery tables. If you have any suggestions they are very welcomed
Thanks a lot in advance
The Kafka Connect BigQuery Connector may be exactly what you need. It is a Kafka sink connector that allows you to export messages from Kafka directly to BigQuery. The README page provides detailed configuration instructions, including how to let the connector recognize your Kafka queue and how to enter the information for the destination BigQuery table. This connector should be able to retrieve the AVRO schema automatically from your Kafka project.

Reading all offsets kafka in python

I am trying to read all messages in the topic of Kafka. I am using confluent cloud service so don't run Kafka in my localhost. I set the configurations as: 'enable.auto.commit': 'True','auto.offset.reset': 'earliest', 'default.topic.config': {'auto.offset.reset': 'smallest'}. However it gives me no message, or if I send message from producer at the same time it gives only that message not all offset messages.
How can I read all offset messages in python?
I didn't use confluent cloud service, If you want to consume all offsets, there are a few things you should pay attention to.
set new consumer groupId that does not consume any data
set 'auto.offset.rese=earliest' or 'auto.offset.reset=smallest' , you need to refer to the kafka version
pay attention to the automatic expiration date of the Topic

Put() vs Flush() in Kafka Connector Sink Task

I am trying to send the data in a batch to a NOSQL database using Kafka Sink Connector. I am following https://kafka.apache.org/documentation/#connect documentation and confused about where the logic of sending records has to be implemented. Please help me in understanding how the records are processed internally and what has to be used Put() or Flush() to process the records in a batch.
When a Kafka Connect worker is running a sink task, it will consume messages from the topic partition(s) assigned to the task. As it does so, it repeatedly passes a batch of messages to the sink task through the put(Collection<SinkRecord>) method. This will continue as long as the connector and its tasks are running.
Kafka Connect also will periodically record the progress of the sink tasks, namely the offset of the most recently processed message on each topic partition. This is called committing the offsets, and it does this so that if the connector stops unexpectedly and uncleanly, Kafka Connect knows where in each topic partition the task should resume processing messages. But just before Kafka Connect writes the offsets to Kafka, the Kafka Connect worker gives the sink connector an opportunity to do work during this stage via the flush(...) method.
A particular sink connector might not need to do anything (if put(...) did all of the work), or it might use this opportunity to submit all of the messages already processed via put(...) to the data store. For example, Confluent's JDBC sink connector writes each batch of messages passed through the put(...) method using a transaction (the size of which can be controlled via the connector's consumer settings), and thus the flush(...) method doesn't need to do anything. Confluent's ElasticSearch sink connector, on the other hand, simply accumulates all of the messages for a series of put(...) methods and only writes them to Elasticsearch during flush(...).
The frequency that the offsets are committed for source and sink connectors is controlled by the connector's offset.flush.interval.ms configuration property. The default is to commit offsets every 60 seconds, which is infrequent enough to improve performance and reduce overhead, but is frequent enough to cap the potential amount of re-processing should the connector task unexpectedly die. Note that when the connector is shutdown gracefully or experiences an exception, Kafka Connect will always have a chance to commit the offsets. It's only when the Kafka Connect worker is killed unexpectedly that it might not have a chance to commit the offsets identifying what messages had been processed. Thus, only after restarting after such a failure will the connector potentially re-process some messages that it did just prior to the failure. And it's because messages will potentially be seen at least once that the messages should be idempotent. Take all of this plus your connectors' behavior into account when determining appropriate values for this setting.
Have a look at the Confluent documentation for Kafka Connect as well as open source sink connectors for more examples and details.

Apache Kafka Consumer-Producer Confusion

I know about what is Producer and Consumer. But official documentation says
It is streaming platform.
It is enterprise messaging system.
Kafka has connectors which are import and export data from databases and other system also.
What does it mean?
I know Producers are client applications which send data to Kafka Broker and Consumers are also client applications which read data from Kafka Broker.
But my question is, can a Consumer push data into Kafka Broker?
And as per my knowledge, I assume that if Consumer wants to push data into Kafka Broker, it becomes a Producer. Is that correct?
1.It is a streaming platform.
It is used for distribution of a data on a public-subscriber model with a storage layer and processing layer.
2.It is an enterprise messaging system.
Big Data infrastructure is open source, so big data market cost per year approximately $40B and may be increased day by day. So it has come to host of hardware. Despite the open source nature of much of his software, there's a lot of money to be made.
3.Kafka has connectors which are import and export data from databases
and other systems also.
Kafka connect provides connectors i.e. Source connector, Sink Connector, JDBC Connector. It provides a facility to importing data from sources and exporting it to multiple targets.
Producers: It can only push data to a Kafka broker or we can say publish data.
Consumers: It can only pull data from the Kafka broker.
A producer produces/puts/publishes messages and as consumer consumes/gets/reads messages.
A consumer can only read, when you want to write you need a producer. A consumer cannot become a producer.
A producer only push data to a Kafka broker.
A consumer only pull data from a Kafka broker.
However, you can have a program being both, a producer and a consumer.