How beam estimate watermarks - apache-beam

I am a beginner in Apache Beam and very curious to understand the internals of Apache Beam.
I read some pages and watched some videos and all are explaining how watermarks help to handle the readiness and obsolescence of an infinite stream.
Basically handling late data. But no one explained how Apache Beam estimates the watermark.
Can you help me understand the basics of watermarks?
How does Apache Beam estimate the watermarks?
You can also point me to some docs that can help me understand the basics of this.

The Beam programming guide is very complete on this topic :
Beam programming guide
There are the following parts : 8.4. Watermarks and late data and 8.4.1. Managing late data

Related

ACID issues: managing distributed transactions using kafka stream

Trying to get some research about distributed transactions into a domain-service (micro-service) architecture, I found this article talking about how to solve this problem using kafka streams.
Concretely, it ends creating a KTable, joining two streams in order to mantain current order' states.
By other side, I read this other article: kafka is not a database talking about ACID issues can arises using kafka as a database.
I don't quite figure out both articles.
Could some body guide me a bit?

Filtering in Kafka and other streaming technologies

I am currently doing some research about which stream processing technology to use. So far I have looked at message queueing technologies and streaming frameworks. I am now leaning towards Apache Kafka or Google Pub/Sub.
The requirements I have:
Deliver, read and process messages/events in real time.
Persistence in the messages/events.
Ability to filter messages/event in real time with out having to read entire topic. For example: if I have topic called ‘details’, I want to be able to filter out the messages/events out of that topic where an attribute of an event equals a certain value.
Ability to see if the producer to a certain topic or queue is finished.
Ability to delete messages/events in a topic based on an attribute within an event equaling a certain value.
Ordering in messages/events.
My question is: what is the best framework/technology for these use cases? From what I have read so far, Kafka doesn’t provide that out of the boxes filtering approach for messages/events in topics and Google Pub/Sub does have a filter approach.
Any suggestions and experience would be welcome.
As per the requirements you mentioned kafka seems a nice fit, using kafka streams or KSQL you can perform filtering in real-time, here is an example https://kafka-tutorials.confluent.io/filter-a-stream-of-events/confluent.html
What you need is more than just integration and data transfer, you need something similar to what is known as ETL tool, here you can find more about ETL and tools in GCP https://cloud.google.com/learn/what-is-etl

Execute a calculation on each incoming data record in kafka

I am consuming a kafka topic and I want to execute a calculation when a new data record is incoming. The calculation should be done on data of the incoming record and the two previous ones (like shown in the picture saved as a link here). Is it possible to somehow buffer the last two records so that I can operate with these and the new record?
Example
As mentioned by mike, this is a very broad question. From what you wrote, this looks like something you could do quite well with Kafka Streams. You might want to have a look at this intro
You can also achieve this simply by using KSql. KSQL is a SQL streaming engine for Apache Kafka. It provides an easy-to-use interactive SQL like interface for stream processing on Kafka, without the need to write code in a programming language like Java or Python.
Kindly find the tutorial https://docs.confluent.io/current/ksqldb/tutorials/index.html

How to do the transformations in Kafka (PostgreSQL-> Red shift )

I'm new to Kafka/AWS.My requirement to load data's from several sources into DW(Redshift).
One of my sources is PostgreSQL. I found a good article using Kafka to Sync data into Redshift.
This article is more good enough to sync the data between the PostgreSQL to redshift.But my requirement is to transform the data's before loading into Redshift.
Can somebody help me to how to transform the data's in Kafka (PostgreSQL->Redhsift)?
Thanks in Advance
Jay
Here's an article I just published on exactly this pattern, describing how to use Apache Kafka's Connect API, and KSQL (which is built on Kafka's Streams API) to do streaming ETL: https://www.confluent.io/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data
You should check out Debezium for streaming events from Postgres into Kafka.
For this, you can use any streaming application be it storm/spark/kafka streaming. These application will consume data from diff sources and the data transformation can be done on the fly. All three have their own advantages and complexity.

How to send kafka matching data to other topic

How to push matching data between topic 1 and topic 2 in another topic 3 ?
when sending messages from producer to consumer?
I have not worked with Spark but I can give you some direction form Apache Storm perspective Apache Storm
Build a topology with 2 kafka spouts each consuming from topic1 and topic2
Consume this data in a bolt and compare the data. You may use single bolt or series of successive bolts. You may need to use some persistence viz. mongodb or something such as redis or memcache , depending on you comparison logic
Push the common data to new kafka topic Send data to kafka from Storm using kafka bolt
This is very Apache Storm specific solution, may not be the most ideal or suitable or efficient one, but aimed to give general idea
Here is a link to basic concepts in storm Storm Concepts
I've been working with Spark for over six months now, and yes it is absolutely possible. To be honest, fairly simple. But putting spark on is a bit exaggerated for this problem. What about Kafka Streams? I have never worked with them, but should this not solve exactly this problem?
If u want to use spark:
Use the Spark Kafka integration (I used spark-streaming-kafka-0-10) to consume and to produce the Data, shoud be very simply. Than look for the Spark streaming Api in the documentation.
A simple join about the 2 DStreams should solve the problem. If u want to store Data who doesn`t match u can window it or use the UpdateStateByKey function. I hope it helps someone. Good Luck :)