Use postgresql as flink sink, And the connection can't serialize PGConnection by kyro? - postgresql

I write a postgresql sink :
class PGTwoPhaseCommitSinkFunction
extends TwoPhaseCommmitSinkFunction[Row,PgConnection,Void](new
KryoSerializer[PgConnection](classOf[PgConnection]),new ExecutionConfig),
VoidSerializer.INSTANCE)
but, when i use it, find the PgConnection Can't serialize,
and the exception is :
com.esotericsoftware.kryo.KryoException: Error constructing instance of class: sun.nio.cs.UTF_8"
how can i handle it ? thanks

The transaction object specified as a second generic parameter should be first of all Serializable and I don't think it's right to use the PgConnection for it. Instead, it should be some custom lightweight transaction state object, holding transaction metadata like id - for example, below is the example of FlinkKafkaProducer transaction state:
/**
* State for handling transactions.
*/
#VisibleForTesting
#Internal
static class KafkaTransactionState {
private final transient FlinkKafkaInternalProducer<byte[], byte[]> producer;
#Nullable
final String transactionalId;
final long producerId;
final short epoch;
KafkaTransactionState(String transactionalId, FlinkKafkaInternalProducer<byte[], byte[]> producer) {
this(transactionalId, producer.getProducerId(), producer.getEpoch(), producer);
}
KafkaTransactionState(FlinkKafkaInternalProducer<byte[], byte[]> producer) {
this(null, -1, (short) -1, producer);
}
KafkaTransactionState(
#Nullable String transactionalId,
long producerId,
short epoch,
FlinkKafkaInternalProducer<byte[], byte[]> producer) {
this.transactionalId = transactionalId;
this.producerId = producerId;
this.epoch = epoch;
this.producer = producer;
}
boolean isTransactional() {
return transactionalId != null;
}
...
However, as it's good pointed in this thread, writing and maintaining a two-phase commit sink is a tricky and very difficult task and it's better to use the table API with the JDBC connector plus postgres driver.
Here is an example of the pipeline writing data into the postgres.

Related

Deleting element from Store after ExpirePolicy

Environment: I am running Apache Ignite v2.13.0 for the cache and the cache store is persisting to a Mongo DB v3.6.0. I am also utilizing Spring Boot (Java).
Question: When I have an expiration policy set, how do I remove the corresponding data from my persistent database?
What I have attempted: I have attempted to utilize CacheEntryExpiredListener but my print statement is not getting triggered. Is this the proper way to solve the problem?
Here is a sample bit of code:
#Service
public class CacheRemovalListener implements CacheEntryExpiredListener<Long, Metrics> {
#Override
public void onExpired(Iterable<CacheEntryEvent<? extends Long, ? extends Metrics>> events) throws CacheEntryListenerException {
for (CacheEntryEvent<? extends Long, ? extends Metrics> event : events) {
System.out.println("Received a " + event);
}
}
}
Use Continuous Query to get notifications about Ignite data changes.
ExecutorService mongoUpdateExecutor = Executors.newSingleThreadExecutor();
CacheEntryUpdatedListener<Integer, Integer> lsnr = new CacheEntryUpdatedListener<Integer, Integer>() {
#Override
public void onUpdated(Iterable<CacheEntryEvent<? extends Integer, ? extends Integer>> evts) {
for (CacheEntryEvent<?, ?> e : evts) {
if (e.getEventType() == EventType.EXPIRED) {
// Use separate executor to avoid blocking Ignite threads
mongoUpdateExecutor.submit(() -> removeFromMongo(e.getKey()));
}
}
}
};
var qry = new ContinuousQuery<Integer, Integer>()
.setLocalListener(lsnr)
.setIncludeExpired(true);
// Start receiving updates.
var cursor = cache.query(qry);
// Stop receiving updates.
cursor.close();
Note 1: EXPIRED events should be enabled explicitly with ContinuousQuery#setIncludeExpired.
Note 2: Query listeners should not perform any heavy/blocking operations. Offload that work to a separate thread/executor.

Joining streams Flink doesn't work with Kafka consumer

I'm trying to join two streams, one from the data collection, one consumes from Kafka.
code snippet
public static void main(String[] args) {
KafkaSource<JsonNode> kafkaSource = ...
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Kafka messages : {"name": "John"}
final DataStream<JsonNode> dataStream1 = env.fromSource(kafkaSource, waterMark(), "Kafka").rebalance()
.assignTimestampsAndWatermarks(waterMark());
final DataStream<String> dataStream2 = env.fromElements("John", "Zbe", "Abe")
.assignTimestampsAndWatermarks(waterMark());
dataStream1
.join(dataStream2)
.where(new KeySelector<JsonNode, String>() {
#Override
public String getKey(JsonNode value) throws Exception {
return value.get("name").asText();
}
})
.equalTo(new KeySelector<String, String>() {
#Override
public String getKey(String value) throws Exception {
return value;
}
})
.window(SlidingEventTimeWindows.of(Time.minutes(50) /* size */, Time.minutes(10) /* slide */))
.apply(new JoinFunction<JsonNode, String, String>() {
#Override
public String join(JsonNode first, String second) throws Exception {
return first+" "+second;
}
}).print();
env.execute();
}
watermark
private static <T> WatermarkStrategy<T> waterMark() {
return new WatermarkStrategy<T>() {
#Override
public WatermarkGenerator<T> createWatermarkGenerator(
org.apache.flink.api.common.eventtime.WatermarkGeneratorSupplier.Context context) {
return new AscendingTimestampsWatermarks<>();
}
#Override
public TimestampAssigner<T> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return (event, timestamp) -> System.currentTimeMillis();
}
};
}
After running snippet code, it doesn't have any merged data in the output. Am I going wrong somewhere?
Apache flink version: 1.13.2
The problem is probably related to watermarking. Since you're not using event-time-based timestamps, try changing SlidingEventTimeWindows to SlidingProcessingTimeWindows and see if it then produces results.
The underlying problem is probably a lack of data. The rebalance() on the Kafka stream guarantees that idle partitions won't stall the watermarks unless all partitions are idle. But if this is an unbounded streaming job, unless you have some data that falls after the first window, the watermark won't advance far enough to trigger the first window.
Options:
Send some data with larger timestamps
Configure the Kafka source as a bounded stream by using the .setBounded(...) option on the KakfaSource builder
Stop the job using the --drain option (docs)
The fact that dataStream2 is bounded is also a problem, but I'm not sure how much of one. At best this will prevent any windows after the first one from producing any results (since datastream joins are inner joins).

Invoking Kafka Interactive Queries from inside a Stream

I have a particular requirement for invoking an Interactive Query from inside a Stream . This is because I need to create a new Stream which should have data contained inside the State Store. Truncated code below:
tempModifiedDataStream.to(topic.getTransformedTopic(), Produced.with(Serdes.String(), Serdes.String()));
GlobalKTable<String, String> myMetricsTable = builder.globalTable(
topic.getTransformedTopic(),
Materialized.<String, String, KeyValueStore<Bytes, byte[]>>as(
topic.getTransformedStoreName() /* table/store name */)
.withKeySerde(Serdes.String()) /* key serde */
.withValueSerde(Serdes.String()) /* value serde */
);
KafkaStreams streams = new KafkaStreams(builder.build(), kStreamsConfigs());
KStream<String, String> tempAggrDataStream = tempModifiedDataStream
.flatMap((key, value) -> {
try {
List<KeyValue<String, String>> result = new ArrayList<>();
ReadOnlyKeyValueStore<String, String> keyValueStore =
streams .store(
topic.getTransformedStoreName(),
QueryableStoreTypes.keyValueStore());
In the last line, To access the State Store I need to have the KafkaStreams object and the Topology is finalized when I create the KafkaStreams object. The problem with this approach is that the 'tempAggrDataStream' is hence not part of the Topology and that part of the code does not get executed. And I cant move the KafkaStreams definition below as otherwise I can't call the Interactive Query.
I am a bit new to Kafka Streams ; so is this something silly from my side?
If you want to achieve sending all content of the topic content after each data modification, I think you should rather use Processor API.
You could create org.apache.kafka.streams.kstream.Transformer with state store.
For each processing message it will update state store and send all content to downstream.
It is not very efficient, because it will be forwarding for each processing message the whole content of the topic/state store (that can be thousands, millions of records).
If you need only latest value it is enough to set your topic cleanup.policy to compact. And from other site use KTable, which give abstraction of Table (Snapshot of stream)
Sample Transformer code for forwarding whole content of state store is as follow. The whole work is done in transform(String key, String value) method.
public class SampleTransformer
implements Transformer<String, String, KeyValue<String, String>> {
private String stateStoreName;
private KeyValueStore<String, String> stateStore;
private ProcessorContext context;
public SampleTransformer(String stateStoreName) {
this.stateStoreName = stateStoreName;
}
#Override
#SuppressWarnings("unchecked")
public void init(ProcessorContext context) {
this.context = context;
stateStore = (KeyValueStore) context.getStateStore(stateStoreName);
}
#Override
public KeyValue<String, String> transform(String key, String value) {
stateStore.put(key, value);
stateStore.all().forEachRemaining(keyValue -> context.forward(keyValue.key, keyValue.value));
return null;
}
#Override
public void close() {
}
}
More information about Processor APi can be found:
https://docs.confluent.io/current/streams/developer-guide/processor-api.html
https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html
How to combine Processor API with Stream DSL can be found:
https://kafka.apache.org/documentation/streams/developer-guide/dsl-api.html#applying-processors-and-transformers-processor-api-integration

Stream re-partitioning on DSL toplogy with selectKey and transform

I feel like I am probably missing something very basic, but I'll ask anyway.
There is input topic with multiple partitions. I'm using selectKey as part of DSL topology. The selectKey always returns the same value. My expectation is that after internal re-partitioning triggered by selectKey() the next processor in the topology will be called on the same partition for the same key. However the next processor that is transform() is called on different partitions for the same key.
Topology:
Topology buildTopology() {
final StreamsBuilder builder = new StreamsBuilder();
builder
.stream("in-topic", Consumed.with(Serdes.String(), new JsonSerde<>(CatalogEvent.class)))
.selectKey((k,v) -> "key")
.transform(() -> new Processor())
.print();
return builder.build();
}
Processor class used by transform
public class Processor implements Transformer<String, CatalogEvent, KeyValue<String, DispEvent>> {
private ProcessorContext context;
#Override
public void init(ProcessorContext context) {
this.context = context;
}
#Override
public KeyValue<String, DispEvent> transform(String key, CatalogEvent catalogEvent) {
System.out.println("key:" + key + " partition:" + context.partition());
return null;
}
#Override
public KeyValue<String, DispatcherEvent> punctuate(long timestamp) {
// TODO Auto-generated method stub
return null;
}
#Override
public void close() {
// TODO Auto-generated method stub
}
}
"in-topic" has two messages with random UUIDs as keys i.e. "8f45e552-8886-4781-bb0c-79ca98f9d927", "a794ed2a-6f7d-4522-a7ac-27c51a64fa28", the payload is the same for both messages
The output from Processor::transform for two UUIDs are
key:key partition: 2
key:key partition: 0
How can I change the topology to make sure that messages with the same key will be arrived on the same partition - I need it to ensure that messages with the same key will go to the same local Kafka store instance (for inserting or updating).
For process() and [flat]transform[Values]() there is no auto-repartitioning. You will need to insert a manual repartition() (or through() in older versions) call to repartition the data. If you compare the JavaDocs (with groupBy() or join() that support auto-repartitioning) you see that auto-repartitioning is not mentioned for them.
The reason is, that those three methods are part of Processor API integration into the DSL, and thus no DSL operators. Their semantics are unknown and thus we cannot tell if they require repartitioning if the key was change or not. To avoid unnecessary repartitioning, auto-repartitioning is not performed.
There is also a corresponding Jira: https://issues.apache.org/jira/browse/KAFKA-7608

Processor node in kafka streams

I am working on processor node in kafka streams. For a simple code, I have written like below just to filter UserID , is this the correct way of doing processor node in kafka streams ?
But, the below code doesn't compile, throws an error with : The method filter(Predicate<? super Object,? super Object>) in the type KStream<Object,Object> is not applicable for the arguments (new Predicate<String,String>(){})
KStreamBuilder builder = new KStreamBuilder();
builder.stream(topic)
.filter(new Predicate <String, String>() {
//#Override
public boolean test(String key, String value) {
Hashtable<Object, Object> message;
// put you processor logic here
return message.get("UserID").equals("1");
}
})
.to(streamouttopic);
final KafkaStreams streams = new KafkaStreams(builder, props);
final CountDownLatch latch = new CountDownLatch(1);
// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);
Can someone guide me please?
builder.stream(topic) returns KStream<Object,Object> type because you don't specify the generic types. And <Object,Object> is not compatible with <String,String>.
If you know, that the actual type is KStream<String,String> you can specify the type as follows:
builder.<Sting,String>stream(topic)
.filter(...)
To answer you question about "processor nodes": yes, adding a filter() will add a processor node internally. Note, that at DSL level, you don't need to think in term of processors usually.
If you want to use processors explicitly, you can use Processor API instead of the DSL. Check out the WordCount example: https://github.com/apache/kafka/blob/trunk/streams/examples/src/main/java/org/apache/kafka/streams/examples/wordcount/WordCountProcessorDemo.java
Note, that using the DSL, internally the code will be translated into a processor topology which is the runtime model of Kafka Streams.
Probably you are using Predicate class from another package. You need to use
import org.apache.kafka.streams.kstream.Predicate;