Is Kafka streaming to Ignite running in Transactional mode ACID? - apache-kafka

I have a distributed database in Apache Ignite and a Apache Kafka streaming service that streams data to the Ignite cluster. The Kafka streamer works as following
Create ignite node to find cluster
Start kafka streamer singleton as a service in the cluster
Shut down the ignite node
The Ignite cluster is in Transactional mode, however I am unsure if this guarantees ACID or only enables it. Could this streaming service to Ignite be considered ACID?
Here is the code for the kafka streamer:
public class IgniteKafkaStreamerService implements Service {
private static final long serialVersionUID = 1L;
#IgniteInstanceResource
private Ignite ignite;
private KafkaStreamer<String, JSONObject> kafkaStreamer = new KafkaStreamer<>();
private IgniteLogger logger;
public static void main(String[] args) throws InterruptedException {
TcpDiscoverySpi spi = new TcpDiscoverySpi();
TcpDiscoveryMulticastIpFinder ipFinder = new TcpDiscoveryMulticastIpFinder();
// Set Multicast group.
//ipFinder.setMulticastGroup("228.10.10.157");
// Set initial IP addresses.
// Note that you can optionally specify a port or a port range.
ipFinder.setAddresses(Arrays.asList("127.0.0.1:47500..47509"));
spi.setIpFinder(ipFinder);
IgniteConfiguration cfg = new IgniteConfiguration();
// Override default discovery SPI.
cfg.setDiscoverySpi(spi);
Ignite ignite = Ignition.getOrStart(cfg);
// Deploy data streamer service on the server nodes.
ClusterGroup forServers = ignite.cluster().forServers();
IgniteKafkaStreamerService streamer = new IgniteKafkaStreamerService();
ignite.services(forServers).deployClusterSingleton("KafkaService", streamer);
ignite.close();
}
#Override
public void init(ServiceContext ctx) {
logger = ignite.log();
IgniteDataStreamer<String, JSONObject> stmr = ignite.dataStreamer("my_cache");
stmr.allowOverwrite(true);
stmr.autoFlushFrequency(1000);
List<String> topics = new ArrayList<>();
topics.add(0,"IoTData");
kafkaStreamer.setIgnite(ignite);
kafkaStreamer.setStreamer(stmr);
kafkaStreamer.setThreads(4);
kafkaStreamer.setTopic(topics);
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "NiFi-consumer");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.1.242:9092");
props.put("key.deserializer", StringDeserializer.class.getName());
props.put("value.deserializer", StringDeserializer.class.getName());
props.put("group.id", "hello");
kafkaStreamer.setConsumerConfig(props);
kafkaStreamer.setSingleTupleExtractor(msg -> {
JSONObject jsonObj = new JSONObject(msg.value().toString());
String key = jsonObj.getString("id") + "," + new Date(msg.timestamp());
JSONObject value = jsonObj.accumulate("date", new Date(msg.timestamp()));
return new AbstractMap.SimpleEntry<>(key, value);
});
}
#Override
public void execute(ServiceContext ctx) {
kafkaStreamer.start();
logger.info("KafkaStreamer started.");
}
#Override
public void cancel(ServiceContext ctx) {
kafkaStreamer.stop();
logger.info("KafkaStreamer stopped.");
}
}

KafkaStreamer uses IgniteDataStreamer implementations under the hood. IgniteDataStreamer isn't transactional by nature so there are no any transactional guarantees.

Related

How to Commit Kafka Offsets Manually in Flink

I have a Flink job to consume a Kafka topic and sink it to another topic and the Flink job is setting as auto.commit with a interval 3 minutes(checkpoint disabled), but in the monitoring side, there is 3 minutes lag. But we want to monitor the processing on real time without 3 minutes lag, so we want to have a feature that the FlinkKafkaConsumer is able to commit the offset immediately after sink function.
Is there a way to achieve this goal within Flink framework?
Or any other options?
On line 53, I am trying to create a KafkaConsumer instance to call commitSync() function to make it working, but it does not work.
public class CEPJobTest {
private final static String TOPIC = "test";
private final static String BOOTSTRAP_SERVERS = "localhost:9092";
public static void main(String[] args) throws Exception {
System.out.println("start cep test job...");
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "console-consumer-cep");
properties.setProperty("enable.auto.commit", "false");
// offset interval
//properties.setProperty("auto.commit.interval.ms", "500");
FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<String>("test", new SimpleStringSchema(),
properties);
//set commitoffset by checkpoint
consumer.setCommitOffsetsOnCheckpoints(false);
System.out.println("checkpoint enabled:"+consumer.getEnableCommitOnCheckpoints());
DataStream<String> stream = env.addSource(consumer);
stream.map(new MapFunction<String, String>() {
#Override
public String map(String value) throws Exception {
return new Date().toString() + ": " + value;
}
}).print();
//here, I want to commit offset manually after processing message...
KafkaConsumer<?, ?> kafkaConsumer = new KafkaConsumer(properties);
kafkaConsumer.commitSync();
env.execute("Flink Streaming");
}
private static Consumer<Long, String> createConsumer() {
final Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, BOOTSTRAP_SERVERS);
props.put(ConsumerConfig.GROUP_ID_CONFIG, "KafkaExampleConsumer");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, LongDeserializer.class.getName());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG,false);
final Consumer<Long, String> consumer = new KafkaConsumer<>(props);
return consumer;
}
}
This does not work like your code
env.execute is to submit job to cluster, the execution is then submitted. The code before this line is just build the job graph rather than executing anything.
To do this after sink, you should put it in your sink function
class mySink extends RichSinkFunction {
override def invoke(...) = {
val kafkaConsumer = new KafkaConsumer(properties);
kafkaConsumer.commitSync();
}
}

Spring Cloud Sleuth with Reactor Kafka

I'm using Reactor Kafka in a Spring Boot Reactive app, with Spring Cloud Sleuth for distributed tracing.
I've setup Sleuth to use a custom propagation key from a header named "traceId".
I've also customized the log format to print the header in my logs, so a request like
curl -H "traceId: 123456" -X POST http://localhost:8084/parallel
will print 123456 in every log anywhere downstream starting from the Controller.
I would now like this header to be propagated via Kafka too. I understand that Sleuth has built-in instrumentation for Kafka too, so the header should be propagated automatically, however I'm unable to get this to work.
From my Controller, I produce a message onto a Kafka topic, and then have another Kafka consumer pick it up for processing.
Here's my Controller:
#RestController
#RequestMapping("/parallel")
public class BasicController {
private Logger logger = Loggers.getLogger(BasicController.class);
KafkaProducerLoadGenerator generator = new KafkaProducerLoadGenerator();
#PostMapping
public Mono<ResponseEntity> createMessage() {
int data = (int)(Math.random()*100000);
return Flux.just(data)
.doOnNext(num -> logger.info("Generating document for {}", num))
.map(generator::generateDocument)
.flatMap(generator::sendMessage)
.doOnNext(result ->
logger.info("Sent message {}, offset is {} to partition {}",
result.getT2().correlationMetadata(),
result.getT2().recordMetadata().offset(),
result.getT2().recordMetadata().partition()))
.doOnError(error -> logger.error("Error in subscribe while sending message", error))
.single()
.map(tuple -> ResponseEntity.status(HttpStatus.OK).body(tuple.getT1()));
}
}
Here's the code that produces messages on to the Kafka topic
#Component
public class KafkaProducerLoadGenerator {
private static final Logger logger = Loggers.getLogger(KafkaProducerLoadGenerator.class);
private static final String bootstrapServers = "localhost:9092";
private static final String TOPIC = "load-topic";
private KafkaSender<Integer, String> sender;
private static int documentIndex = 0;
public KafkaProducerLoadGenerator() {
this(bootstrapServers);
}
public KafkaProducerLoadGenerator(String bootstrapServers) {
Map<String, Object> props = new HashMap<>();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ProducerConfig.CLIENT_ID_CONFIG, "load-generator");
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, IntegerSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
SenderOptions<Integer, String> senderOptions = SenderOptions.create(props);
sender = KafkaSender.create(senderOptions);
}
#NewSpan("generator.sendMessage")
public Flux<Tuple2<DataDocument, SenderResult<Integer>>> sendMessage(DataDocument document) {
return sendMessage(TOPIC, document)
.map(result -> Tuples.of(document, result));
}
public Flux<SenderResult<Integer>> sendMessage(String topic, DataDocument document) {
ProducerRecord<Integer, String> producerRecord = new ProducerRecord<>(topic, document.getData(), document.toString());
return sender.send(Mono.just(SenderRecord.create(producerRecord, document.getData())))
.doOnNext(record -> logger.info("Sent message to partition={}, offset={} ", record.recordMetadata().partition(), record.recordMetadata().offset()))
.doOnError(e -> logger.error("Error sending message " + documentIndex, e));
}
public DataDocument generateDocument(int data) {
return DataDocument.builder()
.header("Load Data")
.data(data)
.traceId("trace"+data)
.timestamp(Instant.now())
.build();
}
}
My consumer looks like this:
#Component
#Scope(scopeName = ConfigurableBeanFactory.SCOPE_PROTOTYPE)
public class IndividualConsumer {
private static final Logger logger = Loggers.getLogger(IndividualConsumer.class);
private static final String bootstrapServers = "localhost:9092";
private static final String TOPIC = "load-topic";
private int consumerIndex = 0;
public ReceiverOptions setupConfig(String bootstrapServers) {
Map<String, Object> properties = new HashMap<>();
properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
properties.put(ConsumerConfig.CLIENT_ID_CONFIG, "load-topic-consumer-"+consumerIndex);
properties.put(ConsumerConfig.GROUP_ID_CONFIG, "load-topic-multi-consumer-2");
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, IntegerDeserializer.class);
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, DataDocumentDeserializer.class);
return ReceiverOptions.create(properties);
}
public void setIndex(int i) {
consumerIndex = i;
}
#EventListener(ApplicationReadyEvent.class)
public Disposable consumeMessage() {
ReceiverOptions<Integer, DataDocument> receiverOptions = setupConfig(bootstrapServers)
.subscription(Collections.singleton(TOPIC))
.addAssignListener(receiverPartitions -> logger.debug("onPartitionsAssigned {}", receiverPartitions))
.addRevokeListener(receiverPartitions -> logger.debug("onPartitionsRevoked {}", receiverPartitions));
Flux<ReceiverRecord<Integer, DataDocument>> messages = Flux.defer(() -> {
KafkaReceiver<Integer, DataDocument> receiver = KafkaReceiver.create(receiverOptions);
return receiver.receive();
});
Consumer<? super ReceiverRecord<Integer, DataDocument>> acknowledgeOffset = record -> record.receiverOffset().acknowledge();
return messages
.publishOn(Schedulers.newSingle("Parallel-Consumer"))
.doOnError(error -> logger.error("Error in the reactive chain", error))
.delayElements(Duration.ofMillis(100))
.doOnNext(record -> {
logger.info("Consumer {}: Received from partition {}, offset {}, data with index {}",
consumerIndex,
record.receiverOffset().topicPartition(),
record.receiverOffset().offset(),
record.value().getData());
})
.doOnNext(acknowledgeOffset)
.doOnError(error -> logger.error("Error receiving record", error))
.retryBackoff(100, Duration.ofSeconds(5), Duration.ofMinutes(5))
.subscribe();
}
}
I would expect Sleuth to automatically carry over the built-in Brave trace and the custom headers to the consumer, so that the trace covers the entire transaction.
However I have two problems.
The generator bean doesn't get the same trace as the one in the Controller. It uses a different (and new) trace for every message sent.
The trace isn't propagated from Kafka producer to Kafka consumer.
I can resolve #1 above by replacing the generator bean with a simple Java class and instantiating it in the controller. However that means I can't autowire other dependencies, and in any case it doesn't solve #2.
I am able to load an instance of the bean brave.kafka.clients.KafkaTracing so I know it's being loaded by Spring. However, it doesn't look the instrumentation is working. I inspected the content on Kafka using Kafka Tool, and no headers are populated on any message.
In fact the consumer doesn't have a trace at all.
2020-05-06 23:57:32.898 INFO parallel-consumer:local [123-21922,578c510e23567aec,578c510e23567aec] 8180 --- [reactor-http-nio-3] rja.parallelconsumers.BasicController : Generating document for 23965
2020-05-06 23:57:32.907 INFO parallel-consumer:local [52e02d36b59c5acd,52e02d36b59c5acd,52e02d36b59c5acd] 8180 --- [single-11] r.p.kafka.KafkaProducerLoadGenerator : Sent message to partition=17, offset=0
2020-05-06 23:57:32.908 INFO parallel-consumer:local [123-21922,578c510e23567aec,578c510e23567aec] 8180 --- [single-11] rja.parallelconsumers.BasicController : Sent message 23965, offset is 0 to partition 17
2020-05-06 23:57:33.012 INFO parallel-consumer:local [-,-,-] 8180 --- [parallel-5] r.parallelconsumers.IndividualConsumer : Consumer 8: Received from partition load-topic-17, offset 0, data with index 23965
In the log above, [123-21922,578c510e23567aec,578c510e23567aec] is [custom-trace-header, brave traceId, brave spanId]
What am I missing?

What happens to the timestamp of a message in a stream when it's mapped into another stream?

I've an application where I process a stream and convert it into another. Here is a sample:
public void run(final String... args) {
final Serde<Event> eventSerde = new EventSerde();
final Properties props = streamingConfig.getProperties(
applicationName,
concurrency,
Serdes.String(),
eventSerde
);
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, EXACTLY_ONCE);
props.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, EventTimestampExtractor.class);
final StreamsBuilder builder = new StreamsBuilder();
KStream<String, Event> eventStream = builder.stream(inputStream);
final Serde<Device> deviceSerde = new DeviceSerde();
eventStream
.map((key, event) -> {
final Device device = modelMapper.map(event, Device.class);
return new KeyValue<>(key, device);
})
.to("device_topic", Produced.with(Serdes.String(), deviceSerde));
final Topology topology = builder.build();
final KafkaStreams streams = new KafkaStreams(topology, props);
streams.start();
}
Here are some details about the app:
Spring Boot 1.5.17
Kafka 2.1.0
Kafka Streams 2.1.0
Spring Kafka 1.3.6
Although a timestamp is set in the messages inside the input stream, I also place an implementation of TimestampExtractor to make sure that a proper timestamp is attached into all messages (as other producers may send messages into the same topic).
Within the code, I receive a stream of events and I basically convert them into different objects and eventually route those objects into different streams.
I'm trying to understand whether the initial timestamp I set is still attached to the messages published into device_topic in this particular case.
The receiving end (of device stream) is like this:
#KafkaListener(topics = "device_topic")
public void onDeviceReceive(final Device device, #Header(KafkaHeaders.RECEIVED_TIMESTAMP) final long timestamp) {
log.trace("[{}] Received device: {}", timestamp, device);
}
Unfortunetely the printed timestamp seems to be wall clock time. Is this the expected behaviour or am I missing something?
Spring Kafka 1.3.x uses a very old 0.11 client; perhaps it doesn't propagate the timestamp. I just tested with Boot 2.1.3 and Spring Kafka 2.2.4 and the timestamp is propagated ok...
#SpringBootApplication
#EnableKafkaStreams
public class So54771130Application {
public static void main(String[] args) {
SpringApplication.run(So54771130Application.class, args);
}
#Bean
public ApplicationRunner runner(KafkaTemplate<String, String> template) {
return args -> {
template.send("so54771130", 0, 42L, null, "baz");
};
}
#Bean
public KStream<String, String> stream(StreamsBuilder builder) {
KStream<String, String> stream = builder.stream("so54771130");
stream
.map((k, v) -> {
System.out.println("Mapping:" + v);
return new KeyValue<>(null, "bar");
})
.to("so54771130-1");
return stream;
}
#Bean
public NewTopic topic1() {
return new NewTopic("so54771130", 1, (short) 1);
}
#Bean
public NewTopic topic2() {
return new NewTopic("so54771130-1", 1, (short) 1);
}
#KafkaListener(id = "so54771130", topics = "so54771130-1")
public void listen(String in, #Header(KafkaHeaders.RECEIVED_TIMESTAMP) long ts) {
System.out.println(in + "#" + ts);
}
}
and
Mapping:baz
bar#42

sending DataStream from VM socket to Kafka and receiving on Host OS's Flink program : Deserialization Issue

I am sending data stream from VM to Kafka's test topic (running on host OS at 192.168.0.12 IP ) using code below
public class WriteToKafka {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Use ingestion time => TimeCharacteristic == EventTime + IngestionTimeExtractor
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
DataStream<JoinedStreamEvent> joinedStreamEventDataStream = env
.addSource(new JoinedStreamGenerator()).assignTimestampsAndWatermarks(new IngestionTimeExtractor<>());
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.0.12:9092");
properties.setProperty("zookeeper.connect", "192.168.0.12:2181");
properties.setProperty("group.id", "test");
DataStreamSource<JoinedStreamEvent> stream = env.addSource(new JoinedStreamGenerator());
stream.addSink(new FlinkKafkaProducer09<JoinedStreamEvent>("test", new TypeInformationSerializationSchema<>(stream.getType(),env.getConfig()), properties));
env.execute();
}
JoinedStreamEvent is of type DataSream<Tuple3<Integer,Integer,Integer>> it basically joins 2 streams respirationRateStream and heartRateStream
public JoinedStreamEvent(Integer patient_id, Integer heartRate, Integer respirationRate) {
Patient_id = patient_id;
HeartRate = heartRate;
RespirationRate = respirationRate;
There is another Flink program that is running on Host OS trying to read the data Stream from kafka . I am using localhost here as kafka and zookeper are running on Host OS.
public class ReadFromKafka {
public static void main(String[] args) throws Exception {
// create execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
DataStream<String> message = env.addSource(new FlinkKafkaConsumer09<String>("test", new SimpleStringSchema(), properties));
/* DataStream<JoinedStreamEvent> message = env.addSource(new FlinkKafkaConsumer09<JoinedStreamEvent>("test",
new , properties));*/
message.print();
env.execute();
} //main
} //ReadFromKafka
I am getting output something like this
I think I need to implement deserializer of type JoinedStreamEvent. Can someone please give me an idea how should I write, the deserializer for JoinedStreamEvent of type DataSream<Tuple3<Integer, Integer, Integer>>
Please let me know if something else needs to be done.
P.S. - I thought of writing following deserializer, but I don't think it is right
DataStream<JoinedStreamEvent> message = env.addSource(new FlinkKafkaConsumer09<JoinedStreamEvent>("test",
new TypeInformationSerializationSchema<JoinedStreamEvent>() , properties));
I was able to receive events VM in the same format by writing a custom serializer and deserializer for both VM and host OS program as mentioned below
public class JoinSchema implements DeserializationSchema<JoinedStreamEvent> , SerializationSchema<JoinedStreamEvent> {
#Override
public JoinedStreamEvent deserialize(byte[] bytes) throws IOException {
return JoinedStreamEvent.fromstring(new String(bytes));
}
#Override
public boolean isEndOfStream(JoinedStreamEvent joinedStreamEvent) {
return false;
}
#Override
public TypeInformation<JoinedStreamEvent> getProducedType() {
return TypeExtractor.getForClass(JoinedStreamEvent.class);
}
#Override
public byte[] serialize(JoinedStreamEvent joinedStreamEvent) {
return joinedStreamEvent.toString().getBytes();
}
} //JoinSchema
Please note that you may have to write fromstring( ) method in your event-type method, as I have added below fromString( ) JoinedStreamEvent Class
public static JoinedStreamEvent fromstring(String line){
String[] token = line.split(",");
JoinedStreamEvent joinedStreamEvent = new JoinedStreamEvent();
Integer val1 = Integer.valueOf(token[0]);
Integer val2 = Integer.valueOf(token[1]);
Integer val3 = Integer.valueOf(token[2]);
return new JoinedStreamEvent(val1,val2,val3);
} //fromstring
Events were sent from VM using below code
stream.addSink(new FlinkKafkaProducer09<JoinedStreamEvent>("test", new JoinSchema(), properties));
Events were received using following code
public static void main(String[] args) throws Exception {
// create execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
DataStream<JoinedStreamEvent> message = env.addSource(new FlinkKafkaConsumer09<JoinedStreamEvent>("test",
new JoinSchema(), properties));
message.print();
env.execute();
} //main

Too many connections in MongoDB and Spark

My Spark Streaming application stores the data in MongoDB.
Unfortunately each Spark worker opening too many connections while storing it in MongoDB
Following is my code Spark - Mongo DB code:
public static void main(String[] args) {
int numThreads = Integer.parseInt(args[3]);
String mongodbOutputURL = args[4];
String masterURL = args[5];
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
// Create a Spark configuration object to establish connection between the application and spark cluster
SparkConf sparkConf = new SparkConf().setAppName("AppName").setMaster(masterURL);
// Configure the Spark microbatch with interval time
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(60*1000));
Configuration config = new Configuration();
config.set("mongo.output.uri", "mongodb://host:port/database.collection");
// Set the topics that should be consumed from Kafka cluster
Map<String, Integer> topicMap = new HashMap<String, Integer>();
String[] topics = args[2].split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}
// Establish the connection between kafka and Spark
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
JavaDStream<String> lines = messages.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
JavaPairDStream<Object, BSONObject> save = lines.mapToPair(new PairFunction<String, Object, BSONObject>() {
#Override
public Tuple2<Object, BSONObject> call(String input) {
BSONObject bson = new BasicBSONObject();
bson.put("field1", input.split(",")[0]);
bson.put("field2", input.split(",")[1]);
return new Tuple2<>(null, bson);
}
});
// Store the records in database
save.saveAsNewAPIHadoopFiles("prefix","suffix" ,Object.class, Object.class, MongoOutputFormat.class, config);
jssc.start();
jssc.awaitTermination();
}
How to control the no of connections at each worker?
Am I missing any configuration parameters?
Update 1:
I am using Spark 1.3 with Java API.
I was not able to perform coalesce() but I was able to do repartition(2) operation.
Now no of connections got controlled.
But I think connections are not being closed or not reused at worker.
Please find the below screenshot:
Streaming interval 1-minute and 2 partitions
You can try map partitions, which works on partition level instead of record level, I.e, task execute on one node will shares one database connection instead of for every record.
Also I guess you can use a pre partition( not the stream RDD). Spark is smart enough to utilize this to reduce shuffle.
I was able to solve the issue by using foreachRDD.
I am establishing the connection and closing it after every DStream.
myRDD.foreachRDD(new Function<JavaRDD<String>, Void>() {
#Override
public Void call(JavaRDD<String> rdd) throws Exception {
rdd.foreachPartition(new VoidFunction<Iterator<String>>() {
#Override
public void call(Iterator<String> record) throws Exception {
MongoClient mongo = new MongoClient(server:port);
DB db = mongo.getDB(database);
DBCollection targetTable = db.getCollection(collection);
BasicDBObject doc = new BasicDBObject();
while (record.hasNext()) {
String currentRecord = record.next();
String[] delim_records = currentRecord.split(",");
doc.append("column1", insert_time);
doc.append("column2", delim_records[1]);
doc.append("column3",delim_records[0]);
targetTable.insert(doc);
doc.clear();
}
mongo.close();
}
});
return null;
}
});