Too many connections in MongoDB and Spark

Too many connections in MongoDB and Spark - mongodb

My Spark Streaming application stores the data in MongoDB.
Unfortunately each Spark worker opening too many connections while storing it in MongoDB
Following is my code Spark - Mongo DB code:
public static void main(String[] args) {
int numThreads = Integer.parseInt(args[3]);
String mongodbOutputURL = args[4];
String masterURL = args[5];
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
// Create a Spark configuration object to establish connection between the application and spark cluster
SparkConf sparkConf = new SparkConf().setAppName("AppName").setMaster(masterURL);
// Configure the Spark microbatch with interval time
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(60*1000));
Configuration config = new Configuration();
config.set("mongo.output.uri", "mongodb://host:port/database.collection");
// Set the topics that should be consumed from Kafka cluster
Map<String, Integer> topicMap = new HashMap<String, Integer>();
String[] topics = args[2].split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}
// Establish the connection between kafka and Spark
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
JavaDStream<String> lines = messages.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
JavaPairDStream<Object, BSONObject> save = lines.mapToPair(new PairFunction<String, Object, BSONObject>() {
#Override
public Tuple2<Object, BSONObject> call(String input) {
BSONObject bson = new BasicBSONObject();
bson.put("field1", input.split(",")[0]);
bson.put("field2", input.split(",")[1]);
return new Tuple2<>(null, bson);
}
});
// Store the records in database
save.saveAsNewAPIHadoopFiles("prefix","suffix" ,Object.class, Object.class, MongoOutputFormat.class, config);
jssc.start();
jssc.awaitTermination();
}
How to control the no of connections at each worker?
Am I missing any configuration parameters?
Update 1:
I am using Spark 1.3 with Java API.
I was not able to perform coalesce() but I was able to do repartition(2) operation.
Now no of connections got controlled.
But I think connections are not being closed or not reused at worker.
Please find the below screenshot:
Streaming interval 1-minute and 2 partitions

You can try map partitions, which works on partition level instead of record level, I.e, task execute on one node will shares one database connection instead of for every record.
Also I guess you can use a pre partition( not the stream RDD). Spark is smart enough to utilize this to reduce shuffle.

I was able to solve the issue by using foreachRDD.
I am establishing the connection and closing it after every DStream.
myRDD.foreachRDD(new Function<JavaRDD<String>, Void>() {
#Override
public Void call(JavaRDD<String> rdd) throws Exception {
rdd.foreachPartition(new VoidFunction<Iterator<String>>() {
#Override
public void call(Iterator<String> record) throws Exception {
MongoClient mongo = new MongoClient(server:port);
DB db = mongo.getDB(database);
DBCollection targetTable = db.getCollection(collection);
BasicDBObject doc = new BasicDBObject();
while (record.hasNext()) {
String currentRecord = record.next();
String[] delim_records = currentRecord.split(",");
doc.append("column1", insert_time);
doc.append("column2", delim_records[1]);
doc.append("column3",delim_records[0]);
targetTable.insert(doc);
doc.clear();
}
mongo.close();
}
});
return null;
}
});

Related

How to Commit Kafka Offsets Manually in Flink

I have a Flink job to consume a Kafka topic and sink it to another topic and the Flink job is setting as auto.commit with a interval 3 minutes(checkpoint disabled), but in the monitoring side, there is 3 minutes lag. But we want to monitor the processing on real time without 3 minutes lag, so we want to have a feature that the FlinkKafkaConsumer is able to commit the offset immediately after sink function.
Is there a way to achieve this goal within Flink framework?
Or any other options?
On line 53, I am trying to create a KafkaConsumer instance to call commitSync() function to make it working, but it does not work.
public class CEPJobTest {
private final static String TOPIC = "test";
private final static String BOOTSTRAP_SERVERS = "localhost:9092";
public static void main(String[] args) throws Exception {
System.out.println("start cep test job...");
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "console-consumer-cep");
properties.setProperty("enable.auto.commit", "false");
// offset interval
//properties.setProperty("auto.commit.interval.ms", "500");
FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<String>("test", new SimpleStringSchema(),
properties);
//set commitoffset by checkpoint
consumer.setCommitOffsetsOnCheckpoints(false);
System.out.println("checkpoint enabled:"+consumer.getEnableCommitOnCheckpoints());
DataStream<String> stream = env.addSource(consumer);
stream.map(new MapFunction<String, String>() {
#Override
public String map(String value) throws Exception {
return new Date().toString() + ": " + value;
}
}).print();
//here, I want to commit offset manually after processing message...
KafkaConsumer<?, ?> kafkaConsumer = new KafkaConsumer(properties);
kafkaConsumer.commitSync();
env.execute("Flink Streaming");
}
private static Consumer<Long, String> createConsumer() {
final Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, BOOTSTRAP_SERVERS);
props.put(ConsumerConfig.GROUP_ID_CONFIG, "KafkaExampleConsumer");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, LongDeserializer.class.getName());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG,false);
final Consumer<Long, String> consumer = new KafkaConsumer<>(props);
return consumer;
}
}

This does not work like your code
env.execute is to submit job to cluster, the execution is then submitted. The code before this line is just build the job graph rather than executing anything.
To do this after sink, you should put it in your sink function
class mySink extends RichSinkFunction {
override def invoke(...) = {
val kafkaConsumer = new KafkaConsumer(properties);
kafkaConsumer.commitSync();
}
}

Is Kafka streaming to Ignite running in Transactional mode ACID?

I have a distributed database in Apache Ignite and a Apache Kafka streaming service that streams data to the Ignite cluster. The Kafka streamer works as following
Create ignite node to find cluster
Start kafka streamer singleton as a service in the cluster
Shut down the ignite node
The Ignite cluster is in Transactional mode, however I am unsure if this guarantees ACID or only enables it. Could this streaming service to Ignite be considered ACID?
Here is the code for the kafka streamer:
public class IgniteKafkaStreamerService implements Service {
private static final long serialVersionUID = 1L;
#IgniteInstanceResource
private Ignite ignite;
private KafkaStreamer<String, JSONObject> kafkaStreamer = new KafkaStreamer<>();
private IgniteLogger logger;
public static void main(String[] args) throws InterruptedException {
TcpDiscoverySpi spi = new TcpDiscoverySpi();
TcpDiscoveryMulticastIpFinder ipFinder = new TcpDiscoveryMulticastIpFinder();
// Set Multicast group.
//ipFinder.setMulticastGroup("228.10.10.157");
// Set initial IP addresses.
// Note that you can optionally specify a port or a port range.
ipFinder.setAddresses(Arrays.asList("127.0.0.1:47500..47509"));
spi.setIpFinder(ipFinder);
IgniteConfiguration cfg = new IgniteConfiguration();
// Override default discovery SPI.
cfg.setDiscoverySpi(spi);
Ignite ignite = Ignition.getOrStart(cfg);
// Deploy data streamer service on the server nodes.
ClusterGroup forServers = ignite.cluster().forServers();
IgniteKafkaStreamerService streamer = new IgniteKafkaStreamerService();
ignite.services(forServers).deployClusterSingleton("KafkaService", streamer);
ignite.close();
}
#Override
public void init(ServiceContext ctx) {
logger = ignite.log();
IgniteDataStreamer<String, JSONObject> stmr = ignite.dataStreamer("my_cache");
stmr.allowOverwrite(true);
stmr.autoFlushFrequency(1000);
List<String> topics = new ArrayList<>();
topics.add(0,"IoTData");
kafkaStreamer.setIgnite(ignite);
kafkaStreamer.setStreamer(stmr);
kafkaStreamer.setThreads(4);
kafkaStreamer.setTopic(topics);
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "NiFi-consumer");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.1.242:9092");
props.put("key.deserializer", StringDeserializer.class.getName());
props.put("value.deserializer", StringDeserializer.class.getName());
props.put("group.id", "hello");
kafkaStreamer.setConsumerConfig(props);
kafkaStreamer.setSingleTupleExtractor(msg -> {
JSONObject jsonObj = new JSONObject(msg.value().toString());
String key = jsonObj.getString("id") + "," + new Date(msg.timestamp());
JSONObject value = jsonObj.accumulate("date", new Date(msg.timestamp()));
return new AbstractMap.SimpleEntry<>(key, value);
});
}
#Override
public void execute(ServiceContext ctx) {
kafkaStreamer.start();
logger.info("KafkaStreamer started.");
}
#Override
public void cancel(ServiceContext ctx) {
kafkaStreamer.stop();
logger.info("KafkaStreamer stopped.");
}
}

KafkaStreamer uses IgniteDataStreamer implementations under the hood. IgniteDataStreamer isn't transactional by nature so there are no any transactional guarantees.

What happens to the timestamp of a message in a stream when it's mapped into another stream?

I've an application where I process a stream and convert it into another. Here is a sample:
public void run(final String... args) {
final Serde<Event> eventSerde = new EventSerde();
final Properties props = streamingConfig.getProperties(
applicationName,
concurrency,
Serdes.String(),
eventSerde
);
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, EXACTLY_ONCE);
props.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, EventTimestampExtractor.class);
final StreamsBuilder builder = new StreamsBuilder();
KStream<String, Event> eventStream = builder.stream(inputStream);
final Serde<Device> deviceSerde = new DeviceSerde();
eventStream
.map((key, event) -> {
final Device device = modelMapper.map(event, Device.class);
return new KeyValue<>(key, device);
})
.to("device_topic", Produced.with(Serdes.String(), deviceSerde));
final Topology topology = builder.build();
final KafkaStreams streams = new KafkaStreams(topology, props);
streams.start();
}
Here are some details about the app:
Spring Boot 1.5.17
Kafka 2.1.0
Kafka Streams 2.1.0
Spring Kafka 1.3.6
Although a timestamp is set in the messages inside the input stream, I also place an implementation of TimestampExtractor to make sure that a proper timestamp is attached into all messages (as other producers may send messages into the same topic).
Within the code, I receive a stream of events and I basically convert them into different objects and eventually route those objects into different streams.
I'm trying to understand whether the initial timestamp I set is still attached to the messages published into device_topic in this particular case.
The receiving end (of device stream) is like this:
#KafkaListener(topics = "device_topic")
public void onDeviceReceive(final Device device, #Header(KafkaHeaders.RECEIVED_TIMESTAMP) final long timestamp) {
log.trace("[{}] Received device: {}", timestamp, device);
}
Unfortunetely the printed timestamp seems to be wall clock time. Is this the expected behaviour or am I missing something?

Spring Kafka 1.3.x uses a very old 0.11 client; perhaps it doesn't propagate the timestamp. I just tested with Boot 2.1.3 and Spring Kafka 2.2.4 and the timestamp is propagated ok...
#SpringBootApplication
#EnableKafkaStreams
public class So54771130Application {
public static void main(String[] args) {
SpringApplication.run(So54771130Application.class, args);
}
#Bean
public ApplicationRunner runner(KafkaTemplate<String, String> template) {
return args -> {
template.send("so54771130", 0, 42L, null, "baz");
};
}
#Bean
public KStream<String, String> stream(StreamsBuilder builder) {
KStream<String, String> stream = builder.stream("so54771130");
stream
.map((k, v) -> {
System.out.println("Mapping:" + v);
return new KeyValue<>(null, "bar");
})
.to("so54771130-1");
return stream;
}
#Bean
public NewTopic topic1() {
return new NewTopic("so54771130", 1, (short) 1);
}
#Bean
public NewTopic topic2() {
return new NewTopic("so54771130-1", 1, (short) 1);
}
#KafkaListener(id = "so54771130", topics = "so54771130-1")
public void listen(String in, #Header(KafkaHeaders.RECEIVED_TIMESTAMP) long ts) {
System.out.println(in + "#" + ts);
}
}
and
Mapping:baz
bar#42

How to read all the records in a Kafka topic

I am using kafka : kafka_2.12-2.1.0, spring kafka on client side and have got stuck with an issue.
I need to load an in-memory map by reading all the existing messages within a kafka topic. I did this by starting a new consumer (with a unique consumer group id and setting the offset to earliest). Then I iterate over the consumer (poll method) to get all messages and stop when the consumer records become empty.
But I noticed that, when I start polling, the first few iterations return consumer records as empty and then it starts returning the actual records. Now this breaks my logic as our code thinks there are no records in the topic.
I have tried few other ways (like using offsets number) but haven't been able to come up with any solution, apart from keeping another record somewhere which tells me how many messages there are in the topic which needs to be read before I stop.
Any idea's please ?

To my understanding, what you are trying to achieve is to have a map constructed in your application based on the values that are already in a specific Topic.
For this task, instead of manually polling the topic, you can use Ktable in Kafka Streams DSL which will automatically construct a readable key-value store which is fault tolerant, replication enabled and automatically filled with new values.
You can do this simply by calling groupByKey on a stream and then using the aggregate.
KStreamBuilder builder = new KStreamBuilder();
KStream<String, Long> myKStream = builder.stream(Serdes.String(), Serdes.Long(), "topic_name");
KTable<String, Long> totalCount = myKStream.groupByKey().aggregate(this::initializer, this::aggregator);
(The actual code may vary depending on the kafka version, your configurations, etc..)
Read more about Kafka Stream concepts here
Then I iterate over the consumer (poll method) to get all messages and stop when the consumer records become empty
Kafka is a message streaming platform. Any data you stream is being updated continuously and you probably should not use it in a way that you expect the consuming to stop after a certain number of messages. How will you handle if a new message comes in after you stop the consumer?
Also the reason you are getting null records maybe probably related to records being in different partitions, etc..
What is your specific use case here?, There might be a good way to do it with the Kafka semantics itself.

You have to use 2 consumers one to load the offsets and another one to read all the records.
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.PartitionInfo;
import org.apache.kafka.common.TopicPartition;
import org.apache.kafka.common.serialization.ByteArrayDeserializer;
import java.time.Duration;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.Properties;
import java.util.Set;
import java.util.stream.Collectors;
public class KafkaRecordReader {
static final Map<String, Object> props = new HashMap<>();
static {
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, ByteArrayDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, ByteArrayDeserializer.class);
props.put(ConsumerConfig.CLIENT_ID_CONFIG, "sample-client");
}
public static void main(String[] args) {
final Map<TopicPartition, OffsetInfo> partitionOffsetInfos = getOffsets(Arrays.asList("world, sample"));
final List<ConsumerRecord<byte[], byte[]>> records = readRecords(partitionOffsetInfos);
System.out.println(partitionOffsetInfos);
System.out.println("Read : " + records.size() + " records");
}
private static List<ConsumerRecord<byte[], byte[]>> readRecords(final Map<TopicPartition, OffsetInfo> offsetInfos) {
final Properties readerProps = new Properties();
readerProps.putAll(props);
readerProps.put(ConsumerConfig.CLIENT_ID_CONFIG, "record-reader");
final Map<TopicPartition, Boolean> partitionToReadStatusMap = new HashMap<>();
offsetInfos.forEach((tp, offsetInfo) -> {
partitionToReadStatusMap.put(tp, offsetInfo.beginOffset == offsetInfo.endOffset);
});
final List<ConsumerRecord<byte[], byte[]>> cachedRecords = new ArrayList<>();
try (final KafkaConsumer<byte[], byte[]> consumer = new KafkaConsumer<>(readerProps)) {
consumer.assign(offsetInfos.keySet());
for (final Map.Entry<TopicPartition, OffsetInfo> entry : offsetInfos.entrySet()) {
consumer.seek(entry.getKey(), entry.getValue().beginOffset);
}
boolean close = false;
while (!close) {
final ConsumerRecords<byte[], byte[]> consumerRecords = consumer.poll(Duration.ofMillis(100));
for (final ConsumerRecord<byte[], byte[]> record : consumerRecords) {
cachedRecords.add(record);
final TopicPartition currentTp = new TopicPartition(record.topic(), record.partition());
if (record.offset() + 1 == offsetInfos.get(currentTp).endOffset) {
partitionToReadStatusMap.put(currentTp, true);
}
}
boolean done = true;
for (final Map.Entry<TopicPartition, Boolean> entry : partitionToReadStatusMap.entrySet()) {
done &= entry.getValue();
}
close = done;
}
}
return cachedRecords;
}
private static Map<TopicPartition, OffsetInfo> getOffsets(final List<String> topics) {
final Properties offsetReaderProps = new Properties();
offsetReaderProps.putAll(props);
offsetReaderProps.put(ConsumerConfig.CLIENT_ID_CONFIG, "offset-reader");
final Map<TopicPartition, OffsetInfo> partitionOffsetInfo = new HashMap<>();
try (final KafkaConsumer<byte[], byte[]> consumer = new KafkaConsumer<>(offsetReaderProps)) {
final List<PartitionInfo> partitionInfos = new ArrayList<>();
topics.forEach(topic -> partitionInfos.addAll(consumer.partitionsFor("sample")));
final Set<TopicPartition> topicPartitions = partitionInfos
.stream()
.map(x -> new TopicPartition(x.topic(), x.partition()))
.collect(Collectors.toSet());
consumer.assign(topicPartitions);
final Map<TopicPartition, Long> beginningOffsets = consumer.beginningOffsets(topicPartitions);
final Map<TopicPartition, Long> endOffsets = consumer.endOffsets(topicPartitions);
for (final TopicPartition tp : topicPartitions) {
partitionOffsetInfo.put(tp, new OffsetInfo(beginningOffsets.get(tp), endOffsets.get(tp)));
}
}
return partitionOffsetInfo;
}
private static class OffsetInfo {
private final long beginOffset;
private final long endOffset;
private OffsetInfo(long beginOffset, long endOffset) {
this.beginOffset = beginOffset;
this.endOffset = endOffset;
}
#Override
public String toString() {
return "OffsetInfo{" +
"beginOffset=" + beginOffset +
", endOffset=" + endOffset +
'}';
}
#Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
OffsetInfo that = (OffsetInfo) o;
return beginOffset == that.beginOffset &&
endOffset == that.endOffset;
}
#Override
public int hashCode() {
return Objects.hash(beginOffset, endOffset);
}
}
}

Adding to the above answer from #arshad, the reason you are not getting the records is because you have already read them. See this answer here using earliest or latest does not matter on the consumer after you have a committed offset for the partition
I would use a seek to the beginning or the particular offset if you knew the starting offset.

Mongo db java driver - resource management with client

Background
I have a remote hosted server thats running java vm with custom server code for multiplayer real-time quiz game. The server deals with matchmaking, rooms, lobbies etc. I'm also using a Mongo db on same space which holds all the questions for mobile phone quiz game.
This is my first attempt at such a project and although I'm competent in Java my mongo skill are novice at best.
Client Singleton
My server contains static singleton of mongo client:
public class ClientSingleton
{
private static ClientSingleton uniqueInstance;
// The MongoClient class is designed to be thread safe and shared among threads.
// We create only 1 instance for our given database cluster and use it across
// our application.
private MongoClient mongoClient;
private MongoClientOptions options;
private MongoCredential credential;
private final String password = "xxxxxxxxxxxxxx";
private final String host = "xx.xx.xx.xx";
private final int port = 38180;
/**
*
*/
private ClientSingleton()
{
// Setup client credentials for DB connection (user, db name & password)
credential = MongoCredential.createCredential("XXXXXX", "DBName", password.toCharArray());
options = MongoClientOptions.builder()
.connectTimeout(25000)
.socketTimeout(60000)
.connectionsPerHost(100)
.threadsAllowedToBlockForConnectionMultiplier(5)
.build();
try
{
// Create client (server address(host,port), credential, options)
mongoClient = new MongoClient(new ServerAddress(host, port),
Collections.singletonList(credential),
options);
}
catch (UnknownHostException e)
{
e.printStackTrace();
}
}
/**
* Double checked dispatch method to initialise our client singleton class
*
*/
public static ClientSingleton getInstance()
{
if(uniqueInstance == null)
{
synchronized (ClientSingleton.class)
{
if(uniqueInstance == null)
{
uniqueInstance = new ClientSingleton();
}
}
}
return uniqueInstance;
}
/**
* #return our mongo client
*/
public MongoClient getClient() {
return mongoClient;
}
}
Notes here:
Mongo client is new to me and I understand failure to properly utilise connection pooling is one major “gotcha” that greatly impact Mongo db performance. Also creating new connections to the db is expensive and I should try and re-use existing connections.
I've not left socket timeout and connect timeout at defaults (eg infinite) if connection hangs for some reason I think it will get stuck forever!
I set number of milliseconds the driver will wait before a connection attempt is aborted, for connections made through a Platform-as-a-Serivce (where server is hosted) it is advised to have a higher timeout (e.g. 25 seconds). I also set number of milliseconds the driver will wait for a response from the server for all types of requests (queries, writes, commands, authentication, etc.). Finally I set threadsAllowedToBlockForConnectionMultiplier to 5 (500) connection in, a FIFO stack, awaiting their turn on the db.
Server Zone
Zone gets a game request from client and receives the meta data string for quiz type. In this case "Episode 3". Zone creates room for user or allows user to join room with with that property.
Server Room
Room then establishes db connection to mongo collection for the quiz type:
// Get client & collection
mongoDatabase = ClientSingleton.getInstance().getClient().getDB("DBName");
mongoColl = mongoDatabase.getCollection("GOT");
// Query mongo db with meta data string request
queryMetaTags("Episode 3");
Notes here:
Following a game or I should say after an room idle time the room get destroyed - this idle time is currently set to 60 mins. I believe that if connections per host is set to 100 then while this room is idle then it would be using valuable connection resources.
Question
Is this a good way to manage my client connections?
If I have several hundred concurrently connected games and each accessing the db to pull the questions then maybe following that request free up the client connection for other rooms to use? How should this be done? I'm concerned about possible bottle necks here!
Mongo Query FYI
// Query our collection documents metaTag elements for a matching string
// #SuppressWarnings("deprecation")
public void queryMetaTags(String query)
{
// Query to search all documents in current collection
List<String> continentList = Arrays.asList(new String[]{query});
DBObject matchFields = new
BasicDBObject("season.questions.questionEntry.metaTags",
new BasicDBObject("$in", continentList));
DBObject groupFields = new BasicDBObject( "_id", "$_id").append("questions",
new BasicDBObject("$push","$season.questions"));
//DBObject unwindshow = new BasicDBObject("$unwind","$show");
DBObject unwindsea = new BasicDBObject("$unwind", "$season");
DBObject unwindepi = new BasicDBObject("$unwind", "$season.questions");
DBObject match = new BasicDBObject("$match", matchFields);
DBObject group = new BasicDBObject("$group", groupFields);
#SuppressWarnings("deprecation")
AggregationOutput output =
mongoColl.aggregate(unwindsea,unwindepi,match,group);
String jsonString = null;
JSONObject jsonObject = null;
JSONArray jsonArray = null;
ArrayList<JSONObject> ourResultsArray = new ArrayList<JSONObject>();
// Loop for each document in our collection
for (DBObject result : output.results())
{
try
{
// Parse our results so we can add them to an ArrayList
jsonString = JSON.serialize(result);
jsonObject = new JSONObject(jsonString);
jsonArray = jsonObject.getJSONArray("questions");
for (int i = 0; i < jsonArray.length(); i++)
{
// Put each of our returned questionEntry elements into an ArrayList
ourResultsArray.add(jsonArray.getJSONObject(i));
}
}
catch (JSONException e1)
{
e1.printStackTrace();
}
}
pullOut10Questions(ourResultsArray);
}

The way I've done this is to use Spring to create a MongoClient Bean. You can then autowire this bean wherever it is needed.
For example:
MongoConfig.java
import com.mongodb.MongoClient;
import com.mongodb.MongoClientURI;
import com.tescobank.insurance.telematics.data.connector.config.DatabaseProperties;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import java.net.UnknownHostException;
#Configuration
public class MongoConfig {
private #Autowired DatabaseProperties properties;
#Bean
public MongoClient fooClient() throws UnknownHostException {
return mongo(properties.getFooDatabaseURI());
}
}
Class Requiring Mongodb connection:
#Component
public class DatabaseUser {
private MongoClient mongoClient;
....
#Autowired
public DatabaseUser(MongoClient mongoClient) {
this.mongoClient = mongoClient;
}
}
Spring will then create the connection and wire it where required. What you've done seems very complex and perhaps tries to recreate the functionality you would get for free by using a tried and test framework such as Spring. I'd generally try to avoid the use of Singletons too if I could avoid it. I've had no performance issues using Mongodb connections like this.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Too many connections in MongoDB and Spark - mongodb

You can try map partitions, which works on partition level instead of record level, I.e, task execute on one node will shares one database connection instead of for every record. Also I guess you can use a pre partition( not the stream RDD). Spark is smart enough to utilize this to reduce shuffle.

Related

How to Commit Kafka Offsets Manually in Flink

Is Kafka streaming to Ignite running in Transactional mode ACID?

What happens to the timestamp of a message in a stream when it's mapped into another stream?

How to read all the records in a Kafka topic

Mongo db java driver - resource management with client

Categories

Resources