Creating a global connection object across an executor in spark when pushing data to mongodb - mongodb

In the link it was suggested to create a connection pool that is available across multiple rdds in the spark streaming job.
rdd.foreachpartition( iter => {
val client = MongoClient(host,port)
val col = client.getDataBase("testDataBase").getCollection("testCollection")
// i am bascically inserting data in the iterator to the testcollection
})
However I was not able to figure out how to create a connection pool that returns a connection object to a mongodb collection. I was able to use foreachpartition to create a single connection for the whole partition. can someone please let me know how to create a connection object that available across the executor for reuse.

The MongoDB Spark Connector internally uses broadcast variables to achieve this:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
So you should be able to share the MongoClient and connection pool across tasks.

Mongo dB spark connection doesn’t help in collecting exceptions. How do we do that. Also if we insert in batches , if one fails - it stops the remaining inserts. Mongo spark driver helps to insert multiple documents as well as you can set ordered = false so it even inserts the remaining documents even if there is some duplicates or some timeouts.

Related

Data streamed from Kafka to Postgres and missing seconds later

I am trying to save data from local Kafka instance to local Postgres with Spark Streaming. I have configured all connections and parameters, and data actually gets to the database. However, it is there only for a couple of seconds. After that, the table simply becomes empty. If I stop the app as soon there is some data in Postgres, data persists, so I suppose I have missed some parameter for streaming in Spark or something in Kafka configuration files. The code is in Java, not Scala, so there is dataset instead of DataFrame.
I tried setting spark.driver.allowMultipleContexts to true, but this has nothing with context. When I run count on database with complete data set streaming in the background, there is always about 1700 records, which means there might be some parameter for batch size.
censusRecordJavaDStream.map(e -> {
Row row = RowFactory.create(e.getAllValues());
return row;
}).foreachRDD(rdd -> {
Dataset<Row> censusDataSet = spark.createDataFrame(rdd, CensusRecord.getStructType());
censusDataSet
.write()
.mode(SaveMode.Overwrite)
.jdbc("jdbc:postgresql:postgres", "census.census", connectionProperties);
});
My goal is to stream data from Kafka and save it to Postgre. Each record has unique ID, which is used as a key in Kafka, so there should be no conflicts regarding primary key or double entries. For current testing purposes, I am using small subset of about 100 records; complete dataset is over 300MB.

Apache Spark : When not to use mapPartition and foreachPartition?

I know that when we want to initialize some resource for a group of RDDs instead of individual RDD elements we should ideally use the mapPartition and foreachPartition. For example in case of initializing a JDBC connection for each partition of data. But are there scenarios where we should not use either of them and instead use plain vanilla map() and foreach() transformation and action.
When you write Spark jobs that uses either mapPartition or foreachPartition you can just modify the partition data itself or just iterate through partition data respectively. The anonymous function passed as parameter will be executed on the executors thus there is not a viable way to execute a code which invokes all the nodes e.g: df.reduceByKey from one particular executor. This code should be executed only from the driver node. Thus only from the driver code you can access dataframes, datasets and spark session.
Please find here a detailed discussion over this issue and possible solutions

Saving data to ElasticSearch in Spark task

While processing a stream of Avro messages through Kafka and Spark, I am saving the processed data as documents in a ElasticSearch index.
Here's the code (simplified):
directKafkaStream.foreachRDD(rdd ->{
rdd.foreach(avroRecord -> {
byte[] encodedAvroData = avroRecord._2;
MyType t = deserialize(encodedAvroData);
// Creating the ElasticSearch Transport client
Settings settings = Settings.builder()
.put("client.transport.ping_timeout", 5, TimeUnit.SECONDS).build();
TransportClient client = new PreBuiltTransportClient(settings)
.addTransportAddress(new TransportAddress(InetAddress.getByName("localhost"), 9300));
IndexRequest indexRequest = new IndexRequest("index", "item", id)
.source(jsonBuilder()
.startObject()
.field("name", name)
.field("timestamp", new Timestamp(System.currentTimeMillis()))
.endObject());
UpdateRequest updateRequest = new UpdateRequest("index", "item", id)
.doc(jsonBuilder()
.startObject()
.field("name", name)
.field("timestamp", new Timestamp(System.currentTimeMillis()))
.endObject())
.upsert(indexRequest);
client.update(updateRequest).get();
client.close();
Everything works as expected; the only problem is performance: saving to ES requires some time, and I suppose that this is due to the fact that I open/close an ES Transport client for each RDD. Spark documentation suggests that this approach is quite correct: as soon as I understand, the only possible optimisation is using rdd.foreachPartition, but I only have one partition, so I am not sure that this would be beneficial.
Any other solution to achieve better performance?
Because you create new connect whenever process a record of RDD.
So, I think use foreachPartition will make better performance regardless of only one partition, because it help you bring your ES connection instance outside, reuse it in the loop.
I would stream the processed messages back onto a separate Kafka topic, and then use Kafka Connect to land them to Elasticsearch. This decouples your Spark-specific processing from getting the data into Elasticsearch.
Example of it in action: https://www.confluent.io/blog/blogthe-simplest-useful-kafka-connect-data-pipeline-in-the-world-or-thereabouts-part-2/

Cassandra insert performance using spark-cassandra connector

I am a newbie to spark and cassandra. I am trying to insert into cassandra table using spark-cassandra connector as below:
import java.util.UUID
import org.apache.spark.{SparkContext, SparkConf}
import org.joda.time.DateTime
import com.datastax.spark.connector._
case class TestEntity(id:UUID, category:String, name:String,value:Double, createDate:DateTime, tag:Long)
object SparkConnectorContext {
val conf = new SparkConf(true).setMaster("local")
.set("spark.cassandra.connection.host", "192.168.xxx.xxx")
val sc = new SparkContext(conf)
}
object TestRepo {
def insertList(list: List[TestEntity]) = {
SparkConnectorContext.sc.parallelize(list).saveToCassandra("testKeySpace", "testColumnFamily")
}
}
object TestApp extends App {
val start = System.currentTimeMillis()
TestRepo.insertList(Utility.generateRandomData())
val end = System.currentTimeMillis()
val timeDiff = end-start
println("Difference (in millis)= "+timeDiff)
}
When I insert using the above method (list with 100 entities), it takes 300-1100 milliseconds.
I tried the same data to insert using phantom library. It is only taking less than 20-40 milliseconds.
Can anyone tell me why spark connector is taking this much time for insert? Am I doing anything wrong in my code or is it not advisable to use spark-cassandra connector for insert operations?
It looks like you are including the parallelize operation in your timing. Also since you have your spark worker running on a different machine than Cassandra, the saveToCassandra operation will be a write over the network.
Try configuring your system to run the spark workers on the Cassandra nodes. Then create an RDD in a separate step and invoke an action like count() on it to load the data into memory. Also you might want to persist() or cache() the RDD to make sure it stays in memory for the test.
Then time just the saveToCassandra of that cached RDD.
You might also want to look at the repartitionByCassandraReplica method offered by the Cassandra connector. That would partition the data in the RDD based on which Cassandra node the writes need to go to. In that way you exploit data locality and often avoid doing writes and shuffles over the network.
There are some serious problems with your "benchmark":
Your data set is so small that you're measuring mostly only the job setup time. Saving 100 entities should be of order of single milliseconds on a single node, not seconds. Also saving 100 entities gives JVM no chance to compile the code you run to optimized machine code.
You included spark context initialization in your measurement. JVM loads classes lazily, so the code for spark initialization is really called after the measurement is started. This is an extremely costly element, typically performed only once per whole spark application, not even per job.
You're performing the measurement only once per launch. This means you're even incorrectly measuring spark ctx setup and job setup time, because the JVM has to load all the classes for the first time and Hotspot has probably no chance to kick in.
To summarize, you're very likely measuring mostly class loading time, which is dependent on the size and number of classes loaded. Spark is quite a large thing to load and a few hundred milliseconds are not surprising at all.
To measure insert performance correctly:
use larger data set
exclude one-time setup from the measurement
do multiple runs sharing the same spark context and discard a few initial ones, until you reach steady state performance.
BTW If you enable debug logging level, the connector logs the insert times for every partition in the executor logs.

Spark streaming merge data

My understanding is that Spark Streaming serialises the closure (e.g. map, filter, etc) and executes it on worker nodes (as explained here). Is there some way of sending the results back to the driver program and perform further operations on the local machine?
In our specific use case, we are trying to turn the results produced by Spark into an observable stream (using RxScala).
Someone posted a comment but deleted it afterwards. He suggested using collect() on an RDD. A simple test showed that collect gathers data from the worker nodes and executes on the driver node; exactly what I needed.